Hello, I am (name) from Lernvern. (6 seconds pause, music) This tutorial is a continuation of the last session , so let's watch ahead. In the tutorial of Machine Learning today we are watching the implementation of Q learning. Q learning we have already discussed conceptually under reinforcement learning. In Q learning we have no data initially, now learning without data is a challenge within itself. But in reinforcement learning Q learning comes with a technique that as you keep on taking actions, you will keep receiving data on the basis of these actions and using this data you can do the learning ahead and make the actions ahead better. So, let us start its implementation. So it's implementation I will do with numpy, (pronounce - numph-pie) so import numpy as NP , we will import numpy because numpy gives us the provision to create different arrays. So first of all I will make a reward matrix here, you are aware that Q learning must have a reward matrix and a Q table so, let us start from it. So first of all let us create a reward matrix.(5 seconds pause) To create a reward matrix I will take capital R and R is equal to NP dot and here M A T R I X matrix will be used and here in a list I will pass different elements. Now here I will take 6 elements, -1, then -1 and then -1 again and after that, these are three and we will take a fourth one and take a zero and one more -1.

So here I have taken six elements and after 6 elements I will put a comma and this we, this we will directly copy from here and for values below, there will be a 6 by 6 matrix . So this is 2,3,4,5 and 6. So this is the 6 by 6 matrix and here you can see that our first row has been made , we will go in the second row and put the value as 100 , so one zero zero ,100 OK., and this zero we will change by putting -1 and here we will put 0. So initially wherever you want to give rewards you can initialize and that is what I am initializing here. In the next one we will again make this zero , here we have made it zero and this one we will make -1 . In the next one we will after 1 this is zero , we will make it 1 at every place and then we will edit it so, this also -1, -1 and this also -1.. So till here we have edited, we are talking about third row , ok fourth row we are talking about and in fourth row also.. in second and third row the zero has come here and in fourth row we will make both of them zero and this one also we will make zero. So both of them have become zero and after this if I see more then the second last one also can be made zero and after this in the next one the last value that we have we will make it 100 and in this also we will make the last value 100, OK.

**3:27**

Now for remaining values, we will make this zero and remove negative reward and here also make it zero and remove negative reward and above this also, here we make it zero. So here you have seen how I have initialized the R matrix. Just see here my R matrix, Reward matrix has been initialized. Ok, So now I will execute this and even if you print and see then you can see that this is my R matrix, alright. Watch here, this is the reward matrix. So after the reward matrix what should be the next thing with us, it is the Q matrix. So, I will make a Q matrix, Q , Q matrix. Now for Q matrix also I will take NP's help and create a Q matrix, isn't it. So now Q is equal to NP dot M A T R I X matrix . So NP dot matrix and here in bracket we will put NP dot zero , so what are we specifying in the Q matrix initially, we are specifying zeros So inside this here we will make a 6 by 6 matrix because the one above is also of 6 by 6 so this should also be of 6 by 6. So print and print Q if we do then we will come to know what type of matrix has been formed, so watch here it should be matrix of zero zero zero of 6 by 6 and here it is. So initially the Q matrix is like a memory , this is like a matrix, so initially there is no memory so we have given all values as zero. But in rewards, rewards can be present from the beginning and they will be there as to in what step what reward will be attained. So we had initialized the reward matrix with some negative and positive rewards. So, let us now move ahead and while moving there is a learning ray and we give it the name gamma , we will call it gamma and this gamma will be the learning rate, LEARNING RATE and the learning rate will always be a very small value, so, Gamma is equal to, we will write 0.8 , a small value which is our learning rate . Now after learning rate, the next step will be initial state meaning as to where to start from. So in the next step initial S T A T E state, so we will take initial state and now whet this initial state will be , we will take it as I N I T I A L initial underscore S T A T E state and make this underscore , so what will be the initial state, initial state we will have to decide as to what it will be ,so we will take initial state as 1 , so let us call the initial state as 1.

We call it 1, this is our initial state., OK. 6 by 6 matrix says that from here to here is one state, from here to here is a second state. So there are so many possible states that we see ,and we gave the name 1 to the initial state . Now let us move ahead, and now what we will do is , we will make a function and what will be the task of this function , the task of this function will be to tell you all the actions that you can take , that you can take XYZ actions , means that for every state you cannot take all actions and you can take actions only according to the state you are in , so this function we will make for this purpose. So the “F U N C T I O N function… to R E T U R N return… all.. the… available.. actions… in… the… state… given… as… an… argument… (typing) So whatever state we are giving as an argument , the function will tell all possible actions for it. So here “D E F def… A V A I L A B L E available… underscore AC, …OK.. available actions.. and what will we pass here in brackets, we will have to pass the state , so I will write state here in brackets and enter. So what this will do is, C U R R E N T current… state… underscore… row.... So what is the current state row that we will ascertain , so the current state row will be R of S T A T E state… , R of state , OK. So, R of state will tell what the current row will be, Ok, that it is going to tell, alright. So now next will be available actions , so available actions, this will be next .

**8:06**

Here NP dot W HE R E where… and here C U R R E N T current state… , Ok. So current state will have which row that we will pass here R O W row,... so current state row… and along with it, OK, along with it we will also write that this row is less than or equal to zero ,... this we have to check and first one is the value… and from here we will return , return, return OK,... so what's happening here, we are returning the available actions ,alright…. So this becomes an function. Now in this function just watch if I want available actions then how can I derive them A V A I L A B L E available underscore A C T I O N S actions , OK. So available actions is equal to , so call for available actions here , so available actions is the name of the function and we will write it ACT in short , A V A I L A B L E available actions,... we have called this function , and here you have to pass a state so pass initial state , for initial state I have put 1 above, just pass initial state and then see here what output have you got, available ,... what output have you got in act , OK. So here we see 3 and 5. So, 3 and 5 are available actions, this is the answer received , so this is our function.

Now we will proceed ahead and while proceeding what we will do is , basically make one more function and what this function will do is, the previous function we had made for actions , now the function that we will make will be for Q table because we have to update the Q table , these are all zeros, so we have to update them. So for this let us make another function. So here, before the Q table also, we have specified all actions, so what all actions are to be taken will also have to be told , so before proceeding ahead let's specify what actions it will take, alright. So it will take some sample action and for that let's make a function sample , sample underscore next underscore action…. , now what action will it take… , so what do we have here available act, so we will pass available act to it , so we will understand what values, 3 and 5 have been passed. So here we will take next action, next A C T I O N action equals to, so next action what we will specify is , next action we will specify as int , NP.. dot ..random..dot..choice… and here A V A I L A B L E available,... Ok . So here whatever available acts we have we will pass here , so all actions available and out of which how many will we choose , we will choose only one , OK. For next action whatever will be there like we can see 3 and 5 above or like 2 and 3 but we have to choose only 1. So randomly we will choose one. So we have made a function that will randomly choose and after choosing it's essential to return so, return.. next… action…. , alright. So this will give you the next action. So here we have encountered a invalid syntax, return should have been T U R N , OK. So this way we will get the next action. So, now just watch we can try this also A C T I O N action is equal to , here will be sample and let's call the next action , so here we call it , and what will we pass in it , we will pass available actions, A V A I L A B L E available actions, this we have passed and enter. So let's print and see , what is the action taken , so we can see in action, 3. So ,out of 3 nad 5, between 3 and 5 , 3 has been selected. So this is how you can choose your next action, alright. So we have reached here, we have made a function to choose the next action also.

**13:40**

Now let us do the further implementation and now we have to work on Q matrix also, how much learning is to be done that also has to be worked out , so let us make a function with the name update. Def update… and we are putting in this the current state , OK. So current underscore state , and along with current state we will put action also and we will also put gamma , so these things we have inserted here , and passed as argument. Now what we will do in update, in update we will start with max index, max underscore index is equal to and here , watch max index NP dot where , with this we will find , we will find and here Q… of… action.. , OK. So where Q of action is equal to equal to, here equal to equal to, equal to what, NP dot max OK and here also we must have Q of action, OK…. So Q action and here we will get the value of max index OK, and this should be one, OK.

Now let’s proceed ahead if max index, so if our the max index yes ok, dot shape of zero is greater than one , so if it is more than one , is greater than one then what we have to do, so in this case we have to do max underscore I N D E X index is equal to int and here NP dot random , so intNP dot random dot C H O I C E choice, OK so here from what we will choose ? we will choose from M A X max underscore index , max underscore index and how much size is required, 1. Now after this we will implement else, E L S E else and in the case of else , in case of else we will do max underscore index is equal to.( 6 seconds pause) So max underscore index is equal to int , so after type conversion what should we get . We should get in int is max underscore index, ( 6 seconds pause) OK. So after this, in else, we have just simply put the max underscore index, OK, whatever was the value of the max underscore index. Now here, max underscore,... max underscore value…. should be Q action comma max index, OK, alright… So this is our else part and we have moved ahead of else part now, so, let’s just correct it, OK. Now whatever is the Q learning formula, we will apply here so that Q learning can take place.

So Q learning formula, now what will come here, here we will have Q, inside Q, C U R R E N T current state will be there, current state and some action will be there isn’t it?

**19:12**

So the agent will be in that current state presently and with it will be an action and… on the basis of that action, new state will be decided in which it should go. So now Q, current state and action and from this reward, OK , reward from where? current state… and after current state, here also action, OK, so this is simple but we had put learning rate also which was gamma,ahead here will come, it will be gamma into …, OK, so here we will put max value, OK. So this way what we are doing is, we are implementing the Q learning formula. In Q learning, current state plus action that will be there will be determined, see here, with reward, current state and action what are we getting and we are multiplying max value with gamma, so we are deciding on basis of these two things that how the Q matrix will be updated now, OK?

Let’s now proceed further, let’s move ahead and now here we will update the matrix, matrix should be updated so, how will it update, it will update through update function so. Update and update initial state , yes initial state and after initial state… the next thing will be action and the next will be…, ok , so double M A , gamma, yes. So this way it will update. So let’s execute it and proceed further and proceeding further what we have to do is we have to give training, OK, and approximately 10,000 times we will give training. So train for 10,000 times OK. So now for I in range and here 1,2,3,4,5 , 10,000 OK 10,000 times. Now here current state, current state is equal to NP dot random dot… ya, so here we will take randint.. so current state is equal to NP dot random dot radint and inside this what we will do is we will pass zero and int and after int Q , and Q and thereafter dot shape. So Q dot shape of zero. (7 seconds pause) So this is our current state and now after this available actions, available actions which will be in A C T , so here available actions. So what is the name of our function, it is ‘available actions’,... so it is available actions and we will write available actions here… OK. So from here we have to pass the current state , C U R R E N T current state we have to pass, so we have passed current state and this will be the available action OK, same thing we are doing. After that where to take actions from, so actions you can take from this function, sample… underscore… next OK , alright, so sample next action , so from this function we have to take it. So now here list of A V A I L A B L E available actions should be passed. So available act, pass this list, OK done.

**24:08**

Now let’s proceed ahead and proceeding ahead, the next thing is that we have taken the action so we will have to update it. So call the update function, do the same thing just repeat together , so here C U R R E N T current underscore S T A T E state , so current state OK, so update current state we will pass and along with it pass action and gamma,... three things we will pass here, alright, this is done. So this is where we have passed. Now it is training, it is training 10,000 times. After training we will print once and see, so, print and what will you print, the Q matrix because , so what is this maximum recursion exceeded, alright, so I will make it run less, only 10,00 times , so instead of 10,000 I will run it for 1000, in 10,000 maximum depth is being reached, we will make it 1000 times and we will print it here and see , what will we print here?so we will print Q here OK. So for now I am keeping it less, let me keep it less and then we will print Q and see whether updates have taken place, why, because Q is our memory table. In Q learning what is the main thing that this Q matrix is updated and on the basis of this further actions are taken, alright, so here, now here we are training it for 10000 times.

**26:00**

After training let us see as to what is the Q value displayed. So you can see that the Q matrix has been updated. This is what is called reinforcement, meaning occurrence of some updates. See here this is 400; this is 320, after doing 10,000 this Q matrix has been updated. So now let me write it in a normalized way. So to write in a normalized way I will print and then Q divide by NP dot max , so here I will normalize it with maximum, so NP dot max of Q, so with value of Q into 100, so max of Q into 100. So here this is normalized, not very high values but normalized values we will get. So this is our complete system with Q learning which is ready. Now if we want to test it then we can test it also. So for testing, let us test it. So to test it, first of all what will be our current state, let us assume it to be 2, so we will make the state at number 2, our current state is equal to 2, and after this the steps that are to be taken, OK so that steps will come from where, they will come from current state, C U R R E N T current state underscore S T A T E it will be from current state , and now while… current… state…, so our goal state is 5, that also I will just write here, so what will be our goal state 5, so 5 is the goal state. So while the current state is not equal to five , till then what are you supposed to do, we will take the next steps OK. Now watch the next step, next step index is equal to NP dot where and here we will write Q of current state, OK, and is equal to equal to , here NP dot max and here we will write Q and here current state, (6 seconds pause) OK, of y , OK, alright. Now next would be, yes, if next step , next step , so if… next step… index… dot…. shape of zero… is greater than one then what we will do is, next step index …, next step index is equal to…. here we will write NP dot, so we will take integer value NP dot random dot choice , OK, so NP dot random dot choice , from where will we take the choice, from the actions available to us, we will take the choice from them OK, so, next step index, so next step index… size is equal to one OK, else, (6 seconds pause) next step index is equal to int simply… next step index,... so simply we will write the next step index without any changes, OK, fine.

**31:06**

Now after this we require steps, as to how many steps there are, isn’t it. So, steps dot, And in this let’s append , so earlier in steps we had put the current state and now in that we will append, append and here what we will append, next step index , so next step index, this we will keep appending and our current state, “current state” (read slow, typing) , that will become what, next step index OK. So this will be our current state, next step index OK. Now let’s see that this time what path has been selected from Q learning, we will see the path. (6 seconds pause)

So path slash N and here S T E P S steps. So just watch here, so because of Q learning 2, 3,4, 5 these are the paths that have been selected, OK. So this was the complete concept of Q learning. So you should once again rewind and revise it and check that how we firstly initialized our reward matrix and put some rewards in it , included minus, negative 1, included zero meaning no reward, then we had put positive rewards and after that we had made a Q matrix in which all values were zero but as and when we trained with the learning rate of our model, we got a new matrix in which all weights were updated , there would have been more updates if we would have run it more, and finally we did a test as to how this will select a path so, what did we use , we used Q matrix and using that, it returned the rewards 2,3,4,5 and it reached the fifth state meaning the goal state, OK.

If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.

## Share a personalized message with your friends.