Namaskar I am Kushal from learnvern.
We are now continuing the project on reinforcement learning in which we have already seen that how we created the complete environment for a cab, a taxi and how we have identified the pick up location from where to pick up and drop off location from where to drop of a passenger and we have also identified the position of or taxi.
After this for the taxi to pick up and drop off a passenger we had written a sample piece of code also in which that taxi picks up and drops off also but we saw that it is taking thousands and thousands of time steps for that and many penalties are also being imposed.
So let us now enter reinforcement learning and see in what way it can help us so that the taxi learns from each action of itself and by rewards through each action , so let us watch it here. So now this Q learning that is there, Q learning is an algorithm , this is an algorithm and this Q learning algorithm will help us in creating a memory in which with each action the rewards that are there, the next state , so to capture these it is going to help us and it is going to create much efficiency in our learning, in the learning of our agent, so let us see how it will work.
OK, so we have a reward table , in our taxi environment we have the reward table , so here if you see in the environment of taxi, so by doing env dot P you can see that there is a reward table , just see this, this is the reward table , so this reward table is going to help us, this reward table will help us and from this reward table agent will learn.
Now let us understand a formula , now the formula is Q state, Q state comma action , so Q is a formula of state and action , so how Q state comma action is defined is, it is defined by one minus alpha comma Q state comma action meaning that you are in some current state and are going to take an action so what is the value that is going to come out, what value you are going to get , so if here you are in a state and from that some value is going to come that we have multiplied by one minus alpha and in that added alpha in brackets reward plus gamma max of, ok, max of Q next state and all actions means that in this a future step has also been taken , the current value has also been taken , the reward obtained has also been taken and one future, means that ahead if you are performing some action from a state then in what will you get the maximum reward, the one in which you will get maximum reward that we will choose , so in this way this formula has been used here , what is alpha here, alpha is learning rate , alpha is always a small value between 0 to 1 , so what will alpha do , in supervised learning settings alpha is the extent to which our Q values are being updated in every iteration.
Now how will these Q values be updated , so we want to control it, we want to change it bit by bit , so that is why we keep alpha as a value between 0 to 1. Gamma is the discount factor, so this discount factor is of what use, it is for future rewards, you will see here that it is choosing max and from where it is choosing , from next state and all possible actions, so what will be the next state and related to it whatever actions will be there and the rewards obtained from that , is basically being multiplied with gamma and gamma is also a value between 0 to 1 and this is also a very small value.
So let us now move further, so if we make discount factor zero so basically this will become greedy and will work on immediate rewards only, so it will become greedy here OK, so this formula we will use here.
Now the Q table is somewhat formed like this, here you can see the Q table, so in Q stable we will have row for every state meaning there will be a row for each state , total from zero to four ninety nine, so total five hundred rows will be there and each action here is specified, see each action , every action in the beginning has a reward , OK reward in the beginning and the reward will be zero, so Q table values are initialized to zero then updated during training to values that optimize the agent's traversal , so initially it was zero and after that as the training progresses it gets updated , so after every action the value will be updated by something , this way OK, so now we will again sum it up.
So first of all we will initialize Q tables with zeros , after that we will start exploring each action from the current state and whatever possible set of actions are there we will decide as to whether we should go to a particular state or not.
We will go to the next state and after going to the next state we will also check that from the this state what are next possible actions , from this state to move to the next state what are the possible actions and in which states they will go and from that the one giving highest Q value, that action we will choose and that value we will store.
Now update the Q table values using the equation , so Q table with the values with the equation provided above and with help of alpha and gamma we will store the new values. Now set the next state as the current state , so now the next state which has become the current state, now we will see if the goal has been attained then its OK otherwise if the goal is not attained then we will again itrerate , run it again. So in this way this algorithm will run.
So now exploiting the learned values, so we exploit the learned values, the rewards and values that we have got we exploit them , we use them so this is the benefit that we get from Q learning . So let us begin this, so first of all we will import , import numpy as np so let us move ahead, here import numpy as np OK, after importing next is to initialize Q table, so what is Q table, Q T A B L E table , Q table is equal to , here we will initialize with np dot zeroes , so np dot zeroes, so np dot zeroes and after doing it from zero you can see that it is Z E R O S , so we will take it as it is, so Z E R O S , zeros and what are we inserting inside this , one list and what does that list has, env dot observations space dot n , env dot action space dot n , so observation space and action space , we will make a matrix from them OK, OK, so let me take this and here we can see the Q table for this , q underscore T A B L E q table and see here this is my Q table , so initially because of np dot zeroes it is all zero zero as we were reading it in the document , now we will, let' train it, so we will train it , so percent percent time , training the agent, import random , from I python display import clear output , so now see this I python display is for display, it is used for the graphical display , what are the hyper parameters that we have, we have alpha , we have gamma we have epsilon , so alpha, gamma e had sen and epsilon also we will see , now see all epochs and all penalties, so for this we have taken , we have taken two lists here , now for I in range, this is basically how many times you want to run this , so I will run from one to one's, tens, hundreds, thousands, ten thousand , lakh ,ten lakh , so ten lakh time it will run , OK, for these many times our loop will run and here what we have done is state is equal to env dot reset .
So we have reset the environment, epochs dot penalties rewards have been written as zero zero zero and done has been made false.
Now here while not done , while not done, if random dot uniform zero comma one is less than epsilon , so here they have checked it like this that this random dot uniform if it is not between zero to one , meaning that it is smaller than epsilon then what we will do is , so basically action is equal to env dot action space sample, so we will take a random action and if this condition is not met then action is equal to np dot argmax OK, so if it is not fulfilled then what are we to do, we will our formula argmax, argmax q table state , so q table state, what does q table do , q table tells us whether reward is being attained or penalty is being imposed, it tells that right, so here action has been taken in two ways, one in random type and the other action has been taken as np dot argmax q table state, by passing state there , so this part, this is exploration part OK, that is when we are exploring and the second part is of exploitation which is a known thing from which we have already observed, now see next state comma reward comma done comma info , so this thing only we had read in the document that env dot step gives us the next state , gives value for reward, whether done or not that also it tells, and it gives information about performance also.
Now we have an old value equal to q table state comma action , so the one now has gone to old value and the next max will be np dot max q table next state , from this we will get to know the next max OK. Then new value is equal to one minus alpha, see that same formula we have applied here , so one minus alpha old value, into old value plus alpha reward into gamma OK reward plus gamma into next max, so that formula we have applied here and after that we will update our Q table with new value , so in this way the zeros in q table will gradually get updated and in that a justified value, one value will be attained .
Now reward is equal to equal to, this is the earlier code that we had written , if reward is equal to equal to minus ten , penalties plus equal to one then, then also you can see that this is going to calculate the penalties, OK. Now here state is equal to next state , so now next state will become the current state and epochs is equal to plus equal to one OK, so that is how it is, so here let me just take this piece of code and put it there OK,( 5 second pause) so I have taken this ( 5 second pause) and let us execute this and now you will see here that how the learning is being performed , so see here episodes are running, eight hundred, eleven hundred, these many episode are running here , the amount of range that we have specified here so the learning will be performed accordingly , so let it happen , we will let it run , so here in the output we will have that how many episodes were run , whether training finished or not and wall time meaning how much time did it take , so above we have taken time and with that we will be able to understand wall time OK, so let us observe that, we will observe it and after that, OK after that in the q table we will put a particular state and see that whether the values change or values did not change , so this is still running and let it run , so here we have set Q table for 328, so q underscore T A B L E table and this is still running and let it happen, we will let it run , and now in q table we will put a state like for 328 state we will check whether there are updations in its values or it is as it was zero zero zero OK.
So that has been done now, here wall time is one minute ten second and now we may execute it and after execution you will see that yes values have changed, values have been updated , so in this the values here are updated. Now we will evaluate the agent, now to evaluate total epochs total penalties episodes hundred, for underscore in range state is equal to env dot reset , epochs comma penalties comma rewards is equal to zero zero zero , done is equal to false OK.
Now while not done and for action it is same there is no change , same from np dot argmax we have to find it , then state reward done info we will find from env dot step, again reward and penalties we kept and then increased epochs and then we will check what are total epochs and penalties.
So let us just quickly work on it and we will evaluate it our agent here, so this is our evaluation part OK, so this will tell us that results after hundred episodes, average time steps per episode twelve point nine one and average penalties (4 seconds pause) per episode are, so you can see that penalties are zero point zero , so if penalties are zero point zero that means it's wonderful right.So here we can see that because of the reinforcement learning penalties are zero zero and after hundred episodes you can see that time steps per episode , menas for every episode it is twelve, twelve point nine one means thirteen is the time step.
So here we understand that if we apply Q learning in this way then because of applying Q learning our agent will learn quickly and will go in less timestamps and we will not get penalties but get rewards instead , so this was the benefit of Q learning OK, so no we we talk about average number of penalties or we talk about average number of penalties or we talk about average rewards per move , so both were different, in the normal model all these things, average number of penalties were increasing, average no of timesteps per trip so this basically reinforcement, reinforcement learning can handle this part OK, and average reward per move, this was also varying and it was varying quite some measures OK. So that is it , here we have implemented the entire project and seen how reinforcement learning can help, you can try this and then you can plan and try more such exciting projects and wherever you want help you can try it on the forum OK. Thank you very much, keep learning, and remain motivated. So friends, let's conclude here today. We will end today’s session here and the parts ahead we will see in the next session. So keep learning, remain motivated, thank you.
Share a personalized message with your friends.