In this tutorial of machine learning we will continue to do the project on cab service and in this cab will automatically learn in what way it has to move in the environment , how it can pick up the passenger in minimum time steps and it can also drop off at the right location .
So now we will watch solving the environment without reinforcement learning. First we will solve the environment without reinforcement learning OK, we will try to run it like this only. So here we can see that firstly we have made env dot as 328 , so we had seen in the last recording, seen in the previous recording that env dot s is equal to 328 . not doing this way we had set it completely meaning the entire state was set , so here we know that 328 was that state so that is why we have directly passed it.
So let us continue now and while continuing here env dot s is equal to, so in env dot s what we will put is 328 , 328 which is the state and after this, after that let's go ahead , so epoch is equal to zero, penalties and rewards is equal to zero and zero , so all this is zero zero, so E P O C H S epochs is equal to zero , so what we did here was we put epochs is equal to zero and after that we have penalties and rewards so, P E N A L T I E S penalties and R E W A R D S rewards and this also we will assign as zero comma zero OK, so this much we have one, so now let us move further and so frames.
In frames here some detail will be stored and this frame will help in animation means will help to visualize , so we will make an empty frame, frames is equal to , so from empty lists we will make a frame , so frame is equal to so here I have made a list and so it has become an empty frame and after this done is equal to false so lets make done as false, why make done false because done here means that(3 seconds pause,typing) , could that, or did the taxi did a correct pick up and a correct drop off and if it has done a correct pick up and a correct drop off then in that case done will be equal to false, this is how it is.
Now let us see, while not done action is equal to env dot action space dot sampled , so I have an action space and from that any sample, means pick any sample randomly and get an action done, so this is what it means.
So here while not done, and what we will do is, so here we will action is equal to, A C T I O N action is equal to ac OK , so here we will call action from env dot action space dot sample, env(2 seconds pause,typing) dot OK , so dot sample, so in this way in actions randomly samples means from action space some value will keep getting assigned.
No what we will do ahead is state reward done info , so state reward done info, where are we getting this from env dot step, we had seen that from env dot step we perform an action of one time step and from that we get state reward done and info, so here we will take S T A T E state reward done info OK.
So after this is equal to env dot step , so env dot S T E P step and here A C T I O N ,(4 seconds pause,typing) OK so I have executed, so what has happened here, ok so here there is a full stop, so instead of full stop should be a comma, OK , so everything is fine, all is well till now it is executing perfectly and so let's move ahead , now if reward is equal to equal to minus 10 then make penalties as equal to one.
So if it is so, then here, if reward is equal to equal to minus ten , if it is equal to minus 10 then what has to be done , so if reward is equal to minus ten then in that case , in that particular case , OK let me remove this, so in that case what we will do is , for penalties, penalties in that we will do an addition so here I will take penalties P E N A L T I E S penalties plus is equal to one , so here we have started calculating penalties also , so after this put each rendered frame into dict for animation , so we have to put it in dictionary format with which we can animate , so I am just taking this, I am taking it as it is and I will put this, I will put this here , so this out of time, out of time till this while runs for that much time it will keep getting appended to the dictionary OK, whatever time it runs that many times it will get append OK, so from while it has come here a bit , OK so let me type only, so frames dot append so let us append it quickly , so F R A M E S frames dot A P P E N D frames dot append and in append what all do we have , in append we have to create dictionary, , so in dictionary control Z Z Z so here we have made a dictionary enter , so in the dictionary what I have is,
OK I have all these things , so let me take these things from here, so this we have done here OK, so we have frame, so let us take it here , so frame state OK, so this has been appended and now after this let's move ahead , now we will move further and while moving forward we see we have epochs , so whatever times this is running , for those many times what should be done is, epochs should also be increased , so how many epochs are there, so epochs is equal to , plus equal to OK, so epochs count will keep on incrementing and in the last just let us see how many epochs are there and how many penalties have been levied, let us see that , so this we have just implemented and here you will see that we have not done anything like learning , we have not done anything like learning, but just picked the environment , so frames is not defined OK, so what name did we give it, so we had given it the name only frames , because I had given it the name frame so that is why, so here we will make it frames, so here you will see that time step is how much, zero time step and and penalties incurred fifty two , so here we got to know that it took zero time-steps and how much were the penalties , so here on executing it these are the details we have got OK. So in this way we can execute it, we can execute and execute it multiple times. So we executed it for the second time so epochs is again zero here and is not being changed so there would be some reason , there will be a reason , while not done.
And here we have rewards and here we have E P O C H S and E P O C H S that's perfect , so our epochs are not incrementing because we had to do plus one here , yes that is why it was not incrementing , so watch here, time-steps taken eighteen eighty five and how much are the penalties here, penalties are six thirty four, so these many penalties are there. So let us execute it again , so this is random and every time, see two thousand five hundred four and seven ninety nine penalties.
So you can see here that the environment has been setup but after the environment has been set up and when we are trying to run the algorithm then the algorithm is not performing good, ti is not working nicely and taking a lot of time steps because of which a lot of negative penalties are being incurred , so let us once display and see it and for disp[laying I am just taking the whole piece of code and just putting it there. So to display it e will use I python dot display , so here you will see I python dot display clear output , from import, from time import sleep and here def print frames and frames we can pass here , so here we have passed frames and then we have enumerat it, and now what we do in enumeration is all the columns that are there that will be seen here , so here with the help of I and frames , here clear output wait is equal to frames that's fine , so print time step, OK just see how much is the time, yes how much is the time step, how much is action, how much is action , then what is the state , what is the reward, all this has been printed and how has it been printed, it has been printed with the help of animation , OK so here S T R value has got no , OK here this get value should not be there because it is an old library function and so I will comment it and in the new library it is not there, so here you can see that it is telling us all the things and here the frame is not being printed , due to some reason frame is not getting printed OK, so here print frame and here dot get value has been written so that is why frame is not being printed, so I will check that part. I will check just why it is not happening that way , so here get value, the get value function , so frame dot get value , so this I am again uncommenting , so here you can see that this is continuously running and it will keep running until all time frames are completed , so two thousand five hundred knot four , so these many twenty five hundred four, till all of these do not finish, it will keep on executing.
So till such time let us search this in google that may be some new function for this might have come and we can immediately use it and if there is no new function then we can leave it here for time being , so here in reinforcement learning, yes OK see this is the error it is pointing and this error we will just resolve. So whenever you are implementing a project, in this way you can search errors in between and resolve various platforms , so in this way we do it and on discussion forums on our platform learnvern also you can discuss this .
So let us now see in this example that, I tried to change the name OK, so this is some different problem , though it is related to this, but it is something different , OK so I do not see any ready made solution for this that we can get but no worries.
So here we have understood this much that in this example when we are making our algorithm learn, although there is no learning factor here , so when we are running the algorithm then how it is running , all those things we are able to see here in textual format , so let it get completed, we will allow it to complete and then we will move ahead to the next , according to our strategy, according to our action plan, we will move ahead.
So here it is taking many time steps so it is not recommend, obviously it is not recommended and how will it get recommended is that it learns correctly and on basis of that learning it gives us the output and so the next part that we will do is enter reinforcement learning , so why not we enter reinforcement learning and from there we try to smoothen the entire learning process and to make it easy, so this will execute till what time , this will execute till twenty five hundred four and till that time you keep on watching, keep on seeing. So we will implement reinforcement learning in the next project discussion. Thank you very much.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Ruturaj Nivas Patil
Very well explained in entire course. Great course for everyone as it takes from scratch to advance level.