Namaskar I am Kushal from learnvern.
In the last tutorial we understood the project scenario and now we will implement it. Now this automatic taxi that will learn ,to implement it we are going to use open AI Gym and here we have a package taxi v3 module, so earlier we had taxi v2 volume and now we have v3 module, so to implement I will parallely code it here , and the reference of the code, reference of the code we will keep taking from this document. So you can see here that firstly pip install cmake(c-make) atari scipy(sci-pie) , so this we will have to install, if it is already installed then it's great , it is a very good thing. So let us install it once, let's install it.
So here pip install cmake and in brackets, the package name is gym atari OK, so in single quotes G Y M gym and A T A R I gym atari , so this is our package and after this scipy(2 second pause,typing) OK, so this is executing and till then we will understand it further .
once it is installed then after it has installed then what next we should do is , now load game environment and render what it looks like , so to load it, import gym , environment is equal to gym dot make and there we will pass taxi v3 environment , that this is the parameter and it should get this environment dot env, env dot render , so in this way we will execute this also , so here it is telling that this requirement is already satisfied , so import S C I P Y scipy , so scipy, we need not import scipy also and import only gym , so G Y M , let’s import gym , and after importing second is env dot , env dot gym , dot make OK, dot make and after that here we will make the environment for taxi V3 , OK we will make the taxi v3’s environment , so let us make the environment for taxi V3 here , so T A X I taxi V3, here we have made it for taxi V3 , so we will have to write exactly in the same way as written there so that our execution is proper and call dot env , dot E N V , OK, so now with it’s help when I do env dot render then such an environment will be displayed in front of me.
So now env is not defined ,OK env has not been defined here , so here let me just cross check on this part, so dot E N so here we have env only, OK so here it was env is equal to and that was a mistake that I had done, so env dot R E N D E R render, so we will do env dot render , so this will now display the complete environment , so here you can see that the complete environment has been created here and now let us move forward, so what has happened here, env dot render , so env is a unified environment interface and it helps us create an environment and here we have some functions of env like env dot reset and when you do env dot reset then it resets the environment and returns random initial state , so here if you see document, the diagram of document then you will see that there is a different initial state and here if you see then you will observe that here initial state of taxi is there and pink and blue have a different location , B and R have a different location , so this is how it initiates the environment randomly. So here it is env dot step action, so when we say env dot step so it will take an action and that action we will have to pass here , so here it will take an action and by taking that action what will happen is that a new state will come and a set of possible actions from the next state will open up so , this is what will happen here, so it moves one time step and here observation, observation will be the agent that observes the environment, so that will be observation, rewards are that we have earlier seen that if it moves positively then it will get a reward and if takes more time then it will buy a penalty OK. So done, the meaning of done is, in this , ok after reward now done, in this it means that it picked up the passenger successfully and dropped off also , so if it had done both pick up and drop off it means done and this we will see ahead.
Now info will be additional information like what is the performance and what is the latency,so this performance and latency are used in the end, as to what was the performance and what was the latency,so they are used there, this information is used there.
OK so now we use env at the end of make to avoid training stopping two hundred iterations OK. Now let us move further, let;s move ahead, now what is our problem , let us discuss it gain, our problem has four locations, four locations and they have been labeled with different letters , our job is, what our job is that the cab should pick up the passenger and after picking it should drop in a different location but both locations should be correct, pick up should be the pick up location and drop off should be drop off location and so we will receive plus twenty points, so a scenario like this can be created that twenty points will be awarded , then you do a successful drop off, lose ten points for every time step , so we will lose if we are not moving in the right direction , so in this way our complete environment will work.
So let us now call env dot reset and env dot random , so these two we will call here, env dot reset OK and after that env dot render , so this was reset and after this env dot render(3 seconds pause,typing), so we have done render also , so now let me execute this and after execution if you see then you can see here that what type of location shave been displayed and in this also if you see then this taxi is overlapping with a pick up and drop off location , it is already there, so in this way we can completely render our environment and display the environment .
Now let us see action space and along with that let’s watch state space also , so env dot action space and env dot observation space , this way we can see so let’s see, print and here it is action space (4 seconds pause,typing ), env dot (3 second pause, typing) action space and is it a method no it is simply , so this is action space and in a similar manner print , here we will do the second one which is observation, so state space(9 seconds pause,typing) , so this is observation space , so here you can see that we have six discrete action space and we have five hundred observation space.
This we had read a while ago with the help of documents also that this way how many counts does our space have.
So let us move forward, so here it is filled square, so here you can see the filled square in yellow,so filled square represents the taxi and if a passenger is sitting in it then it will be green , so if it is green then it means passenger is sitting in it , now pipe symbols here, these pipes pipes over here are barricades, we had seen some barricades above and these are those barricades and then the next thing is R G Y B. Now this R G Y B is important and let us understand it in the beginning itself.
So this R G Y B is pick up and destination locations, like blue , what is blue it shows current pick up location and purple shows the current destination. So blue means pick up of passengers and purple means the destination where passengers have to be dropped, so this was that part. Now, let us move ahead and see here 0 1 2 3 4 , o means south, 1 means nort , 2 means east so in this way we have numbered it also, OK , fine, now , now we have yo find such an optimal location which, which in the least actions , taking the least actions gets us to the destination so that is an objective that we have , so we have a important objective here that how we can optimize and after optimizing how the taxi can quickly taking rewards and avoiding penalties can pick up a passenger and also drop off a passenger.
So till here we have discussed one part and now we will move towards the next part , so our illustration is already there and this illustration we will initialize with a specific state , so just see here , we will initialize it with 3 1 2 0 , so if I , here you can see it is of this type , so if I initialize it with state is equal to env dot encode then , so here we can initialize it, so S T A T E state is equal to env dot, yes env dot encode(5 seconds pause,typing) ,so in env dot encode we will pass 3 1 2 0 , now what this 3 1 2 0 is, 3 1 2 0 is row number of taxi and column number of taxi so in the initial state also you must have seen that 3 1 was tai’s row number and column number , so let us put 3 1 here in encode , three comma one and here 2 0 is passenger index that where the passenger is, so two and zero , so we have four color and out of that one is two and one is zero , so the passenger will be at 2 and will get down at 0 , so this way if you initialize then you can see that print , so what we will do here , we will first of all get the state printed S T A T E , so this is my state, you can see here , it is 328, so out of 500 this is the 328 th state.Now again you can do env dot s is equal to state , in env we have a variable s ,so do env dot s is equal to S T A T E state , so this way, now it has got initialized and when you do env dot render then it will show in the manner that (5 seconds pause,typing) ,see, in the same manner that we had expected , OK, so in this manner we can set this environment and we have set it according to our wish.
Now let us move forward, now we will have a reward table that we will get from env only , so env dot P, so this is a reward table and the reward table that is there in the beginning, what is its use, its use is to tell that in which state you are, which action are you going to take and on taking that action are you going to get a reward or imposed with a penalty , so this basically you understand from env dot p that which action, in which state you are , how many actions are possible from that state and after performing action possible will it get a reward or a punishment and after that also, will it be able to reach the destination, whether this will be done through or will it be false. So all these things here are to be taken care off.
Now see here there is action, after action there is probability of action , what will be the next state, what will be the reward and done or not done , OK so the details basically we are going to get. So let us now understand a little more that here this zero to five are our actions south north east west pick up and drop off , these are our actions, probability will always, always be one or below and in this environment probability will always remain one OK.
So the next state is the state , the next state will be a state for which we have to work from now , so now you took an action and after that action it went to the next state and from that next state we will go ahead if that is not a goal state.
Now here we have stated that we should give some minus one , with minus one do a reward or penalty OK, for any location, for any action that is not taking you to the destination , so in a similar manner make it minus 10 if you have dropped the passenger to a wrong place , then do minus ten , so some questions like this , some strategies like this that we are making will help us to learn. So let us move forward and moving forward we will see solving the environment without reinforcement learning .
So this is all in this video, in the next video we will see how we can solve this without reinforcement learning and after that we will see it with reinforcement learning. So thank you very much, keep watching and let us work on this project and complete it.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.