Hello, I am (name) from learnvern…( 8 seconds pause ; music )
In the previous tutorial of Machine learning, we learnt the FP Growth algorithm, which is part of Association Rule mining, in which we identify frequent patterns and its application is to make some recommendations.
Now, we are going to study a new topic whose name is Reinforcement Learning. Reinforcement, we had seen it earlier also in introduction tutorials, that how a human being who goes to a new place, and adjusts to it even without any past experience, one can sustain at that place. So, it is possible that human beings, even in new situations after taking the first action, immediately learns from that action, and take the next action. So in this way only we will see how reinforcement learning helps in a step by step process.
(01/03)
Now, let's move ahead. So, in reinforcement learning, as I gave you an example of a human being, you can consider human beings as an agent, agent in the environment, meaning the surrounding we are associated with, like right now I am sitting in this recording studio, so this becomes my environment. If you are sitting in your study room, so that becomes your environment and if you are traveling then that vehicle, that bus in which you are sitting becomes your environment. So, an agent is a human being who will perform action, if robots come to your mind then that robot becomes an agent, and his surroundings are the environment and now in that boundary how the agent learns, how he behaves in the environment and he does all this by performing some actions and on these actions he gets some awards, rewards and some penalties are also enforced.
So ,let's understand this in detail. In this diagram you can see that there is an agent and he is performing some action and due to performing that action he can do this on State, actions , rewards and what return we will get from this, when an agent performs an action on these, return will be a state and a new reward, OK. So now there must be some confusion in your mind as to what is exactly going on, what is agent, what is state , what are actions ,so I will now explain each and everything one by one in detail.
Let’s start from agent, agent means that entity that perceives the environment or explores it , perceives means like we have our senses eyes, nose so we can explore or perceive from the environment by smelling, perceive by hearing from ears, see from eyes and perceive , in the sameway our device or algorithm can take external data and perceive it and while perceiving it will explore also what all possibilities are there , what all new states it gets , so its exploring. This is called an agent.
The next that we have is environment ,so where the agent is present , that surrounding is called environment , it can be anything like if there is an automatic taxi so what will be there all around the taxi, there will be roads , alongwith roads will be footpaths, some traffic signals, other vehicles will be there , so these are all examples of environment. Then ahead is, ahead is that is action, now that taxi has to take actions , it has to start, it has to stop , move forward, apply brakes, take side turns , these all are moves or actions which that taxi will take. If I am a human being then to go to a particular place first I will get ready, this is an action , then I will move steps ahead, then I will step out of the house ,so these are actions, alright. What I am telling you, in the same way we have our programs. In the beginning we will do simple programming and as your level increases you can give the programming a complex level.
04:02
Let's move forward. State, state is like I am currently sitting in this position then this is a state , If I do this position this becomes another state , meaning as my positions change, my state will become new. So, the state is a particular situation. If I have folded my hand this is one state , I keep my hand like this it is another state , this is a different state. So if I take an example of a taxi, if the taxi is standing that’s one state, and if the taxi runs, it is another state, taxi while moving, if it increases its speed that is also a state. This means that through actions there are some changes taking place in the environment and that change is called a new state , OK.
Let’s move ahead now. Reward. Every action is connected to some reward , that means you definitely will get a reward. Like if I am moving on the road then I become the agent and the entire surroundings become the environment and each step that I take becomes my actions and now there was a pothole ahead of me that I did not see. So if I did not see the pothole then what will happen, you will say reward, yes I will get a negative reward or a penalty, I can get hurt , i will get hurt , so certainly rewards will be there but reward can be positive or negative. So out of the two one reward will be attained.So in the same way, if there is a taxi which did not follow the traffic signal , may dash with some other vehicle , it may happen that it gets a ticket , so dashing or getting a ticket is a reward here. If it applies proper brakes at every traffic signal where there is red light so basically every thing will be automated and so the RTO and the transport office what they can do is allow some rewards to it like make a toll free for it because it always follows traffic rules , so rewards can be negative or positive and they are a consequence of actions.
6:00
So, let's move ahead, Policy. Policy means that to take any action there is a strategy. Like for a taxi also there will be a strategy when to run, why not to run, why to run , so this reasoning kind of a strategy is called policy which helps us in moving from one state to another so that we can decide what action to take. So here the objective is that for a predefined state or we call it goal state and we have to reach there and to reach there the strategy that is adopted is a policy.
The next is value. Value is that in the long term what returns are we going to get, OK in the long term. So, now every step has a value but this value when we see as a whole it is for longterm, alright. Now, like if you start putting efforts in a particular direction considering a goal then you are continuously putting in efforts and improving also, so in putting efforts and improving what will happen is that everyday there will be some improvement and what will be the long term value that you will get, is your goal which you will be able to achieve. But you must also see that if you take out the value for each step also that we will basically denote with Q value that means taking an action and the corresponding value that you get is called Qvalue.
7:18
So let’s move ahead and watch that for reinforcement learning what is the analogy. This is very simple and till now you have understood it also and now I will explain to you step wise step. Here, first of all you have to observe the environment, what is there in the environment, like if it’s an example of a taxi then observe the environment whether there are some barricades, are there some two lanes , three lanes, how many signals are there and at what distance, what all turns are there, So what you are doing is observing the environment. Then you decide how to act using some strategy. So strategy can initially be zero meaning there is no strategy and can develop gradually but if you have a strategy then using that strategy how you will take actions and whatever you have planned you start taking actions accordingly.
Now as you take actions you will get rewards either negative , will get a penalty and so the reward or penalty that you get, you will learn on that basis. Or not, this went wrong, no this went right , got a reward, this went wrong , so you will learn here and after learning refine your strategy , refine it and then again take action, alright. Iterate until optimal strategy is found, Ok… So an optimal strategy we should get here after which you can stop.
8:32
Let’s move forward and look at some of the characteristics. So here there is no supervisor. In reinforcement learning , as we see in supervised learning there is labeled data and that labeled data is monitored, here there is nothing. Only there are some rewards, some numbers which are providing us some signals , so without supervision this works, there is sequential decision making that is first there is an action on basis of which you take the next decision , then take it on next two actions, so sequential and not parallel decision is being taken, So the way you are taking actions on basis of that decisions are being formed. Time is very important, because in the real world there are mapping applications used for commuting and when we use these then time plays a very important role because if your learning is taking a lot of time then basically it can happen that before taking any action you might have to face a penalty.
The next point is Feedback. Feedback is delayed because on the basis of feedback only learning is done so this is also a known point for awareness that feedback is very late. Now, agents' action determines the data that we are going to get subsequently. So the agent will determine the action, the sequential data that will be received subsequently.
So,let's move ahead to some concepts, OK. Now there is a thing called Greedy Action. What Greedy action is , it’s such an action that gives us the largest value. Means for a goal , the largest value to reach nearer to it . So any action of this kind is called Greedy Action. So whether the algorithm will take the approach of Greedy Action or Non Greedy Action, these are the choices available with us. Now the agent does not choose the largest estimated value , so.. Ok, there are some reasons behind this. So, there are two options, either follow approach the Greedy option and pick the maximum value or sacrifice that and, and there is an expectation here also that it might be so that I get more information gain from the next step and I mat be able to take a better action, so non greedy is also there.
Now, after this is exploration. Exploration is that activity by which an agent is increasing the depth of its knowledge, this is called exploration and exploration benefits in the long term. Alongwith this is Exploitation. So, what happens in exploitation is not seeing long term as to where larger rewards are being reaped , it could be short term but it is concerned with large rewards and this is called exploitation. Now, let’s move ahead to a new concept called Upper Confidence Bound , you can call it UCB. This is principle and what this principle is that many times there are uncertainties about whether one should take this action or not, OK. So, now you are going somewhere and there is some pothole or some water in front or some muddy water in front of you and you are thinking as to where to find a way from and there is an uncertainty in mind to take an action so in that uncertainty also being optimistic you can take some or the other decision. So, UCB, the Upper Confidence Bound is indicating towards thinking optimally , let us think optimally and assume that the action taken will be a correct action. So this is how I think.
So let’s see how it works. This basically applies a formula and finds that Upper Confidence Bound , the one which has the upper confidence meaning the one which has the maximum it tries to exploit that , to see that, ok, and on that basis only it takes action. So, this helps us in reinforcement learning.
12:10
Now here we will see, in this diagram you can see that confidence level is being shown, and confidence level has one lower bound and one upper bound so here we will pick the one with upper bound because upper bound is only chosen.
Come, let us move ahead and one more concept, and this concept is of Thompson Samplib. What happens in Thompson sampling is that in this also exploration and exploitation these two things we discussed, Exploration means to explore the depth ahead and exploitation means not to explore but exploit, where you get maximum reward take that action… But both these terms are necessary because exploration also cannot guarantee that you keep on exploring in the hope or expectation that you might get a better track or way that I can choose and consider if you don’t find it then to make a balance between the two and for this Thompson sampling is used. And here there is a problem, a multi armed bandit problem. You must have seen a game, it used to come earlier for computers also in which we would with the help of sticks throw balls up and down, up and down and we usd to observe winning chances and the problem which we solve there that same approach here has been applied in Thompson Sampling also.
Now what is here is that actions that are being performed are called exploration, because you perform a new action, you gain an experience from that and this experience is part of exploration.
And now after this, Training Information. Now when I perform an action so what did I get, I gain some information and what this information does is , see when I play something either I would get a successful result or I would have failed. So basically if it is successful, then the action is evaluated that it was perfectly correct , and this has been successful or I win something, a good win means this action was right and if I don’t win, I lose then this action was not right. So, the training information gained from taking actions only will evaluate how the action was, OK. In this we will not make any pre-assumptions initially because the principle of reinforcement learning is that you will take some action according to which you will get some reward or penalty and on the basis of that you will take the next action. So, now lets watch ahead , to apply this particular principle we will have to explore continuously , continuous exploration will be required and what this exploration is Trial and Error. We will understand it in a very simple way like if you are doing any new work and you do not know what the result will be ,so you inherit a way and by inheriting you did something but you did not get expected result but reached somewhere near. So what happened with this, exploration happened, right. So by doing trial and error, trial and error you will eventually get a good solution and you will be able to make a solution. So this follows this technique that explore and on exploring you will get some rewards, some penalties and on basis of all this you try to decide, decide that next time will you take same action , next time will you perform same steps or not take these steps, how will you perform, so this is how you learn also and better behavior and better outcomes are expected in this, alright. So this is the basic principle of Thompson Sampling work, that you have to maximize the rewards which can improve the performance ahead and make us reach the goal state.
So, now there is a problem, multi armed bandits. So there is a slot machine and there are many arms in it which I was just mentioning about arms with which you have to throw the balls up, up like this, alright. So, multi armed bandit is a word used for that slot machine in which there are a lot of arms. Now what happens in this is, winning means you have to win, and how will you win, how can we win. So to win you will have to select an action as to what all actions to take but actions, just by playing for the first time you won’t understand action so, so how will you choose actions. So everytime that you take actions, so at the time of taking actions you will have to do a payoff. What will be the payoff, you will either win or lose right, so what are the payoffs, you have won or you have lost, that you have to select and you have to keep it selected, Alright, that through these actions I am winning and through these actions I am not winning, this you will have to preserve, OK.
16:50
Now time and again when you take actions you will understand a strategy that first takes these two actions, then these two actions or something similar, if the ball is falling that way then this or if that movement is happening then this , means that there will be a strategy in our mind. So ,now in which way we will take actions so that our winning maximizes. So this is called solving the multi band bandit problem, OK, this way we solve it. Now see every machine , every slot machine gives different rewards, it's not like all machines will give the same reward, alright. Now this reward is given on some probabilistic distribution basis, so here it sometimes becomes more confusing that should I work on this machine, it might be possible that if I work on some different machine then that same strategy that same learning may not work, this can also happen. So, reinforcement learning works with this approach only that a person goes to a place and his experience is different and that same person goes to a different place and his experience is different, alright.
So, now let us watch some applications of it, the first application of this that I have written here is that of Netflix. In Netflix the images that we see of movie or shows, when we have finished watching it or when we are watching it then below it, so these are shown in a way that chances of the user watching the other movie, means the one which is being shown increases, this way it is shown, means if you are watching a horror movie then they will show you something similar , if latest movie then of similar type will be displayed, if watching comedy then they will show comedy type. So these are simplest examples but even if there is complexity also then this algorithm can help.
18:25
Next is in the Bidding and Stock Exchange , where you are predicting the stock prices as to how much it will be so there also reinforcement learning helps. In traffic lights control also it helps in predictions as to how much delay can be there. In automation and industries also, there are bots and machines where you can devise a way as to how you can transport and deliver items in which manual intervention is not required and it’s done in a smart way. So, these are some applications that we discussed.
Now, for this there is an algorithm also , although there are multiple algorithms but this is a famous algorithm, Q Learning Algorithm. What we do in this is, this is a value based algorithm, value based means that it keeps creating a value. Now, how it creates, I Will just tell you , it makes a table or a matrix and how it works on that matrix is that it has a function like Q S comma A, so what will this function do? This function, where the agent is currently placed, means that it is in a state and it is going to take an action and by taking this action it will enter which state. So this is what this function tells, alright, that by taking this action, in which state can you enter. So now I have to go to a state, my goal state, so this function will help us in telling us that this is one action, action 2 and action 3 and which amongst the three is better that we will get to know from this.
Now after this, watch, what we will do now is make a table and in that we will put the action taken, as this was the action that was taken and from this state it went to this state and whether some benefit was there or not , that we will write there and after that. Then after that one more action, we will perform that action , and whether we got a reward or not and then again update the Q table. So we will keep updating the Q table so that if we encounter the same situation again then we have prior knowledge now that this action was previously taken and no reward was attained or goal state could not be reached , so there is no need to follow it again. So this is how the Q learning algorithm works. So, till here we have seen reinforcement learning, now we will practically watch it and understand and then we will move on to dimensionality reduction.
So friends, we will stop today’s session here and now we will continue in the next session. So keep learning, thank you very much for watching.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.