Hello, I am (name) from Learnvern… ( 8 seconds pause ; music )
You are welcome to the machine learning course and this tutorial is the continuation of the last session. So let’s watch ahead.
We will today see the Upper Confidence Bound. Upper confidence bound is a concept of reinforcement learning. Ok, let me give you an example, before that, before we understand this particular program.
So, in this example I have a small robot, this is a baby robot, and we went out with this baby robot. When we went out, it saw a small dog and it started running after the dog, until we could notice, it had already run quite far. Now while running he reached a place where he could not see the dog anywhere, but it had lost almost all its battery. Now without battery it became unsettled how it should now find it’s guardian, because to find it needs battery. So what is the challenge now, the challenge is that this particular robot is now thinking as to what it should do, and that very moment what happened was that it saw a shop at that very place where it could see charging sockets, with which it could charge its batteries.
So, this was a good time for that particular robot. So it went there and after reaching it started charging. It then observed that the charging was very slow, and it would have to stay there for about two or three days before I could be fully charged… So he started charging from the other socket, but that was charging even slower. Then it started charging from the third one , and it got unsettled as to what to do now. Should I check each socket one by one? So, see how difficult this problem is.
(02/00)
This problem is called multi armed bandits. Ok, this is called multi armed bandits. This means that when you have so many choices and every choice is giving you a different reward, then you will get confused as to which choice to choose, so to solve this, we will now discuss the upper confidence bound. Upper confidence bound says that you will take some action, which means you will have to take some action to make a choice. So the action that you will take, we will consider that a little and till now the best that we have got, that also we will consider. So current action plus the best we have got till now, we will consider that also. So this is the approach taken by upper confidence bound and this is the way it works.
Let us see an example here, this example is one of the ads, this ad has to be optimized. So we will have some ads and out of that which ads are being seen the most, we are practicing to optimize that. So, here let me add this file… I will click on this file, so this is the file and I have uploaded it, OK… (Pause 3 sec, clicking). Now here import numpy as N P , import panda as P D , import matplotlib dot pyplot as plt, import math, so these relevant libraries we have imported, D S Equal to P D dot read C S V and D S dot head will be used to see whether this file has loaded or not. So this file has been loaded.
(03/37)
After this, total number of ads are 10, “the number of times add I was selected, that is number of selections” is equal to, now see what’s happening here. Zero means that value zero has been given for all ads and it has been out as zero, so for all the ten it has been specified as zero, alright. Now “the sum of the number of times ads I was correctly selected”, how many times was he selected correctly. So that has been written here as sum of rewards, rewards means correctly selected, so rewards has been written as 0 into D which means everybody will get rewards, so all ten will get rewards, and total number of rounds, how many rounds will be there? Rounds will be this much, 10,000 rounds. OK. So this is a complete scenario that has been built here. Now which ads are being selected, we will write the logic for that here, so that will come in ads selected.
So now you will see what we are doing here, we have made total rewards zero here, as in initially rewards will be zero. After that we have implemented ‘for loop’ here , for n in range zero comma N, so where did this N come from, see above here , it came from here from total number of rounds So, from zero to total number of rounds because it becomes less, it becomes 0 to N-1. Then max upper bound is equal to zero, max upper bound is zero, and ad equal to Zero, ad is also zero. So what are we doing here? For i in range Zero to D, what is D? Number of, number of advertisements right. So 0 to D means it will run ten times. Here we are specifying the number of selections, if the number of selections is greater than zero. Then, we are calculating the award, which is average award, average reward and we are also doing delta i, what is in delta i, in delta i, we have, two things are there in delta i, average and delta. So the average part is, just see here, sum of rewards divided by number of selection i. So we are taking out the average, complete average till now, so there in average part in this, and the other part in this is, this, which you can see here, square root of 3 by float 2 into math, math log n, math dot log N plus 1, divided by float number of selection i, so this thing here that is delta i, this part you will see it is telling us about the exploration. Ok.
(06/05)
So one part is talking about exploitation and one part is talking about new explorations and after that, if the condition is not there, then give the upper bound this fix value, and if the upper bound is greater than max upper bound, max upper bound is equal to upper bound and ad is equal to i… So in this way what we are doing in this for loop ? in this for loop we are doing exploration, also means which new options to choose and along with that we are doing exploitation here too. OK… So what does exploration mean? Exploration means to try new options, exploitation means that the option you have chosen, you keep going on that, so both are done here. Means for sometime exploit then explore, exploit then explore, this way. Now the ads that are selected, just append them , and no of selections, all these ads should also be appended here, and here sum of rewards ad plus equal to reward. So what happens with this.. basically with this every time at this A D, at this particular position, the reward will be updated, Ok. Then total reward plus is equal to reward. So, the total reward will now be the final reward that is calculated.
So let me execute this, and let us see what is the total reward, total reward is two thousand one seventy eight… and if we see the same thing graphically here, the this here basically shows the upper confidence bound, as I already told that whatever is existing now, out of those which is performing better, so this is performing better, fifth one is performing, fifth advertisement it is. That is performing better.
So this is the Upper Confidence Bound. Out of the many choices, we have to see two things. The first thing is that you should exploit, exploit means that you know nothing, so what you will do is, exploit. But with time you come to know, know what, this is also an option, that is also an option, you get to know multiple options, so, now you will do exploration. So, with a combination of both exploitation and exploration. Upper Confidence bound takes place. So, let's conclude this video here and in the next one we will see, what Thompson sampling is. So remain motivated, keep watching, thank you.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your queries.
Share a personalized message with your friends.