Hello, I am (name) from learnvern. ( 6 seconds pause, music) In the continuation to the previous tutorial on Machine learning we will see ahead in this tutorial today. So, let’s watch it. Today we are talking about Thompson Sampling in reinforcement. We have seen one more technique also which we call upper confidence bound. So how does the former technique work, it works like whatever you have currently explored meaning you explored solution 1, explored solution 2 , so in the current one you just take out the average of rewards and from it choose one from it, the one with maximum, meaning in the current one you choose the best and start exploiting it. So this was one approach. Now, in Thompson Sampling , the difference in approach is that this works on probability basis , that means it does not do sampling in the same manner every time, it does it with a difference, based upon, again, what is this based upon? This is based upon the rewards that have been received, it will tweak. So here you can see that we have imported numpy, panda matplotlib to visualize that we give a chance to the maximum out of the displayed , import random. Now here ds=pd dot read_csv , OK this is the same file we used in the previous program and we will upload the same file that was in use in the previous session. Same file is present here also. So here I have loaded it, so, this is connecting here and the file has also loaded here and now let us execute this, we will execute this. Now you can see that it is the same data of ad1 ad2 , the very same data., OK. So let’s execute it and now you can see that all steps are the same.
2:00
How many ads, total number of times the ad I got rewarded, 1 , how many times he got one , total number of times the ad I got rewarded, 0. So now you will see, as I told you that this works in a probabilistic way. So it will not see how much reward is attained, till now we were noticing how many rewards were gained, it won’t see that , it will only see whether reward is being given or not and if given then , and if it’s giving then assume 1 and if not then assume 0, so this works on this approach. So the number of times the ad I got reward 1 and number of times the ad I got reward 0, alright. Total number of rounds are the same. Ad selected , total reward 0 initially, number of selections 0 into d means , meaning 10, the rewards we have. For n in range 0 to N . maximum range max_random, max_random is initially 0 , ad is also 0, and here you can see that where this probabilistic approach comes from , this comes from betavariate , in this alpha and beta are two components , so betavariate is from where it comes and we execute it from there. So what happens here is, alpha divided by alpha plus beta , this is how it is calculated, so what it means is, alpha means number of successes, how many succeeded divided by number of total trials , meaning success and failures in all , so that is where it comes from, OK. So let us execute, and here it is executed and let us go to the next cell , so total reward attained is twenty six hundred and eight and here you can see that the maximum has been attained by the fifth one. OK. So, here in the number of selections all those values are there ,OK. So here you can see that this is 8000 and it is the largest and in this way by doing Thompson sampling we help in reinforcement learning.So implement it and try it on your own dataset or by making some dummy dataset. So friends let’s conclude here today, we end today’s session here and the remaining parts we will see in the next session. So keep learning, remain motivated, thank you.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.