Namaskar I am (name) from learnvern. (6 seconds pause)
This tutorial is the continuation of the last tutorial on machine learning. So let’s watch it further. In today’s tutorial on machine learning we will see sampling methods.Let us understand what sampling actually is?
Although we have already done sampling in previous algorithms. Here we will learn some techniques and methods so that we can understand this fully. So just watch what is sampling. Sampling is a process in which we take out selected data from the data we have and train and test the algorithm with that, so this is called sampling.
This means that the complete data is populated and to select some data out of it is called sampling. So just see here, the same thing has been written that a predetermined number of observations are taken from a larger population.Like suppose if there are 1000 students in a college and out of those the college is sending 100-100 students at one time for a trip , so we will say that it makes a sample of students and send it on trips.. In this way.
So the first technique that we see here is random sampling or we can call it Simple Random Sampling. Now in its name only you can understand that it is simple and random, so the sampling that we use should be random only, sampling should be random. So in this simple random sampling what happens is that we pick from a dataset some data randomly and then use it to train our algorithm.
So here you can see that there is complete population and from that population see, we pick some, like from here we have picked the third one , then picked second one, then picked the tenth one then the fourteenth one without any logic, there is no logic, isn’t it. So this is called random sampling.
To implement this, just see this, import pandas as pd , import numpy as np, and after this we created a dataframe , df is equal tp pd dot dataframe np dot random dot rand N , the 5000 comma 4 and columns A B C D, in this way we passed a list. So see here we have a dataframe and the shape of this dataframe is 100 by 4 and this if you want to see then I will show you also, sample dot sample…. underscore df… dot H E A D head, and some of it’s records we can see here , so keep watching, here the records will get displayed and A B C D are the column names and here 4719, 4219, 645, so you can see here that random record numbers , any record number has been picked and at the same time if I get this df printed then , in this we do df dot head and if you do df dot head then you will see the indexing here is 0 1 2 3 which is in a sequence , so nothing has been picked up from this sequence and it has been randomly picked up , so this you can use if df dot sample100… , in this way you can use simple random sampling.
Now we will see the next method, this method is stratified sampling , so what happens in stratified sampling is that in this sub groups are created and these sub groups you can call as strata OK, what can you call these sub groups? these can be called as strata , so just see here what happens, this involves the division of a population into smaller sub groups like the students of college are divided into first year second year third year and if you are in schooling then first second, third, so it is divided into such sub groups , so what are these sub groups, these are strata , OK so these are strata.
So now what we will do is, the strata that are there, see the strata are formed based on member’s shared attributes or characteristics such as income or educational attainment. So the example that I gave was of educational attainment , in that those in first year are one strata , the one’s in second year are another strata ok, in this manner. So now sampling will take place on this , OK , see… here we have bifurcated and now from each group we will pick some randomly , so this is also a good thing that , like what was the problem in the previous one, random one , that let’s assume that we picked some data randomly and many students for the first year came and there was less contribution from second, third and fourth , so it is better that we form this strata and from every strata we pick some samples, so this is a better technique. So now there are a lot of examples like, Town A Town B Town C and you will see that in all the three towns you will see that Town A has 1 million factory workers, Town B has 2 million workers and Town C has 3 million workers , so now if I want representation from every one of them,some representation from every one then how is that possible, that is possible through strata OK. Now how can this be done, see this we have already done in the case of machine learning and you must have seen it, so in this we do from sklearn dot model selection import train test split and here X train X test Y train Y test , this way we get the data and we mention in train test split, the input data that is X and Y and stratify is equal to Y OK, so we need one categorical variable for that , so is equal to Y, so what is the categorical variable, Y is the categorical variable , and according to that we startify and what have we specified the test size it is 25 percent so the data that will be output in test will be 25 percent and the rest of the data will go into training , so in this way we do strata based or stratified sampling.
OK, here X is not defined so I will have to take some value for X . So here I have the sample as df and to see it I will simply do import so, from sklearn… dot… datasets… import… load underscore… iris and we will load it and after this we need X and Y so X comma Y is equal to L O A D load underscore… iris and in bracket…. we can basically put …, here in brackets we will put the method , so let’s put the method , so I am just waiting for the recommendations it needs to give me, so it is not giving and now I will just add a new line , let me close this and here we will write it , and what has happened here sklearn , OK there is a mistake , sklearn OK L E A R N , so her sklearn OK, this is imported , so this is iris… so here only we will put return so this was false and we will put it as true , so let’s make it true, so we will get X and Y , so we have put X and Y here OK, so this way you can see here are X and Y and if you now want to see then you can see that the X underscore T R A I N train data is , so in X train we have a total of 150 records , so out of 150 you can see that how many records have been displayed, so I will run shapes, so that we will get from shapes, so X dot train shapes , so here you can see that the records that we have in this are 112 , so these are 75 percent records.
Now the next sampling technique is reservoir sampling technique and now what happens in reservoir, there are many algorithms, just see it is family of randomized algorithms for randomly choosing K samples , so that is why it has been termed reservoir, like in a reservoir there is water coming from many places so in the same way this is family of randomized algorithms and so what will happen here, this algorithm will randomly choose “k samples” from N items and N should be equal to or greater than K and from that choose k samples , so for this see how we have done, import random, def generator max, number is equal to 1 , while number is less than max , number plus is equal to one , yield number. So number will keep incrementing , now see stream generator, here from one generator 10 thousands streams have been generated , now doing reservoir sampling from a stream , so from this data reservoir we will do sampling ,and see what has been done here, K is equal to 5 , reservoir is an empty list initially , for i comma element in enumerate stream , so if i plus one is less than equal to K , so for i, there will be indexing in enumerate, zero one two three so i plus one is less than equal to K , and reservoir dot append , we will append any element that will be there, we will append else probability, here probability will be K divided by i plus one , Ok that is the probability ,then random dot random, so this is some number, random dot random is less than probability, select item in stream and remove one of the K items already selected , you will see that firstly we are feeding items sequentially above but if, if this i plus 1 condition is false then it will go to else and upon going here the randomness will come from where, that will come from this else part , so reservoir random dot choice range 0 to K , so here the index value that will be returned , in that the new value will be put , so this way we are bringing randomly any data and if we execute and see it then randomly from the whole data set you will get these values , so here you can use any algorithm, you can make an algorithm of your choice and this is a small piece of code used here , OK.
now at times there is undersampling or oversampling, now what does undersampling means that sometimes (6 seconds pause) we have this original dataset and here you can see these two colors and in these two colors you will see that our class here in orange has less record and the blue class has many records , so blue class may not become dominant so we will have to remove it , so to do that what can be done, we can do undersampling and in the same manner we can do vice versa, in that we must ensure that this minority class should not get oppressed and for that we will do the oversampling , so undersampling and oversampling these are also two techniques and how we do it. Let’s see that, from sklearn dot datasets import make classifications, we take this module , X Y make classification, here number of classes 2, class separator ok we have taken as 1.5 and then we defined weights and here we can take default values for all this also but we have specified values here , so now in X that we have here, what we are doing is , so here in X we are including Y also and let me execute and show it to you that what data we have , so this is our X here you can see, this is our X, so Yes this is our X and in the last if you see so these are our classes, target OK, so 0,1 0, 1 these are classes we have OK, so this is the data that has got created here ,now we will move forward, let’s move ahead and here , we can now do random oversampling and undersampling and how will we do this,just see here, num underscore zero is equal to length OK,.... so see here what we are doing in num 0 and num 1 is , print num 0 and num 1, we execute it and see and we observe a thing that value of num 0 whenever we have the target value of X as 0 is 90 and whenever it is 1 then it is 10 , so now you see that the one with 10 has a problem, one with 10 has a problem, and what is that problem that the class with 10 is less and if you see the result for this 0 it is 90 and this is 10, so what we should do for 10 is oversampling and for 90 we should do undersampling, so for 10 oversampling, the one which is less for that oversampling and for the one which is more for that undersampling , so here we have done random undersampling and in undersampling what we have done is undersampled data pd dot concat and here you see that for this num 1 , for this num 1, so here we have put num 1 OK, so what is there in num 1 , so in num1 it is 90 OK, and whenever it is zero, what is zero , this target is zero , comma so what we have done in undersampling is that our num 1 is 10 OK, and what was our target , for target is equal to one , so for this you will see that num 1 has 10 records , so here there are 10 records and out of what , out of 90 , so that we have passed here, so if we pass like this so what will happen with this is it will randomly pass samples and concatenate it here in undersampled data , in what it is concatenating, it is concatenating whenever X target is 1 , in all those concatenation is being done whenever num underscore 1 means we add 10 records , so this is 10 and this this 10 those 10 will also be added and how many records will we get, we will get 20 records , so total undersampled data will be 20 OK, alright you see here it is 20, similarly now when we do oversampling, then what we will do is , we will do opposite of this , like here we did pd dot concatenate and what did we pass here, see we passed num zero and what are we doing in this oversampling, we passed num 0 , and write replace is equal to true , so in this 90 plus 90 it becomes 20 , so in undersampling the one which is less it will have double records and in oversampling the one which is more will have double records , so this is how we are doing undersampling and oversampling, in this way OK,
so we also have a concept in which there is imbalance data OK, now when the data is imbalanced then on this imbalanced data also we have to work , so to work on imbalanced data we have some methods which we call, you canse here, one is Tomek links , so through this we can do and the other one is a method that we have and we call S-mote (pronounce: s- motey) or call it synthetic minority oversampling technique , so in imbalanced data, in imbalanced data one class dominates and the other one is quite less , so here also we have these techniques nad by using them we can remove this or we can balance the data. So here tensor flow is used in the backend and this is advanced part of deep learning but here for the demo you can see that from IMBlearn dot undersampling import tomeklinks and in a similar manner as you used train test fit here we will in a similar manner put t1 is equal to , we have made an object for it and after making the object hre we have done for tl fit sample and passed X and Y and this will basically balance the data and return to us OK.
So the same thing we will see here IMBleran dot oversampling and the you can import S M O T E package and similarly for the minority what we have to do is , so here we have encountered unexpected keyword , so let’s check once so ratio is equal to minority ,OK, so here S M O T E is equal to S M O T E ok, S M O T E and what it accepts here that we can check here so, sampling strategy and in sampling strategy in parameters it is float, that’s OK, and when float when string , so minority, so minority is present here, resamples only the minority class , so here we had specified for the minority, OK , so that I will cross check again , that what is the name of the parameter , so here the name of parameter is ‘minority’ , when string specified OK, let me just specify M I N O R I T Y minority , I am only specifying minority otherwise for now I will remove it so that we don’t face the problem, so here we have no attribute fit sample , OK there is no fit sample that’s what it is saying , OK so here we will comment this, and after commenting we will take X underscore sm , and Y underscore sm is equal to S M O T E dot , so let’s check the package here, so here I am doing FI and checking and it is showing fit resample so we will use fit resample and here X and Y , so here we have put X and Y OK, now , OK X is resampled and Y is resampled, so let me check now with them , so using this resampled we have resampled it , so feature name only supports, so we have got two warnings but let us see once if that what is the shape of X underscore sm , this we will check the shape once to confirm , so it’s shape is 180 and 21 and here only I will like to check the shape of X , so this will confirm as to how resampling has taken place , so you can see this 100 and 21 and here resampling has been done for minorities so this has become 180 OK, so this is how we do resampling.
So i hope you got this point , I hope you would have understood this , so now you should apply this on your data set and after applying , these data methods will be helpful to you because many a times you do not get clean data because when we are practicing we practice with very simple kind of dataset , so that is ok for practice but in real time you will observe that data becomes very vague , and very complex and so there these sampling methods will be helpful so before, before you put your algorithm into machine learning model, before you put your data into machine learning model , before that you should apply sampling on data and then give it as input in the model for training. So today’s session will end here and parts ahead will be covered in the next session, So keep learning, remain motivated, thank you.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
good learning but the content titles are jumbled up, like first title of this module is decision tree dichotomiser which is practical part ahead of theory part. Same with the SVM practical 1 title has
Isakki Alias Devi P
yes, i am happy to learning for machine learning in LearnVern.it i s easily understanding for Beginners.
Superb and amazing 😍🤩 enjoyable experience.
Muhammad Nazam Maqbool
Absolutely good course... will suggest it to everyone. has superb content that is covered in a fantastic way.
super course and easily understanding and Good explaned
Ruturaj Nivas Patil
Very well explained in entire course. Great course for everyone as it takes from scratch to advance level.