Namaskar I am (name) from learnvern.(pause 6 seconds ; music)
In the continuation to the last session of machine learning now we will study ahead in this session. In today’s tutorial we will study cross validation. So what cross validation actually is ? So let me explain to you in a very simple language what cross validation actually is. Now watch here, let us assume that we have data and this square box represents the complete data , this has all the data, the complete data , now in cross validation what we do is that we will take out samples from this data , now it’s sample will be either the complete data as it is or a small portion of it , so let us assume that these are it’s samples , so here we will take out N number of samples and these n number of samples we will use to test our algorithm and out of these samples we will save one sample for testing purpose , so you can see here that how I will give rest of the samples to my algorithm for learning purpose and this, one sample I will keep for testing and we will pass it as an input afterwards and algorithm will give us predictions on this. So Cross validation works in this way. So let us see what are the steps involved in it, so I will just explain the steps of cross validation here , the steps of cross validation. So the first step is, OK , so the first step is that you can reserve some data , so reserve… some… portion of data…(typing) … , OK. Now what is the second step, so this some portion of data is what , for… test purpose… , the second step is, using the ….rest of the data ….train the model…. , so with the rest of the data you have trained the model or the machine learning algorithm and what is the third one, then test, test the model using the reserved data , OK. So in this way we do cross validation , now let us see what happens during cross validation, OK exactly what we do. So in cross validation the first part is of validation , so we do validation OK, but what this validation actually means , in validation what actually we do, so in validation the 50 percent of the dataset…, 50 percent of the given data set…. We will use this for training purpose and the rest 50 percent of the training data for testing purpose…. , so this is basically, this is to validate, but this has a problem, now what is the problem, this has a problem or drawback and the drawback is that here when you perform training on 50 percent of the dataset . then in the rest of the 50% data set, it can happen that it has useful information isn’t it, so our model did not get that useful information during training , so the ..50 percent… of test data …may have… some useful information …which training data is not having…. OK.( 4 seconds pause; typing) So it does not have that data or that particular information , so in that case it is a drawback….
So therefore this method is not so useful so we can go ahead and work with some other method, OK. So let us move head and try to understand some other method.So first one was validation, the other method we have is, it is called as “leave one out”, OK, so leave one out cross validation , so what happens in this is , LOOCV and we call it L E A V E leave O N E one OUT out, leave one out ,C R O S S cross, V A L I D A T I O N , leave one out cross validation. So in this the complete data set that we took as an example , so in last example it was 50, 50 percent , now let’s move to this one, in this one this is my complete data set, see this is my data set, now what we will do is. As we did earlier, we will make samples of it , ok so we have made samples, this is sample 1,2 ,3 OK, so like this we have made 4 samples of this , so we take the same example, so out of 4 leave one out and what we will do with this, it is basically for training , that we will use, OK. Now there is one benefit in this, that you keep on iterating it and while iterating you will get a benefit that you first use this, then use this, then use this and like this you will get the benefit , now this was the benefit but at the same time the problem with this is that it will take a lot of time, will consume a lot of time so that is a drawback , that you first iterate all leaving this, then again iterate all leaving, then again iterate this and that may consume lot of time, so that is a drawback in this case . Now apart from this we have a K Fold technique and this is a technique that we use the most, so we have K Fold technique OK, (6 seconds pause, typing)
6:16
so what happens in this K Fold Cross validation is , so in case of K Fold Cross Validation what we do is , in this K numbers, so we divide the data in K numbers of subsets, OK, in K number of subsets and this we call it, which is equal to fold , so what does folds mean here, the K number of subsets is only called folds, Now, Now what we do is train on K minus one subsets , so K minus one and on the remaining one you can do testing , so this is how we use it, OK. So this is our, in this we have not reserved any part also, it is similar but we have a choice, it is similar but in this we have a choice that we can decide K number of subsets, like if it is 10 fold, 20 fold , 50 fold depending upon the data and the sample size we choose to keep. So let me demonstrate to you how this works. So here, here first let’s talk about it’s advantages , now see what it’s advantages or benefits will be and then after this we will implement it and see , so the first advantage that it has is that it is fast , so it runs faster than the leave out algorithm , leave out cross validation, leave one out cross validation and then it is simple, It is simpler to implement, second benefit, we will just see it is simpler by using the library right, and it has a third benefit also that in this, in this the accuracy estimate , the accuracy estimate will be better in this case, so we call it as out of sample accuracy , so this will be more accurate OK, the estimate will be more accurate and it is more efficient , so it is more efficient why?, because here every observation is used , here every observation is used in cross validation , so let me show you that part also that how it is implemented. So to implement this let me take you to the co-lab , OK so here we will see how it is implemented. So, to implement it I will take from sklearn import cross validation (10 seconds gap) OK so from sklearn import cross validation , and now what value do you want to give to K so, K is equal to, let’s say we want give value of 10 to K , so K will be assigned value 10, OK , so how will we assign the value 10 to K, so here C R O S S cross, an underscore had to come , cross validation, OK, so from sklearn import , what should be here validation, C R O S S cross V A L I D A T I O N, okay, from sklearn import CROSS cross validation, OK, so let us check in stack overflow that what is the issue here , so here it is cannot import, so here when we are checking in cannot import ,so cross validation is depreciated OK, so this has depreciated so what we have to do is, we have to basically use sklearn and use train test split there or otherwise should that say from model selection import cross validate, so let us check this also, OK so this also we will check, let’s check this part also , cross validate , so train test split also we can do . those that depreciate that you can immediately check in stack overflow and then you can make the corrections , so I am letting it as it is, I am leaving it as it , the way you can see because this is not used, it cannot be used , so here we will check from sk L E A R N , from sklearn import C R O S S cross validate, it is written here, cross underscore V A L I D A T E cross validate, let us see if this package name is present, so this is also not present , here it is written “from sklearn model selection” , OK dot model selection, so we have to do it with model selection , so dot M O D E L model S E L E C T I O N model selection import cross validate , so let us check, so here the cross validate has been used successfully. So now , data is equal to , OK data is equal to C R O S S cross validate and putting a bracket here let’s study it , OK see here estimator , which algorithm it is using and then X and Y , then group, scoring , CV , cross validator OK , so here cross validator we are using like this. So here when in an algorithm, in an algorithm we have to cross validate and use it that time we insert in CV here , but to put the cross validation we have to create one , so to validate we will have to create , so here we put N folds , can we see N folds here , so here we cannot see N folds , so that means we will use this method afterwards and train test split will be, OK so here it is better to go for sklearn C R O S S cross validation OK , so from it’s original documentation we can take it from here, just watch , there is cross validate also, that is nice , OK, this one with K fold we will use , so K fold we will use ,
12:38
parameter, cross validation ,best parameter , OK , so here we can see that we have train test split and load iris , here in this way , then we can see that till now cross validation has not been used here and in this one cross val score and here clf random OK, so cross val score, OK, yes, so here is shuffle split , it is also possible to use cross validation strategies by crossing a cross validation iterator instead for instance , in this way, so in model selection shuffle split has been used here , so this way you should research , when something you are exploring that time you should research and then implement,so here I will just show you a demo from the beginning, from here , so here import numpy as np , from sklearn dot model selection import train test split, from sklearn import datasets , and from sklearn import svm and this train test split here we have already seen that it is used to split the database, so here you can see that in X test and Y train and y test , with the help of train test split we have specified test size as 0.4, we have kept
40 % test size and the data here has been divided also, OK and after being divided the shapes are being displayed here , and here we have put clf means classifier is equal to svc kernel , put C is equal to one and further made pipeline there itself and applied fit also and here we have tested the score also, OK.
14:24
After testing the score now let us move ahead, A model is trained using K minus 1 of the folds as training data , so now let’s move forward and what happened here, here the data has been trained from X train, X train and we have tested that with X test, we have tested that with X test OK, so one portion we have found and kept aside, now further to this if we see the next part that we have is that see this is all data, out of that this is training data and this is test data, so this is how it’s done. Now from the training data , see from training data also we should have split 1 , split 2, split 3 this way it should be there . So how will it be done here, from sklearn model selection import cross val score, OK , so here clf is equal to svm dot svc, kernel is equal to linear and we have kept the C parameter to one and specified random state as 42 and here in class val score we have put clf and given X and Y as input and here cross validation, we have given CV is equal to five, OK. So if you have given 5 then you will see here 1 ,2,3,4,5 that it has split the output five times and displayed the cross validation score , OK and the same thing we can print here , see the mean score, we have taken out the complete mean of the score and also the final deviation , so the mean score is 0.98 and the standard deviation is 0.02, so the deviation is very less and the mean score is quite good which is 0.98, means it is 98 percent. So this one is quite good. So this we will implement and see that from.. sklearn… dot model selection… import cross val… here we will import train test split. So train underscore TEST, test underscore split, SPLIT, so we are importing train test split. Also let’s correct the spelling of “from” here. So here we imported the train test split. Now here from sklearn import D A T A S E T S datasets , datasets also we will import , so now data is equal to, data will be , the data taken here, the same data we will take and work on the same data , so here we see that the dataset taken is the iris dataset , so here I will take the iris and import the svm also OK, so here we have imported the svm also and our data also we have initialized in the case of X and Y. Now further to this I have to split this data and where I should split it, that is in training and testing , X train , Y train , like this , so I have to split it so let me do that, I will just split it and just watch here, X train, X test, Y train ,Y test, so with the help of train split test we have splitted it OK, we will just make it a little this way, OK. So let me execute, I will execute it . Now you can see here that what is our score, our score is 0.96, OK, 96 percent is the score , we had given the test size as 40 percent. How did we do this thing?
Now this complete thing we have done by evaluating and now after evaluating we will proceed further and go into splitting OK, in this splitting and to go to splitting we will use cross val score and after splitting we will see that what accuracy is being achieved and how, so here next with the help of splitting , so from sklearn dot model selection import cross val score, clf is equal to svm dot svc , support classifier has been used , kernel is equal to linear, C is equal to one and random state is equal to 42. Now this clf has been passed here in the cross val score , X and Y have been passed as input data and cross value is specified as 5, as we have to make 5 splits , it will be 5 folds, OK ,now we will execute it and what do we get in scres, see here , we get different scores , and for this score we take print, the score that we have S C O R E S scores, S C O R E S our score that is there, we will find the mean for that, so the mean is, OK one bracket is missing, let me put one more bracket , so we find the mean and 98 is the mean , now standard deviation, how much is it deviating, so S C O R E S scores, so standard deviation we shall find and you know it should be less. So we will use the standard deviation here and the output for standard deviation is 0.016 which is very less , it’s quite less, so this way we apply cross validation so that we can check data with different samples. So friends, today's session we will end here and we will continue in the next session.So keep watching, thank you very much for watching and let’s meet in the next video.
If you have any questions or comments related to this course then you can post them beneath this video using the discussion button and by means of this you can discuss this course with other learners too.
Share a personalized message with your friends.