Hello,
I am (name) from LearnVern.
In our previous tutorial of Machine Learning, we saw decision tree regressor,
Now, today we will see as to How we can select or choose the Regressor Model?
Because in Machine Learning, when we are applying this on the dataset, then this selection becomes really important.
So, let's see what are the methods or measures, with which we can choose or decide the model that we should select.
Firstly, we should know why we should understand this.
This is important because we want goodness of fit, now, what is this goodness of fit?, I will explain this to you in this paint window,
Assuming that I have some data points, now for these data points, we need to do prediction; so for that a Regression line will be fitted in between; Regression line will fit in something like this.
So, this regression line only will predict the future new values.
This is our x axis and this is y axis, so whatever new inputs that we will get ahead, this corresponding line will be its prediction or the output or the y value.
So, this line that we fit; how we can fit that in an optimal or best way, that is our objective.
So, in model selection we will have to select that model which is a goodness of fit, which can give us the most proper line in fitting.
So, let's see what are the techniques that are there?
First, is probabilistic measure,
Next is Resampling method,
Apart from these, we will discuss two more techniques, one is AIC, and the other is BIC.
Now, first is probabilistic method, where we deal with insample error complexity, now the meaning of insample error complexity is that supposingly you have a dataset, and this dataset is something like this, so you have a complete dataset, and from this if we took some parts as sample and display them, in a diagram format like this, so from this I have created 5 samples,
So, the meaning of insample is that the sample is created from within the data itself.
So, the probabilistic measure works on the basis of this insample measure, where we run our algorithm, and then we pass this insample itself as testing data.
Upon this we see as to how much error or complexities are coming.
So, this is the first approach that is the probabilistic approach.
Second is Resampling method
In this we have out of sample error, this means that a new input has come out of the present data available with us, now on the basis of this new input we are testing the algorithm, so this approach is called as "choose a model via estimated out of sample error", that means out of the sample.
So, these are the two techniques for the beginning
Next, technique is AIC, Akaike Information Criterion, this works on frequentist probability, we will just see ahead as to what is this?
Next is BIC, which is the Bayesian Information Criterion.
So, this works upon Bayesian Probability.
So, these two give us a certain score.
Now, we will see the AIC method, so here when scoring is conducted the formula that is used is; AIC is equal to minus two divided by N multiplied by LL that is log likelihood plus 2 into K by N.
So, this is the formula to find AIC.
Now, if we see after this, what happens in Bayesian Information Criterion, then here
BIC is equal to minus 2 into log likelihood plus log off N into K.
So, both these formulas basically give us a score.
So, this score should be low, the lower the score the better is the performance of that particular model.
After scoring then we have to perform selection, so what do we want here? We want a lower score, so we will see this practically as to How it is done?
Now, after this we will discuss the Resampling model. In this Resampling we have one option, first is random state train test split, meaning we have an entire data, from it we will randomly select the data for dividing them into train test data.
Next we have Cross Validation - K fold and Bootstrap.
Now, we will understand this also,
Now, what happens in K fold, so in this we split the data that we have, into Kfold, for instance we have 5 fold, and then we will take each of them as a different sample.
As I was showing you in the paint window before,1,2., 3, 4, 5. so this K fold is the 5 fold in this way.
And, "where each example appears in a test set only once".
So, here one set at a time will be sent for testing.
First we will send the first one, 2nd one, 3rd one, 4th, uptil fifth.
So, this is K Fold cross validation technique.
Now, after this what is the procedure?
First, we will shuffle this, as we don't want the data to be selected as it is arranged in the original dataset in that sequence.
Like we have thousands of data, so we will pick the first 50 only, we don't want that.
So, we will completely shuffle this.
As, you must have played a card at any time, so you might be aware that we have to shuffle the card, so in the same way we shuffle the data.
Then we divide the data into K groups, as I divided them in 5 groups. So we divide them into K groups, and now each and every data group is treated as the test data, and then the data is sent for testing, and the rest of the data is used as training.
So, this is what happens in K Fold cross validation.
After that we apply the algorithm upon it, and then after that we take the evaluation score, and on the basis of the evaluation score we decide whether to keep or discard the model on the basis of this score.
So, in this way K Fold cross validation works.
Next is bootstrapping technique, in this we have Resampling technique, so as I showed this sample here, if we see, assume that I have chosen the sample all separately, one from here, second from here, third from here, and so on, so in this way.
But, in Resampling with replacement, here what happens is that, in your entire dataset, suppose this is your dataset, in this you selected a sample which is chosen randomly, and again you took another sample, in this also it is selected randomly, now here the possibility is that in these there can be some data which are matching that is they are repeated, or it is also possible that it is not repeated.
But here there is no strict rule that the one selected first will not be taken again, so this is known as Resampling.
So, here in replacement what happens is that we can take the data from the earlier ones also in the new ones, but if we perform without replacement then each and every sample is selected separately, and no records are repeated in them.
So, in bootstrapping technique here we do Resampling, and take the samples again and again, and this Resampling is performed on the basis of replacement, and then we perform testing.
So, this is bootstrapping technique.
Now, we will understand its process in steps.
Choose a number of Bootstrap samples, for instance if you want to take 5, then accordingly select 5.
Next, "for each bootstrap draw a sample with replacement", so first take one, and then at the second time the earlier records will also be prepared for selection again, and we can choose from it also, so in this way you will select the sample.
Now, the next step is to calculate the statistics, so apply the algorithm and calculate the statistics.
Then calculate the mean of the calculated statistics.
And on the basis of this you will decide whether you want to choose the model or not.
So, these were the techniques that we saw, with which model selection can be done.
Now, we will move towards our next topic that is evaluating Regression model performance.
So, keep watching and remain motivated.
Thank you.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.