Hello,
I am (name) from LearnVern.
And this tutorial is in continuation of our previous session.
So, we are learning about Evaluating Classification Models Performance.
So, by using Machine Learning algorithms,we understood the steps to implement it, and following these steps we are also able to make predictions,
But the important part in these steps is Whether or not our algorithm is performing accurately and doing correct predictions and giving proper output, so this becomes a very important step.
So, let's understand this as to how we can evaluate this.
Now, you can see for classification we have many metrics that can be used.
Here, I have listed 5 types of Metrics that you can use.
Classification Accuracy,
Logarithmic Loss,
Area under ROC curve,
Confusion Matrix, and,
Classification Report.
So, first let us talk about Classification Accuracy.
So, this classification accuracy is “a number of correct predictions made as a ratio of all predictions made.”
So, here this means , for instance we have 10 predictions in total, from that how many were correct, supposingly if 8 were correct
So, 8/10, that means 80 percent, so this is the total number of correct predictions made.
So, this is very simple to understand.
Now, here
You can see Cross Validation Classification Accuracy, so here the cross validation method is being used.
We will look at the codes once,
Import pandas,
From S K learn import model selection
From S K learn dot linear model import logistic regression.
So, everything is normal uptil here, where we imported pandas, from S K learn , we imported model selection, and then from S K learn we imported linear models and logistic regression.
So, we already know that logistics do classification.
And, here in the URL we picked Pima India diabetes, as a dataset.
Here, we took parameters such as pregnancy, P L A S, pres, skin, test, mass, pedi, age, class.
And ,in these parameters, we have a data frame equal to pandas dot read CSV, thereafter in this we passed the U R L and in names, we passed column names.
And then I will print and show you, as to what exactly is in the data frame.
And rest, I will keep a comment for now.
So, we are loading the dataset, letting it execute, as it is happening for the first time, so this connection.
Ok! So it is connected now,
So, you can see our data that is loaded here.
Now, let's move ahead,
After this, we will uncomment this.
Now, array is equal to dataframes dot values, so this will convert the values stored inside dataframe into the form arrays.
Now, here we have picked X and Y.
So, here we have 1, 2, 3, 4, 5, 6, 7, 8, 9, a total of 9 columns, that means 0 to 8 columns, so leaving the 8th column until 7 th goes in X and the class will go in Y.
And seed value we have kept it as 7
Now here with the help of model selection we will use the K fold technique.
In this K fold technique we will pass N splits is equal to 10, that means divide our data in 10 splits or samples,for instance if we have 1000 records then make a random sample of 100 each, and then send this data to the algorithm to learn it.
Next, if the model is equal to logistic regression, we created a model or an object for logistic regression.
Then scoring that we have to pass inside the model, we kept it as Accuracy.
In results, model selection dot cross val score, here we passed the model, input data as X and Y, here we gave cross validation's KFold technique, and for scoring to check the accuracy, we have passed accuracy here.
So, let's execute this and see
This part I will make it as a comment, and execute it, and let's see the result.
So, it is showing us some warnings, for now we will not discuss the warning, and we will go directly for output…but it is not showing here, so let me make some corrections.
So, here it is given increase the number of iterations, increase the number of scale data as shown in.
So, here… max iterations is given as 100, so let me just change this and I am giving here max iterations as 500, and I am executing this again.
So, I have executed thi and increased the number of iterations.
So, here if you will see it is 0.77, that means 77 percent is our accuracy, which is not good.
But this is the way we can find out the accuracy.
Now, the next thing is Logarithmic Loss, so what is logarithmic Loss, logarithmic Loss basically tells us about probability.
For instance it tells us the probability about any data point and it's probable, membership from a class.
Supposingly we have day and night, where it will depict the probability that it will belong to the day record.
So, logarithmic Loss is a performance metric for evaluating the predictions of probabilities of membership to a given class,
So, its values belong to zero to one, and between this it reflects confidence upon the membership of class that it belongs.
Now, we will go through the codes.
Here also we are using cross validation technique,
So, first we have imported pandas,
Then from S K learn we have imported model selection, because we are using K Fold technique,
Then using Logistics Regression also, and using the same datasets pima Indian dataset.
So, uptil here everything is the same.
And only here in scoring, we are using here negative log loss, we have done negative log loss because normally log loss should be smaller.
So, if log loss is smaller than its ok, so generally when we get a performance value , there is a notion that the bigger value will be better, so therefore we use negative log loss.
Then, in model selection dot cross val score, in this we passed the model , and then passed X and Y, then passed Kfold in cross validation, and in scoring we passed negative log loss.
Now, let us execute this,
Here also we will have to increase the iterations, so in our model here I will pass max iterations as 500, and then we will execute this and see.
So, here smaller log loss is better with zero representing as no log loss, so we have got tha value as minus 0.48, it is in minus because we have used negative log loss.
So, here zero represents a perfect log loss, so here as this is inverted that is the reason we are getting this negative value.
Now, our next technique is area under ROC curve,
This concept is derived from the confusion matrix that we have already learned about, but we will see once more.
So, confusion matrix is something of this sort,
Here we have our earlier original data which is our observed data, which we consider as correct,
So here you can see columns as predicted and these rows as actual, so this predicted is what our algorithm has predicted, and this actual is the observed data
So, here you will see if the actual and predicted both are matching then we call it as true positive, meaning if we have two classes one positive and the other negative for instance day and night, so if it will be 3 then we will have to increase this,
So, that is the reason we are working upon two classes only.
So, if our class 1 is positive, and even our algorithm considers it as positive, so we call them as true positive,
And if our algorithm says this is negative but in actual it is positive, then we call it a false negative, because the algorithm has predicted it wrong.
So, you can remember this chart as when algorithms and observed data both consider it as positive then true positive , and when they both consider it as negative then true negative.
Similarly, when algorithm considers negative but it is positive then False Negative and when algorithm considers positive but it is Negative then False Positive.
So, you can remember it in this way.
So, when we run logistic regression, there are many challenges that we come across,
So, I will try to explain these challenges,
Now, we will take an example to understand this.
So, with the help of graphical representation I will try to explain this.
If we assume this axis as Y axis, and at this level I have a variable as Day,
And here below I have another variable as (typing 4 sec pause) night.
So, if we consider that we have a record, suppose a record related to brightness, which I will represent through a circle, I will make it a little smaller.
So, when the value of a circle is less, then it is night, and whenever its value is increasing then it shifts from night to day.
So, as the value is getting increased it is getting converted into day. Right?
Clear?
So, this can be understood easily as it is a simple logic.
But what is the problem over here, you will understand,
In this case there are also situations when during the day time, brightness is not that high, and it feels like night.
So, if such a situation comes what can we do to handle the situation?
So, what does logistic regression do in such a case, so logistic regression makes an S shape curve at this time. (10 seconds pause)
So I will try to draw a curve, just bare with the drawing,
So, in this curve we will have to identify a threshold, and I am identifying this threshold at 50 percent.
So, if any data point that will go above this 50 percent we will call it a day.
And those that will remain below that we will call it as night.
So, this data point is coming here, so we will call it a day.
And this data point is here, coming below 50 percent so, we will call it night.
So, here if I try to create confusion matrix,
Then how will my confusion matrix be ?
I will show it to you.
So, here my confusion matrix will be something of this sort.
Now, you can see,
So, here this column is predicted day, and this is predicted night,
We will consider night as F and day as T,
And this is actual day we will assume this as True,
And this is actual night, we will assume it as F,
So, here if we count we have total 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 , 11, total 11 points,
So, out of these my prediction is such that it is predicting this one day as night, as in real it is day but prediction is depicting it as Night.
In the same way, this is Night but predicted as Day, so we will have 1 over here.
So, this one over here is False Negative, it means observation is positive but predicted as Negative.
So, here observation is positive but it has been predicted as Negative.
And this prediction here was actually negative but predicted as positive.
So, we have these two.
Now for the rest remaining it is true positive and true negative.
So, these days over here is true positive, let us count them, so they are 1, 2, ,3, 4, okay 4
And the remaining 1, 2, 3, 4, 5…these 5 will come here.
So, in this way a confusion matrix is created.
Next, we will talk about the ROC Curve, so after we have made a confusion matrix in this way, then we will calculate its precision.
And here for accuracy we will put TP plus TN divided by TP plus TN Plus FP plus FN.
So, this is called classification accuracy.
That means these are correctly classified divided by total number of classification, or total number of predictions, as you can call them.
Then ,we will remove the recall value, that is TP divided by TP plus FN.
Then we will take precision TP divided TP plus FP.
Then we will remove F measure, by combining both of them.
So, this is the way we calculate it
But, right now we are discussing the Confusion Matrix, so let us first execute that part, then we will move ahead for the ROC curve.
So, let us implement this here.
So, in the confusion matrix we saw that it identifies true positive, true negative and false positive and false negative.
So, here all the codes are the same such as pandas, model selection, logistic regression.
The only new thing is that from S K learn metrics we have imported a confusion matrix.
So, everything is the same here, but here, when we predicted, so for the same test we have the Y test also, so the y test is the actual one, and the predicted one is algorithm prediction.
So, we have predicted as well as Y test which is the actual.
So, inside the confusion matrix we have put Y test as well as predicted both.
Then we printed this matrix to see how the confusion matrix is formed.
So, you can see the confusion matrix over here.
So, this is Y test on this side and this is the predicted values.
So, this 142 is true positive, and this 58 is true negative.
But this over here is false negative and here this is false positive.
So, this is how the confusion matrix is derived.
So, you can see here 20 plus 34 is the wrong prediction that means incorrectly classified.
Now, to improve this, if we want to improve this error,
Then we were just talking about this 50 percent threshold.
If I shift my threshold up or down that means increase or decrease.
So, if I decrease this threshold line…let's say 10 percent, not 50%, then this will change the classes of data points, all the points below, the black ones are getting classified correctly, and all of these are getting classified as day and if we go more downwards, this might get classified as day as well as this one too, so as you keep changing your threshold value, you classifications, your data points will also change wherein we will have to search for such a threshold point which will give us the least false positive and false negative.
So, if you are able to find such a point which gives minimum error then that will be our best model.
However, to do such a task manually to find the point, will be really very tedious.
So, instead we can take a graph, and in it we will have to put the true positive rate and false positive rate.
What will you take? True positive rate & false positive rate.
So, we will use a formula to improve the errors or solve the problem.
So, TPR is equal to Sensitivity, which is equal to “TP divided by TP plus FN”Meaning all that are positive).
So, this becomes our sensitivity.
Our second is, FPR that is False Positive Rate, you will calculate this FPR as 1 minus specificity, so to Remove this we will have to use the formula.
False negative…FN, let’s check this formula, this is false negative. So what will be there in false negatives?
Let me just check this once.
False…negative rate formula.
So, this is the formula,
FP divided by FP plus True Negative, so here we have a total number of negatives that will be divided.
So, FP false positive divided by FP plus, we have TN, which is true negative.
FP is the number of false positives, TN is the number of True Negatives.
False positive means it is not actually positive, in actual scenario, it is negative.
We are getting the total number of negatives below, therefore FP plus TN will be the total number of negatives.
Hence by FP plus TN, we get False positive rate.
Now, we will try to make a graph,
So everytime, when you work with a new threshold, you will plot the derived value that you get from that calculation of TPR and FPR on this graph.
After you have plot them,
Then you will again take a new threshold and take another value of TPR and FPR and plot again.
So, in this way you will have to calculate all the values by changing their threshold, and keep plotting them on this graph.
In this way, we assume that our values will be plotted at these places on the graph.
So, one point over here, another here,then next over here , another one here, next one here, and one more here.
This is how, by plotting, you are changing the threshold, and are also calculating TPR and FPR along with plotting.
Now, after you have done plotting.
This is how the plotting will look.
So, if I try to draw a line on this plotting.
After this I will have to decide as to which threshold I will choose.
So, if I choose this threshold, so you can see the curve would be drawn and fitted something like this,
Now, after we have drawn the line one important thing to observe is how much is the area that the curve is covering.
So with our assumed threshold, this much area it is covering.
But instead of this threshold if I take the above threshold, then you will see that we are able to cover a much bigger area.
So, obviously this area is bigger than the previous one.
So, we will choose this bigger threshold and call this area, as the area under the ROC curve.
So, normally you will have to choose the maximum area, otherwise this can vary based on scenario to scenario.
So, this is in general what we do.
So, here we were studying, so basically ROC and AUC helps us to decide as to which threshold should be chosen.
So, now we will implement this .
So, here also I will have to increase maximum iterations to 500. Enter.
So, the value that we have got is 0.82, which is greater than 0 but close to 1.
So, as the AUC is near to 1 then we can say the model is good.
Now, here we have logistic regression as model, so we can change this algorithm and see,
Supposingly here I copy this line, and comment this particular line,
And from here we will import one more, so “from S K learn dot neighbours import KNeighbor classifiers.”
So, we will use these KNeighbor classifiers here.
So, in place of logistics regression we will use KNeighbor classifiers.
This is how we will use KNeighbour’s classifier…previously, we had 84 something , let’s see how much output did we get this time?... This time we got 75.
Now, this has performed less than logistic regression as its score has decreased.
Whose area was more? Logistic regressions.
So, in this way you can decide as to which algorithm you want to use?
You will use logistic regression because it was better than this one.
So, in this way you can decide as to which algorithm you want to use.
Now, lastly we have a classification Report.
So, here I had already shown you that we can find accuracy, then can perform recall, which is TP divided by TP plus FN. means True Positive divided by True Positive, that is the ones who were positive and their prediction was also positive. False negatives, which are not negative but actually positive. So true positives divided by Total positives. Then can perform precision where TP divided by TP plus FP, so here we had used FN and now we will use FP.
So, this is precision.
Then F Measure combines both of them, so 2 into recall and precision divided by recall plus precision.
So, these are the details which this classification report will give us.
Ok! So you can see the report over here.
So, it is giving us precision, recall, F1 score as well as support, both for zero and one class.
So, this is a classification report which has been imported from S K learn dot classification report , from here.
So, these were some metrics such as Accuracy, logarithmic Loss, and area under ROC curve.
Along with that we also saw a confusion matrix and classification report, with this also we can understand the models accuracy.
Now, you practice this in different algorithms and evaluate and see how it is performing.
So, friends, we will stop today's session here, and we will continue in the next session.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.