Namaskar I am Kushal from learnvern.
So last time in the Loan prediction project we had imported libraries and dataset and on the data we did preprocessing tasks like treating missing values, scaling data , we used standard scaling to scale data so that data values, the values come in one kind of range. Now after all this our main objective is that of prediction.
So you have now understood that before prediction we also have to do a lot of activities because it is very necessary to prepare data for prediction , so those all were activities of preprocessing and now we will move ahead to perform classification. So now if you see in classification, so if you want to perform classification then you will see that we have different algorithms and we can start from logistic regression, so logistic regression , now we know that logistic regression is a classification algorithm and it works on the basis of sigmoid function and sigmoid function gives us values that lie between zero and one.
So here we will use a threshold, and if it is below that threshold then it will belong to one class and if above the threshold then it will belong to the other class, so we can use logistic regression or we use nearest neighbor, nearest neighbor algorithm can also be used. Now how does nearest neighbor algorithm perform, it basically searches for nearest data points , so what are the data points near to your data points,here you can mention that I want to see the five nearest neighbors, four nearest neighbors , so it works on that basis or you can use svm support vector machines where support vectors will perform classification for you.
So in this way we can use different algorithms and see which algorithm performs better on our data set. So let us execute it and first of all we will start from logistic regression (4 second pause,typing) so in logistic regression, we will import this, so from sklearn dot linear model , so where does logistic regression come from, it comes from linear model , import logistic regression , logistic regression,so we have imported this , so now after importing we have to make L R C L F classifier , so after writing Lrclf now here we will write logistic regression , so here if you want to give some parameters then you can give parameters, like if you want to give for iterations so then you can give max iterations, but for now because we are doing a demo so max iteration I will keep less and I will keep it only twenty , max iterations, and we can also set random values here , random value means we will set a value for random state as zero , so here I have created a model for LR logistic regression , and now for this model we will pass data and train it.
So L R C L F and here equal to , or you can directly fit here, so fit and here you can pass X underscore train , so the X train data that we have separated , split for training , through that you can train this OK, now because it is classification algorithm , so here you will have to pass Y underscore train , so train input and train output, pair of both of these you will have to pass , so now this can learn from this. So now when it has learned , so after learning what is the next step, that is basically of predict so after learn, you will can predict function and here just pass X test , so with the help of X test data this will give you predictions , so just see it has given predictions here, OK , so this you can store in Y , Y underscore pred, store it in this variable , so lrclf clf dot predict so what we will do with this, we will store it here in Y pred , and what will happen with this, we will be available to compare, with this we will be available to compare , see here we also have Y underscore test , so Y underscore test that we have is the observed data , so now you compare and see that here this is one and this is also one , so this is correct but here it is one and this is zero so this is wrong , OK, so this way we are able to compare that here it is doing some mistakes also, so here if you see that it has given one one to all and here it is one zero so you will see that our actual data is one zero one zero and if you see it’s prediction over here is wrong , so it means that due to these iterations it has so happened that it has not been able to learn properly , so this iteration I will increase to fifty , so doing iteration for fifty now we will check and see as to how prediction has been done here , in this case also prediction has not been done properly , so here we have passed train train data in both , so I amgain making it one hundred which was default and gain let’s check , so this we are seeing just visually here and by seeing we can understand OK,
Now this thing we can see through accuracy score or we can get a classification report generated and see what is the accuracy , so that also I will import and keep it here so that we can directly see the accuracy here, so from sklearn, learn dot matrix , so import let's import M E T R I C S metrics, so from sklearn import matrix , so now from inside this matrix we will get the accuracy , so metrics, metrics dot accuracy score, accuracy , yes here it is, accuracy score , now in accuracy score we have to pass two things, one is prediction that we have to pass and one is actual , so the actual is y underscore test, this is actual, comma y underscore pred then, yes this is the name y underscore pred , so this is the one with prediction , so y pred and this we have passed here and we have executed it, so you can see that there is 73 percent accuracy which is quite less ok, 73 percent accuracy is very less , so here we are getting very less accuracy with logistic regression, so we are not very happy with it , from this model and we will move further and will try to do more.
So let us see a confusion matrix also here. So cm , from where will the confusion matrix come, it will also come from matrix, so cm is equal to confusion matrix, so M E T R I C S metrics , metrics dot , metrics dot and we will create a confusion matrix from here , so metric dot confusion metrics , and in confusion metrics also we have to pas these two things, one is Y underscore test, this we have to pass, so here I have passed y underscore test and the other we have to pass is y underscore pred , so these two values we have to pass here, so here in cm we have the confusion metrics , so let us see what are the values in confusion metrics , so you can see zero three and zero ninety OK,
so here if you understand the confusion metrics we had discussed , so here all the values that should have come, here thirty three, so thirty three has come , so all of these are wrong predictions, isn't it , so all these with one are correct predictions but the ones with zero also that have become one that it is showing as thirty three , so these are wrong predictions , now the reason for that is that our model is an under fit , we can say that it has not learned properly , these particular features it could not learn properly and for all of them it is giving one as output, so here in this case we have seen that logistic regression has not performed very well, so now what should we do, we should move ahead and moving ahead we can basically explore more libraries , OK, now because we had imported matplotlib above, so here PLT do here plt dot scatterplot , you can plot and see , scatter and here in X and Y , so once we can do with X pres and once with y test OK, so now in our data set there are only two data OK, so if I talk about X test , so in X test I have , so only one column we can take out of the two columns here, so here it is not giving any suggestions , so here we X underscore T E S T X underscore test we can dot I N F O so dot info , so here it is not a data frame buy a numpy array , so if it is a numpy array then we will put a colon here and one and access it OK, so here we have all the data items OK, so here zero and one, only these two we will have , and according to zero this array has come , so in this way we access one column of X , so on the basis of first column of X, first column , and y test and Y prep, in this manner we will use it, so this is X and this is y underscore test and in this basis we can see scatterplot here , so here you can see that this is scatterplot here , one is on zero zero zero and one is on one one one , OK, so now that the color that are there , those colors also we will mention , so just see here there is C map and C , the C one is list of colors, so C is equal to , and what we will put here, here also we will put Y test only , y test and according to this we will get colors , so it has only two values and see here two colors have been displayed , so this is the plot for observed data , so O B S E R V E D observed data , this is observed data plot OK, and in the same manner we will do a plot below also, OK , So P R I , D I C T E D predicted data P L O T plot so , what will be there in predicted data, plt dot scatter , so here we will pass the same value of X and this time we will put y pred for y , y pred and in colors also this time C is equal to y pred , OK so we will get colors according to y pred. So here you can see, just see here we have got all one's , so it has made all of them one , so the zero one's also hit has classified as one , so logistic regression has not performed as well as we wanted it to, so now we can move forward and check the neighbors algorithm. Let us see its accuracy also. So now here we will check the neighbor algorithm , so let's see KNN, so for neighbors, from sklearn dot neighbors, N E I G H neighbors import, so here we will import K neighbors , so here we also have KNN, so K neighbors classifier , so we have done neighbors classifier , so K N N C L F , with this name let us create this object, k neighbors classifier and in this k neighbors classifier we have to mention here in brackets that how many numbers of neighbors will be there , so number of neighbors when we have to mention , so here we will take five, generally it is taken odd, we will take five, five is a good number to go with and here which metric we will use, so metric by default is of wincose and rest of the things are default and I am not making changes to it , I am not bringing any changes here, now we will train knnclf dot fit and at the time of training we have X underscore train data , so with this train data we will train it , so x train and here y underscore T R A I N train , so here we have trained it and now after training knnclf dot P R E D I C T predict , so we will predict also and here we will put x underscore T E S T test and here you can see that it has displayed predictions , so I am happy that at least it has given some better predictions this time , so I am thinking that it will give better accuracy also as compared to the previous one, so let us store this in y pred, so y pred is equal to K N N C L F dot predict, so this we will store here, OK, so here X underscore T E S T test , so we have stored in y pred and I will use same , that plt dot S C A T T E R scatter and here X will come as it is and here y underscore test and C also we will keep as Y underscore test , so here you can see our scatterplot and it is as it is, and in the same manner plt S C A T T E R scatter , here X will come as it is, and y this time will be y underscore PRED pred and color also this time will be y underscore P R E D pred OK, it will not be y test, it will be P R E D pred OK, so just see here now, so in this also something similar , so you can see here that there is some mistake, here it is one and two, and here it is one two three so here in the middle it has done something wrong, so there is some mistake but it has shown better performance than the previous one.
So now here let us take out the metrics also, so by doing metrics dot accuracy score let us see that how much accuracy does this have, so metrics dot accuracy score, so I think it is not catching metrics , OK it is catching so accuracy score and here y underscore T E S T test comma y underscore P R E D pred , so let us see what score it gives, so it has given a score of sixty two percent , so although it had distributed in zero and one, two classes but its accuracy is even less because when it divided , it classified , it must have done a wrong classification , now how will we come to know that, this we will know from confusion metrics , so now see in confusion metrics, metrics dot C O N F U S I O N confusion metrics , so in confusion metrics here we will pass y underscore test and here y underscore P R E D pred and see that in what way we are getting, so metrics, M E T R I C S metrics is not defined, OK, this spelling is a bit wrong so we will just correct it, metrics ok, so now here you are seeing that the cm, the confusion metrics has this first value as fourteen and it is absolutely true , sixty three is absolutely true, this twenty seven and nineteen are wrong , so see here for twenty seven and nineteen, there has been a mistake more than the logistic regression also that we did , because when you see here in logistic regression , then in logistic regression only thirty three were wrong OK, only thirty three were wrong but here if you see then here twenty seven plus nineteen have been wrong means that it identified, meaning it identified zero as zero at least for fourteen , so this it has done better, but the overall accuracy of this is bad so, this algorithm also is not all that beneficial for us, economical for us , so it might be that we may have to do something in our data OK, at times it happens that data is imbalanced, there are not complete sample in the data , that also happens , but for the amount of data we have, the one which will give us best accuracy , for now we will assume that the algorithm is performing better for out this particular data> so let us apply some other algorithm in this and after that we will take the final decision, so three algorithms , so applying three algorithms is good enough , you can even try more OK, you can try even more , you can use ensemble also here. So now the next, the next that I had written was support vector machine , so through support vector machine we will try , so support vector, vector, M A C H I N E machine , so svm and where will svm come from,so, from sklearn, so from sklearn dot svm, svm is a complete package import svc , here we are doing classification so we will have to do svc ,so from sklearn dot svm import svc and we will create an object of this , so svc is equal to SVC and in svc do you need to mention something. yes absolutely it is necessary to mention, see here by default there is a kernel function RBF and degree is three and random state here is null, so this random state here I am just specifying OK, so here I have made random state as zero and rest I have kept default, afterwards we can change and see , so here SVC we can assume as one OK, let us assume as scvrbf ok, and after RBF , here svc and the other w can take is , that model we can take L I N linear , so one is linear and so here SVC and here R A N D O M , in this also random state , random underscore S T A T E state , random state is equal to, this also we will take as zero , so we have made two models, means SVC , for support vector qualifier we have made two models and now we can train both of them, OK, so one is rbf function and one is linear function , so svc rbf, so let us first train svc rbf , svc rbf and here dot fit and here we have x underscore TRAIN train and the other one Y underscore T R A I N train , so this we trained and simultaneously, svc and what's the other one L I N dot fit, this also we will do, so here we will gain do X underscore T R A I N train and the other we will have is y underscore T R A I N train , so using both of these we are training it, and so they have got trained also. So now Y underscore P R E D and put a R after PRED which will be of RBF and and here I am doing sv, svcrbf dot P R E D I C T predict and in predict, in predict here I will here put X underscore T E S T test comma, so no need of comma , so this is done, in the same way y underscore P R E D pred and here L meaning the linear one and here svclin meaning linear and here it will be dot predict and after dot predict here we will put X underscore T E S T test , ok so we have got the prediction done from both of them , now let us check the accuracy for each, so here firstly RBF it is the one with SVM , RBF and with that metrics , metrics dot A C C U R A C Y accuracy underscore score, so let us wait for it to get auto completed , so yes here is score and now here X , not X, y underscore T E S T test comma y underscore, y underscore T E S T test OK, so here it will be pred, Y underscore pred , in pred also the RBF one, so here if you see the accuracy of the RBF one then it is seventy three percent which is same as logistic regression and here if you see L I N linear and svm, so how much it is linear svm , again metrics dot A C C U R A C Y accuracy underscore S C O R E score and here y underscore test and this is observed, and here y underscore P R E D pred and in pred we will put L and let us see accuracy for this also, it's accuracy is seventy three point one,exactly the same accuracy, so you either use RBF or ouse linear you will get same accuracy. So this way if you see then the highest accuracy is seventy three percent, just note this and we will just check for logistic also, so for logistic regression we have accuracy of , so see here is logistic and its accuracy is seventy three point one seven zero. So either you use logistic regression or use SVM , both have the same accuracy , so amongst these we can use any.
So in this manner diferent different algorithms that we have, by using them we find accuracies and by using accuracies find predictions and then we check accuracies, graphically also we check by visualizing that in what way it is giving predictions and by doing this now you will be able to tell that the problem statement that was there , the problem statement that whom to give loan and whom not to give loan , so seventy three percent of times you will be correct and rest of the times mistakes may occur and this is a natural thing, mistakes may happen. So in this we will collect more data, we will be able to better train the algorithm and it might happen that we might get a better accuracy.
So for the experiment, before concluding I will just show you what more you can do. So when we did train test split, just see here we had done train test split in the beginning , so that we will change a little and then we will observe once Ok, so what will happen with this is that we will get a little more idea that in the way that we have done splitting, or we had done scaling here and after scaling we had calculated PCA , so here also it could have happened that it lost some information, OK, so there are lot of factors, here it has lost information so here instead of two we can take three , we can go for three also , so here I am taking three, so in PCA I have taken three components OK, it might be happening that in two information being lost is more and now I am doing one more thing and that is we have done train test split here, just see here, so here we have kept test size as twenty percent , so test size I will keep only fifteen percent and once again I will execute it OK, so I am doing restart and run all, I have entirely done a restart and once again I am executing, so by changing these two things do we see any changes , because we had said that change hyper parameters and see , so by changing those if you observe some changes and those are positive then those changes you have done for the good, so we can see by hyper parameter tuning also once that what type of accuracy do we get, so here you must watch from beginning that when we are implementing logistic regression so what output are we getting , that also we will check and then we will check this also that how other two algorithms are performing , so from here our encoder is running and after encoder PCA is running , in PCA you can see that I have three, one ,two, three , so I have three and now logistic regression has started and here just see accuracy is seventy, it has further decreased meaning that the previous one was better , so never mind, now moving further we will see that what output do we get ahead , so here we can see accuracy to be sixty two which is again quite less, so sixty two which is of KNN and in support vector machine also accuracy has reduced to sixty nine , so in support vector machine accuracy has decreased , so in this way we have seen that accuracy when we have taken three principle components has reduced further and so by tweaking you should stop at that level where you get the maximum , so till now in our discussion with the training data twenty percent was good , that twenty percent we keep for testing and rest eighty percent we keep for training and we had divided it into two principal components, OK.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.