Hello friends, I am (name) from LearnVern ( 6 seconds pause, music)
In the continuation of the last session of machine learning we will move further in this session.
In today’s practical we will see KNN example and this example is of social media ads which we previously saw with the Logistic regression also.
So, first I import the dataset and then further we will talk more about it.
This dataset is of social media ads. In this dataset we will see if a person buys a car after watching an advertisement.
So, let me introduce you to the dataset. First, we import the “pandas” library.
Import pandas as pd
Here, Dataset is equal to pd dot read underscore csv, read underscore CSV.
Here we will mention our path.
Let’s see what our dataset is?
So, this is our dataset. You all must be able to see on the screen. As we can see there are a total of one, two, three, four, five, we have a total of five columns such as user id, gender, age, estimated salary, and purchase.
Along with this we have 400 records. As you can see here is 0 serial no. which tells about a user id 15624510 as male and is of age 19 years. He has an estimated salary of 19000 and he has not purchased a new car.
Similarly, if we see 395th serial no., this user id 15691863 is a female of 46 years. Her estimated salary is 41000 and she has purchased a car. So, this is our data.
Now, what is our objective?
Our objective is to find out if a person will buy a new car or not (10 seconds pause)
As you already know, to buy a new car for a common man is a big thing. And there are various factors behind that. But here we are not considering all factors. Here we are taking the age and estimated salary.
To extend it we can consider more factors also… i.e., number of family members, whether they have a partner, if their partner is working, if they already own a car. So, there are many various factors behind the decision of buying a car.
So, you can add various other features and enlarge your dataset with multiple features.
Now, let’s move forward with these features and dataset. While moving ahead, Here we are again working on this dataset, firstly on its pre-processing.
Now, we require some other libraries, so, we import numpy as np and then we need to import.. matplotlib.. dot.. pyplot.. as.. PLT... And we have already imported pandas.
Let’s now divide the data in the form of x and y. So, x is equal to database dot iloc, here we will input all rows from beginning to end and after that we only need 2nd and 3rd columns. and dot values will convert it into plain numpy arrays. So in this way we created x.
Now, similarly we will extract y also from the dataset. So, y is equal to we will use the dataset dot and again we will use iloc function and inside the bracket we will use colon to fetch all records and we will need only the 4th column here. In the end we will use dot values to take it in the form of numpy arrays.
So, in this way our x and y are ready.
If you want to have a look, you can see in x to know that is your X like this which has values in two columns. Also, you can also type x dot shape and see that it has two columns and 400 rows. So, the rows have been exactly given according to the original dataset.
Now, you can see y, this is our y and also i will show you y dot shape which has 400 rows. which means it has only one column and has 400 rows, or you say vice versa also.
Simply, it has 400 values.
Now, we will move further, and our next step is to split the dataset.
So, from sklearn.. dot.. model.. underscore… selection.. we split the data. So SK learn dot model selection, We are importing train test split, train..underscore.. test.. underscore.. split… so we imported a train test split.
Now with the help of the same we will divide this dataset.
Firstly, we will have x underscore train, x underscore test, and then y underscore train, y underscore test. So, we have these four sets. As you know, x train and y train are one pair of input and output and similarly x test and y test are second pair of input and output.
So, we have a total of two pairs of input and output.
We use the X train and Y train during the training of the model. Which means when we we have to make the model learn, when the model would be learning, we will train with this data. That’s why we have named it xtrain, it will provide training or it will be used for training purposes.
If you see xtest and ytest, this has been named like this because we will use this for testing the model i.e., to know the performance of the model and how the model is giving the output. So, we will use it for the test.
So, here we will use the train.. underscore.. test.. underscore.. split.. And here we will pass original x and y. Apart from that we will suggest the model, the size of test data. This is given in terms of percentage. If we write 0.25 then it means 25% of the data should be test data. Which means, if there are 100 records then among those what is 25% so there will be 25 records for the test, similarly if we have 400 records then its 25% comes out to be 100.
So, it will separate 100 records for testing and it will not be used in training. Ok!
So, I will take 25%. Next is a random underscore state. We take a random underscore STATE as 0 and execute it… As you can see, here x underscore TRAIN, train, this is our x_train.
Let’s have a look at the x train.
Now we will see the shape with x underscore TRAIN train dot shape. You will know with the shape how much it should be.
The shape should have the remaining 75% data which is 300 here. The same thing you can see with x underscore test. You will know with the X test also that This will give only 100 data and why only 100?
100 is 25%. As you can see here the output is 100 comma 2. So, this is our final output.
So, now our data is divided. We can see y underscore test and y underscore train also. This is our y underscore test and this is our y underscore train. This is in the form of 0,1,0,1. One good thing in this data is that, there is no need for encoders because our data is in numeric form already. It saved our pre-processing part.
Further we will scale the data and pass the algorithm. Till now our work is done, so we move to scaling of data. For scaling we will use from standard scaler,... this will be SKlearn import standard scaler, so sklearn dot preprocessing import StandardScaler….
So, we imported a standard scaler from here. After importing standard scaler what we will do is we will create an object of standard scaler of any name of our choice, so STANDARD standard scaler, for example here we created an object, sc is equal to StandardScaler.
So, we created an object here. What is the use of a standard scaler? Actually, it shrinks the spreaded values and brings them in a limited range. Now, you will see, we will get values between a limited range like minus one to one.
Not exactly between minus one to one, but we will get values in a limited range.
Here, I type, x underscore test TRAIN train dot summary. So, If I want to know what are the maximum and minimum values in it. I can find it from here. (7 seconds pause)
Now, after making a standard scaler object, we will transform our data into it. I will show the original dataset, I will type dataset dot describe.
So with the dataset dot describe you can see here what kind of values we are getting. The minimum value is kinda large, see it is 18 and maximum is 60. This one is 15000 and this ones, tens ,hundreds, thousands, ten thousands and lakhs. There is high variation in this data as it is from thousands to lakhs. So, we shrink it with a scaler and bring it to a range.
We use sc dot fit underscore transform here, and put our data here… For the purpose of explanation I put a complete dataset in it. I put the dataset here and execute it.
There is a problem arising in this dataset. Here as you can see there are male, female options in our whole dataset, so it will not work. So, we will use the dot iloc function. There is user id, age here. Here we have to ignore the gender column. So, we apply it only on age and estimated salary.
So here iloc.
So, here we require all rows but only 2nd and 3rd columns.
As you can see, it converted and changed the data like this. If I take the same thing outside, like this and execute it. What data will we get then?
As we can see we get the data in the form of a dataframe. we put another bracket here, as you can see,what we are getting is 19, 35, See the range. The range is completely changing here. We store it in dtemp. Now we can see the type of dtemp. First I will show you, of what type it is?
So what is its type?
Its type is nd-array, so it is numpy dot nd-array. Ok!
So, pd dot dataframe,... I first convert it into a dataframe.DTEMP, dtemp, So, pd dot dataframe.
After converting it into dataframe, we try to understand it. As you can see, what is its range here. The 0th row ranges from -1.7, -1.4 right?.
Now, I add dot DESCRIBE describe here. What will happen with this?
As you can see now, where its values are minimum or very low. This is very low and the maximum value is also low. So, its range has shrunk. From -1.8 to 2.3. Actually, this is our purpose of standard scalers to reduce the range so that less mathematical computations are required. So, we use a standard scaler.
I showed it with a dataset for a demo. Now, I put the same in xtrain and xtest. Ok!
So, x underscore train.. is equal to sc… dot fit underscore transform.. Here again x underscore train.
Similarly, Y underscore test is equal to sc dot FIT fit underscore TRANSFORM transform x_test.
So, in this way, our xtrain and x underscore test both are scaled down. As you can see here this is our xtrain and xtest….
So we are done with scaling, now what do we have to do?
Now, we need to apply the KNN model. How will we use it? See, here we type, from sklearn dot.. What will we do next? What is this? KNN, from where will the KNN come? From neighbours so SK learn dot neighbours... import K nearest NeighboursClassifier.
Now, we will create an object for the KNeighboursClassifier…. So, we give a short name as knn is equal to KNeighboursClassifier,... here in brackets we need to mention how many neighbours should be there. Normally we go with five neighbours. So, I type here, n underscore NEIGHBOURS neighbours is equal to 5. Rest parameters I will keep as it is. So, knn becomes an object or model of our algorithm.
Now, we need to train our model. What do we have to do for training? So, we will use dot fit for training right?. So, here knn dot fit x underscore train, y underscore train. We need to pass both xtrain and ytrain. Now, since our model is trained and we can do prediction.
So, here, knn dot predict, here I need to pass x underscore TEST test data and store it in y underscore PRED pred. So, the predictions from KNN will be stored in ypred.
Let's see what your y_pred is. so, this is our y_pred.
We will compare the values in ypred with values in y test. You can check what values are in y pred and y test.
See this values, you can check manually as well, as you can see here 1,1,1 are the same, and in the beginning also 0,0,0. It has predicted well.
So, we find its accuracy and also visualise it. For accuracy, I type, from sklearn dot metrics… import… ACCURACY accuracy underscore score , so, we are importing accuracy score. We can also import confusion matrix also but I will start with an accuracy score here. Here we will display the accuracy score with print ACCURACY is equal to comma ACCURACY, accuracy underscore score and here we will pass both values of actual y_test and predicted y_pred.
We need to pass both of these.
Accuracy comes out to be 93%. If you remember, last time we used logistic regression for this and got 90% accuracy. Here we got 93%, so we should prefer this model, this is a better model.
Now, let's do the confusion matrix also. So, from sklearn.. dot.. matrix… import.. confusion underscore matirx...
So, this is a confusion matrix. Now we can print it or store it separately, so, confusion matrix cm is equal to confusion underscore matrix,... here I have to pass y underscore test and y underscore pred, on these parameters and basis the confusion matrix will be formed.
Now, we can see how the confusion matrix is being executed. We have 64 and 29 as true predictions and 3 and 4 are false predictions which are less as compared to logistic regression we did in previous sessions. OK!
Now, we will try to visualise it. I will show you a demo to do better visualisation. Here we have already imported matplotlib... So, from matplotlib dot COLORS colors, since we need different colors, we will import them with the help of matplotlib dot colors. So, import LISTED Listed COLOR Colormap. Here we are importing listedcolormap. This is done.
Now we will take x1 and x2, which are the two variables we have. Here, we will use np dot MESH meshgrid and how we will utilise it?
We will first use np dot arange to design the dataset like how we want to present our data. So, np dot ARRANGE arange, we are putting start value, from where will we start? let ‘s type manually, start is equal to np dot arrange start is equal to x underscore set, here we are selecting all rows using colon and selecting the first column.
We will make him start from the minimum.
Now, I type start is equal to x underscore set min-1, we will start it one less form its minimum so that it is clearly visible. Alright!
After starting, we will mention about stopping.
Now where will we stop?
So we mentioned the start.
After mentioning the start, we will mention stop. (10 seconds pause)
So, I type, STOP stop, be careful about brackets, see it is treating from here so we need to write in this onl, so stop is equal to x underscore set….colon zero.. Till max, dot MAX, we will add one in max. Plus one OK!
This is done.
In the last we will type STEP step is equal to 0.01 which means in what intervals or distances the data will be represented. This is done now.
In the continuation of this we will give a comma and press enter...
Now, I type, np dot ARRANGE arange is equal to what will come here, again NP dot arrange. we will write the same as the above line but we will select the first column this time instead of the 0th column.
Not much difference is found so i am taking this full.
So instead of zero, here we will take one.
So zero becomes one.
Zero will become one here as well. Plus one plus one, rest of the things remain the same.
So, this much is done. Now, we will first define this xset here. What will be this x set thing. So, I define here x underscore set and similarly y underscore set, what will be its meaning? It will be actually x_train and y_train... So from where are these coming, this is coming from xtrain and ytrain in this way. Let me just check. So, I cross check these brackets. I remove this extra bracket and I re-execute it.
Since X_set is in capital letters, I write it in capital letters, and in the rest of the places also it is in capital letters. ok.
So, our x1 and x2 are initialised and now we can perform visualisation. So, we will start visualisation on first test data and after that we will start visualisation on training data. (7 seconds pause) After doing both of these we will be able to understand.
Now, here, we first take x and y, which means our x-coordinates and y-coordinates. So, this is the x-coordinate of the test, so, I will use x underscore train, and select all rows and 0th column. Along with that our y will be x underscore train, and we will select all rows and 1st column. So, these are our x and y.
Now, I show you by displaying it by using plt dot scatter. So, plt dot SCATTER, scatter. Here X comma Y So, this is our plotting of x and y. Here, I will do one more thing. We will select colours. On what basis? Here I am doing it for the train, so I change it on the train. TRAIN, train. What will we do here? We will select colour for y underscore TRAIN train. It means that here also, c is equal to c. So, in this way we can get colours according to ytrain. So, let’s execute it again and see.
As you can see, this is our distribution on basis of the training dataset. Now. After training the dataset, we will see the testing dataset, what was the condition while we were performing testing, how it will be on the test dataset.
Here also, our x and y will be the same. OK. But here x_test and here also x_test. And the method will remain the same. So, x is equal to x underscore test of all the rows and 0th column and similarly our y will be x underscore test and here we will select all rows and 1st column.
And c in this case what will be colours? Colours will be y underscore pred.
As you can see, what is the difference between these two? In this case I have done colouring on the basis of y_train output but here we are doing colouring on the basis of predicted output. So, here colour will be on the basis of predictions.
After this, I can plot it. So, plt dot SCATTER scatter and here lets have x and y, and c is equal to c. And we can see here, this is our plot. As you can see it is very clean and clear. We can see different clusters distinctly, but these are not different clusters rather they are two classes because this is a supervised approach. So, if we visualise it these are clusters of data points very clearly. But since there are very clear and distinct clusters which means the classification has been done nicely. And we already saw that its accuracy was 93%.
So, we take it side by side. I will execute this one also here so that we can interpret it together. Now, here, it has come from y_pred. so, along with y_pred, I made another change. This was our c like. Now, what I will change here. For now I will delete it. Here, I will change c, c is equal to y underscore, now I take the test. And again I use plt dot SCATTER scatter. In this way we can understand the difference. Previously we saw between training and test data, this is our training data and this one is our test data. But we saw predictions for test data. Now, y_test is actual observations for test data. So, we will see on that basis also, I put x and y here and here c is equal to c. so, for here we can understand visually.
As you can see there are minor errors in predictions but they are logical since the model is generalised. This one was yellow but it was blue in our actual observations. Actually, the actual observation is outlier. We can remove this as it is outlier. Yes, for these two, we can say that these two are ambiguous because they are clearly overlapping and on the margin.
So yes it is acceptable that there is a mistake in the model.
But this one is outlier, this one is also bit more on the other side so to wrongly predict this also it will be equal to being generalised. Because if the model tries to accurately predict them also, then basically say that the model is overfit. It is so overfit that it can not be generalised. What model starts overfitting then it accurately predicts even outliers like this. But when you give it natural data, it will get new input, and it needs to predict new data. Challenges arise in that case. So, in this way we can interpret by visualising it.
You can watch this video again, it is not a long video. Look back and try to interpret. Whatever dataset you are working with you must work on interpretation. It is the most important part.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
So, friends, we will stop this session now and will continue in the next session.
Ruturaj Nivas Patil
Very well explained in entire course. Great course for everyone as it takes from scratch to advance level.