Hello I am Kushal from LearnVern.(6 seconds gap, music).
Today we will be studying in continuation with the previous session of machine learning.
In today's session we will see the decision tree - ID3 algorithm practically.
Where we calculate the impurity with entropy.
So we will calculate with entropy.
For practical purposes the data that we are going to use will be a social network ads dataset.
Firstly I will add this data set and along with that I will copy the path of this data set.
So in the beginning we will check this data set through importing pandas.
Import pandas as PD.
So, our data will become a data frame through a data set equal to pd dot read csv.
So we took the help of read.csv and here we load the data set.
Now you can see this is our data set.
So in this data set there are 400 rows and 5 columns.
In this we have user ID, gender, age, estimated salary and purchase.
So this is a new feature over here, I will explore this to you, if you will click on it you will get an interactive table.
Now you can see that we have got an interactive table over here.
Firstly we will understand this data set.
You can see that it is written to show 25 records over here.
I will make it show 10 records for my convenience.
So you can see over here that we have index value, user ID, gender, age, estimated salary and purchase as columns.
You can see we have these many columns.
So we know here that 0 means that it is not purchased and one means that it has been purchased.
This data is basically giving us information on whether a consumer has purchased the car or not by viewing an ad on social media.
So we saw that wherever there is 0 the person has not purchased the car and wherever there is one the person has purchased the car.
So, if we observe it carefully, then we will notice that those having zero are also having very less salary compared to the ones who are having one and also having high salary.
So maybe because of the salary they are not able to purchase a car.
Also if you look into the age column we can see that it might be possible, due to the age factor also the car purchase thing is getting affected as the female candidate compared to others is of more age.
So in this way it is possible that we can bring out an estimation combining age and salary which will help in predicting if the individual will purchase the car or not.
So this is our example and we will solve this here.
As I have already told you all that this is an interactive table so we can sort the table as such in index over here from big to small or small to big. Similarly you can do it for user id also and can get a sorted data.
And the same goes with the other columns such as gender.
Similarly if you want to get the data only for the purchase candidate and know who are ones with who have purchased, then in this column you can bring out the sorted data also.
You can click on two to see further records. In this way.
So, let us move ahead and work upon this data set.
So as you all know that as soon as we have loaded the data thereafter we have to prepare the data through pre-processing and make it ready for the algorithm.
So our data will be in two parts one will be X and the other will be y.
So what should we do in the x part?
So in x, we will put those columns which will help us in getting any output, so if you look at the first column that is user ID it is not going to help us to know who is going to purchase or in providing any output that is very much evident that user ID is not the parameter on the basis of which the user will decide whether he wants to buy the car or not. right?
So the relevant parameters which can provide any output are basically age and estimated salary. We can consider gender as well but it has nothing to do much. So gender is also not that important but if you want to add then you can do it it's up to you.
Because after a certain age it is possible that anybody can want to switch from a two wheeler to a more comfortable option of a four wheeler.
This can be because of a status in society and for the comfortability requirement due to age as well.
So age can be one factor and second can be salary.
Similarly with the salary, unless a person has the budget then only he or she will be able to buy a car on cash or on loan.
Salary should be enough that the loan can be paid, then only you will buy a car.
So we will take both age and salary.
So here x is equal to the data set and dataset dot here we will use the I loc function, then we will put the first colon as we want all the records from 0 till 399 and for column we will pick only 2nd and 3rd column.
So 0,1,2,3,4 so we will be picking only two and three because these are only relevant columns.
Next here, also we will be using a data set dot iloc, and here we will put a comma and put only the fourth column for output.
In this way, we have this as input and this as output.
I hope you are able to understand me.
So in this way we have at least divided our data between input and output and now we will move ahead.
So there are some more steps for preprocessing. We will perform that.
So firstly, we have the entire data in X and y as input and output respectively, so we need to split this for testing and training.
We have 400 records. From that 25% of the data we will keep it for testing and rest we will use it for training.
For that here,we will split the data into train and test.
So, with the help of SK learn dot model selection import train underscore test split we will work. (read slowly, typing)
So here, we have X underscore train then X underscore test then y underscore train and then y underscore test. (read slowly, typing)
So we will receive a total of 4 dataset, two pairs of input and output data each.
With the help of train test split and insert x and y, then we have to give the test size so we will put 0.25 so the test size will be 25 % and at last we will give a random state is equal to 0.
Now you can see with x underscored train dot shape, our data has become 300 rows and 2 columns.
Similarly here, x underscore test dot shape and our data should have become hundred. Because out of 400, 100 is the 25% and the rest 300 goes for the training.
So we have X train and Y train as one set and X test and y test as the other set.
Now moving ahead we will scale this, as we have earlier also discussed it that scaling helps in easier computation.
So “from SK learn dot preprocessing” we will import a standard scaler. (read slowly, typing)
And for this we will create an object.
SC is equal to STANDARD standard scaler, so this is an object.
And here with the help of this object we will use x underscored train equal to sc dot fit transform, and here again we will input x train. (read slowly, typing)
Now if you will see your x train should be scaled down.
So you can see that it has been scaled down.
Similarly we will do it for the X test.
So X test is equal to sc dot fit underscore transform, and here we will put x underscore test. (read slowly, typing)
Here, also we will display X test and see.
So, these values have also been minimised or basically scaled down over here.
So our data is almost ready here, now we will move ahead for model creation.
So now is the time for model creation and in this we will be using entropy, as decided.
How we are going to do that, let's see.
From SK learn dot tree from here we will import decision tree classifier, D should be in capital letter. decision tree classifier. (read slowly, typing)
This is done now next,
So as this is id3 so we will make dtid3, so we will make this as the model for decision tree classifier.
Now, here the criteria is given Ginni but we will change it to entropy this time.
Why did we keep entropy, because we want to use ID3 this time.
So here CRITERION,criterion is equal to entropy.
So we kept entropy here, you can keep the rest of the strategies as it is except, for this as this is primary which makes the difference.
After this, here also we will keep the random state as 0.
So we have changed a little bit only which you can verify from the previous video, where we have used the cart algorithm, in which we had used Ginni. So where we use Gini can be called a cart . On the other hand, where we use entropy or information gain, we call it ID3.
So let's move ahead.
Dt-ID3, so here we will pass the training data in dot fit method.
So here we are giving training and how you are going to do that?, through dot fit,so here in this, you all know that it is supervised, classification, so we will pass ‘X train’and ‘y train’ both.
So now our model is train and it has learnt it.
So after learning,what else do we have to do?, now we have to move ahead for prediction.
So let's do predictions also.
So, the prediction I will store in one variable that is in y pred, I will store is equal to Dt-ID3 dot predict, so here we will use predict method and in that, we know that we have to test so we will pass x underscore test.
So now we will see what we have got in y pred.
So this is what is stored in y pred.
And we also have the actual y test, our observation data.
So this is our observation data.
Here, we will make it as VALUES,values, so this is our actual observation data.
Now after prediction we will find out its accuracy.
As we cannot do it manually through counting.
As this is a challenging task and difficult to perform.
Still have a look here, here is 1-0-1 and here is 1-0-0.
So one is caught.
So let's check its accuracy, to check its accuracy,we have the classification report or the confusion matrix.
Superformance test we will measure
So from sklearn dot matrics we will first import accuracy report accuracy underscore score, and second we will take confusion matrix. (typing)
So it can tell us these two scores easily.
From SKLEARN sklearn dot matrix import accuracy score confusion matrix.
Now here, we will print our accuracy, and see how much accuracy we achieved.
So, ACCURACY accuracy underscore score and here our observed that is y underscore test and our predicted that is y underscore pred, both these things we will have to put to give us accuracy.
So, 90 percent is the accuracy, which is good but not that great also.
Now, confusion matrix, so CF is equal to confusion underscore MATRIX matrix and here also we will put y underscore test and y underscore pred.
Now, we will view this cf, how they are true, so we have 61 and 29 as true prediction for both classes
And these are false predictions for both the classes, so 3 and7 are false predictions.
So we passed 100 records out of which 10 are false predictions.
This is the reason we have got 90 percent accuracy.
So, same thing we can visualise and see.
Here we will use PLT dot scatter.
So before using PLT dot scatter.
We will take it for training data then we will come to testing.
So let’s check for training data.
So, x is equal to x underscore TRAIN train here we will take all the records and for column we will take zero.
And, for y also we will take it from x train and here we will take the first column.
So, we have picked both the columns.
So, colour we will take from our output that is y underscore train which is our output or prediction
So here, plt dot scatter in that x and y and c is equal to c, we have to pass all these three things because we have created above.
Here, we should have first imported matplotlib dot pyplot as plt. (read slow, typing)
(pause 4 seconds) (repetitive)
I hope you are able to understand me.
Now, we will display this here.
You can see the formation of the training data.
There are many outliers as you can see moving here.
These are outliers.
So, in training data we also have outliers.
Now, if we do over-fitting then our model will not be that good.
So, we can remove these outliers which is a better option. (reapeat)
Now, let's move ahead.
We will see the test data, so y underscore test which is a true value and see it from this.
Now, what will be our x, so it will be X underscore test, and we want all the records but zero column.
Similarly, for y we will have, X underscore test. We want all the records but only the first column.
So, colour we will keep y underscore test this time.
After this for prediction we will just change the colour.
So, here it will be pred and for colour next time we will make it as y pred.
Now, we will do it's plotting.
So, plt dot SCATTER scatter and here x and y then X is equal to c.
This is it's output you can see.
And this is actual, true, so if the model replaces this with yellow, then that's fine . Either you can remove the outlier or the model treats it as yellow only.
So, let's see what happens here.
Here also plt dot SCATTER scatter x and y and then colour is equal to c.
So, here it is treated as yellow only.
This is the reason it's accuracy has decreased.
And here there is some overlap, so there can be some problems over here.
So, it is important to understand and interpret the dataset, we can not solely run behind accuracy, so we will have to understand this point over here is an outlier.
So, I think everyone understood it well.
Now, we will continue in our next session.
So keep Learning and remain motivated.
If you have any questions or comments related to this course.
then you can click on the discussion button below this video and post it there.
So, in this way you can discuss this course with many other learners of your kind.
Share a personalized message with your friends.