Hello,
I am (name) from LearnVern,( 6 Second pause ; music ).
So, we will be continuing our Machine Learning tutorial from our previous session.
In today's machine learning session, we are exploring the practical part of Decision Trees practicals.
Here, we are going to use a dataset of a social media advertisement as an example.
As of now, you might be very much familiar with this Dataset.
So, quickly let us upload the data of social network ads.
So, here you can see we got the dataset now.
So, this dataset we have already explored, so I will just quickly show it to you as to what is there in the dataset.
We will take help of import dot pandas as pd.
Now we have the dataset.
The Dataset is equal to pd dot read csv, and here we will paste the path of the dataset.
So, from here I will copy the path, copy the path and then paste the path here.
Now, we have got the dataset.
You can see this was the dataset,in this, we have user id, gender, age, estimated salary and purchase.
We have these many things in this dataset.
Now, let us move ahead, and apply the decision tree, and in that also we will be applying the cart algorithm.
So, the cart algorithm uses the gini index.
We have discussed it earlier also, that gini is basically an impurity index.
Now, what is this impurity?
Ok! I will explain it to you once more over here.
Suppose, here in this Dataset, purchase is a column which is an output column.
So, here I am writing, 'purchased' and what is this purchased?
This is an output.
I hope you are able to understand me.
What is the meaning of it?
one person bought the car, buy or ‘not buy’, did not buy the car.
What did the person end up with?
So, here buy meaning 1
Buy is equal to one.
And ,not buy meaning 0.
‘Not buy’ is equal to zero.
So, did he buy the car or not?
So, here you can see the first record is 0, second is also 0, third is also a zero and this last one is 1, second last is zero, third last is 1, fourth last is 1 and fifth last is also 1.
So, can I say that the entire data purely belongs to a single class.(pause 4 second)
Before that, first understand what are the classes that we have?
So, we have two classes, one is zero, and the other one is 1.
So does all the data belong to one class only ?, these first record, it belongs to zero, second belongs to zero,3,4 and 5 also to zero, 395,396 and 397 that belongs to class 1.
3:13
So, here there are some records belonging to class 0 and some records belonging to class 1.
So, we can understand that we have two different classes here.And data also belongs to two different classes, and we can see that too..
What does it mean?
So, this means here we have impurity here.
Now you must be able to understand what is an impurity, it basically means for example we are going outside of India and we belong to various regions of India, and if somebody asks us outside of our country as to which country do you belong to, then all of us will reply that we are from India.
So, this means we purely belong to one class.
Now, on the other hand let’s consider such a function where we went to a place and there people came from countries other than India as well like srilanka, Nepal, bhutan even from our neighbouring countries.
You may notice people are from different countries but they look alike, and if somebody asks from which country you all have come.
Then everyone will have different answers, some would say I am from India, other would say I am from Nepal, and some other individual will say I am from Bhutan, I am from srilanka.
So, you can see Everyone is giving different answers.
This means that it doesn't belong to one single class.
4:43
We will consider such situations with the presence of impurity.
So in short , for what the word impurity is being used here?
Here impurity basically means if the data belongs to one class then it is pure and if the data belongs to different classes then it is impure.
So, we usually try to find out this impurity, and if we find it, it is between 0 to 1.
Ok don't get confused between this and that 0,1
Here “ranges from 0 to 1” 0-1 is the range, where 0 means pure, means only one class.
So, the record is pure and it has only one class and 1 means many classes or records or data points, every record is called a data point, belonging to multiple classes. (5 seconds gap, read slowly, typing)
So, this is basically the definition.
Ok let’s move further now.
We understood this conceptually and now we will move ahead .
6:06
we have got the data about what is the next thing that we need to do?
From the SKlearn dot tree, in the tree package we will get a decision tree, so import the decision tree classifier.
So, we have imported a decision tree classifier.
Now, here we will create an object of this.
So, we can write it as DCCLR is equal to decision tree classifier, we can make it DT classifier, so DT classifier is equal to decision tree classifier, here we wrote decision tree classifier, in the bracket we added what algorithm we want to run, or the criteria that we want, so our criteria here is by default gini itself, I am writing down again for you all to see, otherwise it is gini by default already there.
And if you don't add gini here, the algorithm by default stays gini.
Then you will have to change. Right?
Next is the splitter.
The splitter has a strategy: one is the best strategy, the other is random.
So, for now I will take random, SPLITTER, splitter, and in that I will take random strategy.
So, whenever you have any doubt while filling then scroll down, you can cross check it through the suggestions, so here it is given 'best' or 'random', so we have two options.
Now, next is features, so how many features we have to choose by whose help Splitting will be done.
So, first is random state none, so we will take that.
So, RANDOM state is equal to zero. (4 seconds gap).
Other than that we have maximum leaf nodes- so how many maximum leaf nodes should be there, how much should minimum impurity decrease, how much the class weight should be, so we can set all these parameters here.
So, now we have created the decision tree classifier over here.
Ok!
8:31
Now what will we do in this classifier? We have an initial data, on which we have done preprocessing, so this classifier is prepared.
Ok! Preprocessing I will do once here.
So, I will divide it between x and y format, so x is equal to dataset dot iloc and here I want all the records but I want only 2,3 columns, in the form of values.
So this is our x.
Similarly, I will take my y, y is equal to again dataset dot ILOc, iloc here I ll put colon and comma then put 4 so this becomes your last column and in that put values.
So, here both x and y are initialised.
This is the entire x and this is the entire y.
So x and y are initialised.
You might remember that we had done the pre-processing in this x and y earlier also.
You can surely apply all those things.
What you can do is, you will split these.
So, you can split between train and test.
So, let's execute that also, here .
Let’s split into trains and tests.
For that we have, from sklearn dot we will use model selection,so model selection, and from that we will import, “import train test split”, train underscore test underscore SPLIT split.
With the help of a train test split we will divide the data.
So, let's divide the data with the help of a train test split.
Make it a yes here.
Now, here we will have x underscore train,
So x underscore TRAIN,train, second will be x underscore test, third y underscore train and fourth underscore test.
This way!
Now, is equal to,next we will have to work upon function, so train underscore test underscore split.
And here in this function we will give original data x and y, along with that we will guide it or divide the original data.
So, in that we have to give size to the test data, uptil now we had been giving it 0.25, which is working fine for us, so we will be continuing with that size.
Lastly, we can give a random state also, and then initialise it.
So, our data is divided over here.
Ok!
Now, let's scale this, so from SKlearn dot preprocessing, we will do this through preprocessing, here we will Import STANDARD,standard, scaler.
So we will import standard scaler.
And with the help of a standard scaler we will scale it down further.
So, to do that we have here X TRAIN, x train is equal to, now first we want an object for it, so we will make an object by the name SC is equal to STANDARD, standard scaler,ok!
Here,
X train is equal to sc dot fit underscore TRANSFORM, transform , and here same data x underscore TRAIN,train.
Similarly, x underscore test is equal to sc dot fit underscore transform in brackets x underscore TEST, test,and enter.
So, now our x train and x test are also scaled down.
So, in scale we have seen that it reduces the range and calculation becomes effective.
Now, we have made our model also,
Now, in this DTCLR model, we will fit our training data, so x underscore train, along with that y underscore train, so that our data learns it.
Now, this is learning over here.
After that,now y underscore pred, for prediction, so as this is Classification so we will get different classes as zero and one.
So, DTCLR dot, here we will use predict function and in predict we will just put x underscore TEST, test.
So, we had removed this data only so that we could do testing.
Now, here you see this is our y, here you will get to see the value of y pred.
This is our entire y pred that you can see.
Along with that you also once see y test through print. (10 seconds gap)
I hope you are able to understand me.
Sorry, Little mistake occurred.
So, here you can see y test….
Ok!
Now, we have to see the accuracy between them.
So accuracy, from SKlearn dot metrics import (10 seconds gap) accuracy score, confusion matrix.
Now, we will print this. (12 seconds gap)
So, accuracy score also needs both the prediction and the actual output that you have received.
So, it needs y underscore test and y underscore pred.
Now, this will tell us the accuracy score, which is 0.86,
This is bad, with both these algorithms the accuracy has come really bad.
So, here again we will print and see the confusion matrix.
So, CF is equal to (8 seconds gap) confusion matrix, confusion underscore matrix, so here again we will put y underscore test and y underscore pred. (6 seconds gap).
So, this confusion matrix will display the amount of errors that have been performed.
So, here you can see 63 and 23 are done correctly but 9 and 5 are done wrongly.
So, with this we come to know that there have been problems in this algorithm also.
Ok! Let it be, we will move ahead.
Now, let's display its graph.
How will the graph be displayed?
Graph will be displayed in this way.
So, we will see this same thing graphically.
First, we will view training data.
So, for training data what will be our x and y.
So, x is equal to x underscore train. What do we want here? Here we want all the records but columns we want only of zeroth.
So, this has come perfectly for x.
And for y, What do we want?
For y also we want x underscore train and here we want all the records but we want the value only of the 1st column...
That's it.
So, we got the entire x and y.
So it is displayed together and from here we have y.
Now, in this you can set one more thing.
That is c, so c is equal to y underscore TRAIN,train.
So,this way, you set y train
Ok!
I Hope you are able to understand me
Now, PLT dot SCATTER, scatter, here we want x and y and we want c is equal to c, that's enough.
So this is going to help us.
Remove this 9 and make it a bracket and enter.
And, here first we will Import matplotlib so from SKlearn dot matplotlib dot PY PLOT, plot,as PLT.
And after this we will execute this scatter. Ok?
So from SKlearn dot matplotib, here will come import so import PY PLOT as PLT, and then it will be executed.
(Note:- highlighted portion was later omitted by the tutor.) (7 seconds gap)
So, now we have x, y and c also
Now to plot this from matplotlib, matplotlib import pyplot as plt, with the help of this we will do our further execution.
Now, plt dot scatter and pass x and y and c is equal to c so that we can identify.
Now, you can see we have the graph over here.
And from the colouring we can make out the data points of different classes.
And it is overlapping also.
Now, we did this for training data, and will now do the same thing for testing data.
So, for the testing data, X will be the same, only difference would be x train and y train will become x test and y test.
So, here this will become a TEST,test.
And this will also become a TEST,test.
And, here we have to seek a value for c and that we will take it as y underscore test.
Now, everything is done!
Now, PLT dot SCATTER and here we will see x and y and c is equal to c , we will check the colors also.
Here, we have done the plotting with x and y.
So, see here.
This is our testing.
Along with that we also want prediction as we want a change over here and want to check the prediction along with tests.
So we will get to understand how prediction is done.
Ok!
Now we are doing this for prediction. ( 7 seconds gap)
Now, you can compare both.
This is a prediction and this is actual.
In this way, we can see here it is yellowish and here these two are blue, so this has come from prediction and it has made some wrong predictions.
Ok because of some reason, prediction has been done wrong, this one is blue and here it is yellow, wrong prediction.
I hope you are able to understand me.
So, there are many wrong predictions, this is the reason it's accuracy was so less than 83 percent.
Ok! So this was our cart algorithm example with the same dataset.
But its accuracy has come very less uptil now.
So, let's see our next algorithm. How will that Perform?
So, friends, we will stop our session here, and we will continue in the next session.
If you have any questions or comments related to this course.
then you can click on the discussion button below this video and post it there.
So, in this way you can discuss this course with many other learners of your kind.
Share a personalized message with your friends.