Hello,
I am Mohit from LearnVern,( 6 second pause ;typing)
So, today we will be studying in continuation of our Machine Learning course.
So, in today's video we will look at one example of logistics regression, that is from social media data.
So, at first I will be introducing you to the dataset, along with that I will be telling you about the problem statement in it.
So, at first I will upload the dataset, and then through understanding it we will check as to what we need to find out in the data set.
So, from here I am uploading my dataset. And the dataset over here is, the social network dataset. So, I have uploaded my social network dataset now…
Now, you can see that this dataset can be seen uploaded on the left side over here.
I am copying the path of this dataset, and I am closing this window from here,
You all know that,
So, now at first to see this dataset, we require pandas library to view the dataset, so we will first do,
import …(typing) pandas…(typing).. as pd,
Next, we will do data is… equal to… pd dot …read csv, with pd dot read csv, we will be able to read the data.
(5 seconds pause ;typing)
(01/47)
And here, we have inserted the data.
Now, we will look at data once by putting data dot h e a d..head..
here you can see that we have user id..gender..age..estimated salary and purchase.
This is our data,
So, like this, we have 400 records in this particular dataset.
Now, here we are about to use logistic regression. (5 seconds pause ;typing)
By applying it we will try to figure out whether a particular person will buy a new car or not.
So our intention is… to know whether a person will buy a new car or not.
Person (typing)….will….buy (typing)...a new car… (typing)..or not..
So, this we will have to know about.
When it generally happens !
Now this happens, a company has launched a new car, when a company launches a new car model, then they have to do marketing and in this ..marketing.. We are particularly focussing upon social media.
So, basically we are covering individuals who bought the car through social media marketing specifically. So we are focusing on such particular data only.
Okay guys !
we have shown you dataset,
So, here in the end we can mostly see that it is given zero but at some places we also have 1,
so let me show you the data again, here you can see we have both zero and one .
Here where it is written zero in purchase, that means that the person has not bought the car.
And where it is written one in purchase, meaning the person has bought the car.
We will mention it here,
So, let us write here zero, meaning of zero, is a car not purchased, ..(typing).. not… purchased.. , and meaning of one is car..(typing) purchased..
( 6 minutes pause ;typing)
Ok!
So, these are its meanings.
Now, let's move ahead and figure out that the data over here contains what other features in it.
So ,it has user id, gender, age, estimated salary, which you can take out through,
data(typing)… dot… column also,
So we have data dot columns, features such as userid, gender, age, estimated salary, and purchase.
So purchased is the output and rest over here is our input parameters.
Right! So these are called as input or also known as feature sets…(3 seconds pause ;typing)
So, these are our feature set and output is purchased.
Now, how can we solve this,
to solve this at first we need to do data preprocessing…(typing)
So, we need to do data preprocessing.
So when the data is prepared for learning,
Then the second step will be… logistic regression.(4 seconds pause ;typing)
So we will create a model for logistic regression and with its help, we can make predictions.
Then our third step should be… evaluation, so we have to evaluate it.
So these will be the entire steps for execution.
So, let's begin and proceed with execution.
Now to start with execution, we have already loaded the data.
After loading data, what is the next step ?
we should import all other libraries that are required for it.
So, import numpy as… np, this is one library.
Then, import…. matplotlib… dot… pyplot… as plt.
And lastly, we have already imported pandas.
So, this is the way.
Here we will add ‘as’.
Now, you can see both the libraries are imported and we had already imported pandas earlier.
Ok!
Now, we have the data set stored in a data variable or data frame as it is named.
Now, what will we do further ?
Next, we will divide the data between x and y.
That means in input… and output….
So, we will divide the data,
Okay !
Let's begin with X..
Before that I will write data dot iloc (eye lok) here I L..O..C.. , so data iloc is basically location and I is index…So,iloc.
So, it means that it has something to do with indexing and location, so in iloc I will put a colon over here. And then, enter.
So you can see, it has given me all 400 rows and all 500 columns as output.
Now, after this colon, I will put a comma and after comma, in a bracket I will give 2 comma 3, and I will execute it, so here you will see that age and estimated salary is extracted from it.
So, before this you had seen that all the columns from 0,1,2 and 3 were there.
So, let me once again show it to you all, for that I will remove this from here, and execute this.
Now, you can see the user id which is at zero' location, gender at 1 st location, and this is second, then third and then fourth.
So, when I give colon that means beginning till end for rows, when we give colon for the first time it denotes the rows, so for instance, if I write 3 over here, what should be the output ? so in the output, it should bring only 3 records, you can see 0, 1 and 2.
So, this is how it brings output over here.
So, this first part we will keep as it is, because we want all the rows, and for columns we need only second and third, means we don’t want zero and one, so here, we will put in brackets second and third,
So, exactly that happened, i got age, and estimated salary,
Now after this, why have I done that ?
Because these are relevant parameters, that's why I have done that.
We will start work on that, we store this in X,
X is..equal to data I L O C, iloc , and in the bracket, we want all the rows, and after that, In columns, we want only second and third columns, so here in the bracket, we will write second and third..So that, it can be understood !
And one more thing, if I put dot V A L U E S, values.
So, let's check once what happens by putting dot values.
So, here you can see that by putting dot values the data is basically structured in array format.
See, all datas have been shown in Array format.
In this way, we can acquire data in array format.
So, here we will do dot.. v a l u e s, values and execute it, so we received x values here.
Now, we will do the same for y, so y is equal to data dot..I L O C, iloc, now what are we supposed to do for y as it is an output?,
So we will extract y from the input that is the primary data.
So, colon meaning we want all the records, that is,
Yes in the Y, we want all the records, you might think ..all the records ?
we want all the rows not all the columns.
So, here we can mention the columns that we want by putting a comma before it.
So, for the column you can again see it from here, that this is zero, this is one, this is two, this is three and this is fourth.
here for predictions we have to give, purchase column for y, because that is what we have to predict.
So, this is coming at fourth indexing, so here we will mention 4, because it is on fourth indexing.
So, here I have written 4,
Clear so far !
(11/20)
Now, what next we have to do?
Here, also we will put dot values.
So with values, it is getting converted into numpy array format.
And this is better for computation purposes.
So, we got data in both x and y.
So, let me show you x and y.
So x has been written in capital, so now you can see this is the output.
And after that y (4 seconds pause ;typing) also we will execute and see, so this is also formed in array format.
So, in this way, we are preparing the data through preprocessing, so that we can further use it for training and testing.
So, this task will lie under preprocessing only. Ok!
Now, next we will divide the data into training and testing.
For that also we have a certain module, which is known as train test and split.
We will import that from S K L E A R N, sklearn dot (5 seconds pause ;typing) model selection (9 seconds pause ;typing) hmm…(3 seconds pause ;typing)..ok..(3 seconds pause ;typing) import train test split. Ok!
So, here we have imported this from sklearn dot model selection import train test split.
Now, we will have x train comma… x test comma…. y train comma…So y train and.. y test
So this will help us in dividing the data and getting it stored in different variables.
Is equal to, here T R A I N, train test split….test split.., here our train test split..(5 seconds pause ;typing)
we have to even mention what is the original x and y, and thereafter.. We also have to mention what is the test size,(8 seconds pause ;typing) so we have to mention test size,
so we will keep that as 0.25 percent as of now,
along with that random state,(4 seconds pause ;typing) we have to enter a random state, (7 seconds pause ; typing) so the random state is equal to any value so I am giving here zero.
Now, we will execute this.
Ok!
So, this is executed.
Now what have we received in x, in train what has come ?, so this is x train here for us, now if this is x train,( 4 seconds pause ;typing)
(15/04)
we look at its shape so this is 300 by 2.
Similarly, if we look at the shape of xtest, we will get 100.
So, this has divided the data by 25 percent.
It has been 100…100…100…100.
So, we have got a pair of x and y.
Now, let's scale these values as they are many in numbers.
'Scale' basically means to bring the value in a limited range.
That is to bring them between a range such as between 0 to 1 or minus 1 to 1.
So, their range becomes limited and similar.
Because right now the values are difficult to manage as they are ranging from thousand, lakhs or CR.
So, it becomes difficult.
So, it is good that we constraint the value and bring them in a certain range.
For that sklearn dot… we will be using the Preprocessing module,(3 seconds pause ;typing)
sklearn dot preprocessing,
Here, two times pre..so we will remove it.
From here, we will import… sklearn, S T A N D A R D,... standard scaler.
standard scaler work itself is to scale the values.
So, let's scale the value with it
SC is equal to…(4 seconds pause ;typing).. standardscaler…, SC is equal to…Standardscaler and this is its object.
Now we will have to fit x train into sc object,
X underscore..train is equal to sc dot…. fit Transform.
Here, what will we do? we want to use SC dot…. fit Transform.
Let me check if this is giving us suggestions.
So yes, it is giving the suggestion, you can see here.
What are we using here, Fit transformation.
In that, we will put x underscore train,
So, x underscore TRAIN.. Train and enter.
So this way..
After that x underscore TEST, test, this also we will transform is equal to sc dot F IT, fit underscore transform,(5 seconds pause ;typing) and here we will keep… x underscore test.
So, this is the way we transformed it.
Now, after transforming, let's move ahead.
We will just check it, we will check if it is visible or not.
So, x train values have scaled down between minus 1,0 and 1.
Is there much distance between these values?
No!
The difference has been reduced.
So this is what we do when we do a standard scaler.
So, now our data has been scaled down.
Understood guys !
Now, we can move ahead and we can apply logistic regression.
Now, how we can apply logistic regression.
So, logistic regression comes under sk learn
From sklearn dot… Linear model.
So, in sk Learn there are different models, in that logistic regression comes under linear model.
Import…(4 seconds pause ;typing) logistic regression, so we will import logistic regression from here.
Now we will create an object or a model for this logistic regression that we have imported.
Here, the algorithm is known as a model which we train later.
So, here we will make L O G, logistic, R E G,regression equal to logistic regression.
L O G I S T IC..R E, regression.
What will happen in this, if we want to give any inputs we can do, such as penalty, dual etc.
But if we are just doing it for practice then, we can leave it by default and move ahead.
So, here I am giving a random state, random state … Yes ! That is equal to, we will give a value, that is zero, and right now I am not giving anything else.
So we have created an object this way.
So, after this, I have to pass x and y input for training, so L O G, underscore R E G dot… fit and then put the datas in it, so x underscore TE ST test and with that y underscore TE ST test.
So, here what we have done is that the algorithm is now learning.
What does this fit function do ?
This fit function basically trains the data that we give it,
So on the basis of x test and y test, it has trained the data or learned through it.
Now our model has learned the data.
Now, after this, we can move ahead to predictions, so normally we store the prediction in the variable y under PRED pred.
L O G, logistic underscore reg regression, so logistic regression dot here… use dot predict. Ok!...
So, here you put those things of which you want the output.
For example I want the output of x underscore test, so, I want the output of x underscore test, that output will go in y pred.
So, let's see y pred, what is in the y underscore pred ?
y underscore pred, now what is there in y pred, it has this zero ,zero ,zero.
Now, if you just want to visually see if it is right or wrong, then how can you do that?
you mention here print and over in this put y test.
So, this will display the actual y test.
Clear guys !
Now, you can also compare this if it has done it properly or not.
So, in the last ..it is zero zero zero ..and this is one one one..
Now here at third last, it has given 0 value but it's actual value was 1.
So, in this way it has given us the output.
Now, we can use this model for prediction, for this particular dataset and to these many features.
Now, if we want to understand this in more depth, then we can show it through plotting.
How can you plot?
We have an option that we take one column from the original table and then plot it.
on the basis of a column.
once again we will go through the dataset and see what we have, on that basis we can plan.
So, this is our data set having user id, gender, age, estimated salary.
Now, among them on the basis of which column should we do?
Let's do it on the basis of age through graphical representation.
Here I have to extract the age column.
So,here we have data and in the brackets A G E, age.
Ok , we here after execution we can see that it did not give any output, so let me just do the correction.(3 seconds pause )
What do we want ?
So, I will put a data dot, here..I L O C, iloc and through this, I will pick the age column, and here I will put a colon and in that I will write the index value of the age column.
How many columns we have, we have recently seen that, through data dot columns COLUMN.
From here, we can see our columns numbers..okay
(24/19)
we have 0,1 and in 2nd position we have age.
here I will put 2 and it should give me the data for age.
here we have the age data.
And now, on the basis of this age we can plot both the charts…
So, by using the same output that we had received here, I will try to visualise the prediction.
So, here I will put P L T dot scatter and here for x we will put this …age expression, and for y, y underscore T E S T..test.
Ok! So this is done.
Similarly, plt dot scatter, S C A T T E R, and then this expression as it is, next for y I will give a prediction, so y underscore pred.
So, here, it should have been x must be the same length and size.
So, we will try to represent it in graphical format,
To represent it in graphical format, that instead of data we will put an x underscore test.
So that our x and y both remain in equal length.
So, now we can visualise this output to understand the difference between them.
How do we visualise it ? to visualise it , we will take a help x test, so x underscore test.
To visualise it , we will take the help of one particular column, so on the basis of x test.
So,here we already have two values, so we will pick any one, and on the basis of its entire data we will visualise.
So, if I pick this first column,
So, here we will put a colon… comma and take a zeroth column.
So, in this way. Ok!
So, here through plt dot scatter, on its basis, we will make the entry of the data,
So, plt dot s c a t t e r, and here this becomes my x, after this our y test, we I'll put it as it is, y t e s t,test, so we have put y as it is, now we will execute this,
Now, you can see the values in this scatter plot in this way.
But here one more thing can be done, by putting a comma, color as Cmap and configure this also, though we can understand what is 1 and what is zero, but if we put color also, then only we will understand as… to how our prediction has classified it.
So, I will take y pred and put it in colors,
So, you can see here,
Now, by putting color in Cmap,we cannot see much of a difference.
So, let me just check if I can make this in… y test.
Making this in y test,
But, still we are not getting the color, Ok!
(8 seconds pause ;typing)
So, here we will put c is equal to y …underscore pred.
Now, we can clearly understand that y test is our actual output and y pred is our prediction.
So, we can view the mistakes that the model has performed some mistakes,
Where have the mistakes been made?
So, here you can see 1.0, here also and here also.
So, we can see the yellow dots are for one, but here you can see this has also become yellow.
So, these are the mistakes, that means the model has not predicted accurately, so it represents those mistakes.
Now, we can calculate it's accuracy also, because by just looking at it we can not understand how accurate this is?
to find out accuracy what can be done?
For that, by printing the confusion matrix, we can do that by the confusion matrix.
For that, let me just import,
From… the sklearn dot… matrix, (3 seconds pause ;typing) so, in this through the module that sklearn has, we will work through that, import accuracy.. and along with that confusion matrix,...confusion matrix.. so we will import these two.
And on this basis, we will find out,
Now, here we will find out the accuracy score.
So, print, accuracy score,... so our accuracy score will come as; and here in the bracket we will put y underscore… score…TEST test and y underscore PRED pred, which is 0.9 , that is 90 percent.
So, in the same way, we will print confusion matrix,
So the confusion matrix, CF is equal to the confusion matrix, and in this also we will have y underscore test and after that y test, so comma… y underscore p r e d (pred).
So, here you will see our confusion matrix will be formed.
CF is a confusion matrix.
this is our confusion matrix that is displayed.
in this, you can see these are our true, this is also true, this is also true, and these two are mistakes 8 and 2.
with the confusion matrix we can understand that this is true for the first class, and this is true for the second,Ok!
And, these are false predictions for both these two.
So, because of this 8 and 2 our accuracy decreased.
So, this way our second example of logistics regression,
Hope you understood.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.