Hello I am (name) from LearnVern.
Today we will study in continuation of our previous tutorial of machine learning sessions.
So in today's tutorial we are going to see about multiple regression.
Multiple, meaning here we will have multiple inputs.
So if I talk about inputs, meaning here I can have from X1, X2, X3 up till XN.
And I will have an output as y.
So let's understand this with an example,
If I want to know how much I will score in my upcoming examination,
So this I can understand with help of various inputs such as,
how much hour I study,
second, I study from which teacher,
third what are the books that I refer,
These are the three features and with the help of this or with many more features, we would be able to decide our output, as to the estimation about how much I will score in my upcoming examination.
So, in this way if you will see our score is a continuous variable… meaning it is not categorical such as good or bad.
So, the score would be like 67 percent, 88 percent or 97 percent, in this way it would be a continuous value.
So, let's see how we can apply multiple regression in our startup dataset.
first I will upload this file, so this 50s is a famous startup dataset.
So, here I uploaded this and copied its path,
So, first I will explore this dataset for you all.
So, first import pandas as PD, so with the help of the pandas library we will explore the dataset.
So, the dataset is equal to PD dot read underscore CSV, and here in the bracket we have to put the path.
And, here we will explore the dataset once.
So, dataset dot head.
So, here you have in front of you, all the columns of the dataset.
And we have 5 sample records.
R and D spend,
So, here you can see these are our features.
So, here R and D spend is the decimal value or its a floating point value.
Next, how much is spent on administration, this is also a floating point value.
Then, marketing spend is also a floating point value.
Thereafter, we have state,
And lastly the output that we have is profit, which is again a decimal or floating point value.
So, here for input we will have these as parameters.
And for the output, we will have profit.
So, before exploring it furthermore, let's import other relevant libraries also.
So, we can need numpy, so import numpy as NP.
Second, as we have already imported pandas then we will import matplotlib dot pyplot as PLT.
Next, we will import metrics from sklearn, that is, mean squared error.
So, from sklearn dot metrics.. import mean squared error. (5 sec pause)
So, this way we imported the relevant libraries also.
So, now let's divide the data into input and output format, that means in X and Y format.
So, here our input will be X is equal to dataset dot iloc, so we want all the records and all the columns, except one, which is if we do the indexing from behind then it is minus 1, and dot values.
Now, you can see how the data is loaded in X.
Meaning in total there are 4, that means if we do the indexing then starting from zero until 3.
So, these are the columns in X.
And for y that means in output, we will take y is equal to dataset.iloc, here we will take all the records but only one column, that is the 4 th column, as output dot values and execute this.
Now, you can see, in y only the last column is getting displayed here.
So, the last column of profit is our output.
So, let's move ahead and go onto the further execution part.
Now, in data preprocessing we will have to work upon our one input variable or the column which is categorical, so we will have to encode this, and also if we want we can completely encode all the data also, with the help of one hot encoder, and thereafter we can also drop some data to ensure that it is protected from dummy variable trap.
Now, let's move ahead,
And firstly let us import encoders,
So, from sklearn dot preprocessing import here we want a label encoder, (5 sec pause) along with that we can run one hot encoding on our complete data also, so one hot encoder.
So, let's move ahead,
Now we will create an object by the name LE, so LE is equal to label encoder.
Now, with the help of le we will transform our input.
Now, x here we will have all the rows but only the 3rd column, for which we have to do the encoding, so x off all the rows but only the third column, is equal to LE dot fit transform.
So, we will execute with this.
And here x off all the rows and only column 3.
So, in this way you can see that the third column is not transformed.
So, this we have done with the help of label encoder.
Now, our next task will be that with the help of one hot encoder, we will encode our complete input.
So, one hot encoder is equal to one hot encoder.
So, for one hot encoder, x is equal to one hot encoder and here also we will use dot fit transform method, and in this we will pass X and in the form of array we will store it, so dot to array.
So, we are storing this as an array.
Now, you can see.
Now, you can see how our data is stored in our x.
So, this is the way our data is now stored in X.
Now, here we will apply one more thing, that we will try to avoid dummy variable traps.
Now, what is this dummy variable trap? That also I will tell you.
So, as I was telling you that now we will try to remove or avoid dummy variable trap,
Now, What happens in this,
So basically as we have converted this completely in array format, these variables form a co-linearity with each other.
Which we also call them as multi co- linearity, meaning that independent x1, X2, X3, they themselves are dependent upon each other.
So, because of this, what we will do is that we will tweak the data a little bit.
So, x is equal to X off, here we will take all the records but 1 colon, so we will ignore a little data for now.
So, in this way we have prepared our input data that is X.
Now, we would be safe from multi co-linearity.
Now, we will move ahead,
And it's the time to split the data between training and testing.
So, here from sklearn dot preprocessing, so for training and testing we will first do model selection import… train test split..train test split.
So with the help of this we will convert our data into training and testing format. (4 sec pause)
So, here x underscore train, x underscore test, y underscore train and y underscore test, so train.. Test. split, and here I will have to pass both my x and y, along with that the test size, in this case we will keep 20 percent of the data, and random state.. as zero.
So, in this way our data is divided.
Now, I will show you, x underscore train dot shape.
So, we have 40 records here.
In the same way, x underscore test dot shape, so here only 20 percent that means we have 10 records.
So out of 50, 10 records have gone under test and 40 for training.
Now, let's move ahead and we will perform feature scaling.
So, uptil now we are preparing the data so that we can provide our algorithm with a better input.
So, now for feature scaling, (4 sec pause typing) we will use standard scaler,
So, from sklearn dot preprocessing..import..standard scaler.
So, with the help of standard scaler we will execute this.
Sc is equal to standard scaler, so we have created an object for standard scaler.
Now, with the help of this object we will scale the data.
So, x underscore train is equal to SC dot fit transform, and here x underscore train.
Now, we will see what we got in X train.
So, here we can see that we have scaled this with the help of standard scaler.
In the same way, we will do it for X underscore test, is equal to SC dot fit transform, and here will pass X underscore test.
Now, we will see this.
So, this is our x underscore test.
Now, we have our complete data ready.
Now, we will apply in this multiple regression
So, to apply multiple regression we will create a model.
For model creation, from sklearn dot here we will now have linear model, so in simple linear regression also we had chosen linear model and here also we will have to use the same thing, import linear regression.
Now, we form an object for linear regression.
So, LR is equal to linear regression.
Now, after creating an object, now we will give training to this object.
For that we will use fit method,
So, LR dot fit, and we will insert our input data that is X underscore train, and here y underscore train.
So, this is our data through which we have trained our LR that is the model.
Now, with the help of this model itself we will do prediction.
So, y pred, here I will store the predicted output.
So, lr dot predict, in this I will pass X underscore test.
So, after prediction you can see the value of y underscore pred in this way.
So, if I want to see this same thing, as y underscore test then we can see the values something like this.
So, this is our value which is observed, and this is the prediction of the algorithm.
Now, we will evaluate this once and see.
So, to evaluate we have imported mean squared error in the beginning itself.
Now, we will put print absolute mean squared error, we will cross check once as to what we had imported, so it will be MSE, , mean squared error here, (5 sec pause) mean underscore squared error. so in this we will have to pass y true which is our y test, along with that we will have to pass y pred.
So, we have to pass these two things,
So, here you can see this error is very high, but the error should be near to 0. as we know the smaller the error will be the better our model will be.
So, in this we had perform so many steps for preprocessing, actually we do not need so many steps, so after performing label encoding, and one hot encoding, we were not required to perform scaling without which the model would have work fine, but for demonstration purposes I showed you many activities that are done during preprocessing.
So, if we want to plot the graph for this then,
So, we can do that on this dataset.
So we have multiple columns in X,
So ,if we want to plot the graph, then we can take some data from the original dataset and plot the graph.
So, dataset dot iloc, so here all the records that we have and for column we can mention 1st, so in this the values are something like this.
So, we will consider this as x,
And now our y will be y test and y pred,
Now, PLT dot plot, so this is my original data and if we want to do in our original data, then here x comma and for y I will take my output,
So we will do y as y1, so in our dataset dot iloc, all the rows but only the fourth column, so our 4 th column is y.
So, here we will put y1, after that we will use red and cross symbol to plot.
So, here you can see this is our data, and this data is such that it is very difficult to remove regression on this.
So, let's replace this first column with the zeroth column.
So, in the basis of the zeroth column you can see that some correlation between the variables can be seen.
So, the first column was not looking that much relevant for the output.
So, the administration expenditure was not having that much impact on the output.
So, that is why now we have a better graph, where regression line can be drawn.
So, in this way we can also choose as to which variable we can remove also from the analysis if they are having least relevance, and no correlation.
So, here you can see it's a better correlation, and we can also fit a line in between here.
So, here to fit a regression line I will put plt dot plot, and will put the same x and y1 and here I will put B L A C K black.
So, here we can plot a line in this way, so you can see this Regression line that has been plotted here.
So, this is my original data on which I showed.
And as I have conducted one hot encoding therefore its relevancy on the basis of plotting is decreased.
So, I am concluding the plotting here only,
And I hope you must have enjoyed the video.
We will learn the further parts in the upcoming videos.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.