In this module we will see what are the different assumptions if we start the linear regression and before that.
Next we'll also cover in it that how we can perform this simple linear regression in Excel.
So first of all, we will start with what all are the assumptions? First assumption is that, there is a linear relationship between X and Y.
If you wish to start any linear regression.
So, you will have to assume that between x and y there is some type of linear relationship, otherwise whichever best fit lines you wish to find, those you won't find linear like you can see in this particular curve that on the left I've got a linear pattern between x and y.
So, in that easily I can find my best fit line with red line.
But wherever I have got non-linear pattern like anything haphazard, there instead of getting a linear curve, I am getting a nonlinear curve.
Second assumption is residuals or which we also call as error terms, which means error terms are normally distributed with mean zero.
Here we are talking about the error terms.
I am not talking about x and y in this assumption.
So, it is assumed that whichever is my error term is normally distributed and its mean would be zero Plus in normally distributed error terms there will not be any correlations in them.
We call this process as autocorrelation.
Third, which is an important assumption is that error terms are independent of each other which means, if you will see on the left where my error terms are scattered everywhere we are not able to see any visible pattern and no error term is dependent on each other.
But if you see on the right, this error term is following some of the other pattern in which we can see the curve in the form of pattern on the upper and lower side.
So, in this way, the third important assumption is that that I should not have any dependency on error terms.
Fourth and the most important assumption is that error terms have constant variance.
We also call it as homoscedasticity, which means homogeneity of a variance, which means whichever is my variance should not increase or decrease as my error term values increase or they change, like if you will see on the left side curve.
This is my homoscedasticity curve where I can see the constant variance in every term terms.
But on the right side, the variance that I have is forming a certain pattern which means we can see a change in the variance.
This is also called as heteroscedasticity.
So, assumptions of linear regression are very important and it is very much asked in the interviews.
You have to importantly remember that the assumptions that we are making are based on the error terms.
Error terms mean should be zero.
Error term should have constant variance, error terms should be independent to each other.
My error term should be normally distributed along with this the very important thing is that there should be a linear relationship between both x and y.
So, we have covered the assumptions.
Now we will see once, if we have any particular data set and how do we draw liner regression curves in the Excel with that data set or how do we calculate residual square, how do we calculate sum of squares? All that we will see in this particular Excel example.
So, let's assume that we have two particular variable’s data in which my x is ad spending in lakhs.
Which means if I have spent 127.4 lakhs then what is the revenue that I will get? I have that particular data available for some data points.
So, our main goal is to find that is it forming a linear relationship or how strong is the correlation between these two variables? So, our end goal is that we should find r square for this particular data set.
So first of all, when you have x and y value given, so what you will do is for that particular value, you will draw one curve.
You can normally draw it in Excel where on the x axis I have my ad spendings and on the y axis I have my revenue.
So, when you have drawn this curve, here you have got one equation, which is linear regression’s equation.
Y is equal to 0.0528x + 3.3525.
So, this means that our intercept that happened, B's value becomes 3.3525 and the slope we will see in this value is 0.0528.
So, with this we've got the intercept slope.
Normally when we drew the graph.
The very first thing that we have to find for r square.
So, what will be our approach? We want to find RSS, to find the RSS I will have to first calculate the value of Y pred.
What is Y pred basically? Beta knot plus beta one x, which means if we define this particular value in this way, so this means that we have simply apply beta knot plus beta 1x in the formula, then we will get one Y pred.
If I take intercept value which is beta knot, which means by taking b21column I will plus it, this value multiplied by this value, whichever value it will give us that will be Y pred.
In the same way by using this formula I can calculate rest of the Y corresponding, I can calculate Y pred.
Next step is I have to find these values residual square.
Which means I have to calculate y-y pred whole square, which are calling as residual square as well.
If we simply see the formula of residual square, If I minus C2 form B2.
Which means, normally from this value I will minus this value and I square it, then I will get residue square value over there.
I have calculated it for all the values over here.
So, what will be my RSS value.
If I simply sum up the variable from d2 to d18 variable.
Whatever will be the sum of all these that will be called as my RSS.
Now, we will find the TSS’s value.
Now to find the TSS first of all I have to calculate Y’s average value or mean value.
So, my mean value comes here 15.564.
Normally, whatever average I have taken of this particular column that will be my Y average.
Now, if I want to calculate TSS, I can use the simple
formula, where from the b2, I can subtract the average mean from it and I can
calculate its square.
I have applied the normal formula and whichever corresponding y value that has come.
I've squared them all.
So, this my different values of sum of squares.
Now, if I want to calculate the TSS, I can simply give one formula that sum of from E2 till E18 Whichever is my value that will be called as my TSS.
So, we have got RSS, we have lot TSS.
So, what would be my R square, one minus RSS upon TSS.
If I simply do 1 minus B 22 upon D 20.
So, I will get r square value.
R square here is 0.90.
So, this signifies that whichever is my linear regression it forms 90% strong correlation, which means if you spend money on the ad, then you have a 90% chance that it will affect the revenue.
So, this is showing a positive relationship.
How positive relationship? Because our slope is also positive, 0.0528.
So, in this way if we have been given any value, we can use the Excel in it and we calculate our Y pred residual square and we find RSS from it, then we take average of Y and we calculate a sum of squares from which TSS is found.
Simply one minus RSS upon TSS formula we will put the values and calculate R square.
So, in this way we saw how linear regression is a very important tool if you want to find a relationship between two variables.
If you have any comments or questions related to this course then you can click on the discussion button below this video and you can post them over there.
In this way, you can connect with other learners like you and you can discuss with them.
This course is really nice, just have one question in empirical rule explanation , SD deviation example trainer is saying mean however mean (20+30+40+50+60+70/6) value is different kindly confirm than