So, we have now covered that what is correlation coefficient, but if I have different choices of linearity of relationship or I have categorical or continuous variable or I have different types of distributions then which type of correlation coefficient will we use.
We will be seeing it in this chapter.
We have different types of correlation coefficients like Pearson's r coefficient, Spearman’s rho coefficient or kendall’s tau coefficient.
In this module we can especially learn about Pearson's and Spearman.
Why? Because these are very much widely used correlation coefficients.
So, basically what does Pearson's r coefficient does? It tells us a linear relationship between any two numerical variables, which means any such two variables which are quantitative, which means which are in numbers.
If we want to find about their relationship, then we will be using Pearson's r coefficient.
Before using Pearson's r, we will have to keep in mind some of its assumptions like both my variables should be continuous, which means they should be numerical variables.
My data of both the variables should follow normal distributions.
In our data, there should be no outliers, fouth, whichever data we are using or whichever sample we are using, it should be a representative of complete data.
Fifth, there should be a linear relationship between the two variables, which means if we follow all these five assumptions then we can use Pearson's r coefficient to find any two variables linearity So, let's see is formula.
Looking at the formula, we must be finding it very difficult.
What is this x, y, there are even squares used, there is multiplications as well? It is nothing to be worried about.
Like we have seen in the last module, if I divide covariance with standard deviation’s product.
Then it will simply give me correlation coefficient.
So, in this way, in this formula, r x y is my strength of correlation between two variables x and y.
Small n would be my sample size.
Sumission is nothing but the entire sum that we will do, of all the values that will be called my summission.
X means every x variable’s value.
Y means every y variable value.
X and Y, the corresponding scores of X variable and y variable, if we multiply them.
Then we will get one particular formula.
We change Pearson’s r coefficient formula as per the sample or population, in which if I calculate samples Pearson’s r.
Then we will denoted it by small r, in which we will keep small x and y in subscript.
Its simple formula is covariance of x y divided by s of x multiplied by s of y.
What is s of x and y? That is my standard deviation of x and y.
Covariance of x and y simply is, if we calculate the covariance of x and y, then it gives us person’s r coefficient for any sample.
To calculate population Pearson's r, we signify it with a Greek letter, which is row, where we keep capital X and capital Y in the subscript.
What is its formula? Simply instead of s of x and s of y, I simply use sigma of x and sigma of y.
Which means population standard deviation in place of standard deviation then I will get my population Pearson's r.
So, in this way by putting different values and finding covariance we can find Pearson's r.
Next, we
will see what is Spearman’s rho coefficient? We call Spearman’s rho coefficient
as Spearman’s rank correlation coefficient as well.
Also, we majorly use this when we are not able to use Pearson's r coefficient, which means if any of the Pearson’s r’s assumption fails, then we use spearman’s rho.
Generally, may be by not having continuous or numerical value, it is possible that we have a categorical column, in which male or female must be defined, success or failure must be defined.
If these kinds of different categorical columns I have, in that particular case or if my distribution doesn't follow normal distribution.
In all these situations, where we cannot use Pearson's coefficient, there we use Spearman's coefficient.
This is basically a rank correlation coefficient.
Why? Because we give different rankings to every variable, highest to lowest and after that we find its coefficient.
So, in which ways we do it? We will see that in the coming chapters.
Let's see a small difference between Pearson’s and Spearman’s.
What was
Pearson's coefficient doing? It was telling us the linearity of any
relationship, which means if a variable is changing in one direction.
So, in the same direction the other variable will also change or not.
Or will it change at the same rate or not? But what does Spearman’s coefficient do? It tells us the non-linearity; it tells us the monotonicity of the relationships.
What is monotonicity? If any of my values are changing in one direction.
Direction should be the same but will it be changing at the same rate or not? That will be told to us by Spearman’s row coefficient like if we see both the curves, linear relationship where there is a straight line drawn through x and y.
Which is simply telling me that x and y variables are linearly related and any value from this will change on same rate.
But if you see a monotonic relationship on the right curve.
My curve is circular.
In this case if my x value is changing, then y’s value direction is the same.
It is changing in the positive direction, but is its rate same? No.
If x’s value is more then in that case, y’s value will reduce.
Because of this we use Pearson’s and spearman’s when…if we want to use linearity then we use Pearson’s and if we have to find non linearity then we will use Spearman’s.
Like we had seen in Pearson’s, how we define positive and negative linearity.
In the same way we will see monotonocity in this.
If you see positive monotonic relationship curve, so this means that if my one value is increasing then even the other value will increase.
Negative monotonous relationship means if one value is increasing, then the other value will decrease.
Like we can get from this curve.
Non-monotonic curve means both the variables do not follow a completing increasing a completely decreasing pattern.
So,
let's see how we use Spearman’s correlation coefficient and where all it is
mainly in use for us or in which way, we provide the ranking? So, there's a
simple logic, if I have to find Pearson's correlation coefficient between any
variables then we will first start by seeing the formula of Spearman’s rank
correlation coefficient formula.
If you see its formula, it's a little weird formula that I can see.
One minus six summation di square upon n cube minus n.
There are so many values in it.
Let's see this formula through an example.
So that we can understand all these terms easily.
Suppose if I have 9 kids physics and maths data, which means that how many marks the kids have scored in physics and math, I have the data of that entire data set.
Where I have to compute the rank of the students in both the subjects and I have to find the Spearman’s rank correlation.
I have to find out what is the relation in these two subject’s ranks.
So, we will follow it in a particular step.
In the
first step, I have noted both the subject’s marks in a table, in front of it we
have assigned one rank, like if you will see in physics marks, 47 was my
greatest value, so I've given it rank one.
After that, if you see 43 is my second value.
I have given it rank two, in the same way even in math from one to nine, I have ranked all the values.
In the second step.
We have created one column which we have named as d, d is basically difference between the ranks, which means the rank that we had created around physics and maths marks, we take a difference of both the ranks, like in the first row we will see, it is rank three for physics and rank five for math.
So, if I subtract three from five, so my value comes as two.
We will take its absolute value.
Keep this in mind.
We will remove the negative value.
In the same way 5-3 is 2, 1-2, we have taken 1, we have calculated all the values of d.
What did we do in the third step? We squared all the d values and we have placed it in a new column d squared.
Which means if the d’s value was two, then it's square would be 4, if the d's value is one, then its square is 1.
In the same way.
We have created a d squared column.
Now what we did is I summed up all the d squared columns.
After summing the d squared values that I got that is 12.
What
will be do in the final step.
We will put all the values in our formula.
With this we will easily get to know that what was d square, what was n.
I have simply put all the d squared values in this particular value.
One minus 6 multiplied by 12, divided by 9, 9 was my sample size.
Which means I was reading 9 student’s data.
9 would be its value.
N square means, 9’s square is 81 minus 1.
When I put all these values, then I got a Spearman’s rank correlation, which was of 0.9.
So, this was a positive correlation between two ranks.
With this we can easily tell that all these values form a positive correlation.
So, simply my Spearman’s rank values lie between minus one and plus one.
If plus one is a value, then it could be a perfect positive correlation between the ranks if it is minus one then it would be a perfect negative correlation between the ranks and if it is -1 then it will be perfect negative correlation between the ranks.
If my Spearman’s correlation value comes to 0.
So, this means that there is no correlations between ranks.
So, in this way we saw how through Spearman’s rank correlation, we can corelate any two variable’s rank.
So, we
have covered two types of correlation coefficients.
Pearson's correlations and Spearman’s rho.
If I want one linear relationship, then I will follow person’s r, if I want to know about nonlinear or monotonic relationship then I will follow Spearman’s rho.
If there is continuous or numerical data then we will use Pearson’s r, if there are categorical variables then we will follow Spearman’s rho.
Simply if it's a normal distribution then use Pearson's and if there is some other distribution then we can follow Spearman’s rho.
So, in this way we have seen what are the different types of correlation coefficients and how and when we use it.
Congratulations,
everyone.
We have completed the fourth module on hypothesis testing.
In this module, we learned many new concepts, many such terms that we are learning for the first time.
In the first unit we saw that what is hypothesis testing and how it is useful in data science.
In the second unit we covered the first thing that we have to do while performing the hypothesis testing, that is null or alternate hypothesis.
How do we create it for different situations, we have seen it through a lot of examples.
In the third unit we saw if you get null and alternate hypothesis, how do you make the decision of using any test.
Which means we have to reject the null hypothesis or we fail to reject the null hypothesis, how do we form both these decisions on the basis of our critical and acceptance region.
We have covered all these things in detail.
In the fourth unit we saw what are the different types of errors.
In which we learned about type one and type two error.
With the examples like criminal trial and covid vaccine we understood it properly, how we our type one error be reduced or type two error can be reduced.
In the fifth and sixth units we cover that what is covariance.
What is correlation coefficient.
What is the relation between them both and which are the different types of correlations coefficients in which we have seen about Pearson’s r and Spearman’s rho, we learn about both these coefficients in detail.
When and which coefficient we must use, this we have learned in lot of details.
In the next module we'll be covering which are the different types of hypothesis tests.
Thank you.
If you have any comments or questions related to this course then you can click on the discussion button below this video and you can post them over there.
In this way, you can connect with other learners like you and you can discuss with them.
Share a personalized message with your friends.