In the last module, we had covered t test, in which we used two samples to perform our analysis.
But let's assume that I have more than two groups or more than two samples.
Then in that case, which test we will be using? Then we have one type of test, which we called as F test, or ANOVA test, which means analysis of variance.
It's such a test which we perform when we have two or more than two groups present and we have to find the difference whether they are significantly different from each other or not.
So, this kind of tests we call as F test or ANOVA test.
Basically, what happens is, if you have got two samples, we can simply use any of the two tests, like t test and ANOVA test, the results would be the same, but if the values are more than two, if we are performing the t test continuously in the groups one after the other then that becomes a tiresome activity, it becomes a tedious activity.
Plus, in that case, our alpha which is there or the type one error, the chances of it being there also increases a lot.
So, in this situation, we simply use an ANOVA test.
To perform ANOVA test, we have certain terms about which we should know beforehand.
So, what we do is we understand this with an example.
Basically, what are the different terminologies if I have three samples involved, or three different groups incolved? Let's assume that we have three different groups in which the first group that I have is of those people who drink water.
The second group is of those people who drink juice.
Third group is of those people who drink coffee.
I have to note the reaction time in all three groups.
which means if any person is drinking water, drinking juice or drinking coffee then its reaction time will depend on the people or will it depend on the drink? So, in this situation, we will use F test.
What will be the null hypothesis in this case? Let's consider that the mean of all my three groups.
There mean is the same.
Alternate hypothesis would be my mean of all three groups is not the same.
Now we will see the terminology.
When the three groups are involved then it is possible that the variation that I have, which means the differences that are there lie within each group.
Variation within each group.
We are able to notice that every person has a different reaction time.
Now, there is one more possibility that the variation that is there lies in my every group which means now I have seen within the group, it is possible that my variation would lie between the groups as well.
Which means if you will see the first group then its value is 29, 30, 31, 31, 29.
It is different from the data of people who drink juice or form the data of people who drink coffee.
So here are my two different terminologies exist.
One is variation within each group and second, variation between each group.
Now let’s take one situation.
Suppose the variation that we have in every group is a lot, which means if you will see the people who drink water their data, the people who drink juice, their data and the people who drink coffee, their data, which means in every group, there are different type of people whose data we have.
Some people are faster, some are slower, which means if you will see the reaction time, same are taking 10 seconds and some are taking 12 seconds.
In this there are few such people would drink coffee and the reaction time is 37 seconds.
In this way.
Few people are faster and few are slower.
But if you see all the groups, then they look almost similar, which means in this situation, those who are within the variation that is more but the ones between variation are almost similar.
So, what is the conclusion? In this situation we can say that whatever is my variation is because of the people and not because of the drink.
Why? Because in the variation the ring is not showing any effect on my reaction time.
Every person's reaction time is different.
And it is not getting affected because of the drink.
We will call this conclusion that we fail to reject our null hypothesis.
If we take another situation, in that in every group there is not much variation.
If you see the first group where the people drink water.
If you note their reaction time.
Then within the group they look almost the same to me.
29, 30, 31, in the same way the people who drink juice, if you see the reaction time, even that is almost same, 17,18, 19, 20.
The people who drink coffee also lie in the range of 10, 11, 12, 13.
But if you see the variation between the group, then you will see that all three groups are completely different amongst each other.
So, here we can conclude that instead of people here drinks are more important, because of drinks my variation is happening.
So, drink has an effect on the reaction time.
And thus, we reject our null hypothesis.
So, in this way we saw two different terminologies, one was our within the group variation and one between the group variation.
Now we are applying ANOVA on two or more than two groups.
We can also apply it on multiple variables.
In which ways? To see all that our ANOVA test has been converted into two different variables.
One is one way ANOVA and second is two--way ANOVA.
So first of all, we will see how we perform one way ANOVA test.
One way ANOVA test’s requirement is that whatever it is your data whatever data you have collected in that there should be one categorical variable.
And one quantitative variable.
This categorical and quantitative means? That one should be numerical data.
And other, you should have categorical value, which means success, failure, zero, one, this kind of data.
In this situation generally since we are performing ANOVA test, then we should have 10 different group of categories, which means if there are more than this, even then we can perform the ANOVA test.
But since we are performing the one-way ANOVA test, we should at least have three different groups.
What is the null hypothesis in it? There is no difference in my group's mean and the alternate hypothesis that is there, which is at least one group keeps a significant difference compared to our other means.
Now we'll take one example that how we can perform one way ANOVA test, the concept of within the groups and between the groups that we have seen.
How does it theoretically fit in our test? So, one test was conducted, in any place, where three different ecommerce platforms like Amazon, Flipkart, Snapdeal.
We took the feedback of all three from the people.
outstanding.
Okay, and whatever was the feedback we recorded it.
The conclusion that we want, in one way ANOVA the hypothesis that we want, are all these platforms equally popular or not, which means the feedback that we have taken, on that basis I want to find out whether Amazon, Flipkart or Snapdeal are equally popular amongst each other based on the feedback.
So, my null hypothesis would be simply that all the platforms are equally popular.
Which means, all three variables mean is equal amongst each other.
The alternate hypothesis would be at least one platform mean is different, which means the popularity lies differently from the other two, but to understand it we will have to pay attention to certain formula which is between what is between variation in the group? How do we calculate within formula or in total observations how do we find our variation? So, here if you see the formulas on the right-hand side, those are some of square formulas, in which SSB will be sum of squares between the groups.
SSW would be sum of squares within the groups and TSS will be total sum of squares.
So, there formulas that we have calculated.
I will once define the terminologies first, how we are seeing these values, then we will put them in one example.
I we have taken here the number of observations that are there in a group.
If we will see in this example 7.5, 8.5, 6, even the Flipkart values 7, 9.5.
All these are my observations when we will do a total of them that will be called as my observations in a group.
J we are denoting for one particular group.
If I'm talking about Amazon, that would be my group one.
If I talk about Flipkart, that would be my group two, Snapdeal would be my group three.
N of j, which means how many total observations are there in the group X of ij, all the observations, if I include both x and y axis, then that would be called all my observations.
X j bar denotes what the mean of my particular group.
X bar, whatever is my grand mean, which means if you calculate mean of all three values and then you took all those three means together.
That will be my grand mean.
P means, I have p groups with me.
There are in total 3 groups, so my p’s value will be three here.
Now, to calculate SSB, they have used such a difficult looking formula.
But since it is sum of squares, we will simply put our values here and we will square them.
So, in the first step we have created our null and alternate hypothesis.
For one way ANOVA we have to calculate two variances, one is variance between the sample and second is variance within the sample.
So first of all, how do we calculate variance within the sample? Let's see that, we have divided it into three small steps and we have portraited our values in the form of a table.
So, first work that we have is that we calculate every sample’s respective mean.
What does this mean? If I had these three different values, three different categories, Amazon, Flipkart and Snapdeal.
Then corresponding to the Amazon whatever was the average mean that came, we denoted it with x bar A.
Its value comes to me as 8.04545.
In the same way we have denoted XF bar.
F we have kept form convenience, so that we know that this is a Flipkart data, its value comes to 8.227227, in this way.
We have denoted X S bar for Snapdeal, its value is six.
Normally, this is a simple mean which you find by summing up all those values, and we divide it by number of observations and find that value.
My second step is to calculate variance between the samples, in that we have to calculate grand average of mean which means all the three samples that we have calculated now, we will take them and when we will divide it by 3, then I will get grand average of mean, whose value I have as 7.424227575.
In this way, we have calculated two things first, which is my mean of each sample, in second step grand average of mean, third step is very important.
Here the two values that we have calculated in the first step and in the second step, we will take their difference and we will square them.
So how do we do it? So, if we will see in the first case, X A bar is my sample mean of Amazon and X bar is my grand mean.
So, we have taken its difference, in the next column we squared it and then in the same way we have created a data for Flipkart and Snapdeal.
What we have to do in the final step? Whatever squared values we have calculated, we have to sum all those squared values, which means if I see X A bar minus X bar’s whole square.
Then my value that has come is 4.244944.
In the same way there is one value for Flipkart and one value is for the Snapdeal.
How will we calculate the SSB? Normally, all squared value that has come, we will sum all three and the final value that we will get that will be SSB, which is sum of square between the samples.
So, my first step was to create null and alternate hypothesis.
In the second step we calculated, SSB’s value, which means variance between the samples.
In the third step we will calculate variance within the sample.
In this the first thing is that we already have calculated each sample’s mean.
So, when we calculate within the sample’s value.
So, simply from the observation we subtract sample means of respective variable.
Which means when we were calculating between the variance, there we had calculated XA bar minus X bar.
Here what we're going to do is if Amazon's value is 7.5 then we will subtract it from its sample mean, which is the value that we have for 8.04545.
So, in this way we can A minus XA bar, in the next time we calculated it's square.
In this way we have calculated values for Flipkart and Snapdeal.
In the final step, whichever were the squared values, we have summed all those three.
When we summed them, we get one value 66.4090.
So, this is my SSW’s value.
So, in this way in first, second, and third step.
We found null and alternate hypothesis.
After that we calculated SSBs value, and we found out SSW.
Now if we have to calculate any value, it is very important that we should find degree of freedom.
Now we have two groups, one is within the groups and one is between the groups, in which ways the degree of freedom will be found.
We'll see that.
If you want to calculate degree of freedom within the group then the total number of observations whatever are there on your table from that we will subtract total number of groups, which means what are the total observations over here? 7.5, 8.5, whichever are the values of Amazon, there are 11 values, how much are of Flipkart? Which is 11 again.
In Snapdeal I have got 11 values.
So, after totalling them the number of observations that are there on my table that comes to 29.
What would be the number of groups? Three, Amazon, Flipkart and Snapdeal.
So, the degree of freedom that came within the group that is my 26 but the between the groups that you have to calculate that is very simple.
Normally from number of groups if you subtract one, so three minus one is equal to two.
Now, you must be thinking that why are we calculating this degree of freedom or sum of squared? So, what is the ANOVA test’s conclusions that we basically want? One F ration or one F statistic, so to calculate that in ANOVA test, between the groups and within the groups we will want sum of squares, degree of freedom, after that there is one more term, which we call as mean square.
So, what would be my main square? If I divide sum of square values with degree of freedom.
First of all, we will see that we will feel this table.
I have calculated sum of squares of between the group, which comes out as 25.5376.
I've calculated of within the group, that comes out 66.40909.
My degree of freedom within the groups is 26 and between the groups it is two.
Now, to calculate the mean square what we will do is, the values that are there in second and third column.
We will divide them amongst each other, after dividing them the value that I get, it comes out as 12.768821 and within the groups value comes as 2.554195769.
In the same way, we have calculated the mean square.
Now the most important statistics that we have to find, which we call as f ratio or f statistics.
The mean squares that we have found within the groups and between the groups, by diving them both the ratio that we will get, that is my f ratio.
So, entire calculation wise we saw, in which ways we are able to calculate the f ratio.
For that we have to follow a lot of steps.
First, we have to find SSB, then we have to find SSW.
After that we have to find degree of freedom.
After getting the sum of square and degree of freedom.
We will calculate one mean square value for both the groups and then we will find our f ratio.
So, my f ratio that comes in this case, that comes out as 4.999155176, which is a huge value.
Basically, if we round it of, we can assume that 4.99 is my value on an average So, normally we have found f calculated but we will also need to find f critical value.
For that we have one f distribution table where for both degree of freedom, which means within the group and between the groups, degree of freedom, corresponding to it the value that comes.
We will take one Alpha and find our f critical value corresponding to that.
Like in this example my degree of freedom which was 2 and 26.
Corresponding the those both my value that came, that comes out as 3.3690.
Now, you will see one very interesting thing that we have to make the decision, so simply we have to compare our f critical with f calculated, in this case my f critical was less, which means whichever is my value, whatever it is my conclusion is that since f critical lies in the rejection region, so we will reject our null hypothesis.
So, our way of making the decision in all those tests is the same.
Whenever any of my critical value is smaller, we will simply reject the null hypothesis.
And we can conclude that in this particular case that Amazon, Flipkart and Snapdeal are not equally popular.
Why? because our null hypothesis was this that all the platforms are equally popular but since we have rejected the null hypothesis.
So, our final conclusion would be that these three platforms are not equally popular.
In this way in different scenarios, we can perform a one-way ANOVA test.
It is generally used in industries a lot, the ANOVA test, as we have different variables, there are more than two variables, so we perform ANOVA test a lot because there are a lot of calculations that happens in it.
We prefer that the tools that are used in the industry like Excel and Python or R, by using such kind of programming tools on statistics tools, we can directly find our value.
So, here we have covered one way ANOVA test.
One-way ANOVA test is similar to the way we perform the t test.
The only difference is we can use 3 samples or 3 groups in place of 2 samples.
Now we will see how two-way ANOVA test is performed Simply now we saw that if we have 2 or 3 groups, then we were using ANOVA test but lets say that the variables are different, which means, the people who drink water, people who drink juice and people who drink coffee, we have this data with us.
But apart from that we have one more data, which is whether they are more effective in the morning or these drinks are more effective in the evening.
So basically, two-way ANOVA is used when we just don’t have two or more groups but we have variables more than that.
How can simply ANOVA test be performed in this case? We will see that; significant effect is there or not on any drink’s reaction time.
Second, is our time of the day is giving a lot of affects on our reaction time? Which means the people who are drinking in the morning or the people who are drinking in the evening, is it making a lot of difference or third, we can also analyse that these two variables which are there, which means the people who are having drinks or the on which time they are drinking.
Is there an interaction among them or not? Which means, is the coffee more effective in the morning or in the evening? So, in this way, two-way ANOVA test is done.
Its significance is that if you have more than one variable you can use two-way ANOVA test.
Generally two-way ANOVA test involves a lot of calculations.
So, we prefer, in industries like Python , R or excel, which are industry based tools, we give two - way anova test values, feed the data set and simply find the result from there.
So, in this way, we have covered different types of tests, like what is Z test, what is T test.
What is the difference in both, when is ANOVA test performed, when is chi-square test performed? Here we complete the module, which is different types of tests.
Congratulations Friends! We have covered the fifth module.
In which we saw in detail about different types of hypothesis test.
In the first unit we had seen that which all hypothesis test we have, in which we had prepared a chart and we had seen that Z test is for critical value method and p value method.
We perform T test when we have different samples.
In which we had performed different types of T tests.
After that, we also saw that in which ways we can use this chi-square test and ANOVA test.
So, in this way we saw that how we can perform different things.
And this becomes very important for us to know that in which situation we must use which hypothesis test.
If you have any comments or questions related to this course then you can click on the discussion button below this video and you can post them over there.
In this way, you can connect with other learners like you and you can discuss with them.
Share a personalized message with your friends.