In the previous chapter we had covered, two-sample mean test when we had paired samples.
But in most of the scenarios, what happens is, we perform such an analysis, wherein, whichever samples are there, whichever data we are taking, that is not dependent on each other or we can say that we have to perform such tests where the sample observations are independent to each other.
In that situation, in that scenario, we perform, two-sample mean test for unpaired samples.
In this situation, whichever subject we take, on which we are performing the analysis, on which we are performing tests, do not repeat twice, means it does not repeat two times.
Like we had seen in the last example, on one patient we were performing the before and after test, but in this situation what happens, let's consider that the researcher has to still test his new drug but what he will do is, instead of testing before and after on one patient, he will compare that particular drug to a standard drug.
So, in this case we have two different samples, one which is my new drug, which we have to test and second is, in whose comparison, we have to test, which means one standard drug, which was already existing for diabetes, and we have created a new drug.
Now, in between both, we have to make comparisons of our samples, which is more effective and which is less effective.
In that particular scenario, we perform a two-sample mean test for an unpaired sample.
Let’s see one more example. In general situations, if Kohli performed better than Sachin or not in the second innings, is my question.
In this particular case, the two samples that you will have to take, will be different for Kohli and different for Sachin, which means we will take the data of the second innings only.
But how much did Kohli perform in the second innings and how much did Sachin perform in the second innings; we will take two different sample’s data which are independent of each other.
So, in this situation we perform this particular test.
Now, the unpaired samples and whichever are the independent samples.
There are many situations, in which the variance is equal and there are other situations, where our variance is not equal.
So, for both the terms we have three particular formulas, which are related to mean and standard deviation or we can call them both sample’s variances.
So, if you see, equal variances formula of T-value, the formula seems to be a bit difficult, where the numerator is very easy. If we subtract mean 2 from mean 1, why are we doing it? In the previous case where there were paired samples.
The difference of both the mean was zero.
But in this situation, we have to find, how much do both the means vary? which means in both the samples, how much is the variation in the mean?
So, that’s why, the numerator is mean 1- mean 2, which is for the sample means.
The denominator which is there, a lot of values can be seen. Like n1, n2, variance 1, variance 2.
So, wherever we are using one, that is about the 1st sample, wherever we are using two, that is about the second sample.
N1 is my first sample’s sample size and n2 is the second sample’s sample size.
Variance 1 is the 1st sample’s variance, variance 2 is the second
sample’s variance.
In this the degree of freedom that comes, if we normally see, if I do n1-1 plus n1-2 or we will simply call it n1+n1-2.
If you normally see any formulas to perform the T test, then it will be this formula but if we see an unequal variance formula, then its denominator is very simple but the degree of freedom formula looks very difficult.
So, for such situations, what we generally do is, If we perform this analysis by calculations, we check these tests.
So, what we will do is, we will follow a simple approach through which, if it is an equal variance or unequal variance, we will get the answer.
So, its formula goes this way.
T is equal to x1 bar minus x2 bar, which is nothing but mean of both the samples, sample one and sample two, which you can see in the denominator that is under root of SP square upon n1 plus SP square upon n2.
SP square, we generally call it as pooled standard deviation for samples.
Which means this standard deviation that is there, this is a common value of both the samples together or a combined value, which we will calculate through this particular formula.
Here the SP square’s formula that is formed is, SS1+SS2 divided by df1+df2.
Keep in mind that this formula is irrespective of equal or unequal variance.
If you have a different standard deviation for both the samples, so here the SP square, which is there in both the cases that will be the same and if it is different, then we can put different values here.
So, if you see in this formula, SS1 is simply my sum of squares, we have taken as 1st sample’s standard deviation, S1 square and multiplied it with df1 which means
degree of freedom for the 1st sample.
In the same way SS2 is calculated, which is the sum of squares for the second sample, which is S square’s square into df2.
Df1 and df2 like we find them in a normal way n1-1 and n2-1 respectively.
We will use the same formula to find out our two-sample mean test, if I have unpaired values or unpaired samples.
So, how we will use it.
One more important thing is that, this is df1 and df2 that you have calculated.
When we have to calculate DF for the entire population.
Then its formula would be the same.
N1 plus n2 minus 2, that will be called my degree of freedom.
So, let's see an example through that, let's decide how we can reach our analysis.
There is a statistics teacher who teaches in two classes, let's say Class A and Class B. They took a test in both the classes and he wants to find out which class performed better in the test.
So, he has a particular analysis.
Where, in class A there are 25 Students whose average score is 70 and standard deviation is 15.
This means that n1’s value will be 25, average score for class A means that here X1 bar has come to 70 and this is your s1 which is 15.
In this same way, class B’s value is given, which is, in class B there are 20 students.
average score is 74 and standard deviation is 25.
So, we have to perform our analysis and tell whether the classes have differently performed or both the classes are one and the same.
In such a situation if you will see, we have two different samples with us, one is of class A and one is of class B.
So, by default, the first thing that should come to our mind is that we will perform an unpaired, two-sample mean test.
Before starting any test, we first create null and alternate hypothesis.
So, my null hypothesis is that Class A and Class B have performed the same, which means the mew of class A is equal to the mew of class B.
Alternate hypothesis will be opposite, the mew of class A is not equal to mew of class B.
And since, equal to has been used here.
So, we can say that it is a two tailed test.
In the second step.
We always find a T statistic, means a calculated value, which we put in the formula and find it.
So, first of all to find it, we had seen the formula, in which way our formula was designed.
If I put my n1’s value as 25, in the formula and n2’s value as 20.
Then I will get the degree of freedom, which is df1 and df2.
By taking those values, we can calculate SS1and SS2 that we have just seen in the formula.
How will my SS1 be calculated? Normally if i multiply my standard deviation of 1st sample by df1 which is 15 square into 24.
S1’s value is already given to us which we can take from the question.
In the same way, s2’s value, from the question we know that it is 25.
Normally I have put all the values like df1, df2, SS1 and SS2, in the formula of SP square and the value that I got, I have put it in my T-statistics formula.
So, if we have put the values in SP square then the value that I get there is 401.74, I picked that value and placed it in the formula.
Now, in the formula we have to put sample 1’s means and sample 2’s mean as well
That is 70-74.
This becomes my numerator; in the denominator, we have put SP square’s value on both sides.
We have put n1 and n2, from there the t-statistics that I got, which means the t-calculated value, that comes out as -0.67.
In the second step you calculated t-statistics.
Third step is to finally decide, what decision have we made?
For that first of all we will also have to calculate a T-tabular.
For T-tabular you will have to find the degree of freedom, so in this situation, the degree of freedom that I will have, which means we are seeing for the entire test, that will be n1+n2-2.
Which means 25+20-2, which is equal to 43.
If alpha’s value is 0.05, for two-tailed test, we have found the t-tabular with t-distribution table and its value comes as 2.017.
Here, the final decision that we have to make, based on which value is greater, is whether my t-tabular greater or my t-calculator is greater.
So, if you see in this situation t-tabular is clearly greater than my t-calculator.
So, what happened in this scenario, whatever is my value, that lies in the acceptance region and whenever my value lies in the acceptance region, then we say that we fail to reject our null hypothesis.
So, in this way by normally putting values in the formula, we bring out its solution.
Like we had discussed in the old tests.
Though, the formula was not that difficult.
In this case, we have a lot of calculations.
So, we prefer that we find its solution in Excel’s one click and conclude the result of our test.
So, normally if you see it in Python, it has a simple one liner module or library.
People in the Python community have made it.
If you normally import ttest_ind from Scipy.(sky-pie)
For independence IND short form has been used.
You normally import this module or library.
Give value to both the columns and from there you will simply get your statistics analysis, you will get all the details, all the detailed description.
Now, we will see once how we can use it in Excel.
So, normally, the way we saw now, if we simply go into data and use the data analysis section, then we will get our answer.
So, this is a data set that I have, which says that Kohli’s score in the second inning, 32 matches that he has played and in every second inning, the score he had, we have noted that plus we have noted Sachin’s score in the second inning.
I have both the values.
Corresponding to both these values, my null hypothesis was that we are assuming that the mean of both is equal which means both the samples that I have, they have the equal mean.
And the alternate hypothesis that is there, we assume that, both their means are not equal, which means Kohli must have performed better than Sachin or Sachin must have performed better than Kohli, So, the alternate hypothesis is, mew of Kohli’s score is not equal to mew of Sachin’s score.
So, when we create this null and alternate hypothesis, in the next step, simply what we have to do is, by going into the data analysis tab, here like we saw, equal and unequal variance, we can have both these options to perform this test.
But generally what happens is we pick the unequal variance value, the reason for it is that, in most of the cases, we don’t know, without doing the calculations, whether my variance is equal or not.
So, even if you use unequal variance, your analysis will be properly performed and you will get the same results only.
In this case also, we have the same thing, we have to give variable 1 and variable 2 range.
We will put, hypothesized mean difference as 0.
Labels, normally like we saw.
If I take B1 and C1 columns, in my range of variable 1 and 2.
Normally we have to tick the labels column or you simply take the numerical values.
Alpha, we have chosen for 0.05.
If I take the output range on my screen itself.
Then whatever value that I get, I have already calculated this result.
The result that we have got of 2 sample mean tests when we have unequal variance and unpaired sample.
Then these values that have come are somewhat of this type.
Here you can see that mean, variance, observations, mean, all these things are the same, we just have to perform our analysis.
Since we had a variance difference here, so our formulas got a little complicated and since we had two different sample’s samples, so a little more things got added in our formula.
Different sample’s variances, different sample sizes.
So, in this way if we are performing it in Excel.
So, you saw that with just a click you got the answer.
If we see the final decision, what is our result, as per null and alternate hypothesis.
My t-statistics that came, which means the value that we calculate by putting in the formula.
That has come to -2.34.
And for the two-tail, the t critical we have, which we basically find in tabular form, that has come to 2.003.
So, simply this value of mine, which means t-critical that is bigger than my T-stats.
Means that we fail to reject the null hypothesis and we can say that Sachin’s and Kohli's performance, that has been in both the innings, is almost similar.
So, in this way we perform different analysis, different tests, we keep different scenarios and find out our results.
If you have any comments or questions related to this course then you can click on the discussion button below this video and you can post them over there.
In this way, you can connect with other learners like you and you can discuss with them.
Share a personalized message with your friends.