In this chapter, we will learn what are the measures of central tendency.
So, first we should know that what is central tendency, central tendency is information about the centre or middle part of a group of numbers.
So, suppose if I have a particular number group, in that I want such a value which will represent my entire data, that single value is basically my central tendency measure.
This is described in three ways, that is mean, median and mode.
We are going to learn about these three now.
Basically, we have studied a lot about them in the primary classes, but in which way it is used in data science, how important are these three measures? We will see it today.
Let's come first to the mean.
Mean is basically, an arithmetic mean which is nothing but the ratio of the sum of all the observations in the data to the total number of observations.
Suppose, I have a particular data of a particular factory, about the salary that is received by the entire staff in the factory, and I have the data of eight staff members.
Suppose if the first staff member gets a salary of 15,000 rupees, the second person gets 18,000, the third gets 16,000, fourth gets 14,000, fifth gets 15,000.
In the same way, the other three members also get the salary.
If I wish to calculate its mean, it means that I will sum up the entire salaries, I will divide it by 8.
Why 8? Because the total number of candidates that I have got are eight, with this my mean has come to 15 point 2 5.
What does that mean? This means that the average mean that I have received is 15,250 rupees of the entire measure, which is basically like representing the complete data set.
So, the mean is a number around which my entire data is spread out, that's the reason why we call it as measures of central tendency.
Now, we can have population mean, we can have sample mean.
Population mean, like suppose, you have an entire data that is representing the population with capital N number of elements, okay.
So, your mean will be represented by mew, which is nothing
but summation of all the data points divided by capital N.
In the same way, what does the sample mean? Suppose you have a small N number of elements in any particular sample.
So, your x bar would be, which is like the sample mean that will be represented with the sum of all the elements divided by, total number of elements in the sample which is small N.
Now, you will see one thing in these formulas, sample and population there is not much of a difference.
The only difference is the number of elements, which is a small N in case of sample mean and capital N in case of population mean.
So, you keep this in mind that we generally use sample mean, because we don’t have population data present at all the time.
So, average in daily life, is what we call a sample mean.
Next, before going ahead, I would like to tell you that the distributions that we have with us, you must have read a lot about distribution curves.
You must have read about normal distribution; you must have read about gaussian distribution.
We will see more in detail about them in the future chapters, in which the curves are formed, and what are all the different types of distributions but to understand mean, median and mode it is very important for us to understand what exactly is meant by symmetrical distribution, and what is an asymmetrical distribution.
So, basically symmetrical distribution means, whose mean lies in the centre, whose median, mode, all three things lie in the centre and are equal.
Around which my data on the left and my data on the right should be symmetrical.
But what can be an asymmetrical curve? It can be anything, it can be in the positive skewness and also in the negative skewness.
We will see this particular case going ahead in measures of shapes but for now it is important for us to understand that skew and asymmetrical data means when I don't have proper similar orientation to my left and to my right.
And my data is more on one of the sides, okay.
So, one most important thing which is used in statistics in machine learning, and we have a lot of practical application of that thing.
That is an outlier.
Now, what is an outlier? Outlier is basically an extreme or unusual value, which instead of representing our entire data set, it is completely a different value which can be a smaller or the larger one.
Okay, suppose, the old data set that I had of salary, if I would have seen its actual mean, I got the mean of 15 point 3 K in it.
Okay, but suppose in that particular data set, if I put the salary of two such staff members whose salary is supposed to be 90,000 and 95,000, okay.
So, that means now my mean salary for these 10 employees would be 30,000 instead of 15,000.
Okay.
So, now you can see that by just putting the salaries of two people, my entire data doubled up to 30,000.
This means that 90,000 and 95,000 were extreme values.
The major value range, in my distribution was between 12,000 to 18,000.
So, these values that I have added which were the extreme values, we call them as outliers.
Okay, to detect them in any distribution is very important.
Otherwise, the entire result of our analysis changes.
Now this means that, in this particular scenario my mean is skewed.
Skewed means like, we saw that our distribution became asymmetrical, because of two values my entire curve’s median was not equally distributed in the centre.
Because of all these effects, we use another central tendency measure which is the median, which we will see going ahead how the median nullifies the outlier’s effect, and how the mean is highly affected by our outliers the way we just saw.
Now, we will see what the advantages mean basically.
Now, we just saw that we were using mean on numerical data, correct? So, we can use mean both on continuous and discrete numeric data.
We can use both the form of data, it can be a continuous one or it can be a discrete one.
What is the disadvantage? The disadvantage is that we cannot use it for categorical data.
Why? Because we cannot sum the values in one particular way.
You cannot sum any particular category, suppose I have women and men categories.
Suppose if I have these two categories, I will not be able to take an average of these two, because it is not in numerical form.
Also, the other major disadvantage which we saw was outliers’ effect, right? Because our mean, considers every value in one particular distribution.
This means that if there are outliers in it, those will also be considered, then it is not good for our data.
Also, this is influenced by the skewed distribution, which simply means if there are extreme values my data will be highly skewed, and it will not give the exact results.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your queries.
(outro: 15 sec)
Share a personalized message with your friends.