In the chapters till now, we have seen what are outliers and how do they affect our distribution.
So, in this module we will see in which all ways we can detect the outliers.
So, different methods of detecting the outliers, we will be seeing in this chapter.
Suppose, we have a small data set which means, one such data set which we can count.
So, there is one sample where I have 15, 101, 18, 7, there are these kinds of different values.
If you see this distribution here only by looking at it we can find out what is my outlier.
My extreme value or such a value which is very different from the entire data set.
Like in this data set 101 is my outlier.
Now, if you see it's one distribution or we describe this data, which means if I calculate its mean, median, mode, variance and standard deviation.
If I find all these different parameters with outliers and without outliers.
If in my distribution sample there is an outlier then my mean is highly affected.
And the mean's value increases from 12.7 to 20.08.
Whereas the median is not very much affected.
If without outliers my median’s value is 13 then with outliers that out, value becomes 14.
In this way, my mode is the same.
Even in variance and standard division, it is highly affected if there are outliers.
So, with this we get to know that suppose if I have a smaller data set then mean is extremely affected.
In a smaller data set just by looking at it we can detect the outlier but when we do the data science projects or we use any machine learning algorithm in that particular case, our data set is huge.
There are a huge number of rows and columns in it.
So, in this particular situation, we have to use different techniques because of which we can detect the outliers.
Three important methods because of which we can detect an outlier in a huge data set, which means in a big data set.
Let's see that, the first technique is visualising the data.
What does visualising mean? Visualisation means whenever we have any data set, if I want to explore a data or I want to do cleaning in the data or I have to detect the outliers then we visualise that data, that numbered data.
We visualise it which means we see our data in few graphs or plots because of which we can easily understand where my outlier lies there or does it even lies there or not.
I have different plots through which we can detect the outlier, such as box and whisker plot which we had learned as a boxplot, second histogram, third distribution plot and fourth is my scatter plot.
So, let's see once in which way on the basis of different plots, we can detect the outliers.
What is a boxplot? It is a graphical display for distribution of the data.
For example, if I have drawn a particular data on my box plot, there we can easily tell that which is the value which is extreme and lies outside my box plot.
We can implement box plots very easily in python programming by using matplotlib and seaborn like libraries.
If you want to study Python, you can refer to the Python modules available on our platform which are free of cost.
So, if I write one Python code in which I have implemented the boxplot, then my box plot will look something like this.
In which if you see here, the extreme ends of my box plot, you can see some values outside of it.
Those are my outlier values.
In this way we can easily visualise and tell where the outliers lie.
Mainly boxplot, histograms and distribution plots we use when we want to visualise one variable.
Suppose I have some age variable.
In one data set, I've got different age values.
If I want to visualise that variable, then I can plot histogram on it, I can draw a distribution plot or I can draw one box plot.
Like if we see the histogram plot.
In this particular graph, what are we seeing? different ranges on the x axis, from 0 to 80 lies different ranges with me.
And there are one particular bills that are made.
Which means how many values lie between 0 to 10, how much value lies between10 to 20.
What lies, how much lies between 10 to 20? If you see this curve, this is a type of a skewed curve, histogram curve or the tail will be bending more towards the right.
So, this can easily tell us that in this particular graph, in this particular distribution, my outliers lie.
Third is a distribution plot.
Distribution plot is nothing but if we simply mark boundaries on a particular histogram, which means like we used to draw a curve.
In the same way if we draw a curve then in this I'm easily able to understand the values from 100 to 300.
Those are very few and those are few because those could be the extreme values.
And this could affect my data set, which means this can affect my analysis.
These were the three plots.
If I want to visualise one variable.
Let's suppose if I've got any two variables, which means one is the age variable and opposite to that I have taken a fair variable on the Y axis.
So in between these two, which means in between the two variables we can draw a scatter plot, the relationship between these two…when we plot it you'll notice that around 500 there are a few values which are completely different from the entire distribution.
So, we can see in the graph and tell where our outliers lie.
The next method that we use to detect the outliers is a Z score method.
We had learned the Z score, what is it? If I subtract mean from a data point and divide it by standard deviation, then I will get one particular Z value which we have seen even in the normal distribution curve: what is first standard division, second standard division and third standard deviation.
What are these and how far do my Z values lie? So, in this particular scenario, how can we use the Z score? So, what we basically do is if there is any data set with me if I have one particular data set, so what we will do is in that data set, we will calculate the Z score of every data point.
And we will define one threshold value which we generally call as third standard deviation.
Which means if my any Z score value is above three then that will be called an outlier.
Simply what we have to do is we will calculate the Z score of all the data points.
We have defined our threshold value as three, if any value is more or less than +3 or -3, then we will call it an outlier.
Like this is my particular normal distribution curve, where my mean lies on Z=0, on it's plus, minus I've got the values z=1, 2 and 3.
Normally if the equals lie between Z=2 and 3, then we will consider that it a part of my distribution.
But as soon as my value is on the right or left of Z=3 or -3, then we consider it as an outlier.
In this way, we can treat the outliers.
If you have any comments or questions related to this course then you can click on the discussion button below this video and you can post them over there.
In this way, you can connect with other learners like you and you can discuss with them.
Share a personalized message with your friends.