In this module we'll be seeing which methods we can use to prevent our outliers or we can treat them.
So, outliers affect my mean and standard deviation because of it, my analysis provide me incorrect results.
And because of this, all the machine learning algorithms that we use, they get highly affected.
Because of this we have to use different techniques and treat our outliers.
We have seen three methods in which ways we can remove the outliers and in the Python notebook, which is in the Jupiter notebook… how by using different libraries of python like Pandas, numpy and matplotlib, all these different libraries are provided by python itself.
We will see them in the Jupiter notebook, which is an ide, which means such a platform where we do the coding.
We will see it ther in the python notebook.
Why are we doing this? So that we get to know if I have any outlier than how will it be visualised in a box plot or how in different ways we can see it and how we can prevent it.
So, let's see once how the “Cost of living 2018.csv” file looks.
In this particular file, I have different columns like there is a city column, in which there are different names of the cities, cost of living index, this is one particular column in which, cost of living values are given, rent index.
in this way, different columns are available in this particular data set.
In this example, we have used another data set, which is called “Titanic_train.csv”.
So, this particular data set in which there are different columns, let's see how it looks.
There is passenger column, in which I have different types of passengers, what is my passenger count, that will be seen in the passenger’s column, survived, which means how many passengers survived, Pclass, number of passengers, which means if my pclas is 3, then it will have 3 passengers.
What are their names? That is this column I have.
In this way, there are different columns available with me.
Sex, age, parch, ticket number, fair.
So, by using all these columns, we will see how the box plot analysis is done.
So, in this module we'll be seeing three methods to treating the outliers.
First one, we will be deleting observations which means if in any of my data sets, there are outliers.
We will be directly deleting them.
We will be transforming the values, which means we will also learn about transforming methods.
Third is imputation, imputation means any value where we are replacing the value, which means replacing the value is called my imputations.
So how do we do this? Let's see it once.
So first of all, we are importing warnings in Python.
So that in any such modules, where extra warnings come.
Those should get suppressed.
Second, we are running all Python libraries, which means we will be requiring pandas in this.
Why? Because we are taking an Excel sheet or such a data set which is in the form of an Excel sheet and second we want to convert it into data frames.
You don’t need to worry if you do not know Phyton from the scratch.
On learnvern there are already python free courses available which you can easily understand from the very basic, where you can get to know about the libraries like pandas, numpy, seaborn and matplotlib libraries.
In this I have imported all four libraries.
This is very easy.
Like if I want to import pandas then I imported pandas as pd, for numpy as np, nps and pds are its short forms which we continuously use in our coding.
Seaborn and matplotlib are two such libraries through which we do the visualisation, which means if I have to draw any graph or curve, for that we make a lot of use of seaborn or matplotlib libraries.
So come let's see if I delete the observations in which ways can my outlier be removed or treated.
So, simply what we will do is, I have a data set, that data set is a csv particular file in which there are different values.
It is a cost-of-living data of 2018.
We have stored that file as pd.read_csv in one data frame.
Which is called train.
After that I have written a code in which I have used seaborn library, I drew a box plot when there were outliers were there in it.
Which means, if you can see the boxplot before removing the outliers.
So, in my data set you can see that from 120 to 140.
There are many such values which lie outside the box value.
Which means these are my outliers.
Now how to remove them? For that I've written the normal function, “drop outliers” in which simply IQR, the interquartile range that we had studied.
We have used that method in the Python coding, we drew such a curve which can remove our outliers.
So, let's see once what we have done.
To calculate IQR, we calculate 75 percentile, which means q3 and 25th percentile, which means q1.
We can directly remove it through numpy.
Like 1.5 multiplied by np.percentile.
Whichever field’s name I have to take and its 75th value we can calculate.
In same way if I have to calculate 25th percentile then in np.percentile, I have put 25 value.
How we used to drop the values? Simply, if my value lies on right and left of lower fence and upper fence, then we drop those values.
We use the pandas directly and drop those values, which are above 75.
And the values that are below 25.
We can even drop them.
Simply, when I wrote my code and drew a plot, I saw that box plot after removing outliers, in that after 120, Whichever outliers were there, they all how been removed.
So, in the same way in different scenarios we can use the box plot and remove the outliers.
Let's see the second method which is transforming values.
Transforming values means we simply remove the extreme values by using different transformations like it can be lock transformation or it can be box plot transformation.
There are also cube root transformation.
But we don’t mostly use them.
Why do we generally prefer this technique, with this our data is not lost, which means we already have a data set present.
We see its distribution initially, if I see the distribution before lock transformation, then I will simply get this kind of curve.
This is not a normal distribution curve.
But what do I want is, If you see to the right between 120 and 140.
Again, my curve has flattened a lot that means those are my outliers.
In this way we can use the distribution curve, like we have covered in the last module how do we find outliers? We simply saw the distribution curve and we found out between 120 and 140 outliers are present.
Now, by using this method what we have done is we have used numpy’s log module and we simply transformed that value into the log.
Which means if you see on the y axis before log transformation curve, then from 0.00 to 0.025 was its scale on the y axis but what happened in log transformation? It changed our scale, it converted it from 0 to 1.6 and it layed my curve on the positive side, if you see the right-side values then this curve is not very flattened.
Which means whichever values were mine, on the extreme, It has a log transformed it.
The main use of this method is that we don't get data loss in it.
Why? Because the outliers got converted to different scales.
They got converted into smaller values.
We generally use this method when we have skewed data.
And we have to convert them into normal curve or normal distribution.
In this way, we have got a box transformation, while using the box transformation, we use scipy, which is a different library of Python, we will use that.
If we use it and we will see once that without removing the outliers how does my boxplot look, it is the same.
Here you will see that we have different values, here we have taken a different variable named “Rent index” which we will get in our csv file.
If we see those variables, we have many outliers between 70 to 120.
But on that particular variable, which means the rent index, which is a variable.
On that when we apply Scipy’s boxplot transformation then we get such a curve, we get such a boxplot on which our outliers are removed.
And if you notice one thing that our curve before outliers, where the scale was between 0 to 120.
But when we did box plot transformation then our scale was smaller, which means the values lied between 2 to 9 and the outliers which were there in our above curve, which were many.
They got removed, and very few or just one was left.
So, in this way, we have seen two methods.
One is deleting outliers.
And second by which ways by using transformation methods we can remove our outliers.
One 3rd important method is of imputation, what do we do in this method? Whatever data set that we have, if there are outliers in it, we replace those outlier values with our central tendencies measure, either we replace it with mean or we can replace it with median or if you want we can also place a normal zero value in its place.
So, we use this method so that we don't have to suffer the data loss.
Why? because if you remove the outliers, you can lose particular data.
Instead of it if we put a mean and median value of the extreme values, then we don't have to face the data loss.
So, basically, we will see once how we will use it.
We have taken a different data set to understand this example, which is Titanic’s csv file.
So, at that time I have got one variable named age, if I drew a boxplot of it before imputation.
So, here you can see on the right side, I have got few outliers.
How easy it has become.
You have seen.
By using the boxplot directly we can tell that where my outliers lie, how many outliers are there.
Because of this reason we use box plot a lot in the statistics.
Now, if I want to replace the outliers value with the mean.
So, simply what we will do is the way we used to calculate IQR, q3 and q1.
I've simply taken those values.
I took my interquartile range.
I have calculate one lower fence and one upper fence which is q1 minus 1.5 multiplied by iqr and q3 plus 1.5 multiplied by iqr.
So, what we are doing in this method by using iqr, we are simply detecting that where my outliers are lying.
As soon as I got to know that these are my outliers, in place of I will replace it with mean value.
If you will see in this particular line, what we have done is here in the I’s value we have found out which are my outliers.
I have simply replaced it with np.mean, np.mean is the numpy’s value.
By simply using the function of panda we replaced it with mean.
And when I drew its plot then my plot looked something like this, where whichever extreme values were there on the right, they all are removed.
In this way.
If we see before imputation of median, then my curve will look like this, boxplot and in the same way if I replace it with median, you will see that it is a better method.
Because on the median I have no outlier left on the right.
But with the same method we I used mean.
I was left with still two outliers.
In the same way, what does 0 imputation mean? That I found the outliers place them in i’s value and replaced it with zero value.
When I replaced it with zero value, whichever curve I got.
That is a better curve where my outliers have been removed.
In this way we can use different methods and remove the outliers or treat them.
If you have any comments or questions related to this course then you can click on the discussion button below this video and you can post them over there.
In this way, you can connect with other learners like you and you can discuss with them.
Share a personalized message with your friends.