In the modules till now we have covered that in which ways we can clean our data by using different data pre-processing methods.
Which means in which way I can remove the outlier and how can I make my data cleaner.
Let's take an example.
Suppose I have such a data set with in which the values that I have, the variables that are there, they have different units.
Which means there is one feature which is only in kilogrammes, and there is another feature which is only in grams.
There's another feature which is entirely in litres.
And in the same way, all features are there in different units.
In this particular case, if I asked you kilogram is bigger or gram is bigger.
Then obviously in it we can easily say that kilogram is bigger.
But in the machine learning algorithm, there is a problem, which is that suppose, it has been told that 100 grams or two kilograms, which value is greater from these two, it will not see the units.
It will only notice a number and it will say that 100 gram is greater than two kilogram or 3000 metre is greater than five kilometres.
Machine learning algorithm will only see the numerical value and define it in smaller or larger values but in this situation where I have different scale or units of features.
Then we will use there feature scaling method over there.
This is extensively used data pre-processing step, which means if you have any such data where you have different scales of features available.
And if you want to get that scale in the same magnitude of scale than at that time, we will use feature scaling.
Now, you must be thinking that why we have suddenly come to the feature scaling.
So, now we learned about normal distribution.
There we learned about Z score.
So, what is feature scaling basically? It has two types, which are normalisation and standardisation, which means we use normal distribution’s Z score formula and use the different feature scaling methods.
Simply feature scaling is such a technique which our independent variables…which means the features that are in our machine learning algorithm, on which we keep our analysis, those are called as independent variables.
If those independent variables are there in my different scales.
Feature scaling is used to put the different scales in the same magnitude, which means to make their numerical value same, feature scaling is used.
Its most important benefit is that when all the values are in similar number, which means suppose all my features in such a value which are in between 1 to 100.
Then my calculations in algorithm becomes very easy, features scaling we use in different machine learning algorithms like linear regression or k-means clustering.
All these algorithms you can learn through machine learning module.
So, feature scaling basically are done in two ways.
One is normalisation and the second is standardisation.
First of all, we will see what is normalisation and how it is done.
Normalisation is such a scaling technique in which we keep all the values between zero and one, which means if you have any data set with you in that you can have the values, let's say that we have heights, we have the kid’s height data of any class.
So, the height can be any number.
It can be 164 centimetres; it can be 155 centimetres.
So, in this way we will bring all those values in the range of zero to one, which we call as normalisation.
We also call it as min and max scaling.
Why? Because if you see its formula, there is a simple formula of it, x bar is equal to x minus x minimum upon x maximum minus x minimum.
Here there is nothing to be worried about X minimum and X maximum are simply my maximum and minimum values, the feature which we are talking about.
Suppose, we were seeing heigh, whoever has the maximum height in the class we will put it in X max and whose heigh is minimum we will put it in X min and whichever value we are taking about that becomes my X value.
So, it's a very simple math.
If I keep x value minimum, which means, instead of X I have put X minimum, then my numerator will be 0, this means that my X bar value will become 0 but if I take X’s value as X maximum, in that case my numerator will be equal to my denominator and my X bar will be 1.
In this way we scale all the values between 0 to 1.
But suppose your x’s value lies between minimum and maximum value then simply your X bar’s value will also lie between zero and one.
Normalisation is widely used method.
Let's see the second method, which we also call as standardisation.
Standardization is very familiar term for you.
Why? Because the Z score formula that we saw.
What was Z score? X minus mew upon sigma where my mew was mean, sigma was my standard deviation.
So, if I consider mew as the mean of the feature about which we are learning, if we consider that and sigma if we consider that feature’s standard deviation, then the value that will be scaled in which mean 0 and standard deviation 1 if we rescale.
That will be called as my standardization.
In standardisation, there is no bounding range.
Like what was in normalisation.
We take the value between 0 to 1.
But in this we close the mean and standard deviation rather than those particular values.
So, what happened in this particular scene? Since there are no bounding values, so whichever outliers are there in your data, those are not affected by standardisation.
So, simply we got to know a thumb rule.
If your data has got outliers, we can simply use the standardisation and scale them.
We will see one very simple comparison where I have one unscaled data which is there on the left hand which means I have drawn a box plot or four values which means if you see four different features then my original data will look somewhat this way, if I convert that in normalised or standardised data, then you will see that we can easily define those data's which means we saw a different perspective by doing normalisation and standardisation, in which we can easily tell that from where to where does my values lie.
If you will see the normalised data, then all the values will lie on y axis from zero to one.
In standardised curve, we can see that I have zero which is a mean of all my box plots and standard division is one.
So, in this way we can use different techniques of feature scaling, we scale our data and ready it for a better analysis.
One very important question.
Is normalisation better or standardisation? Or what is the difference between the two, let's see it once.
So, what was normalisation? When we use minimum and maximum values we use for scaling, then it is normalisation.
If we use mean and standard deviation there, we use standardisation.
Normalisation is done when we have different scales.
And we can do this scaling between zero to one or between minus one and one.
Whereas in standardisation we see that our mean should be zero and standard deviation is stopped.
Plus, ther is no one range in it, which means there is no one bounded range.
In normalisation it gets affected by outliers.
Why? Because it has one range.
If any values go outside of it, then those would be our outliers.
Sanitization doesn't get affected by outliers.
Normalisation is more useful when we do not know anything about any distribution but standardisation is useful when the feature follows a normal distribution.
Simply if you know that it is following a normal distribution then we can convert it into standard normal distribution, which means it can be converted into mean zero and standard deviation one.
Normalisation is normally also called a scaling normalisation.
Why? Because we scale the values in it.
Standardisation, is also called as Z score normalisation.
In this chapter, we saw how normalisation and standardisation forms one application of our Z score.
In the next unit we will see what is central limit theorem.
If you have any comments or questions related to this course then you can click on the discussion button below this video and you can post them over there.
In this way, you can connect with other learners like you and you can discuss with them.
This course is really nice, just have one question in empirical rule explanation , SD deviation example trainer is saying mean however mean (20+30+40+50+60+70/6) value is different kindly confirm than