Hello,
I am (name) from LearnVern.
In our previous session of Machine Learning we saw about Evaluating Regression Models Performance.
In which we saw the different types of classification metrics that are available using which we can evaluate the performance or the accuracy of the algorithm that we are using.
Now, we are going ahead to see our next topic that is,
Clustering
So, we are going to first start with its primary understanding of clustering which is Distance Based Learning also known as Distance Based Metrics.
Because the basic meaning of Clustering is grouping,
For example, you saw a lot of stuff lying around, but you don't know their names.
Then your first instinct will start identifying it with their shape and pattern and try to identify it.
As to what is their colour, what is their shape, size.
So with that you will start identifying, as you will proceed in this, then at a point you will start finding some things as similar as the others which you will find as different.
So because of the similarity and difference that you find your brain will automatically start grouping the things which are similar to their cluster whereas some other things it will group differently.
So this is known as distance based learning.
Now you might be wondering where is the distance in this?
So, it is there, show the similarity and the difference that you found in this itself is known as distance, because if mathematically computed this then you will get a distance from this. So let's understand this thing.
So clustering is an unsupervised approach because we have the input, that is X1 X2 X3 which means we have the features but there are no labels in it.
So this is what you are studying in the second point that when you have to work upon an unlabeled data set, we have to identify the patterns and that is the work which our algorithm performs.
The definition of the clustering is "a way of grouping the data points into different clusters, consisting of similar data points".
The explanation that I gave you, this definition is just precisely defining it.
Which means that if there are different types of data points then and it needs to group it based on similar data points, so club the similar data points and those who are not similar club them separately in some other groups.
If you look at the types of clustering then in this we have connectivity based clustering which we also call hierarchical (hi-raar-kikal) clustering.
Second type is centroid based clustering, we perform this through the partitioning method.
Then we also have distribution based clustering.
Next is density based clustering, and this identifies with model based methods.
Then we also have fuzzy clustering and constraint based clustering, which we call supervised based clustering.
So we have multiple types of clustering.
So, to achieve this such as to create them, the primary metrics that we need for it, let's see them now.
So in this euclidean (yu-kli-dyun) matrix is the first matrix that finds the very shortest distance in between two points.
Assuming if you create such a graph, in this particular graph how will you find the shortest distance between A and B.
So in this way by drawing a line you can create the shortest distance.
So this is known as the euclidean (yu-kli-dyun) distance.
And whatever points you have if you want to remove get distance then we can use this formula for it: under root of summation of, from I (aai) is equal to 1 to n, QI minus PI.
That means you will find the difference of each and every point, then square it and finally under root the values and find it.
So in this way euclidean distance basically finds the shortest distance.
Now, next is Manhattan distance; which is basically "the sum of absolute difference between the points across all dimensions".
For instance there are two points in between these points. There are some negative values that are coming, so in this we will not take the negative of it but only take its value and leave aside the negative.
"So sum of absolute differences between the points across all dimensions".
So here,P1 minus Q1, P2 minus Q2, so we took out the distance by minusing them, and thereafter taking out the absolute value, then next we took their sum.
So if you see in a generalised way then the summation of pi minus qi from i is equal to 1 to n.
So in this way we take out Manhattan distance.
Now third is minkowski distance, so minkowski distance is a generalised format of euclidean distance and Manhattan distance.
So you are if you see the formula it is very similar;
xi minus yi, raise to p, and outside it has 1 by p, meaning you need to basically mention the order, as to in which order you want to find out the minkowski distance, if you want to find an order of 3 then you can put the raise to 3, and instead of raise to 1 by P you will put raise to 1 by 3.
Suggest you will only have to decide as to which order we want to find out.
So, this was distance Metrics and the algorithms that are inspired from them are.
Here I have mentioned them.
K means clustering,
Hierarchical clustering,
Apart from this there is,
So, this algorithm basically does grouping and clustering.
Now the next topic that I want to see is of K means clustering, and that we will see both conceptually and practically.
Till then, remain motivated and keep learning.
Thank you.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.