Hello,
I am (name) from LearnVern. (7 seconds pause, music)
We will study in continuation of our previous session of Machine Learning,
So, come let us see today DB Scan algorithm,
DBscan, meaning density based spatial clustering,so this DB scan algorithm for application with noise,
So, this algorithm works on the basis of distance, though you know that all algorithm works on the basis of distance only,
So, we will see its parameters,
So, this algorithm over here, will find code samples with high density and expand the cluster from them.
How is it going to do that?
Here, we have a parameter called EPS, "this parameter will find the maximum distance between two samples, and one to be considered in the neighbourhood of the other".
So, if I have 3 data points, then I will take one datapoint as the sample to find the distance between the other two samples
So, how much is the maximum distance that needs to be found, like it can be from the centre also, so that we need to find and check. right?
So, the point that will have maximum distance, will not be the neighbour,
The point with less distance will be the neighbour. right?
So, here we can modify this parameter by ourselves, so this is the flexibility that we have,
So, now we have 0.5 as the default value of this parameter,
So we will see this later at the time of implementation, when we will perform hyper parameter tuning then if needed we will change the value that time.
So, the next parameter that we have to change is, the minimum samples, so how many minimum number of samples "we have to take in a neighbourhood for a point to be considered as a core point".
Meaning for instance if I have a total of 5 points, then how many points should I consider?For example If I take 3 points, then these points will find the distance among themselves, and after finding the distance, it will identify the centre point.
So, here this is also changeable, and by default its given value here is 5, which we can increase or decrease as per our need.
So, with the help of these two parameters we will calculate the distance, and tweak it.
Here to calculate the distance we also have metrics, here we will calculate pairwise distance,
So, let's implement this and see, as to how this DB scan Algorithm works?
So, in the first we will import all the libraries that are needed, such as
Import numpy as np.
Import pandas as pd.
So, numpy, pandas, matplotlib, then sklearn dot cluster, in this we have taken DBscan, then we will have to perform preprocessing that is the reason we have imported standard scaler, then we have taken normalise, so both of these that is standard scaler and normalise, basically works to limit the range for the values having larger differences in between them.
Along with that we have also imported sklearn dot decomposition, import PCA principal component analysis.
So, these are the things that you will learn ahead in the course, that is the reason, for now, you just have to know that standard scaler and normalise, they basically help in reducing the larger differences between the values, and here PCA holds the capability of reducing the values, say if we have 30 features, it has the capacity to reduce it to 30 features, so this is also a technique of preprocessing in Machine Learning.
Now, we have the data which is credit related data, and I will try to upload it, so this is my data and I will upload this..
So, till then I will execute the step one,
After that x is equal to pd dot read csv, so here I am reading this, and here I will also let it display, so that we can see the data.
So, this way I displayed the data also.
So, you can see this is the data that we have.
Customer ID, Balance, ok!
So, here I will put data dot INFO info, only directly, so that we can see.
So, here you can see..
So, inside info, I have not given it as a method, so I will write it in method form.
So, we have customer ID, balance, balance frequency, purchases, so like this we have 17 parameters.
So, as it is Clustering algorithm, so based on these 17 features, so here 17 is a lot of features, so moving ahead we can reduce them to may be 2 features, 3 or 4 features, with the help of PCA, that is Principal Component Analysis.
So, with the help of this we will work on this.
So, here we have already performed two three things.
So, if you have already seen EDA, that means you already must have seen that here we have dropped customer ID and with F-fill, that is forward fill, I have also filled in the missing values, so that in our processes there should be no error and issues.
So, let's do preprocessing in step 3.
So, here we will use scalar is equal to standard scalar,
In X scaled, have put in scalar dot fit and put all the data,
Then have normalise the scale that is given,
And finally have converted this into dataframe,
So, in this way we have scaled our data, then normalised it, and again converted it into a pandas dot dataframe.
Now, moving ahead I want to reduce the dimensions, so we had 17, so from that we removed one, so out of 16, only 2 components or features should remain, so PCA is equal to PCA , n component is equal to 2, because we have to convert it into only 2 dimensions,
So, next with the help of PCA dot fit transform , in that we added normalise X.
Then again, the x principal that we had converted into a dataframe.
Then, here you can see now we just have 2 components, P1 and p2.
You can see from here.
Now, you will look at things like this, you will not understand anything, but this is how algorithms work which you will learn also, so it basically took all the relevant important features, and computed them using a certain formula, and brought something like this in front of us.
And, these are the most relevant values
Now, we will start building the model,
So, to create a model, db default is equal to DBScan, and as I had told you that we can tweak eps and minimum samples.
Here, you can see I have tweaked
So, after tweaking our model is now ready, and it has got training also, through fit method, as I have run the fit method here only, so it has got the training also,
Now, we will visualise and see as to How cluster zero, one, two and three, are forming separate clusters for each one of them.
So, let us see the visualisation, so here you can see that, so here this is our entire data, and on this we have label 0, label 1, label 2 and label minus1.
So, in this way this Clustering is done, that we can see here.
Now, basically if we want to tweak something or make some changes or edit something, so we can do that in Clustering, how can we do that?
by performing as I told you earlier also that we can do hyper parameter tuning also, so here you can see we have EPS and here I have increased the samples, so we increased the samples to 50, and we can increase this also, for instance it is 0.03, so we can make it 0.04, so I have increased this also.
So, I have changed this also.
Now, we will execute this and again visualise it,
So, we will see that there will be some changes visible over here.
In both the clusters of above and below, on them you will see some changes that you will be able to view., you need to see carefully.
So, in this way you can tweak. (X 2)
So, in this way in your clustering, wherever you want to cut and get a better cluster, so there you can stop this tuning, so in this way hyper parameter tuning is also done.
Now, we just saw how through DBScan we can implement special Clustering.
So, friends, we will stop this session here, and it's further parts we will cover in the next session.
Thank you very much.
If you have any questions or comments related to this course.
then you can click on the discussion button below this video and post it there.
So, in this way you can discuss this course with many other learners of your kind
Share a personalized message with your friends.