Hello,
I am (name) from LearnVern,
In Machine Learning the previous topic that we saw was Normalising the Data.
Now, today we will learn about how we can find the data sets, because without it we will never be able to run the machine learning algorithm.
So, let's proceed and see how we can find datasets or collect them.
First method is , conducting Survey,
Create a questionnaire for any type of domain, so you can distribute the questionnaire in the form of Surveys, where the participants will fill in the details and you will get the data.
Next method is that there are a lot of websites for datasets, which we will learn one by one in our lectures.
Third option could be government datasets. The government also freely provides us with a lot of data.
And the last, would be real time data, if you are working on a project having sensors, log and softwares, then you can get real time data from them.
So, let us see one by one
For instance there is Kaggle, Government, Amazon, Google Dataset Engine, UCI machine learning.
We will explore each of them one by one.
So, first I will take you through Kaggle dot com, it is “redirecting us to a youtube website”, let us wait for a minute.
And here the website is open.
Here, we have different sections like Competition, Datasets, Code, Discussions, and many other options as well.
So, we have to go in data sets, and from here you can add your new data sets as well, as many people also contribute their datasets.
Here, you can see different datasets, like trending datasets, popular datasets, datasets of movies and TV shows, data sets of clothing and accessories, economics.
So, you can download whatever datasets you want by clicking on it and then by choosing the download option.
When you click on the download button, it will ask you to sign in, so after signing in you will get the downloaded datasets.
Now, moving on to the second website, that is government data sets
Government also provides us with many datasets that are freely available.
Here, you can see O G D, open government Data.
So, when you scroll down, you can see some policies and documents, where it explains to us, as to How we can use these datasets.
And then you have visualisation, here it has already done visualisation, which you can download in C S V format.
And thereafter you can apply machine learning, data analytics, EDA that we are going to learn ahead and apply all these on your tools.
So,let's click on one of it
For instance here we have GDP, growth rate of India constant rate prices during 2001 and 2002 until 2013-2014.
Now, you can see it has loaded visualisation, then you can view it in the Visualisation tool, which will then open the visualisation engine. Once we open the visualisation engine, after which you can download it from here in different formats like png, jpeg which are image formats and pdf.
But I don't want these because I want the datasets only.
So, I will go above, here I have different tabs, I will go on the first tab that is O G D instance data.
and here you find the data having 10 entries, change it you can see 100 entries and download it in CSV format.
Now the data has been downloaded.
So, this is the second method, where you can use a government website where you can download those data that are publicly available, and then you can start working on it.
Now, let's see our next option, which is Amazon datasets.
So, Amazon has also made some data publicly available, you will get all the links in the presentation.
You can see different datas are available here.
If you want to search it then you can do it from here.
For instance, If I want some data related to health, then I will type health keyword over here, and then I have to press enter and am expecting that it will give me data related to health, so this search may not be working.
So we will use this data over here, machine learning pan cancer, I will click on this, and this is redirecting us to another website.
So, we will wait for a while, till then we will go to the previous website again, and here you can see the registry of open data AWS github repository.
So from here we have directly reached the AWS lab and you can click on these datasets and directly download them.
Here you have different datasets in different formats.
So, you can download it from here also.
Now, if we go back here and see Machine Learning detects Pan Cancer Ras Pathway Activation.
So, here we will accept the cookies.
There's a description here.
Let’s wait some more.
Here you will get a free paper that you can read and work upon your datasets.
So you can download the standard pdf and here in front of you a paper is open.
So, here they also made available to you with different works of different authors that are done on Machine learning that you can freely refer to.
I will close this and move ahead.
We explored Amazon’s datasets.
Now, next that we have is Google Search Datasets, however this will not have all datas from google only, rather it has created a listing of different datasets.
For example if I have to search related to health, then you will find that it is providing datas from different sources. For example, it's giving data from kaggle and other websites also.
So, it does not mean that it will provide google's own data only.
So, over here, it is giving health data from an open definition.
This is given from Algeria.
It is basically searching from different websites and showing you results.
So, this is also a good thing, as when we normally search it, there are possibilities of many websites that remain unsearched.
So, if you want to do a specific data search then you can go on its website of dataset search dot research dot Google dot com, we can do it from here.
So, these were some of the options using which we can find datasets.
I will show you one more option that is the UCI Machine Learning repository.
We will explore this as well. It's a very fantastic repository. Those who practice machine learning, love this website as it has a lot of datasets.
Here, you can see it is showing a record of 588 datasets, and also giving you details that classification related data is 482, regression has 137, clustering 117, other has 56.
In attributes we have categorical, numerical, mixed.
In Data we have multivariate, univariate, sequential, timeseries, text.
So, there are so many domains which are given in detail.
You can click on any data over here,
For instance this dataset on car evaluation, we will click on it.
After this, you will find it has given an entire description also.
Then, here in the data folder option you will get the file in any format be it in zip or XL files or any format that it is available in, you can get downloaded from here.
So, this was one more way to download the datasets.
Now, let us Download our datasets.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
In our next topic we will explore these Datasets and use them in Statistics and EDA, meaning Exploratory Data Analysis, where we will see about univariate, bivariate and multivariate analysis.
Thank you very much for watching.
Remain Motivated.
Thank you
Share a personalized message with your friends.