Hello,
I am Kushal from LearnVern. (pause 6 sec)
You all are welcome in our Machine Learning Course.
This session is a continuation of the previous tutorial.
So, come let's see ahead
So, now we are seeing about,
Splitting the Data into training and test sets, Practically.
Here I have written the steps,
So at first, I will have to upload the Dataset,
Then we will see how we can do manual Splitting.
And then, we will see how we can use the train test split method of SK learn dot model selection for splitting.
So, at first I will click on this folder and then click on the icon of upload, and upload the iris Dataset from here.
So, as soon as this data is uploaded, thereafter I will run my further programme on it.
So, I will copy the path of it.
And I will close this, and go to a new cell.
So, first I want a Library which is
Import pandas, this is a favourite library to handle data especially when the data is structured.
So, now we have imported the library.
After this,
We will use the function called data equal to PD dot read CSV, this function will help us to read this particular iris data,
So, read CSV and in the brackets we will give the path to it.
And in this data, we know that there is no header,
So I will have to mention here, header is equal to none,
So with header is equal to none is mentioned, this header will not be there,
Along with this, I should mention the column names, so for it I will pass the column names as a list over here by mentioning name is equal to sepal (pronunciation: seh pl) length in the first column, then is sepal (pronunciation: seh pl) width, third is petal length and fourth is petal width and in the last, one more column is left which is labels…
So, in this way I am loading the data.
Here I will write data, and execute this and check if we have 152 records.
So, here I can see some Null values.
So, without doing much treatment, I will just simply use drop NA, and for now, I will just drop it.
So, to drop I will mention drop NA, and here inplace is equal to true, and now I will execute this.
And, let's see out of 152 records how many records will remain.
There are 138 records that are remaining which are without null values, on which we can further perform.
(02/40)
First we will see manual…
To see the manual, you should know that in this data, there is an input and output.
So, input is called as X,
And output is called Y…
So, I will first use this data set in X and Y.
X is equal to data dot iloc, is a function through which I will mention here, that I want all the rows but in columns I need from 0 to 3… so if I want till 3… I will have to write 4 over here.
So, I extract X from here, and you can see x has been extracted over here with sepal length, sepal width, petal length and petal width.
So, now I want to extract Y as well,
For which y is equal to data dot iloc… and in this data dot iloc, here I will pass that I want all the rows but in columns, I want only the fourth one, and I don't want any other columns.
Here you will see in Y also, we have 138 records, so in Y you will be able to see 138 records… Now let's move forward.
So, in this way data is divided between x and y.
Now, moving ahead we have to divide between training and testing,
So, first we will divide x,
Training… and Testing…
So, we will begin with x.
So, x underscore TRAIN, train, is equal to x dot, or you can simply write x of, or x dot iloc, that we just used, so here we want the first hundred records in x train.
And, in x underscore test, here we want the rest of the records after 100.
So, here x dot iloc and mention here 100 colon, so in the first one 100 will not be included and in the second one 100 will be included.
So, we have executed this also.
Now when we check the data, we can see in x train, we have 100 records, and similarly when we check x underscore test, here we are getting all the remaining.
(05/13)
So,as we have done x, so we are also required to do y as well,
So for y what are we going to do?
So, y underscore train, so we are giving 80% of the data for training, so it will come from here,
So, here we are directly going to give a colon and 100.
So we will give uptil 100 records to y train.
And here, we will give to y underscore test, y of 100 colon, that is remaining after 100 we will give to y test.
So, here you can see the output of y test in this way.
So, we will see y test also, so it has started here from 114 and gone uptil 151.
So, this is how we can manually do the conversion of training and testing.
Now, if you want to do this automated, we will be required to use some functions.
But before it there was a problem here, which I would like to discuss with you all.
That when we divided this it is done sequentially, all are virginica,
So this type of sequential division is not good for algorithms.
Because in this way the data received for training becomes completely different from the data received for testing.
So this is not good.
Always an algorithm should get data division in a balanced format.
So, in such cases what we do
So, the second method is sklearn.
This Library will help us.
So, let's see with this as well.
So,sk learn dot here we have model selection, inside that we have train test and split.
So, with the help of this we are going to do all the work.
So, here I will write all the 4 variables together
Meaning x underscore train comma, x underscore test comma, y train comma, y test is equal to.
So x train, x test, y train, y test is equal to train test split.
And inside that we will write x and y, along with that test, also the size that we want, so test size is equal to let's take it as 30 percent, and here r a n d o m.
And you can also mention the random value that you want.
So let me put a random underscore state equal to 1.
Now, enter.
Let it get executed.
Now this has been executed.
Now you can see x underscore train, how many records did it have?
So, here you can see 108, meaning 70 percent of the records have come here.
Then, you will check x underscore test, so here it must be having around 30 percent of record.
Similarly, we will do my underscore train and the output will be good.
And, same way y underscore test, we will receive the output for test.
So, this data of training and testing we will use ahead with the machine learning algorithms.
So, friends, let's conclude here for today.
Today's session will end here.
And it's further parts we will cover in the next session.
So keep learning and remain motivated.
Thank you.
If you have any questions related to this course or you have any comments.
then you can click on the discussion button below this video and post it there.
So, in this way you can discuss this course with many other learners of your kind.
Share a personalized message with your friends.