Hello,
I am Kushal from LearnVern.( 6 seconds pause ; music )
This session is a continuation of the previous session.
So, let's see ahead.
So, here you can see, I have opened Colab dot research dot google dot com.
Now, here I will click on 'new notebook' and open a new notebook.
In this, we will see how to identify the missing values inside a datasets and handle them.
So, first of all I will click on 'connect', over here,
So that my notebook is connected and I am allocated with a CPU, along with that I get all the resources, RAM that are required.
Here, you can see my cell, where I will type my quote.
And towards it's left only, here you can see there is an option of file. If I click on it this section gets open.
Here, if you click on sample data, then you will get to see all the preloaded sample data.
So, here you can see, so many datasets are already available over here for training and testing.
I will minimise this as of now.
Here, we have the option of load.
As I have already downloaded the iris (pronounce :aa i rish) data so, now I will upload that only.
So, I will click here,and go to the Machine learning folder.
Here, I have the iris file available
By clicking on 'open' I will upload this file.
Now, here you can see a Reminder is given, which says that 'the uploaded file will get deleted when the runtime is recycled.'
When you will close this run time and open it again later, then these run files will be lost because this CPU allocated will get shutdown and also destroyed during recycle.
And, you will be allocated with a new virtual CPU.
So, here I have got my Datasets.
So, let's begin with the first step .
So, What will be my first step?
You already know this first step.
That is, to Import the Libraries, so here I will need a pandas library. So, for it I will write import Pandas as pd.
So, this is how I can import it.
So, step two will be loading the Datasets.
(writing programmes ; typing, be little slow)
So, I will load the datasets using a function of the pandas library.
Now, here you can see, I have created a variable called data and, here pd dot read function.
So, there are a lot of functions to read, like read clipboard, c s v, excel, feather.
So I need a C S V function here because, iris dot data over here, has comma separated Values.
So, here, I will put brackets and mention the path in single quotes.
So, by clicking on these 3 dots I copied the path.
Here, I went and put single quotes over here and then paste it.
Now I will execute this. And
Now, I will try to display the data, so here I typed data and shift + enter .
So, you can see here our data is displayed here.
(Done)
Clear , okay!
So, here you might see that it has used the first row and column as header, so to handle this.
(writing programmes ; typing, be little slow)
You will go here, where you have loaded your data, and write down, header by giving a comma. And after writing header type n o n e (single alphabet dubbing)(none), what will happen after writing 'none' there will be no header in our data.
And without it our data will be available with all its records present.
You can see here, there are 152 rows and 5 columns.
So our data is successfully loaded now.
(Done)
So, here you might see that,
The column name is yet to be given properly.
Along with that, here you can see that there are some N A N , N A N and N A N, given at places.
So, these itself are null values or not a number also known as missing values.
These itself are known by different names
Like NA, not a number, missing values.
We will identify these datasets and see their functions in a moment.
But before that, let us understand in brief what exactly are these datasets?
So, for that we will go over here and type iris flower
So iris is a plant which has these flowers.
So if you see this flower has different species.
Roses have different species like red, yellow,pink,white … small and big.
So, similarly here we are handling it's 3 species here.
Which are these 3 species?
(typing)
One is named as setosa (se to sa)
Second species is versicolor.(ver - si - color)
And, the third species is vergenica.(ver - ji -nica)
So, we are handling these 3 species.
So, we have around 50 records for each species here, maybe one or two more or less for some.
(04/58)
So, first we will give them column names, because you will get confused and not understand with 1,2,3 and 4.
(writing programmes ; typing, be little slow)
So, the first column is sepal (se p l) length.
So, for that I will write data dot c o l u m n s (single alphabet dubbing) (columns)
So with data dot columns I can see names of all the columns.
So, I will use data.columns itself is equal to, here I will write s e p a l(single alphabet dubbing) underscore l e n g t h (single alphabet dubbing) sepal length, and,
Similarly, I will create all other columns like petal lengths, and sepal width and all others.
So, sepal w i d t h (width).
Next I have petal underscore , sorry, control Z
Underscore length.
Next petal underscore w i d t h (width),
And the last one is l a b e l s (labels) or let's keep it singular as a label.
So, this has changed.
And, now I am checking by typing data and shift+enter,
So, here I can see sepal length, sepal width, petal length and petal width.
So here it should have been G T H, so let me make this small correction.
So, here I did a small correction.
(Done)
So, now you can see my column names properly.
So my column names are done.
Clear guys ! Perfect !
Now, let us handle the missing values.
So, first let us identify it.
So, for that here I have two functions
'IsNA' is the first function.
And the other function that I have is 'IsNull'.
Both the functions work exactly the same, so don't get confused.
These are the same functions.
So, now here I can do 'data dot isna' ( is n a) and then execute it.
So, here you can see that with most false, I have some that are true as well.
So, what does this true mean, it means that here NA that is Null Values are available.
So, I am using the same function with IsNull
Data dot isnull.
So, with this function also you can see you will get the same output of 152 by 5.
In, third you can see it's true, in the fourth also it's true.
So, you get exactly the same answers.
Both the functions work the same.
So, if you want to know How many null values are present column wise in sepal length and in sepal width.
So, for that you can do data dot Isna, and here put a function of some in the pipeline, and then execute it and see.
So, you can see that in sepal length there are 4,
In sepal width there are 8 null values, petal length has 6 null values, and petal width has 5.
In Fact, label also has 5 null values
So, this is the summary that we have received.
Now, we understand that these are null values.
Because the first step should be to identify it only.
Now, the next step is to handle it.
So, the easiest step to handle it is to drop.
On which I had explained that if the missing values are very less then only can you drop it.
So, here there are 152 records, in which the missing values are really less, so it can be dropped also.
So, to drop we have drop na function.
So we will use this function and see, so data dot dropna.
So, you can see there were 152 records, from it after all the NA values drop, there are 138 records remaining.
So, all the NA values have been dropped.
So, here you must have observed that 2 and 3 s values have dropped, 6 and 7 values have also dropped, so their records are also removed from here because they had null values or Missing values in them.
So this is one simple way.
Okay, got it !
So, you should notice that, after this if I type data again here, so there it will show 152 records only, there will be no changes there.
So, this function doesn't do permanent changes.
If you want permanent changes, here you will have to write inplace equal to true.
So, I am not going to execute this for now, because if I do that it will change the entire data.
This I have written only for your knowledge.
If you want you can expect and see.
Now, we will see the capabilities of the function data dot drop na.
In this, you will first of all see that, it is written over here, 'axis is equal to zero,' over here where,' axis zero' means whether you want to go row wise or column wise, zero meaning row wise and 1 meaning column wise.
So, by default it is going row wise.
So, the next option is 'how', meaning delete any null values.
Or, here you can type 'all' meaning delete all the null values.
Next, there is an option of 'threshold', where it will determine as to How many non null values should be present in that row, and then consider not deleting it.
Next, is 'subset' which determines as to which are the columns on which you have to run them.
Next, is 'inplace' meaning whether you want to make changes in the original datasets or not, so it is done by ' inplace'.
(10/29)
So, here we have 1 2 3 4, a total of 5 parameters on which we can work.
You can scroll down , and read the description to know the details,
Like if there is 'zero' it will drop the row
'1'meaning it will drop the column.
If 'how' is with 'any' then it will drop the row if any null value is present.
If 'how' is with 'all' then it will delete only when all are null values.
In 'thresh' it checks the number of non values required before deleting it.
Below, you can see some more examples which you can explore by using it.
Now, here we will explore on our own, as to how it works.
So, here I will first work on thresh, so here I have 1,2,3,4 and 5, a total of 5 columns.
So here, if I say I should be having all 5 as non null values and rest can be considered to be dropped.
And then enter.
Originally, I had 152 records.
So, now I have 138 records,
So, here it has deleted all the rows which did not have all the columns with non null values.
Now, I will reduce this from 5 to 4 from here,
Meaning if at all, out of 5 any one column has null values it is ok to not drop it.
So, it should have at least 4 non null values to keep it and the rest can be deleted.
So, I will run this.
After running this I have 146 records.
So we have many records over here,147,148,149,150,151,152 around 6 records that did not have at least 4 non null values in out of 5 columns, which got deleted from here.
For example, 1, 2, 3 and 4 and even in this 1, 2, 3 and 4
So, these 4 we have non null values and only one 1 null value.
So, in this way you can drop those rows which have many null values in it.
Correct!
Now, we will move ahead.
And go towards the next parameter data dot dropna and then put brackets.
And as soon as you put brackets, you get all the suggestions.
Here, you write 'how', 'how' by default has 'any' with it, so you write down 'all'.
Meaning, all the records in a particular row have all null values in it, then only delete it.
So here, by default in our original datasets, we had 152 records.
Now, we will see how many records will remain with us after executing how is equal to all.
We have 150 records.
So, this means there are only 2 records which have completely null values in them.
All the records are NA NA NA ..
So, here only two records got deleted.
So, next we will see data dot drop na and put brackets, and try using this function of subset.
So, many times what happens is that we want to delete only from a particular column, like I only want to delete it from the 'label' column.
So, for that I will write subset and I will pass this subset, for that what I am going to pass
I will write l a b e l , label.
So, here you will see only from the 'label' column, that NA values have been deleted.
It didn't go from here and not even from here, meaning only label NA values have been dropped.
And, 147 records are remaining here.
So, in this way you can drop NA values from one particular column, from here.
So, in these ways we saw that there are different options that we have, which we can use, when we want to drop the missing values.
(14/36)
But, sometimes we have to substitute also.
So what are we going to do, during the time of substitute, let us see!
So, first I will drop NA values from Label columns and then substitute a specific value in all columns.
So, here I will write inplace is equal to true,
So, the data will be converted into its original sets.
So, here we have 150,
No! let me check data
So, we have 147 records available and all label columns having NA values have been dropped.
So, now we will go ahead with imputation, meaning to impute or substitute.
So, will come on substitution.
For that I will use the 'data dot fillna'(fill n a) method. It is also called the fillna method.
Here, in the brackets I will pass, the data I want to fill.
So, here I have passed that in all null records, fill 3.7 ( 3 point 7) and then press enter.
So, now you can see wherever there was null, they are all filled by 3.7.
You can cross check this with the previous record, from up here.
Where 0,1, 2, 3, on sepal width, it's given NA.
Now it has changed to 3.7 over here.
So, in this way, wherever it was NA it has become 3.7 now.
So, this is the way, you can impute or substitute the missing values.
So, that was a little complicated but easy !
So, today we saw
How we can identify the Missing values
Next, we can drop the records of missing values.
And lastly, we saw to impute or substitute it.
As we will go ahead the complexities will increase, but we will make it more simplified for you to learn.
So, friends let's conclude here today,
We will stop our today's session over here.
Their further parts we will cover in the next session.
So keep learning and remain motivated.
Thank you.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.