I am Kushal from LearnVern.( 6 seconds pause ; music )
In our previous topic, we saw,
how we can install libraries and then import it, and after importing the Libraries, how we can import datasets by using the same libraries.
We even went to the extent to see,
how some libraries like sk learn are made in such a way that it can be used to import in built datasets also.
So, now let's move ahead with our next topic.
Which is known as Handling Missing Data or
We also call it, Handling Missing Values.
So, we will understand this topic properly in depth, and we will also learn to practically handle these missing values.
So, let's see, What are Missing Values or Missing Data?
As you can see in the diagram,
All the parts in white colour are filled but, in that you can see one Portion is in black which is not filled.
Which means that portion is Missing.
In the same way, when we are capturing data or we are transferring data from its source to our destination, so at that moment, many times some data is left behind, or is missed.
It can happen that some data from a particular column gets missed or maybe some cells of some rows data are left here and there.
But, missing values are the missing values..right !
So, missing values in the data are blank values.
So, let's move ahead and understand.
Why did we have to face these Missing values?
As I was just telling you about it,
First and foremost here the issue is,
Data Corruption, where your data is stored in the system for a very long time but, due to some malware or a virus entering your system, and your data got corrupted.
So, what happens in that case, probably the entire data must have got corrupted and it cannot be reused, or,
It can also happen that a little part of it still usable and the rest of it got deleted, or,
Maybe it's not getting recognised by the system, when loaded.
So, this is data Corruption.
Right ! okay .
Now, the next issue could be
Failure to record data,
Meaning when we were capturing the data,
As we all know about sensors today, whenever we use our mobile we can also find sensors, like face recognition,camera, touch, all are sensors.
So, sometimes you must have seen that even on our mobile, when we touch it, our mobile is unable to recognise it, maybe because of some loose connection, or, maybe the sensor had some malfunction.
So, reason can be anything but this is failure to record data.
your action is only data, so when we moved our finger from one place to another, it failed to capture that data.
Or, we can give another example of measuring temperature, where also, the sensors are used to measure the temperature.
Here. At times sensors are responding slow or, maybe sometimes it's capturing and the other time it's not.
So, in that case also, there is failure to record data or missing data.
It is also missing data..
Third is, Incomplete Extraction,
We all do Windows shopping!
So,supposingly there is an online selling company XYZ where a lot of data gets generated through browser
When we browse online for window shopping, the data gets captured of our window shopping through that browser.
So, supposingly if some functions or an API of a programme which captures the data of the browser used for windows shopping gets affected and it stops working.
It can happen!, because software keeps on upgrading and in that process,faulty, a function wasn't upgraded, so a little Data that was supposed to be captured through that API was not able to be captured and the rest were captured normally.
So at that moment when we will extract the data, the data will be extracted incompletely.
So, all the data won't be captured because one API from it was distorted and the data associated with that won't be captured.
Last, is No Response.
Which is related to surveys.
So, supposingly there are 10 questions in a survey and the user finds 9 out of 10 questions were good.
But he found the last question, the 10th question to be too personal.
So he doesn't want to reply to these 10 questions.
So, this happens very often.
So, if the users skip the questions which they don't want to respond to and keep them as NA or Blank.
Then this is known as No Response.
Understood guys! perfect..
So, these are the 4 reasons that I have discussed here due to which we face missing values or Missing Data.
So, let us move ahead and understand
If at all, some missing data comes, then what are the challenges that we will have to face?
Yes, we will face challenges !
First challenge is, Reduces the Statistical Power.
That means if you have 1000 records, out of that 1000 records, around 200 of them are missing data.
So, if you apply any statistical data then how it is going to get applied.
So, you will have to do something, or try to fill that data.
So, in this way, it will weaken the statistical power of the data.
Second challenge is, Bias in Estimation of Parameters.
Normally, we do analysis through data.
We run machine Learning's algorithm on datasets.
Here, every single attribute or column of data is really important for us.
We have to check whether a particular parameter is important for output or not.
And, if there are missing values then, we find it difficult to find that because of the missing values.
Last ,is Reduces the representativeness of the samples.
So, what is a sample, you must have studied in statistics that the entire data together is called a 'population', so a small data extracted from this large population is called a sample.
So, samples are really very useful.
Because often the data is large enough, and it becomes difficult to process it with the computation resources or algorithms that we have access to.
Because there are often constraints of computation and memories as they are limited.
So, when we are working on the sample then "representativeness of the samples" meaning
The sample represents a population.
For example, whenever a survey is done.
It doesn't take out a survey of all 100 million people from the population. But consider only 100 or 1000 people for the survey from different states and regions and consider them representative of the entire population.
This is how it happens!
with missing data the representativeness of the sample reduces.
So, this is a challenge.
And we will have to solve this challenge.
Right ! okay…
So, let's see how we can solve it.
So here there are different techniques to handle the missing value.
Here, I am verbally explaining it to you.
And hereafter soon we will practically do it.
The first method is deleting the rows with missing values.
In your record, if in some rows there are missing values then you can delete it.
Yes you can do that!
But here you have to take care, if you have 100 records then if 1,2 or uptil 5 values are missing then it's ok to delete them.
But, if you have to delete more than 5 percent of the data then, it will affect the entire data.
So remember to delete the missing values only when it is very less, 5 or less than 5 percent.
Next method is to impute.
Next two points you will see on Impute.
In impute we have continuous or categorical values.
Here, you can see continuous and categorical.
Continuous meaning 1,2,1.1 ( one point one),2.5.(two point five)
So, this is a numerical value, meaning
For example if there is a particular house which is priced at some value, then the price of 10 houses will be?,
These are some numerical values.
Or , If I am running in a Marathon and along with me 10 different people are also running.
So what is our speed? So, this is also a numerical continuous value.
So, there are some continuous value and
The others are categorical values.
Categorical values means,
Meaning, day and night, or red ,yellow ,blue, meaning some fixed number of choices.
Like first, second and third.
Here,the meaning of "impute" is substitute.
So if we have a missing value, then we can substitute it by applying some method.
For example, I have 100 products and the price of all products is 99.
I forgot the rate of 1 product.
So that rate also will be quoted as 99 by taking the average of all others.
So, this could be one way.
We will see different methods ahead but imputing the missing value is one way.
Next, the method is ‘‘using algorithms that support Missing values.’’
So you can directly choose to use such an algorithm which is made to support the missing values.
So, this is also a good way.
Next, is ‘prediction of missing values.’
Like before we spoke about imputation of values, similarly using an algorithm that will predict the missing values, in that way, also are missing values will be fulfilled.
So, at last moving ahead with these methods, we can do imputation of values through "deep learning" which is a specialisation part of Machine Learning itself.
As it is named here we can do
‘Imputation of data using deep learning library through datawig.’
So, these were some of the methods by which we understood as to How we can impute the missing data?
After seeing it practically, we will go ahead on encoding the categorical data.
Till then, keep watching and remain motivated.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.