Hello,
I am Kushal from LearnVern.( 6 seconds pause ; music )
Uptil now in our Machine Learning Course, we saw How Environment can be set up
In which we saw,
Anaconda Environment Set up and
Google Colab set up
So, today we are going to learn about Data Wrangling or you may also call it Data pre processing.
So, we are going to cover about, what are the things involved in Data Preprocessing or Data Wrangling in great detail and also in a very easy manner.
If we look from a dictionary the meaning of wrangling, it is basically a dispute, or an argument which exists from a very long period.
Now, here in this type of wrangling we also try to identify some facts to understand what is true and what is not.
In the same way in "Data Wrangling", we try to do research on it, to know what exactly the data wants to tell and what are the hidden facts in it.
This is what is also known as Data Cleansing and Data Munging.
Okay !
So this means that on raw data, we do some processing and use some functions on it and reformat it in such a way, that it becomes better and a fantastic format.
So, it becomes more useful for our further analysis.
(01/37)
Now, let's see what is the entire process of data wrangling
At first we identify how the data is and from where it is coming.
So, in Discovering we identify the data,
Now, when we have the data.
The second step is cleaning, wherein we check if there are missing values in between any columns.
So we substitute those columns or if there is a row which is not that important then, we can consider deleting it entirely.
In cleaning, we handle some missing values or if some value is very large or is very short, then we try to normalise it.
These are some of the activities of data cleaning.
So, after we are done cleaning .
Then comes the step of data validation.
In which, we generate trust worthiness of the data and also make some corrections.
So, now when the data validation is also done.
Then comes the last step of Data Wrangling that is Structuring.
Where, the data is formatted in such a way, that in future the upcoming new processes will not find any difficulties in handling the data.
Now, we will see all the steps and the processes that are taken to execute the process of Data Wrangling.
Clear up till, Now!
Now, let's look at the next slide.
EDA meaning Exploratory Data Analysis.
Now just see the first word itself
Exploratory .
If I say to you, go to Kashmir and explore there,
Then you will immediately become happy by thinking about it, that there will be snow, beautiful trees and cold temperatures.
What are you doing here? You are exploring!
So, similarly when we have the data, what are we going to do? we will have to explore it.
Here at first some curiosity develops in our mind.
Questions arrives in our mind,
That,
What is the data trying to tell?
What are the things that I can remove from it?
What are the insights that I can seek from it?
So, that means our first step in EDA is that,
We need to formulate Questions.
Write down your curiosity on a paper or a notepad, what are the things you want to identify from the data?,and the insights that you want to draw from it.
When you are done with this step, then go to the next step.
(04/04)
About Searching Answers;
You have so many tools and techniques, which we will see in our upcoming sessions.
So, there are a lot of softwares and libraries, using which we will find answers to our queries.
After searching the answers, we will go on to the next step.
So, initially our questions were based upon a very little understanding of our data, so our questions can be immature, or it can be incomplete, or there can also be a possibility that while we were searching for answers, we could come across a lot of new aspects and dimensions.
Our next step is to Refine our questions and even generate new ones.
In this way, the cycle needs to be repeated again.
Clear ! Okay !
In this way, we complete our entire process of Exploratory Data analysis and acquire complete information from the data.
For example, supposingly your handling the analysis of sales and marketing of a company
For which, you will collect all the data.
Then the first question that will come to your mind is "How many Sales the company has made?", And the answer to it can be found from the data.
Then, from it you understood that, this year the sale rate was quite high.
Then the second question that will come to your mind is "What is the reason behind this growth in sales"?
So, this is a cycle which we execute multiple times.
And how many times the cycle will be repeated is decided by "us" that is machine learning or data scientist engineers themselves.
Or else it can also be decided by SMEs.
Now, let's go to our next slide and try to understand a new concept.
That is ETL and ELT
Here, ETL means Extract, Transform and Load,
Meaning whenever we need data from a source then we call it as Extract.
And, then we Transform the Data as we just saw in Data pre processing and in EDA also, in that, we do some Operations on the data by which it becomes ready for further analysis.
We Transform it or restructure it.
And, then the next step is, we load it, that is we load it in the storage system which is joint with the analytical system, where we are going to proceed our analysis of the data.
(06/30)
But, sometimes the data is so unstructured, or semi structured that it is very difficult and time consuming to transform it, so in that case, instead of following ETL we use ELT.
That means we Extract the data,
After which, we directly upload in our analytical system and then only we do transformation and further analysis.
So, we have two mechanisms one for structured data that is ETL,
and the other for semi structured or unstructured data that is ELT.
So we follow both of them.
Now moving ahead let's see,
What exactly is the handling of missing values?
Here, you can see in the diagram that the white portion in the picture is missing,
This missing value, we will have to impute from somewhere and fill it.
When the data is having a lot of missing value then, we will have to impute it and substitute it.
But when the missing value is just one from all hundred records, then we can even delete it instead of substituting it.
So handling of Missing value is a part of Data Wrangling or so called Data Preprocessing.
Understood guys ! great !
Our next topic in this is, Feature Engineering.
Where we have 3,4 features and in it we have some features that are repeated that means it's duplicated
Like, feature 3 over here is duplicated.
So, we can consider removing it.
And this feature we do not consider for data analysis.
So in this case I have limited the features and reduced it for my further analysis.
Because the more features, we will have the more computation power would be required, the more memory will be needed and even the complexities of algorithms will increase.
So in Feature Engineering, we have to identify which features will be selected that can be useful in further processing.
Let's move ahead and understand this with an example.
Here, we have 4 features which are helping me to score high.
Discipline,
Hard Work,
Smart Work, and,
Have Milk.
Here, I see that all the first three Discipline, Hard Work and Smart work are right
But having milk or don't having milk is not required. because one who doesn't drink milk, also scores well and those who drink milk does not necessarily always score high.
I saw here that this feature is not that important and I can remove it.
Now, I just have 3 features that can affect my score.
So, this is a part of Feature Engineering.
(09/43)
Now, next part that we can see about
Data Normalisation.
In Data Normalisation, we have two diagrams.
So on one side the data is 1,2,100, and on the other side the data is in decimals, wherein all the values are between 0 to 1.
Now, tell me what is the distance between 1 to 2?
The distance is 1!
But the difference between 2 and 100 is 98.
The difference has increased so much!.
So , here we say that the data is not normalised.
But, this data itself I converted it, How did I convert that I will explain ahead.
So 1 got transformed in 0.097
2 got transformed in 0.0194 and
100 got transformed in 0.9708
Alright!
Now, u can check the difference between every number that is between 0 to 1.
So, this is called Data Normalisation.
So,now, the calculation and computation will work properly without consuming much computation and memory.
Ok, now let us move ahead.
And understand the complete lifecycle of Data Analysis
First of all,
Discover the Questions.
Write down the questions first itself for the analysis that you want to do.
Next, acquire the data i.e grab data or extract it, we also saw in EDA to extract it.
Then, Clean the data i.e Exploratory Data Analysis or Preprocessing. So clean it.
Now, Explore the data.
Then apply feature Engineering.
Then,after that only predictive modelling is done which is a part of Machine learning.
And, then data Visualisation which is a part of Data Analytics.
So, does that stop here?
No, this is a continuous process, it's a cycle.
Everytime time new data will come and new aspects are also added in them.
So in this way,our cycle keeps on moving
So, this entire presentation that we saw was on data Preprocessing and data life cycle.
I hope you understood it well.
Be excited !
Now, next we will see how to practically implement all this.
And the next topic that we will cover is
Importing the Libraries and Importing the Datasets.
So, keep learning and keep watching.
Thank you very much.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.