Interested in Personalized Training with Job Assistance? Know More

Complete Machine Learning Course in English > Data Preprocessing for Machine Learning

Data Wrangling

17.2k

Start a new search

To find content from modules and lessons

Overview

Hello,

I am Kushal from LearnVern.( 6 seconds pause ; music )

Uptil now in our Machine Learning Course, we saw How Environment can be set up

In which we saw,

Anaconda Environment Set up and

Google Colab set up

So, today we are going to learn about Data Wrangling or you may also call it Data pre processing.

So, we are going to cover about, what are the things involved in Data Preprocessing or Data Wrangling in great detail and also in a very easy manner.

If we look from a dictionary the meaning of wrangling, it is basically a dispute, or an argument which exists from a very long period.

Now, here in this type of wrangling we also try to identify some facts to understand what is true and what is not.

In the same way in "Data Wrangling", we try to do research on it, to know what exactly the data wants to tell and what are the hidden facts in it.

This is what is also known as Data Cleansing and Data Munging.

Okay !

So this means that on raw data, we do some processing and use some functions on it and reformat it in such a way, that it becomes better and a fantastic format.

So, it becomes more useful for our further analysis.

(01/37)

Now, let's see what is the entire process of data wrangling

At first we identify how the data is and from where it is coming.

So, in Discovering we identify the data,

Now, when we have the data.

The second step is cleaning, wherein we check if there are missing values in between any columns.

So we substitute those columns or if there is a row which is not that important then, we can consider deleting it entirely.

In cleaning, we handle some missing values or if some value is very large or is very short, then we try to normalise it.

These are some of the activities of data cleaning.

So, after we are done cleaning .

Then comes the step of data validation.

In which, we generate trust worthiness of the data and also make some corrections.

So, now when the data validation is also done.

Then comes the last step of Data Wrangling that is Structuring.

Where, the data is formatted in such a way, that in future the upcoming new processes will not find any difficulties in handling the data.

Now, we will see all the steps and the processes that are taken to execute the process of Data Wrangling.

Clear up till, Now!

Now, let's look at the next slide.

EDA meaning Exploratory Data Analysis.

Now just see the first word itself

Exploratory .

If I say to you, go to Kashmir and explore there,

Then you will immediately become happy by thinking about it, that there will be snow, beautiful trees and cold temperatures.

What are you doing here? You are exploring!

So, similarly when we have the data, what are we going to do? we will have to explore it.

Here at first some curiosity develops in our mind.

Questions arrives in our mind,

That,

What is the data trying to tell?

What are the things that I can remove from it?

What are the insights that I can seek from it?

So, that means our first step in EDA is that,

We need to formulate Questions.

Write down your curiosity on a paper or a notepad, what are the things you want to identify from the data?,and the insights that you want to draw from it.

When you are done with this step, then go to the next step.

(04/04)

About Searching Answers;

You have so many tools and techniques, which we will see in our upcoming sessions.

So, there are a lot of softwares and libraries, using which we will find answers to our queries.

After searching the answers, we will go on to the next step.

So, initially our questions were based upon a very little understanding of our data, so our questions can be immature, or it can be incomplete, or there can also be a possibility that while we were searching for answers, we could come across a lot of new aspects and dimensions.

Our next step is to Refine our questions and even generate new ones.

In this way, the cycle needs to be repeated again.

Clear ! Okay !

In this way, we complete our entire process of Exploratory Data analysis and acquire complete information from the data.

For example, supposingly your handling the analysis of sales and marketing of a company

For which, you will collect all the data.

Then the first question that will come to your mind is "How many Sales the company has made?", And the answer to it can be found from the data.

Then, from it you understood that, this year the sale rate was quite high.

Then the second question that will come to your mind is "What is the reason behind this growth in sales"?

So, this is a cycle which we execute multiple times.

And how many times the cycle will be repeated is decided by "us" that is machine learning or data scientist engineers themselves.

Or else it can also be decided by SMEs.

Now, let's go to our next slide and try to understand a new concept.

That is ETL and ELT

Here, ETL means Extract, Transform and Load,

Meaning whenever we need data from a source then we call it as Extract.

And, then we Transform the Data as we just saw in Data pre processing and in EDA also, in that, we do some Operations on the data by which it becomes ready for further analysis.

We Transform it or restructure it.

And, then the next step is, we load it, that is we load it in the storage system which is joint with the analytical system, where we are going to proceed our analysis of the data.

(06/30)

But, sometimes the data is so unstructured, or semi structured that it is very difficult and time consuming to transform it, so in that case, instead of following ETL we use ELT.

That means we Extract the data,

After which, we directly upload in our analytical system and then only we do transformation and further analysis.

So, we have two mechanisms one for structured data that is ETL,

and the other for semi structured or unstructured data that is ELT.

So we follow both of them.

Now moving ahead let's see,

What exactly is the handling of missing values?

Here, you can see in the diagram that the white portion in the picture is missing,

This missing value, we will have to impute from somewhere and fill it.

When the data is having a lot of missing value then, we will have to impute it and substitute it.

But when the missing value is just one from all hundred records, then we can even delete it instead of substituting it.

So handling of Missing value is a part of Data Wrangling or so called Data Preprocessing.

Understood guys ! great !

Our next topic in this is, Feature Engineering.

Where we have 3,4 features and in it we have some features that are repeated that means it's duplicated

Like, feature 3 over here is duplicated.

So, we can consider removing it.

And this feature we do not consider for data analysis.

So in this case I have limited the features and reduced it for my further analysis.

Because the more features, we will have the more computation power would be required, the more memory will be needed and even the complexities of algorithms will increase.

So in Feature Engineering, we have to identify which features will be selected that can be useful in further processing.

Let's move ahead and understand this with an example.

Here, we have 4 features which are helping me to score high.

Discipline,

Hard Work,

Smart Work, and,

Have Milk.

Here, I see that all the first three Discipline, Hard Work and Smart work are right

But having milk or don't having milk is not required. because one who doesn't drink milk, also scores well and those who drink milk does not necessarily always score high.

I saw here that this feature is not that important and I can remove it.

Now, I just have 3 features that can affect my score.

So, this is a part of Feature Engineering.

(09/43)

Now, next part that we can see about

Data Normalisation.

In Data Normalisation, we have two diagrams.

So on one side the data is 1,2,100, and on the other side the data is in decimals, wherein all the values are between 0 to 1.

Now, tell me what is the distance between 1 to 2?

The distance is 1!

But the difference between 2 and 100 is 98.

The difference has increased so much!.

So , here we say that the data is not normalised.

But, this data itself I converted it, How did I convert that I will explain ahead.

So 1 got transformed in 0.097

2 got transformed in 0.0194 and

100 got transformed in 0.9708

Alright!

Now, u can check the difference between every number that is between 0 to 1.

So, this is called Data Normalisation.

So,now, the calculation and computation will work properly without consuming much computation and memory.

Ok, now let us move ahead.

And understand the complete lifecycle of Data Analysis

First of all,

Discover the Questions.

Write down the questions first itself for the analysis that you want to do.

Next, acquire the data i.e grab data or extract it, we also saw in EDA to extract it.

Then, Clean the data i.e Exploratory Data Analysis or Preprocessing. So clean it.

Now, Explore the data.

Then apply feature Engineering.

Then,after that only predictive modelling is done which is a part of Machine learning.

And, then data Visualisation which is a part of Data Analytics.

So, does that stop here?

No, this is a continuous process, it's a cycle.

Everytime time new data will come and new aspects are also added in them.

So in this way,our cycle keeps on moving

So, this entire presentation that we saw was on data Preprocessing and data life cycle.

I hope you understood it well.

Be excited !

Now, next we will see how to practically implement all this.

And the next topic that we will cover is

Importing the Libraries and Importing the Datasets.

So, keep learning and keep watching.

Thank you very much.

If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.

See More

Learner's Ratings

4.4

Overall Rating

75%
13%
0%
6%
6%

Reviews

L

Losika Nicholas

5

were can i get the dataset

K

Kumar Madduru

5

Thanks for giving this course

D

Dinesh Kumar

4

Your screen is very blur and it doesn't has clarity even in 720P.Please make sure that will not happen again.

D

DOGALA UDAYKUMAR

5

bettor

N

Naresh Kulunge

4

good learning but the content titles are jumbled up, like first title of this module is decision tree dichotomiser which is practical part ahead of theory part. Same with the SVM practical 1 title has

E

Eswar Veeranki

5

good

I

Isakki Alias Devi P

5

Wonderful course

S

sushma Yadla

5

yes, i am happy to learning for machine learning in LearnVern.it i s easily understanding for Beginners.

P

Prabhat Yadav

5

Superb and amazing 😍🤩 enjoyable experience.

M

Muhammad Nazam Maqbool

5

Absolutely good course... will suggest it to everyone. has superb content that is covered in a fantastic way.

Show More

Recommended Courses

Free हिन्दी

Excel For Data Analysis

51074

3.7 Enroll For Free

Free हिन्दी

SQL For Data Analysis

19080

3.8 Enroll For Free

Course Content

Getting Started with Machine Learning

How to use LearnVern

Introduction to Machine Learning

Environment Setup Part 1

Environment Setup Part 2

Environment Setup Part 3

Data Wrangling

Importing Libraries and Dataset

Handling Missing Data

Handling Missing Data - Practical

Encoding Categorical Data

Encoding Catergorical Data - Practical

Splitting Dataset

Splitting Dataset - Practical

Normalizing the Data - Part 1

Normalizing the Data - Part 2

Finding Machine Learning Datasets

Exploratory Data Analysis

Plotting Graphs - Part 1

Plotting Graphs - Part 2

Distribution Models - Part 1

Distribution Models - Part 2

Assignment : Data Preprocessing for Machine Learning

Machine Learning Paradigms

Assignment : Machine Learning Paradigms

Decision Tree Iterative Dichotomiser 3

Random Forest

Support Vector Machine Classifier

Support Vector Machine Classifier - Practical 1

Support Vector Machine Classifier - Practical 2

Naive Bayes Classifier

Naive Bayes Classifier - Practical 1

Naive Bayes Classifier - Practical 2

Evaluating Classification Models Performance

Evaluating Classification Models Performance - Practical

Overview of Classification

Logistic Regression

Logistic Regression - Practical - 1

Logistic Regression - Practical - 2

KNN

KNN Practical - 1

KNN - Practical 2

Decision Trees for Classification

Decision Trees for Classification - Practical 1

Decision Trees for Classification - Practical 2

Assignment : Supervised Learning Algorithms

Simple Linear Regression

Simple Linear Regression - Practical

Salary Prediction using Linear Regression

Multi-Linear Regression

Startup Prediction using Multiple Regression

Support Vector Regressor

Support Vector Regressor - Practical 1

Support Vector Regressor - Practical 2

Decision Tree Regressor

Decision Tree Regressor - Practical 1

Decision Tree Regressor - Practical 2

Regressor Model Selection

Evaluating Regression Model Performance

Evaluating Regression Model Performance - Practical

Assignment : Regression Algorithms

Distance Metrics

K-Means Clustering

K-Means Clustering - Practical

Mall Customers Prediction using K Means Clustering

Hierarchical Clustering - Agglomerative , Divisive

Agglomerative Clustering - Practical

Divisive Clustering - Practical

DBscan Spatial Clustering

Mall Customers Prediction using Hierarchical Clustering

Assignment : Unsupervised Learning Algorithms

Association Rule Learning - Apriori, FP Growth

Association Rule Learning - Apriori Practical

Market Basket Analysis using Apriori

FP Growth

Market Basket Analysis using FP Growth

Assignment : Association Rule Mining

Reinforcement Learning Theory - Multi Armed Bandits

Upper Confidence Bound - Practical

Thompson Sampling - Practical

Q Learning

Assignment : Reinforcement Learning

Overview of Dimensoionality Reduction

Princinpal Component Analysis

Principal Component Analysis - Practical

Linear Discriminant Analysis

Linear Discriminant Analysis - Practical

Assignment : Dimensionality Reduction

Basics of Regularization and Optimization

Cross Validation

Hyperparameter Tuning

Sampling Methods

Underfitting and Overfitting in Models

Variance and Bias

Assignment : Regularization and Optimization

Advance Trends in Machine Learning

Introduction to Keras and Deep Learning

Practical Demonstration -Keras

Reinforcement Learning Project - Teach a Taxi Part 1

Reinforcement Learning Project - Teach a Taxi Part 2

Reinforcement Learning Project - Teach a Taxi Part 3

Reinforcement Learning Project - Teach a Taxi Part 4

Loan Prediction Project Part 1

Loan Prediction Project Part 2

Course Summary

Interview Questions Part 1

Interview Questions Part 2

Interview Questions Part 3

Career Guidelines

Enroll For Free

Complete Machine Learning Course in English Code

Free

Full Course, No Certificate

With Ads
No Certificate

₹999/-

No Ads

Full Course, with NSDC Certificate

Ad Free
Globally Recognized NSDC Certificate