Hello,
I am (name) from LearnVern (6 seconds gap)
In our previous tutorial of Machine learning we studied about how to find ML datasets, meaning how we can search datasets for machine learning algorithms from different websites and internet sources or even manually.
So, today we are going to learn about how we can use these datasets to perform Exploratory Data Analysis or in short it is also known as EDA.
So, let's begin by understanding What is an EDA?
By the word Exploratory we understand that it is something to do with exploring a thing, so what are we talking about?We are talking about 'Data'.
Meaning to explore the data.
Let's take an example for it.
For instance, when you go to a new home, you will first explore the house by visiting it's hall area, kitchen, then bedrooms, balconies or any gallery, or even go to the Terrace, and also check out if there is any garden in the house so in this way you are exploring what type of rooms are there in the house, which type of spacious places are present there, everything, each section of that new house.
In the same way when we receive data with us we explore that data, now we have ways to explore, as when you visit the house you can physically go from one room to another, one place to another, and suppose as we saw lockdown period, everything became virtual, increasing technologies helped us do almost everything virtually, like by using AVR everything can be seen virtually, so you can visit virtually also.
In the same way when we talk about data, here we have different methods:
For instance we can use statistics, we can identify mean median mode, we can identify max min, or we can also create grouping, so this is the statistical approach.
Next approach to it is visualisation, where data can be visualised through graphs or in the form of pie charts, so with different ways we can visualise the data.
So, EDA just means to analyse, investigate or drill down the data and explore it as to What the data is trying to say?What is hidden in the data and identify it.
So, this happens at a very primitive stage, that means at the very beginning, thereafter as we proceed ahead we come across grouping, clustering, recommendation, prediction, mining algorithms and machine learning algorithms, which happens in later steps.
So, the primitive understanding of the data is known as EDA also known as Exploratory Data Analysis.
3:05
So, let's move ahead and understand it in more detail.
So,Why is Exploratory Data Analysis important to perform?
So, when we perform this EDA on the data, this increases our understanding about the 'variables'?
So uptil now you must have understood about the variables, as we discussed in previous topic, when we worked upon iris dataset, and also treated missing values, in that we had different columns, like sepal length, sepal width, petal length and petal width, at last the column was about species, target variables or label. So these different types of columns, attributes, fields or variables that we have, these all are synonyms words, you can call variable or attribute or field or feature, all these mean the same, so don't get confused about it.
So when we apply EDA it helps in understanding of these variables, that help us to understand certain things about the data such as what is median, what was the score on an average. This type of thing we get to know when we perform EDA.
For which we use average,minimum, maximum, mean, median, mode, or we get a range, or get a quartile.
So we do all these things through EDA.
Next point is that there are many times when there is an error in the data and we had already discussed Missing values, but sometimes there are errors too, that we need to find.
Also, there are Outliers, this image is showing outliers (no image) . They are like the odd man out game that we play, which looks separate from other things, so there are Outliers also.
So, these information is acquired through EDA only, and this is a prerequisite for other analysis that needs to be performed ahead
4:54
Next part, as I said primitive, on the primitive stage itself we identify certain patterns and some trends, or visualise through graphs such as box plots, scatter plots, histogram, pie charts, area plots,trend lines on scatter plots.
Here, you can see there are so many diagrams that are made here.
So, there are different types of formats that help us to visualise and that in turn helps in increasing our understanding about the data as to what type of data is it, what is its distribution pattern and also smoothens and simplifies further analysis that needs to be performed.
Let me take you through the next slide, and make you understand what are the things that can be done in this
The very first point is Univariate Analysis,
Here, uni, the word means ‘one’, that is why i have written here one and variate meaning variable, like we had just understood about variables they are same as features, attributes.
So, it is basically about working upon a single i.e. one variable in a field or column.
Next, it is also about 'one dependent variable', for instance if you will walk fast you will reach
the destination early.
So, reaching 5 min, 10 min, 15 min early is dependent upon how fast you walk.
So, here what is depending?
Time.
On who?
On a variable.
Dependent variable means the output or target.
So what do we need to find?
How much time was taken?
So this is the target.
I will give you another example.
How many marks did you score?
It is dependent. On what?
Study hours, number of lectures you will attend, how attentive you are, prior knowledge of the subject.
So there are so many factors on which your score is dependent.
So in this univariate, one is a dependent variable, on which we do analysis.
Next, is Bivariate, here bi meaning two and variate meaning variable.
So,when we have two variables we check on the cause and effect.
Cause and effect.
Cause and effect.
Cause and effect.
In Earlier example, we just had scores, but who was bringing that score had nothing to do, but in this we are interested in both the things.
For instance he studied for 10 hrs so got 80 percent. And he only studied for 2 hrs so got X marks.
So, there are two variables.
Something is happening because of some other things.
He drove the car fast therefore met with an accident.
So rash driving resulted in accident
So, there is a cause so there is an effect.
He invested cleverly in the stocks therefore was able to gain profit.
why? There is a cause.
Ok so cause and the relationship between the two variables.
Ok, this is what basically Bivariate analysis helps us to know about.
After this, we also have multivariate or bivariate.
Here we have multiple variables in which we have to perform analysis.
So, we have over here cluster analysis, factor analysis, multiple regression analysis, principal component analysis.
These are further topics which will come next.
Here you saw uni, which means one.
Working on a single variable.
Bi means, working on two variables.
Multi means many variables.
Working on More than two variables.
So we will do this analysis practically as well.
So friends, let's conclude here today.
We are stopping today’s session here itself.
We will see further parts in the next session.
So keep learning and remain motivated.
Thankyou.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.