Interested in Personalized Training with Job Assistance? Know More

Complete Machine Learning Course in English > Interview Question

Interview Questions Part 3

18.4k

Start a new search

To find content from modules and lessons

Overview

Namaskar I am Kushal from learnvern.

In the session on machine learning today we will see part three of interview questions and answers. This session is based on scenario based interview questions and answers.

Previously we saw some direct questions, so generally as a fresher and as an experienced person also direct questions will always be asked but if you are experienced then you must focus more on scenario based questions.

So today we will discuss five scenario based questions which will make your understanding even better. So today our first question is, this is a scenario so listen carefully, you are given a trained data set having one thousand columns and one million rows. The dataset is based on a classification problem.

Your manager has asked you to reduce the dimension of this data so that model computation can be reduced. Your machine has memory constraints so what would you do?

So in the question. So let us understand what the question is, we have a data set and it is a trained data set meaning that with this data set we are going to train our algorithm or model, it has thousand columns and one million rows , so we have significantly large number of columns , there are a lot of features, thousand , the data is based on a classification problem, so classification is a supervised approach , it means it is labeled data so we will have some categorical output data , your manager has asked you to reduce the dimension of this data so that computational time can be reduced , your machine has memory constraints, so see there is a task given to you that its dimension has to be reduced and for what it is to be done so that model computation time is reduced , Ok, so this is the task given to you, and there is a challenge that we can ourselves understand that thousand columns I mean it is a big number and number of features

OK, so at once it should strike in your mind that since it is thousand so there might be a problem of overfitting , so this should be coming to your mind immediately , now your machine has memory constraints, so there are memory constraints also , so what would you do. So you have to first understand all aspects of the question first and then you have to answer it in which you will cover all aspects , so come let us answer it, so first of all our machine has a RAM problem, RAM constraint so we will first stop all unnecessarily applications and processes whether it is browsing or some browsing application or whether it is some other processes or softwares that are running and are not required ,so we should close the, so what will happen with that is some of the memory will get free , so this is the first step that we should do as a best practice , now the other thing is that if data is so much then you must randomly sample the data , with random sample what will happen is that so much of data going at one time and getting processed is going to take a lot of time and you will be able to avoid that , so you can randomly sample in thousand, thousand, thousand and send data as per capacity OK, now the next part is to reduce dimensionality, in this just look for numerical or categorical variables in your data , and in these numerical and categorical variables look out for the correlated variables , many times X1 X2 X3 variables which are our input variables they themselves have correlation with each other , so if you find correlation in those variables then remove a correlated variable from that or from the correlated variables if you keep just one then that will also suffice , if it is categorical variable then there you can do a chi(kai) square test , so with this what will happen is that you will be able to reduce the variables and so dimensions will be reduced and after doing this you must apply PCA , apply Principal Component Analysis, and principal component analysis will pick those variables where maximum variance can be seen , so this if for reducing the dimensionality. In this you can use stochastic gradient descent also , stochastic gradient descent can also be used if you are building a linear model and if you are an SME and you are experienced then you will have your own understanding with which you will know predictors and know which are the inputs that affect the output more , so that is also a method to choose but this is an intuitive means what we feel from inside that this can happen , this is that approach and if there is a mistake in this approach then that would be a great loss , it will be very significant loss to us.

So these were some ways in which this question can be answered. Now let us move to the next questions.Our question number two is, you have been given a dataset on cancer detection. You have built a classification method and achieved an accuracy of ninety six percent.

Why should you not be happy with your model performance? What can you do about it? So it is a cancer detection data set and after making the classification model ,you got an excellent accuracy of ninety six percent , but the question asked is that why should you not be happy and what should you do , so come let us understand, the first thing is that we are working on cancer data , so before going for the interview whatever popular data sets are there that we should always explore and we must find that the classification data sets that are there and where there is labeled data whether they come in balanced or imbalance data , are they imbalanced , so some things you will experience while working on the data , so here if you see that in this particular data set , this is a imbalanced data set , what will happen in imbalanced data set , the class that will be imbalanced that will be penalized in a way, that will in short it is not being treated well, OK, for the imbalanced one which has less records and in this data set those who have cancer , they are minority class and they are very less , so when the minority class is so less then we should not keep high accuracy as a performance parameter , understand it once more, in imbalance data set if we have some class with many records like ninety percent, ninety five percent and there is some class which has only ten, fifteen percent or five percent then we should not give accuracy more preference and rather we should give sensitivity meaning true positive rate and specificity meaning true negative rate meaning those who have cancer and those who do not have cancer , so how correctly have they been identified , so with this when we take together then F measure is made , so true positive rate, true negative rate and F measure, these are more important for us rather than accuracy. So we should concentrate on these things.

Now moving further if we talk about minority class , so you can do a thing, you can do undersampling and oversampling , the classes that are more do undersampling for them and those classes which are less do oversampling for them, this SMOTE you can apply so that the data gets balanced, this is one thing that you can do or you can change the prediction threshold OK, so there is a probability calculation method and with that you can use an optimum threshold , OK, you can do it with AUC ROC curve , so that is another point and then there is also another approach that to the minority class you can apply additional weight , so if they get a larger weight then they will get a larger value and a better value, we can also use anomaly detection so we can do anomaly detection also , so these are the steps that you can perform on this particular data set also. So let us now move to the next question, question number three. You are assigned a new project which involves helping a food delivery company save more money.

The problem is the company's delivery team isn’t able to deliver food on time and as a result their customers get unhappy and to keep them happy they end up delivering the food for free. Which machine algorithm can save them? So see what is here is that we all order online food and if our food is not delivered on time then we also get disturbed and we always tell them that you should deliver food in time , so this is a problem with the company that the delivery boys are not able to deliver in time, so what will you do.

So in this question if you start using your brain I will use this algorithm, that algorithm, decision tree, KME, KNN , so this you don’t have to do, here you should try and find out that whether this is a machine learning problem or for the routing system,should we do something that it follows the correct route, so what, what the problem actually is , OK, so to find an optimized way is a problem or is machine learning a problem , so this we should understand because for a machine learning problem we always have to identify one thing that we should see some pattern there, there should be pattern and for that a normal equation, some mathematical equation, some formula , so with that normal formula it is not getting solved. So if it is not getting solved with a formula then it means there is a normal pattern and it is not getting solved with a normal formula nad related to it we should have some data also , so these three things we should take care of whenever we see a problem to find whether it is related to machine learning or not. So let us now move to the next question and that is question number four. You came to know that your model is suffering from low bias and high variance, so there is low bias and high variance. Which algorithm should we use to tackle it and why? So in this what we can do is, firstly you will have to understand that what is low bias and high bias, so low bias is when the model is giving predictions and the actual values that you have are matching exactly the same , so the problem here is that the model starts to mimic the training data meaning it is doing a complete mimicry , so in this, in this there is nothing to be very happy off , because it can happen here that you get a very good accuracy on training data but the problem here is that the model does not generalize, so if you give any new data to the model then it will not predict it in a good manner , so this is the problem because there is low bias and high variance , so it is only good for training data and for nothing else, so this is a challenge here.

So here what you can do is use ensemble technique, ensembled , bagging is an ensemble technique, OK, random forest can be used, so the problem of high variance can be solved from there. So what will happen in bagging is, that it will divide your data into subsets meaning in samples and in this there will be repeated randomized samples meaning that record one can come four times also OK, in the first time there were some other record with record one, second time there were some different ones, third time there were some different ones meaning that one set of records can be repeated OK, but will be randomly repeated , so these are used for training the model and after training what we do, we do voting or we take out the average and from that voting or average we get the final output. So with this the high variance problem gets solved and it starts generalizing or our model gets generalized. So these techniques we can use. Now for high variation we have regularization techniques also , so you can use regularization techniques or you can take out top end features from variable importance charts , so this technique also you can use to reduce high variance and optimize it, so let us now move to our next question which is question number five.

You are given a dataset. The dataset contains many variables , some of which are highly correlated and you know about it. Your manager asked you to run PCA on it . Would you remove correlated variables first and why? So we are aware that there is dataset and lot of variables as we had in question number one also , so what you will do here, you know that there are correlated variables , you know it, and what you have to do is run PCA for dimensionality reduction , so you have to run PCA, so will you remove the correlated variables, so what is your answer, so the answer must be that yes we will remove it because, because there is a reason for it , PCA we know that PCA reduces, it reduces the dimensionality but because of correlated variables PCA has a challenge , and what is the challenge , let us assume that we have three variables and two out of them are correlated and on this if you run PCA on it then what will happen, so for the first principle component the variance will be twice , it will be twice than for the uncorrelated variables , so why was this affected, because they are correlated that is why it was affected , so it is better to first remove the correlated one’s and the ones that are left with us that should not be correlated, they should be independent and then if we run PCA , so basically that is going to give us more benefit. So these were five questions which were scenario based.

If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.

See More

Learner's Ratings

4.5

Overall Rating

78%
11%
0%
6%
5%

Reviews

A

Aryan Ambat

5

Yes

Z

zeyana Fathima

5

thanks for giving this wonderful course in a understandable way please provide the details from where can i get the datasets

L

Losika Nicholas

5

were can i get the dataset

K

Kumar Madduru

5

Thanks for giving this course

D

Dinesh Kumar

4

Your screen is very blur and it doesn't has clarity even in 720P.Please make sure that will not happen again.

D

DOGALA UDAYKUMAR

5

bettor

N

Naresh Kulunge

4

good learning but the content titles are jumbled up, like first title of this module is decision tree dichotomiser which is practical part ahead of theory part. Same with the SVM practical 1 title has

E

Eswar Veeranki

5

good

I

Isakki Alias Devi P

5

Wonderful course

S

sushma Yadla

5

yes, i am happy to learning for machine learning in LearnVern.it i s easily understanding for Beginners.

Show More

Recommended Courses

Free हिन्दी

Excel For Data Analysis

53624

3.7 Enroll For Free

Free हिन्दी

SQL For Data Analysis

20243

3.8 Enroll For Free

Course Content

Getting Started with Machine Learning

How to use LearnVern

Introduction to Machine Learning

Environment Setup Part 1

Environment Setup Part 2

Environment Setup Part 3

Data Wrangling

Importing Libraries and Dataset

Handling Missing Data

Handling Missing Data - Practical

Encoding Categorical Data

Encoding Catergorical Data - Practical

Splitting Dataset

Splitting Dataset - Practical

Normalizing the Data - Part 1

Normalizing the Data - Part 2

Finding Machine Learning Datasets

Exploratory Data Analysis

Plotting Graphs - Part 1

Plotting Graphs - Part 2

Distribution Models - Part 1

Distribution Models - Part 2

Assignment : Data Preprocessing for Machine Learning

Machine Learning Paradigms

Assignment : Machine Learning Paradigms

Decision Tree Iterative Dichotomiser 3

Random Forest

Support Vector Machine Classifier

Support Vector Machine Classifier - Practical 1

Support Vector Machine Classifier - Practical 2

Naive Bayes Classifier

Naive Bayes Classifier - Practical 1

Naive Bayes Classifier - Practical 2

Evaluating Classification Models Performance

Evaluating Classification Models Performance - Practical

Overview of Classification

Logistic Regression

Logistic Regression - Practical - 1

Logistic Regression - Practical - 2

KNN

KNN Practical - 1

KNN - Practical 2

Decision Trees for Classification

Decision Trees for Classification - Practical 1

Decision Trees for Classification - Practical 2

Assignment : Supervised Learning Algorithms

Simple Linear Regression

Simple Linear Regression - Practical

Salary Prediction using Linear Regression

Multi-Linear Regression

Startup Prediction using Multiple Regression

Support Vector Regressor

Support Vector Regressor - Practical 1

Support Vector Regressor - Practical 2

Decision Tree Regressor

Decision Tree Regressor - Practical 1

Decision Tree Regressor - Practical 2

Regressor Model Selection

Evaluating Regression Model Performance

Evaluating Regression Model Performance - Practical

Assignment : Regression Algorithms

Distance Metrics

K-Means Clustering

K-Means Clustering - Practical

Mall Customers Prediction using K Means Clustering

Hierarchical Clustering - Agglomerative , Divisive

Agglomerative Clustering - Practical

Divisive Clustering - Practical

DBscan Spatial Clustering

Mall Customers Prediction using Hierarchical Clustering

Assignment : Unsupervised Learning Algorithms

Association Rule Learning - Apriori, FP Growth

Association Rule Learning - Apriori Practical

Market Basket Analysis using Apriori

FP Growth

Market Basket Analysis using FP Growth

Assignment : Association Rule Mining

Reinforcement Learning Theory - Multi Armed Bandits

Upper Confidence Bound - Practical

Thompson Sampling - Practical

Q Learning

Assignment : Reinforcement Learning

Overview of Dimensoionality Reduction

Princinpal Component Analysis

Principal Component Analysis - Practical

Linear Discriminant Analysis

Linear Discriminant Analysis - Practical

Assignment : Dimensionality Reduction

Basics of Regularization and Optimization

Cross Validation

Hyperparameter Tuning

Sampling Methods

Underfitting and Overfitting in Models

Variance and Bias

Assignment : Regularization and Optimization

Advance Trends in Machine Learning

Introduction to Keras and Deep Learning

Practical Demonstration -Keras

Reinforcement Learning Project - Teach a Taxi Part 1

Reinforcement Learning Project - Teach a Taxi Part 2

Reinforcement Learning Project - Teach a Taxi Part 3

Reinforcement Learning Project - Teach a Taxi Part 4

Loan Prediction Project Part 1

Loan Prediction Project Part 2

Course Summary

Interview Questions Part 1

Interview Questions Part 2

Interview Questions Part 3

Career Guidelines

Enroll For Free

Complete Machine Learning Course in English Code

Free

Full Course, No Certificate

With Ads
No Certificate

₹999/-

No Ads

Full Course, with NSDC Certificate

Ad Free
Globally Recognized NSDC Certificate