Namaskar I am Kushal from learnvern.
In the session on machine learning today we will see part three of interview questions and answers. This session is based on scenario based interview questions and answers.
Previously we saw some direct questions, so generally as a fresher and as an experienced person also direct questions will always be asked but if you are experienced then you must focus more on scenario based questions.
So today we will discuss five scenario based questions which will make your understanding even better. So today our first question is, this is a scenario so listen carefully, you are given a trained data set having one thousand columns and one million rows. The dataset is based on a classification problem.
Your manager has asked you to reduce the dimension of this data so that model computation can be reduced. Your machine has memory constraints so what would you do?
So in the question. So let us understand what the question is, we have a data set and it is a trained data set meaning that with this data set we are going to train our algorithm or model, it has thousand columns and one million rows , so we have significantly large number of columns , there are a lot of features, thousand , the data is based on a classification problem, so classification is a supervised approach , it means it is labeled data so we will have some categorical output data , your manager has asked you to reduce the dimension of this data so that computational time can be reduced , your machine has memory constraints, so see there is a task given to you that its dimension has to be reduced and for what it is to be done so that model computation time is reduced , Ok, so this is the task given to you, and there is a challenge that we can ourselves understand that thousand columns I mean it is a big number and number of features
OK, so at once it should strike in your mind that since it is thousand so there might be a problem of overfitting , so this should be coming to your mind immediately , now your machine has memory constraints, so there are memory constraints also , so what would you do. So you have to first understand all aspects of the question first and then you have to answer it in which you will cover all aspects , so come let us answer it, so first of all our machine has a RAM problem, RAM constraint so we will first stop all unnecessarily applications and processes whether it is browsing or some browsing application or whether it is some other processes or softwares that are running and are not required ,so we should close the, so what will happen with that is some of the memory will get free , so this is the first step that we should do as a best practice , now the other thing is that if data is so much then you must randomly sample the data , with random sample what will happen is that so much of data going at one time and getting processed is going to take a lot of time and you will be able to avoid that , so you can randomly sample in thousand, thousand, thousand and send data as per capacity OK, now the next part is to reduce dimensionality, in this just look for numerical or categorical variables in your data , and in these numerical and categorical variables look out for the correlated variables , many times X1 X2 X3 variables which are our input variables they themselves have correlation with each other , so if you find correlation in those variables then remove a correlated variable from that or from the correlated variables if you keep just one then that will also suffice , if it is categorical variable then there you can do a chi(kai) square test , so with this what will happen is that you will be able to reduce the variables and so dimensions will be reduced and after doing this you must apply PCA , apply Principal Component Analysis, and principal component analysis will pick those variables where maximum variance can be seen , so this if for reducing the dimensionality. In this you can use stochastic gradient descent also , stochastic gradient descent can also be used if you are building a linear model and if you are an SME and you are experienced then you will have your own understanding with which you will know predictors and know which are the inputs that affect the output more , so that is also a method to choose but this is an intuitive means what we feel from inside that this can happen , this is that approach and if there is a mistake in this approach then that would be a great loss , it will be very significant loss to us.
So these were some ways in which this question can be answered. Now let us move to the next questions.Our question number two is, you have been given a dataset on cancer detection. You have built a classification method and achieved an accuracy of ninety six percent.
Why should you not be happy with your model performance? What can you do about it? So it is a cancer detection data set and after making the classification model ,you got an excellent accuracy of ninety six percent , but the question asked is that why should you not be happy and what should you do , so come let us understand, the first thing is that we are working on cancer data , so before going for the interview whatever popular data sets are there that we should always explore and we must find that the classification data sets that are there and where there is labeled data whether they come in balanced or imbalance data , are they imbalanced , so some things you will experience while working on the data , so here if you see that in this particular data set , this is a imbalanced data set , what will happen in imbalanced data set , the class that will be imbalanced that will be penalized in a way, that will in short it is not being treated well, OK, for the imbalanced one which has less records and in this data set those who have cancer , they are minority class and they are very less , so when the minority class is so less then we should not keep high accuracy as a performance parameter , understand it once more, in imbalance data set if we have some class with many records like ninety percent, ninety five percent and there is some class which has only ten, fifteen percent or five percent then we should not give accuracy more preference and rather we should give sensitivity meaning true positive rate and specificity meaning true negative rate meaning those who have cancer and those who do not have cancer , so how correctly have they been identified , so with this when we take together then F measure is made , so true positive rate, true negative rate and F measure, these are more important for us rather than accuracy. So we should concentrate on these things.
Now moving further if we talk about minority class , so you can do a thing, you can do undersampling and oversampling , the classes that are more do undersampling for them and those classes which are less do oversampling for them, this SMOTE you can apply so that the data gets balanced, this is one thing that you can do or you can change the prediction threshold OK, so there is a probability calculation method and with that you can use an optimum threshold , OK, you can do it with AUC ROC curve , so that is another point and then there is also another approach that to the minority class you can apply additional weight , so if they get a larger weight then they will get a larger value and a better value, we can also use anomaly detection so we can do anomaly detection also , so these are the steps that you can perform on this particular data set also. So let us now move to the next question, question number three. You are assigned a new project which involves helping a food delivery company save more money.
The problem is the company's delivery team isn’t able to deliver food on time and as a result their customers get unhappy and to keep them happy they end up delivering the food for free. Which machine algorithm can save them? So see what is here is that we all order online food and if our food is not delivered on time then we also get disturbed and we always tell them that you should deliver food in time , so this is a problem with the company that the delivery boys are not able to deliver in time, so what will you do.
So in this question if you start using your brain I will use this algorithm, that algorithm, decision tree, KME, KNN , so this you don’t have to do, here you should try and find out that whether this is a machine learning problem or for the routing system,should we do something that it follows the correct route, so what, what the problem actually is , OK, so to find an optimized way is a problem or is machine learning a problem , so this we should understand because for a machine learning problem we always have to identify one thing that we should see some pattern there, there should be pattern and for that a normal equation, some mathematical equation, some formula , so with that normal formula it is not getting solved. So if it is not getting solved with a formula then it means there is a normal pattern and it is not getting solved with a normal formula nad related to it we should have some data also , so these three things we should take care of whenever we see a problem to find whether it is related to machine learning or not. So let us now move to the next question and that is question number four. You came to know that your model is suffering from low bias and high variance, so there is low bias and high variance. Which algorithm should we use to tackle it and why? So in this what we can do is, firstly you will have to understand that what is low bias and high bias, so low bias is when the model is giving predictions and the actual values that you have are matching exactly the same , so the problem here is that the model starts to mimic the training data meaning it is doing a complete mimicry , so in this, in this there is nothing to be very happy off , because it can happen here that you get a very good accuracy on training data but the problem here is that the model does not generalize, so if you give any new data to the model then it will not predict it in a good manner , so this is the problem because there is low bias and high variance , so it is only good for training data and for nothing else, so this is a challenge here.
So here what you can do is use ensemble technique, ensembled , bagging is an ensemble technique, OK, random forest can be used, so the problem of high variance can be solved from there. So what will happen in bagging is, that it will divide your data into subsets meaning in samples and in this there will be repeated randomized samples meaning that record one can come four times also OK, in the first time there were some other record with record one, second time there were some different ones, third time there were some different ones meaning that one set of records can be repeated OK, but will be randomly repeated , so these are used for training the model and after training what we do, we do voting or we take out the average and from that voting or average we get the final output. So with this the high variance problem gets solved and it starts generalizing or our model gets generalized. So these techniques we can use. Now for high variation we have regularization techniques also , so you can use regularization techniques or you can take out top end features from variable importance charts , so this technique also you can use to reduce high variance and optimize it, so let us now move to our next question which is question number five.
You are given a dataset. The dataset contains many variables , some of which are highly correlated and you know about it. Your manager asked you to run PCA on it . Would you remove correlated variables first and why? So we are aware that there is dataset and lot of variables as we had in question number one also , so what you will do here, you know that there are correlated variables , you know it, and what you have to do is run PCA for dimensionality reduction , so you have to run PCA, so will you remove the correlated variables, so what is your answer, so the answer must be that yes we will remove it because, because there is a reason for it , PCA we know that PCA reduces, it reduces the dimensionality but because of correlated variables PCA has a challenge , and what is the challenge , let us assume that we have three variables and two out of them are correlated and on this if you run PCA on it then what will happen, so for the first principle component the variance will be twice , it will be twice than for the uncorrelated variables , so why was this affected, because they are correlated that is why it was affected , so it is better to first remove the correlated one’s and the ones that are left with us that should not be correlated, they should be independent and then if we run PCA , so basically that is going to give us more benefit. So these were five questions which were scenario based.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.