Namaskar I am Kushal from learnvern.
This tutorial is the continuation of the last tutorial and now let us watch further. In today’s session on machine learning we are discussing loan prediction project use cases , so we shall begin with a dataset, understand the problem statement and then we will move towards implementation .
So first of all it is the dataset, this is the data set on loan prediction which will be made available for you to download and in this dataset what are the columns that we will see , we have loan id which is a unique number , we have gender in which male and female means categorical data will be there, married, applicant married yes or no , then dependents, number of dependents will be there, education , what is the applicant’s graduation , graduate or undergraduate, self employed, is he self employed or not self employed, then there is application income , applicant income , the co applicant , applicant can have a co applicant so generally it can be wife, brother, husband so what is there income , then loan amount, so loan amount has been given in thousands , then next is loan amount term that in how many months loan has to be repaid , then credit history, so credit history meets the guidelines or not that has been given , then property area, so property area is urban, semi urban or rural that is the information given and the last is loan status that whether loan has been approved or not , so that is in yes or no.
So let us now see what this problem statement actually is. So this problem statement is dream house finance company , so this is a finance company that deals with home loans meaning that it helps people in getting home loans ,and they have presence all across urban, semi urban and rural areas , so it is in urban also, semi urban also and in rural areas also, Now the customer applies for home loan and after that the company validates whether the customer is eligible for the home loan or not eligible.
Now see if this work I sit and start doing , then I will also do it and I will get some two three people more and if many customers come means if customers keep coming in thousands and they kept applying then I will need a bid team because everyday one person cannot process forms for thousand applicants , so a big team will be required and the cost for this or I get this handled with two to four people only or if I automate this whole process with machine learning then I will save a lot of cost.
So this is what the problem statement here that the company wants that it automates the loan eligibility process and after automation whatever is the details required from customers like customer’s gender, marital status, education , number of dependents, income , loan amount, credit history ,so whatever parameters are required that we should take in the beginning and on basis of that our algorithm should segment the customer, classify it and after classifying it should tell us in output whether it will be eligible or not eligible, so this is what it is, this is the problem statement.We have old data and on basis of that data we have to solve this problem statement. So let us now begin and implement this.
We will start implementing this step by step in a nice manner , so I have already made a notebook and kept it here, so here I will have to load data so let me first load the data and you can download it from the website and upload it into your jupyter notebook or colab , so just see here that I am loading data , now here we have two data sets train and test, so now I will work with train because it is my learning and training dataset, so that is why.
So here you can see that train.csv is there and now here I will import some libraries that initially I will require , so initially I will require pandas, import pandas as pd, and after pandas next that I will require is import numpy so numpy as np and after this I have to do some visualization and for that I will require import matplotlib dot pyplot as plt, so these are the things that I will require and here I will , just for diagrams we will matplotlib inline so that it draws the diagram in one go OK. So we have imported these three libraries and dataset I have already uploaded on the system, so now this data I will load as data frame , so data is equal to pd dot read underscore csv and here you either copy the path or you can directly type it also and here I have copied and pasted it OK, so now let us execute it and after executing we can see using data dot head that how our data has been loaded , so here you can see that my data loan id, gender , married, dependent , so whatever columns we were discussing all those columns are here and the last column is loan status in which there is Y or N meaning whether the loan was given or not, so this is a supervised problem because the output in this is already given , so this is labeled data and this is a supervised problem and we can solve this.
So now let us move further, and now that we have data we should know that data a little more so, data dot info can be run so that we can understand the features or columns of data , so here you can see that loan id is there, gender is there, married is there and in front of every parameter object object is written , so object is written till self employed, and above you cans see that for self employed the values written are basically alphanumeric values and because of this everywhere object has been written, and after that you will see it is int, then float, float, float and last two are again objects and so if you see then these values are of the same type, this is an integer, this is float, it has float value and in the last two are like strings that is why they have been called strings.
So now we have understood about the data, and we have also understood that total twelve, means we have zero to 12 parameters so thirteen parameters we have , twelve are input parameters till property and the thirteenth parameter , that is twelfth parameter from zero , that is loan status which will be our output parameter , so this is the understanding we have till now about data. Now let us move and now moving further the first thing that we have to do is what, that is to preprocess data , so to prepare data for analysis is the foremost thing , if you will not prepare then you will face difficulties afterwards and you will have to do afterwards also , so it is better that all preprocessing scopes of data is seen considered by analyzing data or by seeing data and we perform this so that when we put the data for predictions finally then we do not face any problems.
So here you will see how I will be doing data preprocessing , so here first of all I will try to find data dot isna(is-n-a) , that is it na or not, so when I will find isna so with it I will put dot sum also so that I will be able to see easily that in each column how many na's are there , so in loan id there are zero, in gender there are thirteen, in married there are three , in dependents there are fifteen, self employed has 32 and in loan amount there are twenty two , in credit history there are fifty, so in this way I can see it here that how many na values do we have so missing and so for these missing values we will have to treat also now, so for treating we have already had discussions, conceptual also and by way of implementation also, so here what we are doing is that in projects all things come together , we have discussed them module wise, but here all of them are coming together , so here in each column there is a different demand , like in gender here I cannot put mean median mode , so in gender basically I will have to decide, either I will drop or if I have very less records then I can drop but thirteen are quite relevant number of records so I will not drop them , but I will give a fix value like I will give male, OK, so first I will see here, data of , data of and here I will put gender,gender and now here data dot gender dot value count we can see , dot underscore V A L U E C O U N T S value counts and here we will be able to see that males are four eighty nine and females are one hundred and twelve , so males are more so what we will do is, we will take out an inference that males are more and because of this we will substitute with males , so what we will do is data dot G E N D E R gender , data dot gender is equal to , here also data dot gender dot fillna(fill-n-a) , so doing fillna here we will put male here and after putting mail we can see that you will now not get null values in the gender column , so that we will check in the last and the other one that we will see is the married one, so let us do for the married one , so data of married , so data of married and here we will dot V A L U E do a value count , so value underscore C O U N T S counts, so we have done value counts and here in value counts we will correct the brackets and here we can see YES is three ninety eight and NO is two hundred thirteen, so here you can see that value of yes is more so here also we will assume that if value of yes is more then we should fill yes , so here data dot married, M A R R I E D married OK, so data dot married is equal to and here we will do data dot married dot F I L L N A , fillna and here OK, here we will put , because quantity is more so we will put YES, OK, so Y is capital E S and enter , so this we have done for married also. So in a similar manner we have the next one for dependents , so data, now let us move forwards, so data of D E dependents , data of dependents dot value counts, value C O U N T S value counts OK, so here we did value counts and let's executes it , so now here you can see that there is zero, one, two and three also , so the maximum is zero, so how well substitute is with zero, so data of DEPE, here we are using simple logic and choosing the one with maximum value, the one which is maximum, so data dot dependents is equal to data dot D E P E N data dot dependents , so data dot dependents dot fillna and here we will fill zero , and we have filled zero and why did I give this in single quotes because it was first telling it to be the data type of object so that is why I have given it in single quotes> Now the next is self employed, so here I will data dot, Ok data of will come first, data of self employed so from here we will take self employed , so data of self employed V A L U E value C O U N T value count , so we did value counts and saw that here we have received records that five hundred are, five hundred are NO and YES are eighty two , so obviously we will choose five hundred because it is more, so data dot self employed, so now the self employed OK, it is self underscore employed, now it is correct , now data of self employed , so here what we will apply, we will apply fillna here, and in fillna we will , so here NO is more so we will put NO ,OK, N capital NO, so in this way our data has been filled and amongst yes and no only data with NO,NO we have put OK, now we will move further, so let us move ahead , now the next we have is , this for self employed and the next one is for loan amount OK, so here we will see data of loan amount , so we are seeing for loan amount, so loan amount(5 seconds pause,typing) , so loan amount TERM loan amount term.
Now for this loan amount term let us see the value count , value count(6 seconds pause,typing) OK, so this is the loan amount term , here three sixty is the maximum so we will pick three sixty from here and fill three sixty , so data dot L O A N loan underscore A amount term(2 second pause,typing) ok, so loan amount term and here D A T A data dot(3 second pause,typing) loan amount term, OK and here we will do dot F I L L N A fiilna and in fillna here three six zero this we will fill here OK, so this is our fillna, so we have filled till loan amount term and the next we have is credit history, so again data of , so here we will put data of, here credit history will come , so credit history and let us see it's value count also , so dot V A L U E value count, V A L U E value underscore C O U N T S value counts, so what is the value count for this, so here for one point zero it is four seven five and for zero point zero it is less , so here also I will use one point zero in fillna, so data dot, or it will be data of , so data of credit history , or we will come to know from data dot, right, C R E D I T , data dot C R E D I T credit underscore H I S T O R Y so data dot credit history, so here what we will do . Is equal to data dot credit history dot fillna , so here after doing fillna what we will do is we will fill one point zero OK, so here we have filled one point zero isna.
Now again we will see data dot isna dot sum , so here we can see, OK loan amount, see one is still remaining . Loan amount is still remaining , so data of loan amount , this is remaining, loan amount OK, is equal to , so data of, Ok let's see the value count first , dot V A L U E value underscore C O U N T S value counts and after this we can see that here one twenty is the maximum , so we will substitute one twenty here , so data dot L O A N , A M O U N T amount , loan amount is equal to again data dot L O A N A M O U N T loan amount , so this much you have done and after this what has to be done is, we have to do fillna here and in fillna put one twenty , one twenty point zero this we have put , now data dot H E A D , data dot head is not required , isna and here dot SUM , so now we have , it is coming for self employed, we had treated self employed, let us see it, self employed we had treated but here we have not initialized self employed , so data dot S E L F self underscore E M P L O Y E D employed is equal to this, OK , so this also we have executed and now let's hope that it has been finalized , so here you can see all, so this is how we handle missing data which is part of preprocessing , so this we have dealt nicely, so now let us move further and moving further this complete data that we have will required to be split , so will we split it, so split the data(3 second pause,typing) into training , training testing is already there , so in that input and output((5 second pause,typing) OK, so into input and output meaning in the form of X and Y. So for this what we will do is we will take X is equal to , here data dot I L O C data dot iloc and after data dot iloc what we will do is , here we will use that iloc for indexing purpose , so here in data dot iloc what we will do is, we want all records so I have put a colon, but I want records from one to 11 so I have to put from one to twelve , so that is how and, dot V A L U E S values , so this will now go into X OK, so here Y will be data dot I L O C iloc and after that we will put colon if all records are required and after that only column twelve is required rest are not required, so just see here this is X and I will just show you by doing head , so this is X in which the last column now will not be displayed , so here this numpy np is an array so here e will not write head and I will show X directly, so here you can see that X is displayed OK, so in the same manner I will show you Y, so here we have Y also displayed , OK, in the form yes no yes no , so in this manner we have separated the data into training and testing , no, that we still have to do , we have now just split the data into input and output and now we will split this data into training and testing, so how will that happen, so T R A I N train and T E S T test and S P L I T so beforehand only you have train test files but those test files you can use afterwards again for testing, so here we will see that how we will do it, so just watch this, from sklearn dot , from sklearn dot preprocessing OK, so from Sklearn dot preprocessing import, so here we have to import train test split , T R A I N train underscore, test, S P L I T split , train test split OK, so this we have to import from sklearn, OK so let us correct from and sklearn dot preprocessing , we have to OK, and now here in preprocessing let's make P small and preprocessing, so here you can see that we have done corrections for spelling and so, here from sklearn, this seems entirely OK, so this train test split is not in [reprocessing,
This is in model selection, so this much I will correct and put M O D E L model underscore selection , so this is in model selection , train test split, so now you can see that this has come from model selection. Now how do we do it, X underscore T R A I N train comma X underscore T E S T test then Y underscore T R A I N train and Y underscore T E S T test , so in this way we split it and here we will keep train test split, T E S T test, S P L I T split , so we have kept train test split , so in train test split, inside this now we will do what, we will fill X and Y and with that we will write that how much is the test size that we want , so test size , so we want zero point twenty test size OK, so in this way you can do one more thing, random state also you can initialize here, so initialize random state also with a number , so here our X train, X test has been created. So just see that our X train is somewhat like this OK, now in a similar manner data has split and gone into X test also and and in Y train and Y test the output has gone respectively, Now we will move further and the next step of preprocessing that is encoding, like this is male, so give zero to male and one to female or vice versa , so in this way doing encoding is very necessary because in such programs using these columns becomes easy but difficult for the computer because feeding the computer M A L E male or F E M A L E or complete YES or NO is difficult , so if we give them names of 0,1,2 ,3 then it becomes easy. So here from S K L E A R N learn dot preprocessing it is there, so sklearn dot preprocessing import, here we will import , what we will import, we will import L A B E L label encoder so this label encoder we will import., OK.
Now after label encoder if you want some other encoder then you can take that like like one hot encoder and many other encoders, you can just check, label is the most common and most often used , python has dummies also and those are also used , so from sklearn dot preprocessing import label encoder , now here L A B E L label E N C O D E R encoder is equal to OK, so label encoder is equal to label encoder and(3 second pause,typing) , so label encoder is an object of label encoder and for that what I will do is, that I will make this object and keep it and this object I will now use for coding purpose ok, so what we will encode is X train also and X test also, all we will encode OK, so how do we encode , so for encoding we simply what we have to do is, for I in range zero to five, so for I in range zero to five, what we will do is for X train , we will work on X train first, so X train and X train of all the records and I OK, so what we will do here is, we will keep I so that it navigates single single column , now what we will do ahead is L A B E L label encoder he object that we have made, we will call that , label encoder dot fit transform , so label encoder dot fit transform, in this way we will do it, now in fit transform what we will do is we pass our data here X underscore T R A I N , so complete X underscore train we will pass again , here in the same manner it was done earlier, ith number of records and same number of column , OK, in this manner, so this is done and here if you see after this , so here we will see that the entire rows we require , and after all rows let us put ten here and then after that label encoder dot (3 second pause), so label encoder dot fit transform , okay and in fit transformation we will put here again X train and in X train here we will put , OK here an i has been missed , so for i in ok fine, so this is our x train transformed and in the same manner you can transform Y also, so L A B E L label encoder, label encoder underscore y is equal to , here it will be L A B E L label encoder , label encoder OK, pass this, just create this object and now Y underscore T R A I N train , y train is equal to , here for that, label encoder y and through this y dot fit underscore transform , dot fit transform and here you can put y underscore train okay, so our X train also and Y train also both have been encoded and now in a similar way what you will do is you will do for X test, x test also in a similar way , so what we can do now, basically this same thing , same this thing you can take here ok, and here I will put underscore XT , xt so this we are doing for xt , so here everywhere there will now be xt and rest will remain the same but in place of X this will become X test , here also it will be text and here also it will be text , OK done , so this is for test and and in a similar way after Y train we will do for y test OK, so in this manner what we did was encoding and now you see y underscore test how it looks, so now it will not look YES ,NO now it will look 0, 1, just see, so YES , NO has converted to 0, 1 and in the same way if you see X train , now whatever you will see all those values will be shown converted in numerical format , so here X will be in capitals, so X train if you will see then you will observe that all of this has come in numerical format , so this is easy for our algorithm and computer so that it sees them in numbers and process them in numbers because to process text is a little cumbersome , it is a different approach so it is necessary that we encode it with label encoder and now after encoding what should we do , we should scale these values, somewhere it is three sixty and somewhere it is one , two so we should scale this and how can you scale, just see , so here you can do scaling , this all part of preprocessing , now to do scaling, from sklearn , from sklearn dot preprocessing , so from sklearn dot preprocessing import S T A N D A R D standard scaler S C A L E R scaler ok, so standard scaler with help us to preprocess , means to scale it , so here SC is equal to S T A N D A R D standard scaler, so an option we have is that you make this an object and then scale data through it , now see x train I have, x underscore T R A I N train , so X should be capital , so X train is equal to , so here now sc dot fit underscore OK, (3 second pause,typing) so let's transform, so fit transform and here we will give X train, OK, so X train has been transformed, now here , let us see x train once so X underscore T R A I N train OK, so here if you see then all these have been normalized , so the data has come in the range of zero to one , so this basically, some have gone into minus , so in one small value all data has been converted and this is range conversion, so now let us move further and now after this scaling , X test is remaining, so X train, OK train has already been done, so X T E S T test, so X test , so for test it is remaining so let us do for this also , so is equal to sc dot fit transform, fit transform and here you can put X test OK, OK, (5 second pause,typing) so in this manner, so here in X test also let us check once , so here it is capital so here also we will put capital OK, so here also data is now scaled. Now one more step we can do of preprocessing and this will be PCA, principal component analysis , so here we have many columns and these many can be heavy for computations so we will do PCA , PCA works on variance, on explained variance , so the more the variance the better it is , should be above sixty percent , so let’s apply PCA , so from sklearn dot preprocessing , so from sklearn dot preprocessing import (2 second pause,typing) , so here we will import, and what will we import , so we won’t get it in preprocessing, we will get it in decomposition , so sklearn dot decomposition , so in decomposition we will import, import PCA, we will import PCA, ok, so after importing PCA , yes, now PCA is equal to, PCA is equal to , here we will get the components of PCA , how many components you want, so in n components you specify how many components you want, so components are equal to , we want only ten , so is it correct, not is is not at all correct, we want to make it less , so we want two, right , what are principal components, they extract amongst so many properties some features after removing few , if we have six seven then we will make them two , so this is basically my object PCA, and this I will now use in X TRAIN train , how will we use it X train ,PCA dot fit transform and in fit transform X train OK, fine, so here we will put X underscore train and in similar way we will now do for X underscore , so here we will do for test also okay, so here by putting X test we will transform this also , now you will see that X underscore train , so if you see this then you will find only, OK , (6second pause,typing) so here in X train X should be capital , so here you will see two components , so now just see there are only two, so we had multiple but PCA does principal component analysis and this reduces multiple features and dimension and brings it to as specified , so this basically in this manner our data set is ready now, for processing purpose ,OK, so for today we will see this much only, and in the next video we will see the implementation of this which are actually the algorithms.
If you have any queries or comments, click the discussion button below the video and post there. This way, you will be able to connect to fellow learners and discuss the course. Also, Our Team will try to solve your query.
Share a personalized message with your friends.