credit risk scoring model final

CREDIT RISK PREDICTION MODEL

Submitted by: Rituparna Sarkar

Outline1. Project Objective 2. Process Approach3. Data Source and Variables4. Data Analysis 5. Data Pre-processing6. Exploratory Analysis7. Model development

i. Training the modelii. Validation

8. Conclusion & Limitations

Project ObjectiveTo develop a prediction model to assess credit risk to borrowers

• Do all borrowers have an equal probability to default?

• Is there a way to determine risk of defaulting before processing a credit request?

• Can we classify customers into two groups, i.e.. Risky and Non-Risky based on the nature of their financial data?

• Which are the key factors to be considered to assess risk of lending to an individual based on historic data?

Process Approach

1. Develop a predictive model to assess the credit risk to Borrowers

2. Develop business understanding of data, relationship between variables and data sources to be used

1. Get data from relevant data sources

2. Explore data for missing values, outliers, invalid data through descriptive statistics and visualization techniques

3. Understand the business relevance of outliers, missing values and invalid data and formulate the approach to treat them accordingly

1. Data splitting for training and test

2. Data clean up for missing values, outliers, invalid data

3. Data binning and imputation for outlier treatment

4. Binning independent variables as per business needs

5. Data exploration for patterns and collinearity test

1. Develop logistic regression model to classify customers into two groups based on credit risk probability

2. Train the model using 80% of training data

1. Validate the trained model using rest 20% of training data

2. If satisfied with accuracy percentages proceed to testing using test dataset, else go to previous step (modeling) and train the model again

When satisfied with the test results, deploy the model to aid business take decisions based on predictions given by the model

BusinessUnderstanding

Data Understanding

Data Preparation Modeling Deployment Evaluation

* Software Used – Excel & SPSS

Data Source and Variables• Data source is a dataset with 2,50,000 records taken from Kaggle website. Dataset

was split into two parts – 1,50,000 cases for Training and validation and rest 1,00,000 cases for testing the model.

• Data Dictionary for variables in dataset:Variable Name Description TypeSeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N

RevolvingUtilizationOfUnsecuredLines

Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits

percentageage Age of borrower in years integerNumberOfTime30-59DaysPastDueNotWorse Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integerDebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentageMonthlyIncome Monthly income realNumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integerNumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. integerNumberRealEstateLoansOrLines Number of mortgage and real estate loans including home equity lines of credit integerNumberOfTime60-89DaysPastDueNotWorse Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integerNumberOfDependents Number of dependents in family excluding themselves (spouse, children etc.) integer

DATA ANALYSIS

Descriptive Statistics• There are 1,50,000 cases in

training dataset;• Out of 11 variables available,

SeriousDlqIn2yrs is the binary dependent variable for which model has to be developed

• MonthlyIncome has large number of missing values. NumberOfDependents too have some missing values

• There are high numbers of extreme values(outliers) for RevolvingUtilizationOfUnsecuredLines, DebtRatio and MonthlyIncome as indicated by high Standard Deviation.

Missing Value AnalysisNumberOfDependents missing values are about 2.6% (less than 5%) hence these cases could be removed

MonthlyIncome has around 20% value missing, which is quite high and needs to be imputed

DATA PRE-PROCESSING

Data Cleaning StepsInvalid Data identified below to be removed in the Excel sheet

• Age Variable - One case showing 0 • Variables NumberOfTime30-99DaysPastDueNotWorse, NumberOfTimes90DaysLate

and c)NumberOfTime60-089DaysPastDueNotWorse contains cases with values 96 and 98 which indicates ‘Don’t know’ and ‘Refused to Say’. They are very few in number and common for all three variables.

Data Formatting in ExcelVariables RevolvingUtilizationOfUnsecuredLines and DebtRatio to be changed from General to Number format

Imputation in SPSS:• Imputation for missing values in MonthlyIncome • 5 imputations done using all independent variables and 5th imputation results

taken for training

Descriptive Statistics After Data Cleaning• After data cleaning

total number of cases down to 145837

• Outliers in variables DebtRatio, MonthlyIncome and RevolvingUtilizationOfUnsecuredLines to be removed through binning

Variable BinningBinning done for following variables:• Age: Age Binning containing bins for age group

• DebtRatio & RevolvingUtilizationOfUnsecuredLines: Created variables DebtRatio_Binning and RevolvingUtilizationOfUnsecuredLines_Binning with following cut off values :

• MonthlyIncome: Variable MonthlyIncome_Binning with 5 equal width bins

Age Group Bin21-30 131-40 241-50 351-60 4>60 5

Group Bin Remark<=0.25 1 Good

0.25 - 0.50 2 Low Risk> 0.50 3 High Risk

EXPLORATORY ANALYSIS

Exploratory Analysis (Using SPSS) Delinquency over different categories

0 1 0 1

21 - 30 7374 940 8314 5.42% 9.68% 5.70%

31 - 40 20562 2285 22847 15.11% 23.53% 15.67%

41 - 50 31130 2828 33958 22.87% 29.12% 23.28%

51 - 60 32334 2213 34547 23.75% 22.79% 23.69%

60 + 44725 1446 46171 32.86% 14.89% 31.66%

136125 9712 145837 100.00% 100.00% 100.00%

Age_Binning

Total

SeriousDlqin2yrs %

Total

Age_Binning * SeriousDlqin2yrs CrosstabulationCount

SeriousDlqin2yrs

Total

Disproportionate percentage of samples for dependent variable. Sampling of

training dataset required to remove bias in model development

• Maximum customers from age group 60+

• Delinquency risk is highest for Age Group of 41-50 and Lowest in 21-30 age group

a) Age

Exploratory Analysis (Contd.)Around 60% data have number of dependents as 0; Delinquency count and percentage also highest for this group

Total percentage share of data with dependents greater than 3 is only around 2%

0 1 0 1

0 81722 4992 86714 60.03% 51.40% 59.46%

1 24372 1921 26293 17.90% 19.78% 18.03%

2 17930 1571 19501 13.17% 16.18% 13.37%

3 8646 833 9479 6.35% 8.58% 6.50%

4 2564 296 2860 1.88% 3.05% 1.96%

5 677 68 745 0.50% 0.70% 0.51%

6 134 24 158 0.10% 0.25% 0.11%

7 46 5 51 0.03% 0.05% 0.03%

8 22 2 24 0.02% 0.02% 0.02%

9 5 0 5 0.00% 0.00% 0.00%

10 5 0 5 0.00% 0.00% 0.00%

13 1 0 1 0.00% 0.00% 0.00%

20 1 0 1 0.00% 0.00% 0.00%

136125 9712 145837 100.00% 100.00% 100.00%

NumberOfDependents

Total

SeriousDlqin2yrs %

Total

NumberOfDependents * SeriousDlqin2yrs Crosstabulation

Count

SeriousDlqin2yrs

Total

b) Number of Dependents

Exploratory Analysis (Contd.)

0 1 0 1

<= 0.25 24825 1472 26297 36.47% 30.31% 36.06%

0.26 - 0.50 19181 1256 20437 28.18% 25.86% 28.03%

0.51+ 24057 2128 26185 35.35% 43.82% 35.91%

68063 4856 72919 100.00% 100.00% 100.00%

SeriousDlqin2yrs %

TotalDebtRatio (Binned)

Total

DebtRatio (Binned) * SeriousDlqin2yrs CrosstabulationCount

SeriousDlqin2yrs

Total

0 1 0 1

<= 0.25 41954 912 42866 61.64% 18.78% 58.79%

0.26 - 0.50 9680 573 10253 14.22% 11.80% 14.06%

0.51+ 16429 3371 19800 24.14% 69.42% 27.15%

68063 4856 72919 100.00% 100.00% 100.00%

SeriousDlqin2yrs %

Total

RevolvingUtilizationOfUnsecuredLines (Binned) * SeriousDlqin2yrs Crosstabulation

Count

SeriousDlqin2yrs

TotalRevolvingUtilizationOfUnsecuredLines (Binned)Total

Around 44% of Delinquency from group with Debt Ratio > 0.5

Around 69% of Delinquency from group with RevolvingUtilizationOfCreditLines > 0.5d) RevolvingUtilizationOfCreditLines

c) Debt Ratio


0 1 0 1

<= 3100.00

26699 2494 29193 19.61% 25.68% 20.02%

3100.01 - 5000.00

29083 2518 31601 21.36% 25.93% 21.67%

5000.01 - 7083.00

25214 1766 26980 18.52% 18.18% 18.50%

7083.01 - 10823.00

27435 1461 28896 20.15% 15.04% 19.81%

10823.01+ 27694 1473 29167 20.34% 15.17% 20.00%

136125 9712 145837 100.00% 100.00% 100.00%

SeriousDlqin2yrs

TotalMonthlyIncome (Binned)

Total

SeriousDlqin2yrs %

Total

MonthlyIncome (Binned) * SeriousDlqin2yrs Crosstabulation

Count

• More than 50% of defaulters are accounted by lower 40% of the income range• Other 3 groups have more or less same percentage of defaulters

e) Monthly Income

Exploratory AnalysisMonthly Income vs. Other Financial Variables


All parameters below have similar pattern - low income range attributing to high values of debt indicatorsi) RevolvingUtilizationOfUnecuredLines,ii) DebtRatio, iii) NumberOfTime30-59DaysPastDueNotWorse,iv) NumberOfTimes90DaysLate v) NumberOfTime60-089DaysPastDueNotWorse,vi) NumberOfOpenCreditLinesAndLoansvii) NumberOfRealEstateLoansOrLines

Collinearity DiagnosticsSample Collinearity Diagnostic results for Age vs. Other 9 independent variable shown here

Performed similar diagnostics for each of the 10 variable against other variables

Condition Index was always less than 15 indicating no collinearity is existing between independent variables

MODEL DEVELOPMENT

Logistic Regression Model The model is developed to classify the SeriousDlqin2yrs variable as 1 or 0

• 1 indicates risk of defaulting• 0 indicates no risk

As the proportion of cases with SeriousDlqin2yrs = 1 is just 6.7 % of the total, a 50:50 strata sampling approach is followed to come up with the model

Pre-processed training dataset is used to draw samples for training and validation of the model

80% random samples drawn from training dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1 and used for developing and training the model

20% random samples drawn from same dataset with equal proportion of SeriousDlqin2yrs equal to 0 and 1 and used to for validation

Final model tested using test data set given

Logistic regression models were developed and compared with two different approaches:• With binned variables (Model 1)• Binned model as Model 1, but missing data binned into another category instead of clean up/imputation,

wherever applicable(Model 2)• A model without binning using variables directly (Model 3)

MODEL 1 – WITH BINNING

Model 1 – With binning• The model has been developed considering business needs and therefore the bins have been

created considering business cut offs.

• In the current model, missing values for NoOfDependents, NumberOfTime30-99DaysPastDueNotWorse, NumberOfTimes90DaysLate and NumberOfTime60-089DaysPastDueNotWorse variables have been removed as they formed 2% of the data and missing values in MonthlyIncome have been imputed.

• Since RevolvingUtilizationUsingUnsecuredLines and DebtRatio are percentages for which bins have been created. Bins created for Age variable as well.

• Dummy variables were created for the categories in the binned variables clubbing insignificant bins together to have better control of the model.

• Training dataset comprised of stratified sample of 9000 records (4500 SeriousDlquin2Yrs = 1 and 4500 SeriousDlquin2Yrs = 0).

• The model comprises of 10 variables including 4 dummy variables.

ModelDevelopment_Model1

Model 1 – Output• The logit function equation for the model is :

-(0.595)+(0.597)* NumberOfTime3059DaysPastDueNotWorse+ (1.029)* NumberOfTimes90DaysLate + (0.072)* NumberRealEstateLoansOrLines + (0.862)* NumberOfTime6089DaysPastDueNotWorse + (0.030)* NumberOfOpenCreditLinesAndLoans – (0.025)*Age + (0.825) * RU_0_.25(1)+ (0.689)* RU_0(1) – (0.783)* RU_GT_.5(1) + (0.129)* DebtRatio_GT0.25_0.5(1)

• A cut off value of 0.5 gave optimal results

Model 1 - Variables Used Variables used

• Age• NumberOfTime3059DaysPastDueNotWorse• NumberOfTime6089DaysPastDueNotWorse• NumberOfTimes90DaysLate• NumberOfOpenCreditLinesAndLoans• NumberRealEstateLoansOrLines• DebtRatio – Dummy Variable used with range of DebtRatio >= 0.25 & <0.5• RevolvingUtilizationOfUnsecuredLines – Used 3 Dummy Variables : RU_0 (where RU=0),

RU_0_.25( where RU>0 but <0.25) and RU_GT_.5( where RU >=5).

Observations• MonthlyIncome was a significant variable but had a Beta Co-efficient of 0 therefore dropped from

the model.• MonthlyIncome and DebtRatio were affecting each other• RevolvingUtilizationOfUnsecuredCreditLines and DebtRatio seems to be correlated.• Though bins were created for Age variable but all the bins were contributing equally to the model

therefore used the Age variable as such.• NoOfDependents was initially thought as significant variable but turned out to be insignificant.

Created bins for NoOfDependents variable but the bins too were insignificant.

Model 1 - Validation• Validated the developed model on a non- stratified random sample of 40% of the data (which

comprised of 29168 records).• Overall accuracy : 78.62% and Misclassification rate : 21.38%• Prediction accuracy for Risky (= 1) is 75.9%

Model 1 – Pros and Cons 17% of the missing values has been imputed and only 2% has been removed, thereby data loss is minimal.

The model has been developed taking into consideration widely used business cut offs and significant parameters.

Since the model has been built on data where missing values were treated, the accuracy of the model may drop on data where missing values are present.

Analyzing Top 10% ( Customers who are prone to default)• 67.4% defaulters are in the age group : 30-50• 67% of defaulters had Revolving Utilization and Debt Ratio less than 0.5• 70.6 %, 78.7% and 74% of the defaulters made payments on time and did not go past 30 days, 60 days and 90

days respectively.• 70% of the defaulters had Monthly Income less than or equal to 7466 USD and 73.3 % of the defaulters did not

have any dependent.

Analyzing Bottom 10% ( Customers who are safe)• 80 % of non- defaulters are more than 40 years of age.• 61% of non- defaulters had Revolving Utilization and Debt Ratio less than 0.5• 85 %, 96.9% and 97.5% of the non- defaulters made payments on time and did not go past 30 days, 60 days and

90 days respectively.• 70% of the non- defaulters had Monthly Income less than or equal to 8366 USD and 50.4 % of the non-

defaulters did not have any dependent.

MODEL 2 – CONSIDERING MISSING VALUES

Model 2 – Considering Missing Values• Missing values have not been imputed here, rather an extra category has been added in

the binned variables to consider missing value as another category. (Example : NoOfDependents_Binned shown below)

• Selection of variables have been based on B, Exp(B), Sig values

• Optimal Binning has been used based on SeriousDlquin2yrs variable.

BinnedVariables

ModelDevelopment

Model 2 – Output• Final Model

(1.311*Age_1)+(1.107*Age_2)+(0.898*Age_3)+(0.479*Age_4)+(1.802*NoOf30_1)+(2.971*NoOf30_2)+(3.445*NoOf30_3)+(3.858*NoOf30_4)+(4.001*NoOf30_5)+(-1.784*NoOf60_1)+(-0.362*NoOf60_2)+(-3.125*NoOf90_1)+(-1.311*NoOf90_2)+(-0.549*NoOf90_3)+1.442.

• Training Set – Stratified sampling of 4000 records with SeriousDlquin2Yrs = 1 and another 4000 with SeriousDlquin2Yrs = 0

• A cut off value of 0.4 gave optimal results

Model 2 - Variables Used Variables used

• Age_OptimalBin• NumberOfTime3059DaysPastDueNotWorse_OptimalBin• NumberOfTime6089DaysPastDueNotWorse_OptimalBin• NumberOfTimes90DaysLate_OptimalBin

Possible reasons why few other variables are not significant• Age has a non-linear relationship with MonthlyIncome• Other 3 variables in the equation are the indicators of number of defaults committed by

the customer which has a relation with NumberOfOpenLinesOfCredit and RevolvingUtilizationsOfUnsecuredLines

• MonthlyIncome will effect the DebtRatio

Model Validation

Model 2 - Validation• Multiple test run has been performed on different sample sizes

• The below given validation table was for a random sample of 90000.

• Overall Accuracy 72.62% and Misclassification 27.37%

• Risky ( = 1) prediction accuracy of 75.1%

Model 2 – Pros and Cons Capable of handling missing values (including 98,96)

Intermediate processing required is minimal (only binning required)

The model uses only 4 variables

Optimal binning used and not the industry standard binning

Other insights• Analyzing top 10% (most risky customer segment)

84% of the customer are below 56 years of age72% have 1 or more past 30 days default

• Analyzing bottom 10% (safest customer segment)All of them are of 64 years or above in ageAlmost all of them have 0 defaults under any case.

MODEL 3 – USING VARIABLES DIRECTLY

Model 3 – Using Variables Directly• Final model has following equation:

0.754+(0.031*Age)+(0.766*NumberOfTime3059DaysPastDueNotWorse)+(1.179*NumberOfTime6089DaysPastDueNotWorse)+(1.417*NumberOfTimes90DaysLate)

• This model is simplest but business considerations were not accounted for, hence cannot assure robustness on deployment

• It cannot handle missing values

Model Validation Model Developmet

CONCLUSION & LIMITATIONS

Conclusion & Limitations• Model 1 and Model 2 give similar accuracy levels. Model 3 is not

recommended. Choice of final model is left to business based on the pros and cons mentioned

• These models to be further validated for scalability and robustness

• The test dataset given did not have delinquency values; hence after validation with 20% random samples from training data set further validation could not be performed using test dataset for accuracy check on a totally new set of data.

• Assumptions taken on binning financial variables could change the significance of different variables in final model. This aspect to be validated with business

THANK YOU

credit risk scoring model final

Data & Analytics