kaggle "give me some credit" challenge overview

Predicting delinquency on debt

What is the problem?


• X Store has a retail credit card available to customers



• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt



• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt

• This prevents the store from collecting payment for products and services rendered

Is this problem big enough to matter?


• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years



• If only 5% of their carried debt was the store credit card this is potentially an:




• Average loss of $8.12 per customer




• Average loss of $8.12 per customer

• Potential overall loss of $1.2 million

What can be done?

What can be done?

• There are numerous models that can be used to predict which customers will default

What can be done?


• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss

What can be done?


• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss

• Or better screen which customers are approved for the card

How will I do this?

How will I do this?

• This is a basic classification problem with important business implications

How will I do this?


• We’ll examine a few simplistic models to get an idea of performance

How will I do this?


• We’ll examine a few simplistic models to get an idea of performance

• Explore decision tree methods to achieve better performance

What will the models predict delinquency?

Each customer has a number of attributes



John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4




Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2





...





...

We will use the customer attributes to predict whether they were delinquent

How do we make sure that our solution actually has predictive power?


We have two slices of the customer dataset



Train150,000

customers

Delinquencyin dataset



Train Test150,000

customers


101,000customers

Delinquencynot indataset



Train Test150,000

customers


101,000customers

Delinquencynot indataset

None of the customers in the test dataset are used to train the model

Internally we validate our model performance with cross-fold validation

Using only the train dataset we can get a sense of how well our model performs without externally validating it

Train



TrainTrain 1

Train 2

Train 3



TrainTrain 1

Train 2

Train 3

Train 1

Train 2

AlgorithmTraining



TrainTrain 1

Train 2

Train 3

Train 1

Train 2

AlgorithmTraining

AlgorithmTesting

Train 3

What matters is how well we can predict the test dataset

We judge this using the accuracy, which is the number of our predictions correct out of the total number of predictions made

So with 100,000 customers and an 80% accuracy we will have correctly predicted whether 80,000 customers will default or not in the next two years

Putting accuracy in context


We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it


We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it

The potential loss is minimized by ~$8,000 for every 100,000 customers with each percentage point increase in accuracy

Looking at the actual data


Assume$2,500


Assume$2,500

Assume0

There is a continuum of algorithmic choices to tackle the problem

Simpler,Quicker

Complex,Slower


Simpler,Quicker

Complex,Slower

RandomChance


Simpler,Quicker

Complex,Slower

RandomChance

50%


Simpler,Quicker

Complex,Slower

RandomChance

50%

SimpleClassification

For simple classification we pick a single attribute and find the best split in the customers


Num

ber

of C

usto

mer

s

Times Past Due


Num

ber

of C

usto

mer

s

Times Past Due

True PositiveTrue NegativeFalse PositiveFalse Negative

1


Num

ber

of C

usto

mer

s

Times Past Due


1 2


Num

ber

of C

usto

mer

s

Times Past Due


1 2 ...

We evaluate possible splits using accuracy, precision, and sensitivity

Acc = Number correctTotal Number



Prec = True PositivesNumber of People

Predicted Delinquent





Sens = True PositivesNumber of PeopleActually Delinquent

0 20 40 60 80 100Number of Times 30-59 Days Past Due

0

0.2

0.4

0.6

0.8

AccuracyPrecisionSensitivity






0 20 40 60 80 100Number of Times 30-59 Days Past Due

0

0.2

0.4

0.6

0.8

AccuracyPrecisionSensitivity






0.61 KGI on Test Set

However, not all fields are as informative

Using the number of times past due 60-89 dayswe achieve a KGI of 0.5

However, not all fields are as informative

Using the number of times past due 60-89 dayswe achieve a KGI of 0.5

The approach is naive and could be improved but our time is better spent on different algorithms

Exploring algorithmic choices further

Simpler,Quicker

Complex,Slower

RandomChance

0.50


0.50-0.61


Simpler,Quicker

Complex,Slower

RandomChance

0.50


0.50-0.61

RandomForests

A random forest starts from a decision tree

Customer Data


Customer Data

Find the best split in a set ofrandomly chosen attributes


Customer Data


Is age <30?


Customer Data


Is age <30?

No

75,000 Customers>30


Customer Data


Is age <30?

No

75,000 Customers>30

Yes

25,000 Customers <30


Customer Data


Is age <30?

No

75,000 Customers>30

Yes

25,000 Customers <30

...

A random forest is composed of many decision trees

...

Customer Data

Best Split

No

Customers Data Set 2

Yes



...

Customer Data

Best Split

No


Yes

Customers Data Set 1 ...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes



...

Customer Data

Best Split

No


Yes


Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute


...

Customer Data

Best Split

No


Yes


Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


...

Customer Data

Best Split

No


Yes


We use a large number of trees to not over-fit to the training data

Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute

The Random Forest algorithm are easily implemented

In Python or R for initial testing and validation

The Random Forest algorithm are easily implemented

In Python or R for initial testing and validation

Also parallelized with Mahout and Hadoop since there is no dependence from one tree to the next

A random forest performs well on the test set

Random Forest 10 trees: 0.779 KGI


Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI


Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI


Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy

ClassificationRandom Forests


Simpler,Quicker

Complex,Slower

RandomChance

0.50


0.50-0.61

RandomForests

0.78-0.85


Simpler,Quicker

Complex,Slower

RandomChance

0.50


0.50-0.61

RandomForests

0.78-0.85

Gradient TreeBoosting

Boosting Trees is similar to a Random Forest

Customer Data


Is age <30?

No

Customers >30 Data

Yes

Customers <30 Data

...

Boosting Trees is similar to a Random Forest

Customer Data

Is age <30?

No

Customers >30 Data

Yes

Customers <30 Data

...

Do an exhaustive searchfor best split

How Gradient Boosting Trees differs from Random Forest

...

Customer Data

Best Split

No


Yes


The first tree is optimized to minimize a loss function describing the data


...

Customer Data

Best Split

No


Yes



The next tree is then optimized to fit whatever variability the first

tree didn’t fit


...

Customer Data

Best Split

No


Yes




tree didn’t fit

This is a sequential process in comparison to the random forest


...

Customer Data

Best Split

No


Yes




tree didn’t fit

This is a sequential process in comparison to the random forest

We also run the risk of over-fitting to the data, thus the learning rate

Implementing Gradient Boosted Trees

In Python or R it is easy for initial testing and validation

Implementing Gradient Boosted Trees

In Python or R it is easy for initial testing and validation

There are implementations that use Hadoop but it’s more complicated to achieve the best performance

Gradient Boosting Trees performs well on the dataset

100 trees, 0.1 Learning: 0.865022 KGI


100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI



0 0.6 0.8Learning Rate

0.75

0.8

0.85

KG

I

0.2 0.4



0 0.6 0.8Learning Rate

0.75

0.8

0.85

KG

I

0.2 0.4

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy


Boosting Trees

Moving one step further in complexity

Simpler,Quicker

Complex,Slower

RandomChance

0.50


0.50-0.61

RandomForests

0.78-0.85


0.71-0.8659

BlendedMethod

Or more accurately an ensemble ofensemble methods

Algorithm Progression



Random Forest



Random Forest

Extremely Random Forest



Random Forest


Gradient Tree Boosting


Algorithm ProgressionTrain Data Probabilities

Random Forest



0.10.50.010.80.7...


Algorithm ProgressionTrain Data Probabilities

Random Forest



0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Combine all of the model information

Train Data Probabilities

0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.



0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Optimize the set of train probabilities to the known delinquencies



0.10.50.010.80.7...

0.150.60.00.750.68

.

.

.

Optimize the set of train probabilities to the known delinquencies

Apply the same weighting scheme to the set of test data probabilities

Implementation can be done in a number of ways

Testing in Python or R is slower, due to the sequential nature of applying the algorithms

Could be faster parallelized, running each algorithm separately and combining the results

Assessing model performance

Blending Performance, 100 trees: 0.864394 KGI



0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy


Boosting TreesBlended



But this performance and the possibility of additional gains comes at a distinct time cost.

0.4 0.5 0.6 0.7 0.8 0.9

Random

Accuracy


Boosting TreesBlended

Examining the continuum of choices

Simpler,Quicker

Complex,Slower

RandomChance

0.50


0.50-0.61

RandomForests

0.78-0.85


0.71-0.8659

BlendedMethod

0.864

What would be best to implement?


There is a large amount of optimization in the blended method that could be done



However, this algorithm takes the longest to run.This constraint will apply in testing and validation also




Random Forests returns a reasonably good result.It is quick and easily parallelized





Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though

Increases in predictive performance have real business value

Using any of the more complex algorithms we achieve an increase of 35% in comparison to random

Increases in predictive performance have real business value

Using any of the more complex algorithms we achieve an increase of 35% in comparison to random

Potential decrease of ~$420k in losses by identifyingcustomers likely to default in the training set alone

Thank you for your time

kaggle "give me some credit" challenge overview

Business

customers delinquency

number of lines

customer attributes

customer database

current risky customers

credit lines

train dataset

customer dataset train