kaggle "give me some credit" challenge overview
DESCRIPTION
Full description of the work associated with this project can be found at: http://www.npcompleteheart.com/project/kaggle-give-me-some-credit/TRANSCRIPT
Predicting delinquency on debt
What is the problem?
What is the problem?
• X Store has a retail credit card available to customers
What is the problem?
• X Store has a retail credit card available to customers
• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt
What is the problem?
• X Store has a retail credit card available to customers
• There can be a number of sources of loss from this product, but one is customer’s defaulting on their debt
• This prevents the store from collecting payment for products and services rendered
Is this problem big enough to matter?
Is this problem big enough to matter?
• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years
Is this problem big enough to matter?
• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years
• If only 5% of their carried debt was the store credit card this is potentially an:
Is this problem big enough to matter?
• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years
• If only 5% of their carried debt was the store credit card this is potentially an:
• Average loss of $8.12 per customer
Is this problem big enough to matter?
• Examining a slice of the customer database (150,000 customers) we find that 6.6% of customers were seriously delinquent in payment the last two years
• If only 5% of their carried debt was the store credit card this is potentially an:
• Average loss of $8.12 per customer
• Potential overall loss of $1.2 million
What can be done?
What can be done?
• There are numerous models that can be used to predict which customers will default
What can be done?
• There are numerous models that can be used to predict which customers will default
• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss
What can be done?
• There are numerous models that can be used to predict which customers will default
• This could be used to decrease credit limits or cancel credit lines for current risky customers to minimize potential loss
• Or better screen which customers are approved for the card
How will I do this?
How will I do this?
• This is a basic classification problem with important business implications
How will I do this?
• This is a basic classification problem with important business implications
• We’ll examine a few simplistic models to get an idea of performance
How will I do this?
• This is a basic classification problem with important business implications
• We’ll examine a few simplistic models to get an idea of performance
• Explore decision tree methods to achieve better performance
What will the models predict delinquency?
Each customer has a number of attributes
What will the models predict delinquency?
Each customer has a number of attributes
John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4
What will the models predict delinquency?
Each customer has a number of attributes
John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4
Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2
What will the models predict delinquency?
Each customer has a number of attributes
John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4
Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2
...
What will the models predict delinquency?
Each customer has a number of attributes
John SmithDelinquent: YesAge: 23Income: $1600Number of Lines: 4
Mary RasmussenDelinquent: NoAge: 73Income: $2200Number of Lines: 2
...
We will use the customer attributes to predict whether they were delinquent
How do we make sure that our solution actually has predictive power?
How do we make sure that our solution actually has predictive power?
We have two slices of the customer dataset
How do we make sure that our solution actually has predictive power?
We have two slices of the customer dataset
Train150,000
customers
Delinquencyin dataset
How do we make sure that our solution actually has predictive power?
We have two slices of the customer dataset
Train Test150,000
customers
Delinquencyin dataset
101,000customers
Delinquencynot indataset
How do we make sure that our solution actually has predictive power?
We have two slices of the customer dataset
Train Test150,000
customers
Delinquencyin dataset
101,000customers
Delinquencynot indataset
None of the customers in the test dataset are used to train the model
Internally we validate our model performance with cross-fold validation
Using only the train dataset we can get a sense of how well our model performs without externally validating it
Train
Internally we validate our model performance with cross-fold validation
Using only the train dataset we can get a sense of how well our model performs without externally validating it
TrainTrain 1
Train 2
Train 3
Internally we validate our model performance with cross-fold validation
Using only the train dataset we can get a sense of how well our model performs without externally validating it
TrainTrain 1
Train 2
Train 3
Train 1
Train 2
AlgorithmTraining
Internally we validate our model performance with cross-fold validation
Using only the train dataset we can get a sense of how well our model performs without externally validating it
TrainTrain 1
Train 2
Train 3
Train 1
Train 2
AlgorithmTraining
AlgorithmTesting
Train 3
What matters is how well we can predict the test dataset
We judge this using the accuracy, which is the number of our predictions correct out of the total number of predictions made
So with 100,000 customers and an 80% accuracy we will have correctly predicted whether 80,000 customers will default or not in the next two years
Putting accuracy in context
Putting accuracy in context
We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it
Putting accuracy in context
We could save $600,000 over two years if we correctly predicted 50% of the customers that would default and changed their account to prevent it
The potential loss is minimized by ~$8,000 for every 100,000 customers with each percentage point increase in accuracy
Looking at the actual data
Looking at the actual data
Looking at the actual data
Looking at the actual data
Assume$2,500
Looking at the actual data
Assume$2,500
Assume0
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
RandomChance
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
RandomChance
50%
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
RandomChance
50%
There is a continuum of algorithmic choices to tackle the problem
Simpler,Quicker
Complex,Slower
RandomChance
50%
SimpleClassification
For simple classification we pick a single attribute and find the best split in the customers
For simple classification we pick a single attribute and find the best split in the customers
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1 2
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1 2
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1 2
For simple classification we pick a single attribute and find the best split in the customers
Num
ber
of C
usto
mer
s
Times Past Due
True PositiveTrue NegativeFalse PositiveFalse Negative
1 2 ...
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
Sens = True PositivesNumber of PeopleActually Delinquent
0 20 40 60 80 100Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
AccuracyPrecisionSensitivity
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
Sens = True PositivesNumber of PeopleActually Delinquent
0 20 40 60 80 100Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
AccuracyPrecisionSensitivity
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
Sens = True PositivesNumber of PeopleActually Delinquent
0 20 40 60 80 100Number of Times 30-59 Days Past Due
0
0.2
0.4
0.6
0.8
AccuracyPrecisionSensitivity
We evaluate possible splits using accuracy, precision, and sensitivity
Acc = Number correctTotal Number
Prec = True PositivesNumber of People
Predicted Delinquent
Sens = True PositivesNumber of PeopleActually Delinquent
0.61 KGI on Test Set
However, not all fields are as informative
Using the number of times past due 60-89 dayswe achieve a KGI of 0.5
However, not all fields are as informative
Using the number of times past due 60-89 dayswe achieve a KGI of 0.5
The approach is naive and could be improved but our time is better spent on different algorithms
Exploring algorithmic choices further
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
Exploring algorithmic choices further
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
A random forest starts from a decision tree
Customer Data
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
No
75,000 Customers>30
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
No
75,000 Customers>30
Yes
25,000 Customers <30
A random forest starts from a decision tree
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
No
75,000 Customers>30
Yes
25,000 Customers <30
...
A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1 ...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1 ...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute
A random forest is composed of many decision trees
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1 ...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
We use a large number of trees to not over-fit to the training data
Class assignment of a customer is based on how manyof the decision trees “vote” on how to split an attribute
The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
The Random Forest algorithm are easily implemented
In Python or R for initial testing and validation
Also parallelized with Mahout and Hadoop since there is no dependence from one tree to the next
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI
A random forest performs well on the test set
Random Forest 10 trees: 0.779 KGI150 trees: 0.843 KGI1000 trees: 0.850 KGI
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
ClassificationRandom Forests
Exploring algorithmic choices further
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
0.78-0.85
Exploring algorithmic choices further
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
0.78-0.85
Gradient TreeBoosting
Boosting Trees is similar to a Random Forest
Customer Data
Find the best split in a set ofrandomly chosen attributes
Is age <30?
No
Customers >30 Data
Yes
Customers <30 Data
...
Boosting Trees is similar to a Random Forest
Customer Data
Is age <30?
No
Customers >30 Data
Yes
Customers <30 Data
...
Do an exhaustive searchfor best split
How Gradient Boosting Trees differs from Random Forest
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
The first tree is optimized to minimize a loss function describing the data
How Gradient Boosting Trees differs from Random Forest
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
The first tree is optimized to minimize a loss function describing the data
The next tree is then optimized to fit whatever variability the first
tree didn’t fit
How Gradient Boosting Trees differs from Random Forest
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
The first tree is optimized to minimize a loss function describing the data
The next tree is then optimized to fit whatever variability the first
tree didn’t fit
This is a sequential process in comparison to the random forest
How Gradient Boosting Trees differs from Random Forest
...
Customer Data
Best Split
No
Customers Data Set 2
Yes
Customers Data Set 1
The first tree is optimized to minimize a loss function describing the data
The next tree is then optimized to fit whatever variability the first
tree didn’t fit
This is a sequential process in comparison to the random forest
We also run the risk of over-fitting to the data, thus the learning rate
Implementing Gradient Boosted Trees
In Python or R it is easy for initial testing and validation
Implementing Gradient Boosted Trees
In Python or R it is easy for initial testing and validation
There are implementations that use Hadoop but it’s more complicated to achieve the best performance
Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI
Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI
Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI
0 0.6 0.8Learning Rate
0.75
0.8
0.85
KG
I
0.2 0.4
Gradient Boosting Trees performs well on the dataset
100 trees, 0.1 Learning: 0.865022 KGI1000 trees, 0.1 Learning: 0.865248 KGI
0 0.6 0.8Learning Rate
0.75
0.8
0.85
KG
I
0.2 0.4
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
ClassificationRandom Forests
Boosting Trees
Moving one step further in complexity
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
0.78-0.85
Gradient TreeBoosting
0.71-0.8659
BlendedMethod
Or more accurately an ensemble ofensemble methods
Algorithm Progression
Or more accurately an ensemble ofensemble methods
Algorithm Progression
Random Forest
Or more accurately an ensemble ofensemble methods
Algorithm Progression
Random Forest
Extremely Random Forest
Or more accurately an ensemble ofensemble methods
Algorithm Progression
Random Forest
Extremely Random Forest
Gradient Tree Boosting
Or more accurately an ensemble ofensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.10.50.010.80.7...
Or more accurately an ensemble ofensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
Or more accurately an ensemble ofensemble methods
Algorithm ProgressionTrain Data Probabilities
Random Forest
Extremely Random Forest
Gradient Tree Boosting
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
Combine all of the model information
Train Data Probabilities
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
Combine all of the model information
Train Data Probabilities
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
Optimize the set of train probabilities to the known delinquencies
Combine all of the model information
Train Data Probabilities
0.10.50.010.80.7...
0.150.60.00.750.68
.
.
.
Optimize the set of train probabilities to the known delinquencies
Apply the same weighting scheme to the set of test data probabilities
Implementation can be done in a number of ways
Testing in Python or R is slower, due to the sequential nature of applying the algorithms
Could be faster parallelized, running each algorithm separately and combining the results
Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
ClassificationRandom Forests
Boosting TreesBlended
Assessing model performance
Blending Performance, 100 trees: 0.864394 KGI
But this performance and the possibility of additional gains comes at a distinct time cost.
0.4 0.5 0.6 0.7 0.8 0.9
Random
Accuracy
ClassificationRandom Forests
Boosting TreesBlended
Examining the continuum of choices
Simpler,Quicker
Complex,Slower
RandomChance
0.50
SimpleClassification
0.50-0.61
RandomForests
0.78-0.85
Gradient TreeBoosting
0.71-0.8659
BlendedMethod
0.864
What would be best to implement?
What would be best to implement?
There is a large amount of optimization in the blended method that could be done
What would be best to implement?
There is a large amount of optimization in the blended method that could be done
However, this algorithm takes the longest to run.This constraint will apply in testing and validation also
What would be best to implement?
There is a large amount of optimization in the blended method that could be done
However, this algorithm takes the longest to run.This constraint will apply in testing and validation also
Random Forests returns a reasonably good result.It is quick and easily parallelized
What would be best to implement?
There is a large amount of optimization in the blended method that could be done
However, this algorithm takes the longest to run.This constraint will apply in testing and validation also
Random Forests returns a reasonably good result.It is quick and easily parallelized
Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though
What would be best to implement?
Random Forests returns a reasonably good result.It is quick and easily parallelized
Gradient Tree Boosting returns the best result and runs reasonably fast.It is not as easily parallelized though
Increases in predictive performance have real business value
Using any of the more complex algorithms we achieve an increase of 35% in comparison to random
Increases in predictive performance have real business value
Using any of the more complex algorithms we achieve an increase of 35% in comparison to random
Potential decrease of ~$420k in losses by identifyingcustomers likely to default in the training set alone
Thank you for your time