wharton department of statistics profiting from data mining bob stine department of statistics the...
Post on 19-Dec-2015
217 views
TRANSCRIPT
WhartonDepartment of Statistics
Profiting from Data MiningProfiting from Data Mining
Bob Stine
Department of Statistics
The Wharton School, Univ of Pennsylvania
April 5, 2002
www-stat.wharton.upenn.edu/~bob
WhartonDepartment of Statistics
2
OverviewOverview Critical stages of data mining process
- Choosing the right data, people, and problems
- Modeling
- Validation
Automated modeling- Feature creation and selection
- Exploiting expert knowledge, “insights”
Applications- Little detail – Biomedical: finding predictive risk factors
- More detail – Financial: predicting returns on the market
- Lots of detail – Credit: anticipating the onset of bankruptcy
WhartonDepartment of Statistics
3
Predicting Health RiskPredicting Health Risk
Who is at risk for a disease?- Example: detect osteoporosis without expense of x-ray
Goals- Improving public health- Savings on medical care- Confirm an informal model with data mining
Many types of features, interested groups- Clinical observations of doctors- Laboratory measurements, “genetic”- Self-reported behavior
Missing data
WhartonDepartment of Statistics
4
Predicting the Stock MarketPredicting the Stock Market Small, “hands-on” example
Goals- Better retirement savings?
- Money for that special vacation? College?
- Trade-offs: risk vs return
Lots of “free” data- Access to accurate historical time trends, macro factors
- Recent data more useful than older data
“Simple” modeling technique
Validation
WhartonDepartment of Statistics
5
Predicting the Market: SpecificsPredicting the Market: Specifics
Build a regression model - Response is return on the value-weighted S&P
- Use standard forward/backward stepwise
- Battery of 12 predictors with interactions
Train the model during 1992-1996 (training data)- Model captures most of variation in 5 years of returns
- Retain only the most significant features (Bonferroni)
Predict returns in 1997 (validation data)
Another version in Foster, Stine & Waterman
WhartonDepartment of Statistics
6
Historical patterns?Historical patterns?
vwReturn
-0.06
-0.04
-0.02
0.00
0.02
0.04
0.06
0.08
92 93 94 95 96 97 98
Year
?
WhartonDepartment of Statistics
7
Fitted model predicts...Fitted model predicts...
-0.05
-0.00
0.05
0.10
0.15
92 93 94 95 96 97 98
Year
Exceptional Feb return?
WhartonDepartment of Statistics
8
What happened?What happened?
Pred Error
-0.15
-0.10
-0.05
-0.00
0.05
0.10
92 93 94 95 96 97 98
Year
Training Period
WhartonDepartment of Statistics
9
Claimed versus Actual ErrorClaimed versus Actual Error
20
40
60
80
100
120
0 10 20 30 40 50 60 70 80 90 100
Complexity of Model
Actual
Claimed
SquaredPrediction
Error
WhartonDepartment of Statistics
10
Over-confidence?Over-confidence?
Over-fitting- Model fits the training data too well –
better than it can predict the future.
- Greedy fitting procedure “Optimization capitalizes on chance”
Some intuition - Coincidences
• Cancer clusters, the “birthday problem”
- Illustration with an auction
• What is the value of the coins in this jar?
WhartonDepartment of Statistics
11
Auctions and Over-fittingAuctions and Over-fitting
What is the value of these coins?
WhartonDepartment of Statistics
12
Auctions and Over-fitting Auctions and Over-fitting
Auction jar of coins to a class of MBA students
Histogram shows the bids of 30 students
Most were suspicious, but a few were not!
Actual value is $3.85
Known as “Winner’s Curse”
Similar to over-fitting:best model like high bidder
1
2
3
4
5
6
7
8
9
WhartonDepartment of Statistics
13
Profiting from data mining? Profiting from data mining?
Where’s the profit in this?- “Mining the miners” vs getting value from your data
- Lost opportunities
Importance of domain knowledge
Validation as a measure of success- Prediction provides an explicit check
- Does your application predict something?
WhartonDepartment of Statistics
14
Pitfalls and Role of ManagementPitfalls and Role of Management
Over-fitting is dominated by other issues…
Management support-Life in silos
-Coordination across domains
Responsibility and reward-Accountability
-Who gets the credit when it succeeds?Who suffers if the project is not successful?
WhartonDepartment of Statistics
15
Specific PotholesSpecific Potholes
Moving targets-“Let’s try this with something else.”
Irrational expectations-“I could have done better than that.”
Not with my data-“It’s our data. You can’t use it.”-“You did not use our data properly.”
WhartonDepartment of Statistics
16
Back to a real application…Back to a real application…
Emphasis on the statistical issues…
WhartonDepartment of Statistics
17
Predicting BankruptcyPredicting Bankruptcy
Goal- Reduce losses stemming from personal bankruptcy
Possible strategies- If can identify those with highest risk of bankruptcy…
Take some action
• Call them for a “friendly chat” about circumstances
• Unilaterally reduce credit limit
Trade-off- Good customers borrow lots of money
- Bad customers also borrow lots of money
WhartonDepartment of Statistics
18
Predicting BankruptcyPredicting Bankruptcy
“Needle in a haystack”- 3,000,000 months of credit-card activity
- 2244 bankruptcies
- Simple predictor that all are OK looks pretty good.
What factors anticipate bankruptcy?- Spending patterns? Payment history?
- Demographics? Missing data?
- Combinations of factors?
• Cash Advance + Las Vegas = Problem
We consider more than 100,000 predictors!
WhartonDepartment of Statistics
19
Modeling: Predictive ModelsModeling: Predictive Models
Build the modelIdentify patterns in training data that predict future observations.- Which features are real? Coincidental?
Evaluate the modelHow do you know that it works?- During the model construction phase
• Only incorporate meaningful features
- After the model is built
• Validate by predicting new observations
WhartonDepartment of Statistics
20
Are all prediction errors the same?Are all prediction errors the same?
Symmetry- Is over-predicting as costly as under-predicting?
- Managing inventories and sales
- Visible costs versus hidden costs
Does a false positive = a false negative?- Classification in data mining
- Credit modeling, flagging “risky” customers
- False positive: call a good customer “bad”
- False negative: fail to identify a “bad”
- Differential costs for different types of errors
WhartonDepartment of Statistics
21
Building a Predictive ModelBuilding a Predictive Model
So many choices… Structure: What type of model?
• Neural net• CART, classification tree• Additive model or regression spline
Identification: Which features to use?• Time lags, “natural” transformations• Combinations of other features
Search: How does one find these features?• Brute force has become cheap.
WhartonDepartment of Statistics
22
Our ChoicesOur Choices Structure
- Linear regression with nonlinearity via interactions
- All 2-way and some 3-way, 4-way interactions
- Missing data handled with indicators
Identification- Conservative standard error
- Comparison of conservative t-ratio to adaptive threshold
Search- Forward stepwise regression
- Coming: Dynamically changing list of features
• Good choice affects where you search next.
WhartonDepartment of Statistics
23
Identifying Predictive FeaturesIdentifying Predictive Features
Classical problem of “variable selection”
Thresholding methods (compare t-ratio to threshold)- Akaike information criterion (AIC)
- Bayes information criterion (BIC)
- Hard thresholding and Bonferroni
Arguments for adaptive thresholds- Empirical Bayes
- Information theory
- Step-up/step-down tests
WhartonDepartment of Statistics
24
Adaptive ThresholdingAdaptive Thresholding
Threshold changes to conform to attributes of data- Easier to add features as more are found.
Threshold for first predictor- Compare conservative t-ratio to Bonferroni.
- Bonferroni is about Sqrt(2 log p)
- If something significant is found, continue.
Threshold for second predictor- Compare t-ratio to reduced threshold
- New threshold is about Sqrt(2 log p/2)
WhartonDepartment of Statistics
25
Adaptive Thresholding: BenefitsAdaptive Thresholding: Benefits
EasyAs easy and fast as implementing the standard criterion that is used in stepwise regression.
TheoryResulting model provably as good as best Bayes model for the problem at hand.
Real worldIt works! Finds models with real signal, and stops when the signal runs out.
WhartonDepartment of Statistics
26
Bankruptcy Model: ConstructionBankruptcy Model: Construction
Data: reserve 80% for validation- Training data
• 600,000 months
• 458 bankruptcies
- Validation data
• 2,400,000 months
• 1786 bankruptcies
Selection via adaptive thresholding - Compare sequence of t-statistics to Sqrt(2 log p/q)
- Dynamic expansion of feature space
WhartonDepartment of Statistics
27
Bankruptcy Model: PreviewBankruptcy Model: Preview
Predictors- Initial search identifies 39
• Validation SS monotonically falls to 1650
• Linear fit can do no better than 1735
- Expanded search of higher interactions finds a bit more
• Nature of predictors comprising the interactions
• Validation SS drops 10 more
Validation: Lift chart- Top 1000 candidates have 351 bankrupt
More validation: Calibration- Close to actual Pr(bankrupt) for most groups.
WhartonDepartment of Statistics
28
Bankruptcy Model: FittingBankruptcy Model: Fitting
Where should the fitting process be stopped?
Residual Sum of Squares
400
410420
430
440
450460
470
0 50 100 150
Number of Predictors
SS
WhartonDepartment of Statistics
29
Bankruptcy Model: FittingBankruptcy Model: Fitting
Our adaptive selection procedure stops at a model with 39 predictors.
Residual Sum of Squares
400
410
420
430
440
450
460
470
0 50 100 150
Number of Predictors
SS
WhartonDepartment of Statistics
30
Bankruptcy Model: ValidationBankruptcy Model: Validation The validation indicates that the fit gets better while
the model expands. Avoids over-fitting.
Validation Sum of Squares
1640
1680
1720
1760
0 50 100 150
Number of Predictors
SS
WhartonDepartment of Statistics
31
Bankruptcy Model: Linear?Bankruptcy Model: Linear? Choosing from linear predictors (no interactions) does
not match the performance of the full search.
Validation Sum of Squares
1640
1680
1720
1760
0 50 100 150
Number of Predictors
SS
Linear Quadratic
WhartonDepartment of Statistics
32
Bankruptcy Model: More?Bankruptcy Model: More? Searching higher-order interactions offers modest
improvement.
Validation Sum of Squares
1640
1680
0 20 40 60
Number of Predictors
SS
Quadratic Cubic
WhartonDepartment of Statistics
33
Lift ChartLift Chart
Measures how well model classifies sought-for group
Depends on rule used to label customers - Very high threshold
Lots of lift, but few bankrupt customers are found.
- Lower thresholdLift drops, but finds more bankrupt customers.
data allin bankrupt %
selectionDMinbankrupt %=Lift
WhartonDepartment of Statistics
34
Generic Lift ChartGeneric Lift Chart
%Responders
0.0
0.2
0.4
0.6
0.8
1.0
0 10 20 30 40 50 60 70 80 90 100
% Chosen
Model
Random
WhartonDepartment of Statistics
35
Bankruptcy Model: LiftBankruptcy Model: Lift Much better than diagonal!
0
25
50
75
100
0 25 50 75 100
% Contacted
% Found
WhartonDepartment of Statistics
36
CalibrationCalibration
Classifier assigns Prob(“BR”)rating to a customer.
Weather forecast
Among those classified as 2/10 chance of “BR”, how many are BR?
Closer to diagonal is better.
Actual
0
25
50
75
100
10 20 30 40 50 60 70 80 90
WhartonDepartment of Statistics
37
Bankruptcy Model: CalibrationBankruptcy Model: Calibration Over-predicts risk above claimed probability 0.4
Calibration Chart
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8
Claim
Actual
WhartonDepartment of Statistics
38
Summary of Bankruptcy ModelSummary of Bankruptcy Model
Automatic, adaptive selection- Finds patterns that predict new observations
- Predictive, but not easy to explain
Dynamic feature set- Current research
- Information theory allows changing search space
- Finds more structure than direct search could find
Validation- Essential only for judging fit.
- Better than “hand-made models” that take years to create.
WhartonDepartment of Statistics
39
So, where’s the profit in DM?So, where’s the profit in DM?
Automated modeling has become very powerful, avoiding problems of over-fitting.
Role for expert judgment remains- What data to use?
- Which features to try first?
- What are the economics of the prediction errors?
Collaboration- Data sources
- Data analysis
- Strategic decisions