wharton department of statistics profiting from data mining bob stine department of statistics the...

39
Wharton Department of Statistics Profiting from Data Profiting from Data Mining Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 www-stat.wharton.upenn.edu/~bob

Post on 19-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

Profiting from Data MiningProfiting from Data Mining

Bob Stine

Department of Statistics

The Wharton School, Univ of Pennsylvania

April 5, 2002

www-stat.wharton.upenn.edu/~bob

Page 2: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

2

OverviewOverview Critical stages of data mining process

- Choosing the right data, people, and problems

- Modeling

- Validation

Automated modeling- Feature creation and selection

- Exploiting expert knowledge, “insights”

Applications- Little detail – Biomedical: finding predictive risk factors

- More detail – Financial: predicting returns on the market

- Lots of detail – Credit: anticipating the onset of bankruptcy

Page 3: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

3

Predicting Health RiskPredicting Health Risk

Who is at risk for a disease?- Example: detect osteoporosis without expense of x-ray

Goals- Improving public health- Savings on medical care- Confirm an informal model with data mining

Many types of features, interested groups- Clinical observations of doctors- Laboratory measurements, “genetic”- Self-reported behavior

Missing data

Page 4: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

4

Predicting the Stock MarketPredicting the Stock Market Small, “hands-on” example

Goals- Better retirement savings?

- Money for that special vacation? College?

- Trade-offs: risk vs return

Lots of “free” data- Access to accurate historical time trends, macro factors

- Recent data more useful than older data

“Simple” modeling technique

Validation

Page 5: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

5

Predicting the Market: SpecificsPredicting the Market: Specifics

Build a regression model - Response is return on the value-weighted S&P

- Use standard forward/backward stepwise

- Battery of 12 predictors with interactions

Train the model during 1992-1996 (training data)- Model captures most of variation in 5 years of returns

- Retain only the most significant features (Bonferroni)

Predict returns in 1997 (validation data)

Another version in Foster, Stine & Waterman

Page 6: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

6

Historical patterns?Historical patterns?

vwReturn

-0.06

-0.04

-0.02

0.00

0.02

0.04

0.06

0.08

92 93 94 95 96 97 98

Year

?

Page 7: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

7

Fitted model predicts...Fitted model predicts...

-0.05

-0.00

0.05

0.10

0.15

92 93 94 95 96 97 98

Year

Exceptional Feb return?

Page 8: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

8

What happened?What happened?

Pred Error

-0.15

-0.10

-0.05

-0.00

0.05

0.10

92 93 94 95 96 97 98

Year

Training Period

Page 9: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

9

Claimed versus Actual ErrorClaimed versus Actual Error

20

40

60

80

100

120

0 10 20 30 40 50 60 70 80 90 100

Complexity of Model

Actual

Claimed

SquaredPrediction

Error

Page 10: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

10

Over-confidence?Over-confidence?

Over-fitting- Model fits the training data too well –

better than it can predict the future.

- Greedy fitting procedure “Optimization capitalizes on chance”

Some intuition - Coincidences

• Cancer clusters, the “birthday problem”

- Illustration with an auction

• What is the value of the coins in this jar?

Page 11: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

11

Auctions and Over-fittingAuctions and Over-fitting

What is the value of these coins?

Page 12: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

12

Auctions and Over-fitting Auctions and Over-fitting

Auction jar of coins to a class of MBA students

Histogram shows the bids of 30 students

Most were suspicious, but a few were not!

Actual value is $3.85

Known as “Winner’s Curse”

Similar to over-fitting:best model like high bidder

1

2

3

4

5

6

7

8

9

Page 13: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

13

Profiting from data mining? Profiting from data mining?

Where’s the profit in this?- “Mining the miners” vs getting value from your data

- Lost opportunities

Importance of domain knowledge

Validation as a measure of success- Prediction provides an explicit check

- Does your application predict something?

Page 14: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

14

Pitfalls and Role of ManagementPitfalls and Role of Management

Over-fitting is dominated by other issues…

Management support-Life in silos

-Coordination across domains

Responsibility and reward-Accountability

-Who gets the credit when it succeeds?Who suffers if the project is not successful?

Page 15: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

15

Specific PotholesSpecific Potholes

Moving targets-“Let’s try this with something else.”

Irrational expectations-“I could have done better than that.”

Not with my data-“It’s our data. You can’t use it.”-“You did not use our data properly.”

Page 16: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

16

Back to a real application…Back to a real application…

Emphasis on the statistical issues…

Page 17: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

17

Predicting BankruptcyPredicting Bankruptcy

Goal- Reduce losses stemming from personal bankruptcy

Possible strategies- If can identify those with highest risk of bankruptcy…

Take some action

• Call them for a “friendly chat” about circumstances

• Unilaterally reduce credit limit

Trade-off- Good customers borrow lots of money

- Bad customers also borrow lots of money

Page 18: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

18

Predicting BankruptcyPredicting Bankruptcy

“Needle in a haystack”- 3,000,000 months of credit-card activity

- 2244 bankruptcies

- Simple predictor that all are OK looks pretty good.

What factors anticipate bankruptcy?- Spending patterns? Payment history?

- Demographics? Missing data?

- Combinations of factors?

• Cash Advance + Las Vegas = Problem

We consider more than 100,000 predictors!

Page 19: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

19

Modeling: Predictive ModelsModeling: Predictive Models

Build the modelIdentify patterns in training data that predict future observations.- Which features are real? Coincidental?

Evaluate the modelHow do you know that it works?- During the model construction phase

• Only incorporate meaningful features

- After the model is built

• Validate by predicting new observations

Page 20: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

20

Are all prediction errors the same?Are all prediction errors the same?

Symmetry- Is over-predicting as costly as under-predicting?

- Managing inventories and sales

- Visible costs versus hidden costs

Does a false positive = a false negative?- Classification in data mining

- Credit modeling, flagging “risky” customers

- False positive: call a good customer “bad”

- False negative: fail to identify a “bad”

- Differential costs for different types of errors

Page 21: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

21

Building a Predictive ModelBuilding a Predictive Model

So many choices… Structure: What type of model?

• Neural net• CART, classification tree• Additive model or regression spline

Identification: Which features to use?• Time lags, “natural” transformations• Combinations of other features

Search: How does one find these features?• Brute force has become cheap.

Page 22: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

22

Our ChoicesOur Choices Structure

- Linear regression with nonlinearity via interactions

- All 2-way and some 3-way, 4-way interactions

- Missing data handled with indicators

Identification- Conservative standard error

- Comparison of conservative t-ratio to adaptive threshold

Search- Forward stepwise regression

- Coming: Dynamically changing list of features

• Good choice affects where you search next.

Page 23: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

23

Identifying Predictive FeaturesIdentifying Predictive Features

Classical problem of “variable selection”

Thresholding methods (compare t-ratio to threshold)- Akaike information criterion (AIC)

- Bayes information criterion (BIC)

- Hard thresholding and Bonferroni

Arguments for adaptive thresholds- Empirical Bayes

- Information theory

- Step-up/step-down tests

Page 24: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

24

Adaptive ThresholdingAdaptive Thresholding

Threshold changes to conform to attributes of data- Easier to add features as more are found.

Threshold for first predictor- Compare conservative t-ratio to Bonferroni.

- Bonferroni is about Sqrt(2 log p)

- If something significant is found, continue.

Threshold for second predictor- Compare t-ratio to reduced threshold

- New threshold is about Sqrt(2 log p/2)

Page 25: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

25

Adaptive Thresholding: BenefitsAdaptive Thresholding: Benefits

EasyAs easy and fast as implementing the standard criterion that is used in stepwise regression.

TheoryResulting model provably as good as best Bayes model for the problem at hand.

Real worldIt works! Finds models with real signal, and stops when the signal runs out.

Page 26: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

26

Bankruptcy Model: ConstructionBankruptcy Model: Construction

Data: reserve 80% for validation- Training data

• 600,000 months

• 458 bankruptcies

- Validation data

• 2,400,000 months

• 1786 bankruptcies

Selection via adaptive thresholding - Compare sequence of t-statistics to Sqrt(2 log p/q)

- Dynamic expansion of feature space

Page 27: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

27

Bankruptcy Model: PreviewBankruptcy Model: Preview

Predictors- Initial search identifies 39

• Validation SS monotonically falls to 1650

• Linear fit can do no better than 1735

- Expanded search of higher interactions finds a bit more

• Nature of predictors comprising the interactions

• Validation SS drops 10 more

Validation: Lift chart- Top 1000 candidates have 351 bankrupt

More validation: Calibration- Close to actual Pr(bankrupt) for most groups.

Page 28: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

28

Bankruptcy Model: FittingBankruptcy Model: Fitting

Where should the fitting process be stopped?

Residual Sum of Squares

400

410420

430

440

450460

470

0 50 100 150

Number of Predictors

SS

Page 29: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

29

Bankruptcy Model: FittingBankruptcy Model: Fitting

Our adaptive selection procedure stops at a model with 39 predictors.

Residual Sum of Squares

400

410

420

430

440

450

460

470

0 50 100 150

Number of Predictors

SS

Page 30: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

30

Bankruptcy Model: ValidationBankruptcy Model: Validation The validation indicates that the fit gets better while

the model expands. Avoids over-fitting.

Validation Sum of Squares

1640

1680

1720

1760

0 50 100 150

Number of Predictors

SS

Page 31: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

31

Bankruptcy Model: Linear?Bankruptcy Model: Linear? Choosing from linear predictors (no interactions) does

not match the performance of the full search.

Validation Sum of Squares

1640

1680

1720

1760

0 50 100 150

Number of Predictors

SS

Linear Quadratic

Page 32: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

32

Bankruptcy Model: More?Bankruptcy Model: More? Searching higher-order interactions offers modest

improvement.

Validation Sum of Squares

1640

1680

0 20 40 60

Number of Predictors

SS

Quadratic Cubic

Page 33: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

33

Lift ChartLift Chart

Measures how well model classifies sought-for group

Depends on rule used to label customers - Very high threshold

Lots of lift, but few bankrupt customers are found.

- Lower thresholdLift drops, but finds more bankrupt customers.

data allin bankrupt %

selectionDMinbankrupt %=Lift

Page 34: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

34

Generic Lift ChartGeneric Lift Chart

%Responders

0.0

0.2

0.4

0.6

0.8

1.0

0 10 20 30 40 50 60 70 80 90 100

% Chosen

Model

Random

Page 35: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

35

Bankruptcy Model: LiftBankruptcy Model: Lift Much better than diagonal!

0

25

50

75

100

0 25 50 75 100

% Contacted

% Found

Page 36: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

36

CalibrationCalibration

Classifier assigns Prob(“BR”)rating to a customer.

Weather forecast

Among those classified as 2/10 chance of “BR”, how many are BR?

Closer to diagonal is better.

Actual

0

25

50

75

100

10 20 30 40 50 60 70 80 90

Page 37: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

37

Bankruptcy Model: CalibrationBankruptcy Model: Calibration Over-predicts risk above claimed probability 0.4

Calibration Chart

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8

Claim

Actual

Page 38: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

38

Summary of Bankruptcy ModelSummary of Bankruptcy Model

Automatic, adaptive selection- Finds patterns that predict new observations

- Predictive, but not easy to explain

Dynamic feature set- Current research

- Information theory allows changing search space

- Finds more structure than direct search could find

Validation- Essential only for judging fit.

- Better than “hand-made models” that take years to create.

Page 39: Wharton Department of Statistics Profiting from Data Mining Bob Stine Department of Statistics The Wharton School, Univ of Pennsylvania April 5, 2002 bob

WhartonDepartment of Statistics

39

So, where’s the profit in DM?So, where’s the profit in DM?

Automated modeling has become very powerful, avoiding problems of over-fitting.

Role for expert judgment remains- What data to use?

- Which features to try first?

- What are the economics of the prediction errors?

Collaboration- Data sources

- Data analysis

- Strategic decisions