© deloitte consulting, 2005 cart from a to b james guszcza, fcas, maaa cas predictive modeling...

54
© Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Upload: griselda-hood

Post on 17-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

© Deloitte Consulting, 2005

CART from A to B

James Guszcza, FCAS, MAAACAS Predictive Modeling Seminar

ChicagoSeptember, 2005

Page 2: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

© Deloitte Consulting, 2005

Contents

An Insurance ExampleSome Basic TheorySuggested Uses of CARTCase Study: comparing CART with other methods

Page 3: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

What is CART? Classification And Regression Trees Developed by Breiman, Friedman, Olshen, Stone in

early 80’s. Introduced tree-based modeling into the statistical

mainstream Rigorous approach involving cross-validation to select

the optimal tree One of many tree-based modeling techniques.

CART -- the classic CHAID C5.0 Software package variants (SAS, S-Plus, R…) Note: the “rpart” package in “R” is freely available

Page 4: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Philosophy

“Our philosophy in data analysis is to look at the data from a number of different viewpoints. Tree structured regression offers an interesting alternative for looking at regression type problems. It has sometimes given clues to data structure not apparent from a linear regression analysis. Like any tool, its greatest benefit lies in its intelligent and sensible application.” --Breiman, Friedman, Olshen, Stone

Page 5: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

The Key Idea

Recursive Partitioning Take all of your data. Consider all possible values of all variables. Select the variable/value (X=t1) that produces

the greatest “separation” in the target. (X=t1) is called a “split”.

If X< t1 then send the data to the “left”; otherwise, send data point to the “right”.

Now repeat same process on these two “nodes” You get a “tree” Note: CART only uses binary splits.

Page 6: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

© Deloitte Consulting, 2005

An Insurance Example

Page 7: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Let’s Get Rolling Suppose you have 3 variables:

# vehicles: {1,2,3…10+}Age category: {1,2,3…6}Liability-only: {0,1}

At each iteration, CART tests all 15 splits.(#veh<2), (#veh<3),…, (#veh<10)(age<2),…, (age<6)(lia<1)

Select split resulting in greatest increase in purity. Perfect purity: each split has either all claims or all

no-claims. Perfect impurity: each split has same proportion of

claims as overall population.

Page 8: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Classification Tree Example: predict likelihood of a claim

Commercial Auto Dataset 57,000 policies 34% claim frequency

Classification Tree using Gini splitting rule

First split: Policies with ≥5

vehicles have 58% claim frequency

Else 20% Big increase in purity

NUM_VEH <= 4.500

TerminalNode 1

Class Cases %0 29083 80.01 7276 20.0

N = 36359

NUM_VEH > 4.500

TerminalNode 2

Class Cases %0 8808 42.31 12036 57.7

N = 20844

Node 1NUM_VEHClass Cases %

0 37891 66.21 19312 33.8

N = 57203

Page 9: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Growing the Tree

FREQ1_F_RPT <= 0.500

TerminalNode 1

Class = 0Class Cases %

0 18984 78.71 5138 21.3

N = 24122

FREQ1_F_RPT > 0.500

TerminalNode 2

Class = 1Class Cases %

0 2508 57.41 1859 42.6

N = 4367

LIAB_ONLY <= 0.500

Node 3FREQ1_F_RPT

N = 28489

LIAB_ONLY > 0.500

TerminalNode 3

Class = 0Class Cases %

0 7591 96.51 279 3.5

N = 7870

NUM_VEH <= 4.500

Node 2LIAB_ONLY

N = 36359

AVGAGE_CAT <= 8.500

TerminalNode 4

Class = 1Class Cases %

0 4327 48.11 4671 51.9

N = 8998

AVGAGE_CAT > 8.500

TerminalNode 5

Class = 0Class Cases %

0 2072 76.51 637 23.5

N = 2709

NUM_VEH <= 10.500

Node 5AVGAGE_CAT

N = 11707

NUM_VEH > 10.500

TerminalNode 6

Class = 1Class Cases %

0 2409 26.41 6728 73.6

N = 9137

NUM_VEH > 4.500

Node 4NUM_VEH

N = 20844

Node 1NUM_VEHN = 57203

Page 10: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Observations (Shaking the Tree) First split (# vehicles) is

rather obvious More exposure more

claims But it confirms that CART is

doing something reasonable. Also: the choice of

splitting value 5 (not 4 or 6) is non-obvious.

This suggests a way of optimally “binning” continuous variables into a small number of groups

NUM_VEH <= 4.500

TerminalNode 1

Class Cases %0 29083 80.01 7276 20.0

N = 36359

NUM_VEH > 4.500

TerminalNode 2

Class Cases %0 8808 42.31 12036 57.7

N = 20844

Node 1NUM_VEHClass Cases %

0 37891 66.21 19312 33.8

N = 57203

Page 11: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

CART and Linear Structure

FREQ1_F_RPT <= 0.500

TerminalNode 1

Class = 0Class Cases %

0 18984 78.71 5138 21.3

N = 24122

FREQ1_F_RPT > 0.500

TerminalNode 2

Class = 1Class Cases %

0 2508 57.41 1859 42.6

N = 4367

LIAB_ONLY <= 0.500

Node 3FREQ1_F_RPT

N = 28489

LIAB_ONLY > 0.500

TerminalNode 3

Class = 0Class Cases %

0 7591 96.51 279 3.5

N = 7870

NUM_VEH <= 4.500

Node 2LIAB_ONLY

N = 36359

AVGAGE_CAT <= 8.500

TerminalNode 4

Class = 1Class Cases %

0 4327 48.11 4671 51.9

N = 8998

AVGAGE_CAT > 8.500

TerminalNode 5

Class = 0Class Cases %

0 2072 76.51 637 23.5

N = 2709

NUM_VEH <= 10.500

Node 5AVGAGE_CAT

N = 11707

NUM_VEH > 10.500

TerminalNode 6

Class = 1Class Cases %

0 2409 26.41 6728 73.6

N = 9137

NUM_VEH > 4.500

Node 4NUM_VEH

N = 20844

Node 1NUM_VEHN = 57203

Notice Right-hand side of the tree... CART is struggling to

capture a linear relationship

Weakness of CART The best CART can do is

a step function approximation of a linear relationship.

Page 12: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Interactions and Rules This tree is obviously not

the best way to model this dataset.

But notice node #3 Liability-only policies

with fewer than 5 vehicles have a very low claim frequency in this data. Could be used as an

underwriting rule Or an interaction

term in a GLM

FREQ1_F_RPT <= 0.500

TerminalNode 1

Class = 0Class Cases %

0 18984 78.71 5138 21.3

N = 24122

FREQ1_F_RPT > 0.500

TerminalNode 2

Class = 1Class Cases %

0 2508 57.41 1859 42.6

N = 4367

LIAB_ONLY <= 0.500

Node 3FREQ1_F_RPT

N = 28489

LIAB_ONLY > 0.500

TerminalNode 3

Class = 0Class Cases %

0 7591 96.51 279 3.5

N = 7870

NUM_VEH <= 4.500

Node 2LIAB_ONLY

N = 36359

AVGAGE_CAT <= 8.500

TerminalNode 4

Class = 1Class Cases %

0 4327 48.11 4671 51.9

N = 8998

AVGAGE_CAT > 8.500

TerminalNode 5

Class = 0Class Cases %

0 2072 76.51 637 23.5

N = 2709

NUM_VEH <= 10.500

Node 5AVGAGE_CAT

N = 11707

NUM_VEH > 10.500

TerminalNode 6

Class = 1Class Cases %

0 2409 26.41 6728 73.6

N = 9137

NUM_VEH > 4.500

Node 4NUM_VEH

N = 20844

Node 1NUM_VEHN = 57203

Page 13: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

High-Dimensional Predictors Categorical predictors:

CART considers every possible subset of categories Nice feature Very handy way to group

massively categorical predictors into a small # of groups

Left (fewer claims): dump, farm, no truck

Right (more claims): contractor, hauling, food delivery, special delivery, waste, other

= ("dump",...)

TerminalNode 1

N = 11641

= ("hauling")

TerminalNode 2N = 652

= ("specDel")

TerminalNode 3N = 249

= ("hauling",...)

Node 3LINE_IND$

N = 901

= ("contr",...)

TerminalNode 4

N = 25758

= ("contr",...)

Node 2LINE_IND$

N = 26659

Node 1LINE_IND$N = 38300

Page 14: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Gains Chart: Measuring SuccessFrom left to right: Node 6: 16% of policies,

35% of claims. Node 4: add’l 16% of

policies, 24% of claims. Node 2: add’l 8% of

policies, 10% of claims. ..etc.

The steeper the gains chart, the stronger the model.

Analogous to a lift curve. Desirable to use out-of-

sample data.

Page 15: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

© Deloitte Consulting, 2005

A Little Theory

Page 16: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Splitting Rules Select the variable value (X=t1) that

produces the greatest “separation” in the target variable.

“Separation” defined in many ways. Regression Trees (continuous target): use

sum of squared errors. Classification Trees (categorical target):

choice of entropy, Gini measure, “twoing” splitting rule.

Page 17: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Regression Trees Tree-based modeling for continuous target variable

most intuitively appropriate method for loss ratio analysis

Find split that produces greatest separation in ∑y – E(y)2

i.e.: find nodes with minimal within variance and therefore greatest between variance like credibility theory

Every record in a node is assigned the same yhat model is a step function

Page 18: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Classification Trees Tree-based modeling for discrete target variable In contrast with regression trees, various measures of

purity are used Common measures of purity:

Gini, entropy, “twoing” Intuition: an ideal retention model would produce

nodes that contain either defectors only or non-defectors only

completely pure nodes

Page 19: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

More on Splitting Criteria Gini purity of a node p(1-p)

where p = relative frequency of defectors Entropy of a node -Σplogp

-[p*log(p) + (1-p)*log(1-p)] Max entropy/Gini when p=.5 Min entropy/Gini when p=0 or 1

Gini might produce small but pure nodes The “twoing” rule strikes a balance between purity

and creating roughly equal-sized nodes Note: “twoing” is available in Salford Systems’ CART but

not in the “rpart” package in R.

Page 20: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Classification Trees vs. Regression Trees Splitting Criteria:

Gini, Entropy, Twoing

Goodness of fit measure: misclassification rates

Prior probabilities and misclassification costs available as model

“tuning parameters”

Splitting Criterion: sum of squared errors

Goodness of fit: same measure! sum of squared errors

No priors or misclassification costs… … just let it run

Page 21: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

How CART Selects the Optimal Tree Use cross-validation (CV) to select the

optimal decision tree. Built into the CART algorithm.

Essential to the method; not an add-on Basic idea: “grow the tree” out as far as

you can…. Then “prune back”. CV: tells you when to stop pruning.

Page 22: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Growing & Pruning One approach: stop

growing the tree early. But how do you

know when to stop? CART: just grow the tree

all the way out; then prune back.

Sequentially collapse nodes that result in the smallest change in purity.

“weakest link” pruning.

|

|

Page 23: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Finding the Right Tree “Inside every big tree is

a small, perfect tree waiting to come out.”

--Dan Steinberg2004 CAS P.M.

Seminar The optimal tradeoff of

bias and variance. But how to find it??

|

|

Page 24: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Cost-Complexity Pruning Definition: Cost-Complexity Criterion

Rα= MC + αL MC = misclassification rate

Relative to # misclassifications in root node. L = # leaves (terminal nodes) You get a credit for lower MC. But you also get a penalty for more leaves.

Let T0 be the biggest tree.

Find sub-tree of Tα of T0 that minimizes Rα. Optimal trade-off of accuracy and complexity.

Page 25: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Weakest-Link Pruning Let’s sequentially collapse nodes that result

in the smallest change in purity. This gives us a nested sequence of trees that

are all sub-trees of T0.

T0 » T1 » T2 » T3 » … » Tk » … Theorem: the sub-tree Tα of T0 that minimizes

Rα is in this sequence! Gives us a simple strategy for finding best tree.

Find the tree in the above sequence that minimizes CV misclassification rate.

Page 26: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

What is the Optimal Size? Note that α is a free parameter in:

Rα= MC + αL 1:1 correspondence betw. α and size of tree. What value of α should we choose?

α=0 maximum tree T0 is best. α=big You never get past the root node. Truth lies in the middle.

Use cross-validation to select optimal α (size)

Page 27: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Finding α Fit 10 trees on the

“blue” data. Test them on the “red”

data. Keep track of mis-

classification rates for different values of α.

Now go back to the full dataset and choose the α-tree.

model P1 P2 P3 P4 P5 P6 P7 P8 P9 P10

1 train train train train train train train train train test

2 train train train train train train train train test train

3 train train train train train train train test train train

4 train train train train train train test train train train

5 train train train train train test train train train train

6 train train train train test train train train train train

7 train train train test train train train train train train

8 train train test train train train train train train train

9 train test train train train train train train train train

10 test train train train train train train train train train

Page 28: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

How to Cross-Validate Grow the tree on all the data: T0. Now break the data into 10 equal-size pieces. 10 times: grow a tree on 90% of the data.

Drop the remaining 10% (test data) down the nested trees corresponding to each value of α.

For each α add up errors in all 10 of the test data sets.

Keep track of the α corresponding to lowest test error. This corresponds to one of the nested trees Tk«T0.

Page 29: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Just Right Relative error: proportion

of CV-test cases misclassified.

According to CV, the 15-node tree is nearly optimal.

In summary: grow the tree all the way out.

Then weakest-link prune back to the 15 node tree.

cp

X-v

al R

ela

tive

Err

or

0.2

0.4

0.6

0.8

1.0

Inf 0.059 0.035 0.0093 0.0055 0.0036

1 2 3 5 6 7 8 10 13 18 21

size of tree

Page 30: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

© Deloitte Consulting, 2005

CART in Practice

Page 31: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

CART advantages Nonparametric (no probabilistic assumptions) Automatically performs variable selection Uses any combination of continuous/discrete

variables Very nice feature: ability to automatically bin

massively categorical variables into a few categories.

zip code, business class, make/model… Discovers “interactions” among variables

Good for “rules” search Hybrid GLM-CART models

Page 32: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

CART advantages CART handles missing values automatically

Using “surrogate splits” Invariant to monotonic transformations of

predictive variable Not sensitive to outliers in predictive

variables Unlike regression

Great way to explore, visualize data

Page 33: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

CART Disadvantages The model is a step function, not a continuous score

So if a tree has 10 nodes, yhat can only take on 10 possible values.

MARS improves this. Might take a large tree to get good lift

But then hard to interpret Data gets chopped thinner at each split

Instability of model structure Correlated variables random data fluctuations

could result in entirely different trees. CART does a poor job of modeling linear structure

Page 34: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Uses of CART Building predictive models

Alternative to GLMs, neural nets, etc Exploratory Data Analysis

Breiman et al: a different view of the data. You can build a tree on nearly any data set with

minimal data preparation. Which variables are selected first? Interactions among variables Take note of cases where CART keeps re-splitting

the same variable (suggests linear relationship) Variable Selection

CART can rank variables Alternative to stepwise regression

Page 35: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

© Deloitte Consulting, 2005

Case Study:Spam e-mail Detection

Compare CART with:Neural NetsMARSLogistic RegressionOrdinary Least Squares

Page 36: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

The Data

Goal: build a model to predict whether an incoming email is spam.

Analogous to insurance fraud detection About 21,000 data points, each representing

an email message sent to an HP scientist. Binary target variable

1 = the message was spam: 8% 0 = the message was not spam 92%

Predictive variables created based on frequencies of various words & characters.

Page 37: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

The Predictive Variables

57 variables created Frequency of “George” (the scientist’s first

name) Frequency of “!”, “$”, etc. Frequency of long strings of capital letters Frequency of “receive”, “free”, “credit”…. Etc

Variables creation required insight that (as yet) can’t be automated.

Analogous to the insurance variables an insightful actuary or underwriter can create.

Page 38: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Sample Data Points

Page 39: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Methodology

Divide data 60%-40% into train-test. Use multiple techniques to fit models on train

data. Apply the models to the test data. Compare their power using gains charts.

Page 40: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Software

R statistical computing environment Classification or Regression trees can be fit

using the “rpart” package written by Therneau and Atkinson.

Designed to follow the Breiman et al approach closely.

http://www.r-project.org/

Page 41: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Un-pruned Tree Just let CART keep

splitting as long as it can.

Too big. Messy More importantly:

this tree over-fits the data

Use Cross-Validation (on the train data) to prune back. Select the optimal

sub-tree.

|

Page 42: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Pruning Back Plot cross-validated error rate vs. size of tree

Note: error can actually increase if the tree is too big (over-fit)

Looks like the ≈ optimal tree has 52 nodes So prune the tree back to 52 nodes

cp

X-v

al R

ela

tive

Err

or

0.2

0.4

0.6

0.8

1.0

Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05

1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83

size of tree

Page 43: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Pruned Tree #1 The pruned tree is still

pretty big. Can we get away with

pruning the tree back even further?

Let’s be radical and prune way back to a tree we actually wouldn’t mind looking at.

|

cp

X-v

al R

ela

tive

Err

or

0.2

0.4

0.6

0.8

1.0

Inf 0.09 0.043 0.018 0.011 0.0096 0.0061 0.0049 0.0033 0.002 0.0011 2.4e-05

1 3 5 7 8 10 11 13 15 17 19 20 22 24 25 30 37 52 56 66 71 83

size of tree

Page 44: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Pruned Tree #2

|freq_DOLLARSIGN< 0.0555

freq_remove< 0.065

freq_EXCL< 0.5235

tot.CAPS< 83.5

freq_free< 0.77

freq_george>=0.14

freq_hp>=0.16

freq_EXCL< 0.3765

avg.CAPS< 2.92

freq_remove< 0.025

01.061e+04/170

0415/29

10/13

120/75

059/0

170/178

0285/12

0208/54

112/51

146/193

14/290

Suggests rule:

Many “$” signs, caps, and “!” and few instances of company name (“HP”) spam!

Page 45: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

CART Gains Chart How do the three trees

compare? Use gains chart on test

data. Outer black line: the

best one could do 45o line: monkey

throwing darts The bigger trees are

about equally good in catching 80% of the spam.

We do lose something with the simpler tree. 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Perc.Total.Pop

Pe

rc.S

pa

m

perfect modelunpruned treepruned tree #1pruned tree #2

Spam Email Detection - Gains Charts

Page 46: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Other Models Fit a purely additive MARS model to the data.

No interactions among basis functions Fit a neural network with 3 hidden nodes. Fit a logistic regression (GLM).

Using the 20 strongest variables Fit an ordinary multiple regression.

A statistical sin: the target is binary, not normal

Page 47: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

GLM model

Logistic regression run on 20 of the most powerful predictive variables

Page 48: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Neural Net Weights

Page 49: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Comparison of Techniques

All techniques add value. MARS/NNET beats GLM. But note: we used all

variables for MARS/NNET; only 20 for GLM.

GLM beats CART. In real life we’d probably

use the GLM model but refer to the tree for “rules” and intuition.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Perc.Total.Pop

Pe

rc.S

pa

m

perfect modelmarsneural netpruned tree #1glmregression

Spam Email Detection - Gains Charts

Page 50: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Parting Shot: Hybrid GLM model

We can use the simple decision tree (#3) to motivate the creation of two ‘interaction’ terms: “Goodnode”:

(freq_$ < .0565) & (freq_remove < .065) & (freq_! <.524) “Badnode”:

(freq_$ > .0565) & (freq_hp <.16) & (freq_! > .375)

We read these off tree (#3) Code them as {0,1} dummy variables Include in GLM model

At the same time, remove terms no longer significant.

Page 51: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Hybrid GLM model

•The Goodnode and Badnode indicators are highly significant.

•Note that we also removed 5 variables that were in the original GLM

Page 52: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Hybrid Model Result

Slight improvement over the original GLM. See gains chart See confusion matrix

Improvement not huge in this particular model…

… but proves the concept

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Perc.Total.Pop

Pe

rc.S

pa

m

perfect modelneural netdecision tree #2glmhybrid glm

Spam Email Detection - Gains Charts

Page 53: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

Concluding Thoughts

In many cases, CART will likely under-perform tried-and-true techniques like GLM.

Poor at handling linear structure Data gets chopped thinner at each split

BUT: is highly intuitive and a great way to: Get a feel for your data Select variables Search for interactions Search for “rules” Bin variables

Page 54: © Deloitte Consulting, 2005 CART from A to B James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005

More Philosophy

“Binary Trees give an interesting and often illuminating way of looking at the data in classification or regression problems. They should not be used to the exclusion of other methods. We do not claim that they are always better. They do add a flexible nonparametric tool to the data analyst’s arsenal.” --Breiman, Friedman, Olshen, Stone