Transcript

DataLabUSA

Leveraging Feature Selection Within TreeNet

DataLabUSA

Overview

• Introduction

• The Case For Feature Selection

• Methodologies

• Case Study – DMA Analytics Challenge 2007

• Comparison of Approaches

• Advanced Algorithms

• Conclusion – Questions & Answers

DataLabUSA

The DataLab Environment

• DataLab USA

• Industries Served

• The Data Environment

• Analytical Framework

DataLabUSA

When more is not necessarily better

• TreeNet Models are naturally more robust than more traditional algorithms.

• Without any limitations a TreeNet Model in a typical DM environment can incorporate hundreds of independent variables.

• How many of these variables actually provide trueinformational gain?

DataLabUSA

Not all variables are created equal

• Certain types of variables can degrade TN model performance.

• High order categorical (e.g. State, cluster)

• Composite variables(e.g. risk score, cluster, family composition)

DataLabUSA

Why Not Specialize?

• Lower number of variables can allow for tighter parameters

• Increased number of terminal nodes

• Decreased number of observations in minchild

• Allowance for more variable interactions (ICL)

DataLabUSA

You want me to build how many models?

• Brute Force = 2N-1

• 60 Variables = 1,152,921,504,606,846,975 Models

• Processing Time = 730,693,161,740 years

• Age of the Universe ≈ 13,730,000,000 years

• 1/2 will include top variable

• 1/4 will include top two variables

• 1/1024 will include top ten variables

DataLabUSA

Feature Selection

• Feature Selection Goal – Efficiently identify the subset of independent variables that maximize model discrimination.

• Basic Feature Selection = N x (N+1)/2

• 60 Variables = 60 + 59 + 58 + … + 1 = 1,830 Models

DataLabUSA

Feature Selection - Framework

• The programmatic development and evaluation of TN batches is a necessity

• Performance of initial models dictate the composition of later models.

• Too many decision points to require human interaction.

• SAS/C#

DataLabUSA

Variable Shaving

• Stepwise removal of variables from model based on variable importance.

• Typically starts with an unrestricted model and removes variables until stop condition is met or there are no more variables to remove.

• At each step variable with lowest importance is removed.

• Very low cost – only requires N total models since only one model per step.

DataLabUSA

Forward Selection

• Stepwise addition of variables to model based on performance criteria.

• Typically starts with 0 variables and grows until available variables are exhausted or a stop condition is met.

• Each step has the following substeps that are repeated up to N iterations:

1. Model Testing

2. Evaluation

3. Variable Selection

DataLabUSA

Forward Selection – Process

DataLabUSA

Backward Selection

• Stepwise removal of variables from model based on decision criteria.

• Typically starts with an unrestricted model and restricts variables until stop condition is met or there are no more variables to remove.

• Substeps are similar to forward selection:

1. Model Testing – candidate variables are removed from models.

2. Evaluation - identify model with highest performance

3. Variable Removal – remove variable from model

DataLabUSA

Backward Selection – Process

DataLabUSA

Case Study – Overview• 2007 DMA Analytics Challenge

• Dependent Variable: Response

• Independent Variables: 228 variables in total

• Household demographics

• Area/household level lifestyles and interests

• Geo demographics

• Socio-economic census

• Domain: 40k random mailpieces generating 20k responders

DataLabUSA

Case Study – Overview• Model Parameters – TN 2.0

• Type: Logistic Binary – ROC stopping condition

• Nodes: 6

• Trees: 200

• Minchild: 200

• LR: 0.1

• SubSample: .5

• Validation Type: 50% internal test

• Performance (No Variable Restrictions):

• ROC (Learn/Test): .764/.736

• KS (Learn/Test): .392/.351

DataLabUSA

Case Study – Variable Shaving• Decision Metric: Importance (TN 2.0)

• Resample: Changing of seed values

• Peak performance attained after 72 variables.

• 157 models required to identify best 72 out of 228 variables.

• ROC (Learn/Test): .763/.741

DataLabUSA

Case Study – Variable Shaving

0.730

0.740

0.750

0.760

0.770

0.780

0.790

0 10

20

30

40

50

60

70

80

90

10

0

11

0

12

0

13

0

14

0

15

0

16

0

17

0

18

0

19

0

20

0

21

0

22

0

# Variables

Learn ROC Test ROC5 per. Mov. Avg. (Learn ROC) 5 per. Mov. Avg. (Test ROC)

DataLabUSA

Case Study – Forward Selection• Decision Metric: ROC (Test)

• Resample: Rows of input file are physically shuffled after each batch.

• Peak performance attained after 25 variables.

• 6,400 models required to identify 25 out of 228 variables.

• ROC (Learn/Test): .768/.758

DataLabUSA

Case Study – Forward Selection

0.730

0.740

0.750

0.760

0.770

0.780

0.790

0

10

20

30

40

50

60

70

80

90

10

0

11

0

12

0

13

0

14

0

15

0

16

0

17

0

18

0

19

0

20

0

21

0

22

0

# Variables

Learn ROC Test ROC

5 per. Mov. Avg. (Learn ROC) 5 per. Mov. Avg. (Test ROC)

DataLabUSA

Case Study – Backward Selection• Decision Metric: ROC (Test)

• Resample: Rows of input file are physically shuffled after each batch.

• Peak performance attained after 71 variables.

• 23,600 models required to identify best 71 out of 228 variables.

• ROC (Learn/Test): .761/.760

DataLabUSA

Case Study – Backward Selection

0.730

0.740

0.750

0.760

0.770

0.780

0.790

0

10

20

30

40

50

60

70

80

90

10

0

11

0

12

0

13

0

14

0

15

0

16

0

17

0

18

0

19

0

20

0

21

0

22

0

# Variables

Learn ROC Test ROC

5 per. Mov. Avg. (Learn ROC) 5 per. Mov. Avg. (Test ROC)

DataLabUSA

Case Study – Method Comparison

0.730

0.735

0.740

0.745

0.750

0.755

0.760

0

10

20

30

40

50

60

70

80

90

10

0

11

0

12

0

13

0

14

0

15

0

16

0

17

0

18

0

19

0

20

0

21

0

22

0

# Variables

Shaving Forward

Backward

DataLabUSA

ComparisonForward Backward Shaving

Sensitivity to unstable/heavily used variables

+ + -

Sensitivity to heavily interactive variables

- + +

Guard against compositevariables

- + -

Processing efficiency + - +

Suitability for large number of variables

+ - +

DataLabUSA

Devising more advanced algorithms• Combination of the two procedures

• Controlling for differences in parameters over variable space.

• Decision metric augmentation

• Variable sampling

• Internal re-sampling of learn vs. test

• ICL

DataLabUSA

Conclusions

• Key component of TN model optimization

– Performance

– Interpretability

• Backward/Forward selection important building blocks for more sophisticated methods

• Questions?


Top Related