Leveraging Feature Selection Within TreeNet

Download Leveraging Feature Selection Within TreeNet

Post on 13-Jul-2015

934 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<p>Slide 1</p> <p>Leveraging Feature Selection Within TreeNetDataLabUSA OverviewIntroductionThe Case For Feature SelectionMethodologiesCase Study DMA Analytics Challenge 2007Comparison of ApproachesAdvanced AlgorithmsConclusion Questions &amp; AnswersDataLabUSA The DataLab EnvironmentDataLab USAIndustries ServedThe Data EnvironmentAnalytical FrameworkDataLabUSA When more is not necessarily betterTreeNet Models are naturally more robust than more traditional algorithms.Without any limitations a TreeNet Model in a typical DM environment can incorporate hundreds of independent variables.How many of these variables actually provide true informational gain?</p> <p>DataLabUSA Not all variables are created equalCertain types of variables can degrade TN model performance. High order categorical (e.g. State, cluster)Composite variables(e.g. risk score, cluster, family composition)DataLabUSA Why Not Specialize?Lower number of variables can allow for tighter parametersIncreased number of terminal nodesDecreased number of observations in minchildAllowance for more variable interactions (ICL)</p> <p>DataLabUSA You want me to build how many models?Brute Force = 2N-160 Variables = 1,152,921,504,606,846,975 ModelsProcessing Time = 730,693,161,740 yearsAge of the Universe 13,730,000,000 years1/2 will include top variable1/4 will include top two variables1/1024 will include top ten variablesDataLabUSA Feature SelectionFeature Selection Goal Efficiently identify the subset of independent variables that maximize model discrimination. Basic Feature Selection = N x (N+1)/260 Variables = 60 + 59 + 58 + + 1 = 1,830 Models</p> <p>DataLabUSA Feature Selection - FrameworkThe programmatic development and evaluation of TN batches is a necessityPerformance of initial models dictate the composition of later models. Too many decision points to require human interaction.SAS/C#DataLabUSA Variable ShavingStepwise removal of variables from model based on variable importance. Typically starts with an unrestricted model and removes variables until stop condition is met or there are no more variables to remove.At each step variable with lowest importance is removed.Very low cost only requires N total models since only one model per step.</p> <p>DataLabUSA Forward SelectionStepwise addition of variables to model based on performance criteria. Typically starts with 0 variables and grows until available variables are exhausted or a stop condition is met.Each step has the following substeps that are repeated up to N iterations:Model TestingEvaluationVariable Selection</p> <p>DataLabUSA Forward Selection Process</p> <p>DataLabUSA Forward Selection Process</p> <p>DataLabUSA </p> <p>Forward Selection ProcessDataLabUSA Forward Selection ProcessSample Scenario 7 available independent variables</p> <p>DataLabUSA Backward SelectionStepwise removal of variables from model based on decision criteria. Typically starts with an unrestricted model and restricts variables until stop condition is met or there are no more variables to remove.Substeps are similar to forward selection:Model Testing candidate variables are removed from models.Evaluation - identify model with highest performanceVariable Removal remove variable from model</p> <p>DataLabUSA </p> <p>Backward Selection ProcessDataLabUSA Backward Selection Process</p> <p>DataLabUSA Backward Selection Process</p> <p>DataLabUSA Backward Selection Process</p> <p>DataLabUSA Case Study Overview2007 DMA Analytics ChallengeDependent Variable: ResponseIndependent Variables: 228 variables in totalHousehold demographicsArea/household level lifestyles and interestsGeo demographicsSocio-economic censusDomain: 40k random mailpieces generating 20k responders</p> <p>DataLabUSA Case Study OverviewModel Parameters TN 2.0Type: Logistic Binary ROC stopping conditionNodes: 6Trees: 200Minchild: 200LR: 0.1SubSample: .5Validation Type: 50% internal testPerformance (No Variable Restrictions):ROC (Learn/Test): .764/.736KS (Learn/Test): .392/.351</p> <p>DataLabUSA Case Study Variable ShavingDecision Metric: Importance (TN 2.0)Resample: Changing of seed valuesPeak performance attained after 72 variables.157 models required to identify best 72 out of 228 variables. ROC (Learn/Test): .763/.741DataLabUSA Case Study Variable ShavingDataLabUSA Case Study Forward SelectionDecision Metric: ROC (Test)Resample: Rows of input file are physically shuffled after each batch. Peak performance attained after 25 variables.6,400 models required to identify 25 out of 228 variables. ROC (Learn/Test): .768/.758DataLabUSA Case Study Forward SelectionDataLabUSA Case Study Backward SelectionDecision Metric: ROC (Test)Resample: Rows of input file are physically shuffled after each batch. Peak performance attained after 71 variables.23,600 models required to identify best 71 out of 228 variables. ROC (Learn/Test): .761/.760DataLabUSA Case Study Backward SelectionDataLabUSA Case Study Method ComparisonDataLabUSA ComparisonForwardBackwardShavingSensitivity to unstable/heavily used variables++-Sensitivity to heavily interactive variables-++Guard against composite variables-+-Processing efficiency+-+Suitability for large number of variables+-+DataLabUSA Devising more advanced algorithmsCombination of the two proceduresControlling for differences in parameters over variable space.Decision metric augmentationVariable samplingInternal re-sampling of learn vs. testICLDataLabUSA ConclusionsKey component of TN model optimizationPerformanceInterpretabilityBackward/Forward selection important building blocks for more sophisticated methodsQuestions?DataLabUSA </p>