decisiontrees( - carnegie mellon school of …decisiontree for(tax(fraud(detec/on( 5 refund% marst%...

35
Decision Trees Aar$ Singh Machine Learning 10701/15781 Mar 6 , 2014

Upload: others

Post on 27-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Decision  Trees  

Aar$  Singh      

Machine  Learning  10-­‐701/15-­‐781  Mar  6  ,  2014  

Page 2: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Representa/on  

•  What  does  a  decision  tree  represent    

 

2  

Page 3: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Decision  Tree  for  Tax  Fraud  Detec/on  

3  

Refund  

MarSt  

TaxInc  

YES  NO  

NO  

NO  

Yes   No  

Married    Single,  Divorced  

<  80K   >  80K  

•  Each  internal  node:  test  one  feature  Xi  

•  Each  branch  from  a  node:  selects  one  value  for  Xi  

•  Each  leaf  node:  predic$on  for  Y  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Query  Data  

Page 4: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Predic/on  

•  Given  a  decision  tree,  how  do  we  assign  label  to  a  test  point  

 

4  

Page 5: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Decision  Tree  for  Tax  Fraud  Detec/on  

5  

Refund  

MarSt  

TaxInc  

YES  NO  

NO  

NO  

Yes   No  

Married    Single,  Divorced  

<  80K   >  80K  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Query  Data  

Page 6: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Decision  Tree  for  Tax  Fraud  Detec/on  

6  

Refund  

MarSt  

TaxInc  

YES  NO  

NO  

NO  

Yes   No  

Married    Single,  Divorced  

<  80K   >  80K  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Query  Data  

Page 7: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Decision  Tree  for  Tax  Fraud  Detec/on  

7  

Refund  

MarSt  

TaxInc  

YES  NO  

NO  

NO  

Yes   No  

Married    Single,  Divorced  

<  80K   >  80K  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Query  Data  

No  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Page 8: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Decision  Tree  for  Tax  Fraud  Detec/on  

8  

Refund  

MarSt  

TaxInc  

YES  NO  

NO  

NO  

Yes   No  

Married    Single,  Divorced  

<  80K   >  80K  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Query  Data  

No  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Page 9: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Decision  Tree  for  Tax  Fraud  Detec/on  

9  

Refund  

MarSt  

TaxInc  

YES  NO  

NO  

NO  

Yes   No  

Married    Single,  Divorced  

<  80K   >  80K  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Query  Data  

No  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Married    

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Page 10: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Decision  Tree  for  Tax  Fraud  Detec/on  

10  

Refund  

MarSt  

TaxInc  

YES  NO  

NO  

NO  

Yes   No  

Married    Single,  Divorced  

<  80K   >  80K  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Query  Data  

No  

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Married    

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Assign  Cheat  to  “No”  

Page 11: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Decision  Tree  more  generally…  

11  

1   1  

1  0  

1  0  

•  Features  can  be  discrete,  con$nuous  or  categorical  

•  Each  internal  node:  test  some  set  of  features  {Xi}  

•  Each  branch  from  a  node:  selects  a  set  of  value  for  {Xi}  

•  Each  leaf  node:  predic$on  for  Y  

1  

1  1  

0  1   1  

1  0  

Page 12: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

So  far…  

•  What  does  a  decision  tree  represent  •  Given  a  decision  tree,  how  do  we  assign  label  to  a  test  point  

Now  …    

•  How  do  we  learn  a  decision  tree  from  training  data  

12  

Page 13: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

How  to  learn  a  decision  tree  •  Top-­‐down  induc$on  [ID3,  C4.5,  CART,  …]        

13  

Refund  

MarSt  

TaxInc  

YES  NO  

NO  

NO  

Yes   No  

Married    Single,  Divorced  

<  80K   >  80K  

6.  Prune  back  tree  to  reduce  overfihng,  assign  majority  label  to  the  leaf  node  

Page 14: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

How  to  learn  a  decision  tree  •  Top-­‐down  induc$on  [ID3,  C4.5,  CART,  …]        

14  

Refund  

MarSt  

TaxInc  

YES  NO  

NO  

NO  

Yes   No  

Married    Single,  Divorced  

<  80K   >  80K  

ID3  

       (steps  1-­‐5)  aler  removing  current  amribute    6.  When  all  amributes  exhausted,  assign  majority  label  to  the  leaf  node  

Page 15: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Which  feature  is  best  to  split?  

15  

X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F

T   F  

Y:  4  Ts          0  Fs  

Y:  1  Ts          3  Fs  

T   F  

Y:  3  Ts          1  Fs  

Y:  2  Ts          2  Fs  

Good  split  if  we  are  more  certain  about  classifica$on  aler  split  –    Uniform  distribu$on  of  labels  is  bad  

Absolutely  sure  

Kind  of  sure  

Kind  of  sure  

Absolutely  unsure  

Page 16: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Which  feature  is  best  to  split?  

16  

Pick  the  amribute/feature  which  yields  maximum  informa$on  gain:  

 H(Y)  –  entropy  of  Y            H(Y|Xi)  –  condi$onal  entropy  of  Y  

Page 17: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Entropy  •  Entropy  of  a  random  variable  Y  

 More  uncertainty,    more  entropy!      Y  ~  Bernoulli(p)  

     Informa/on  Theory  interpreta/on: H(Y)  is  the  expected  number  of  bits    needed    to  encode  a  randomly  drawn  value  of  Y    (under  most  efficient  code)    

17  

p  

Entrop

y,  H(Y)  

Uniform    Max  entropy  

Determinis/c  Zero  entropy  

Page 18: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Andrew  Moore’s  Entropy  in  a  Nutshell  

18  

Low  Entropy   High  Entropy  

..the  values  (loca$ons  of  soup)  unpredictable...  almost  uniformly  sampled  throughout  our  dining  room  

..the  values  (loca$ons  of  soup)  sampled  en$rely  from  within  the  soup  bowl  

Page 19: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Informa/on  Gain  •  Advantage  of  amribute  =  decrease  in  uncertainty  

–  Entropy  of  Y  before  split  

 

–  Entropy  of  Y  aler  splihng  based  on  Xi  •  Weight  by  probability  of  following  each  branch  

•  Informa$on  gain  is  difference    

             Max  Informa/on  gain  =  min  condi/onal  entropy  19  

Page 20: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Informa/on  Gain  

20  

X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F

T   F   T   F  

Y:  4  Ts          0  Fs  

Y:  1  Ts          3  Fs  

Y:  3  Ts          1  Fs  

Y:  2  Ts          2  Fs  

>  0  

Page 21: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Which  feature  is  best  to  split?  

21  

Pick  the  amribute/feature  which  yields  maximum  informa$on  gain:  

 H(Y)  –  entropy  of  Y            H(Y|Xi)  –  condi$onal  entropy  of  Y  

Feature  which  yields  maximum  reduc$on  in  entropy          provides  maximum  informa$on  about  Y  

Page 22: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Expressiveness  of  Decision  Trees  

22  

•  Decision  trees  can  express  any  func$on  of  the  input  features.  •  E.g.,  for  Boolean  func$ons,  truth  table  row  →  path  to  leaf:  

•  There  is  a  decision  tree  which  perfectly  classifies  a  training  set  with  one  path  to  leaf  for  each  example  -­‐  overfihng  

•  But  it  won't  generalize  well  to  new  examples  -­‐  prefer  to  find  more  compact  decision  trees  

Page 23: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Bias-­‐Variance  Tradeoff  

23  

fine  par$$on  

coarse  par$$on   variance  small  

variance  large  bias  small  

bias  large  

Ideal  classifier  

average  classifier  

Classifiers  based  on    different  training  data  

Page 24: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

When  to  Stop?  

•  Many  strategies  for  picking  simpler  trees:  –  Pre-­‐pruning  

•  Fixed  depth  •  Fixed  number  of  leaves  

–  Post-­‐pruning  •  Chi-­‐square  test  

–  Convert  decision  tree  to  a  set  of  rules  –  Eliminate  variable  values  in  rules  which  are  independent  of  label  (using  chi-­‐square  test  for  independence)  

–  Simplify  rule  set  by  elimina$ng  unnecessary  rules  

–  Informa$on  Criteria:  MDL(Minimum  Descrip$on  Length)  

24  

Refund  

MarSt  

NO  

Yes   No  

Married    Single,  Divorced  

Page 25: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

•  Penalize  complex  models  by  introducing  cost  

25  

log  likelihood   cost  

           regression              classifica$on  

   

     penalize  trees  with  more  leaves  

Informa/on  Criteria  

CART  

Page 26: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

How  to  assign  label  to  each  leaf  Classifica$on  –  Majority  vote    Regression  –  ?    

26  

Page 27: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

How  to  assign  label  to  each  leaf  Classifica$on  –  Majority  vote    Regression  –  Constant/                Linear/Poly  fit  

27  

Page 28: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Regression  trees  

28  

Average (fit a constant ) using training data at the leaves

Num  Children?  

≥ 2   < 2  

Page 29: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Connec/on  between    histogram  classifiers    

and    decision  trees  

29  

Page 30: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Local  predic/on  

30  

Histogram,  kernel  density  es$ma$on,  k-­‐nearest  neighbor  classifier,  kernel  regression  

Histogram Classifier

Δ

Page 31: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Local  Adap/ve  predic/on  

31  

Let  neighborhood  size  adapt  to  data  –  small  neighborhoods  near  decision  boundary  (small  bias),  large  neighborhoods  elsewhere  (small  variance)  

Decision Tree Classifier Majority  vote    at  each  leaf  

Δx  

Page 32: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Histogram  Classifier  vs  Decision  Trees  

32  

Ideal  classifier   Decision  tree   histogram  

256  cells  in  each  par$$on  

Page 33: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

Applica/on  to  Image  Coding  

33  

1024  cells  in    each  par$$on  

Page 34: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

34  

JPEG            0.125  bpp   JPEG  2000    0.125  bpp  non-­‐adap$ve  par$$oning   adap$ve  par$$oning  

Applica/on  to  Image  Coding  

Page 35: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %

What  you  should  know  

35  

•  Decision  trees  are  one  of  the  most  popular  data  mining  tools  •  Simplicity  of  design  •  Interpretability  •  Ease  of  implementa$on  •  Good  performance  in  prac$ce  (for  small  dimensions)  

•  Informa$on  gain  to  select  amributes  (ID3,  C4.5,…)  •  Decision  trees  will  overfit!!!  

–  Must  use  tricks  to  find  “simple  trees”,  e.g.,  •  Pre-­‐Pruning:  Fixed  depth/Fixed  number  of  leaves  •  Post-­‐Pruning:  Chi-­‐square  test  of  independence  •  Complexity  Penalized/MDL  model  selec$on  

•  Can  be  used  for  classifica$on,  regression  and  density  es$ma$on  too