![Page 1: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/1.jpg)
Decision Trees
Aar$ Singh
Machine Learning 10-‐701/15-‐781 Mar 6 , 2014
![Page 2: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/2.jpg)
Representa/on
• What does a decision tree represent
2
![Page 3: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/3.jpg)
Decision Tree for Tax Fraud Detec/on
3
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
• Each internal node: test one feature Xi
• Each branch from a node: selects one value for Xi
• Each leaf node: predic$on for Y
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Query Data
![Page 4: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/4.jpg)
Predic/on
• Given a decision tree, how do we assign label to a test point
4
![Page 5: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/5.jpg)
Decision Tree for Tax Fraud Detec/on
5
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Query Data
![Page 6: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/6.jpg)
Decision Tree for Tax Fraud Detec/on
6
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Query Data
![Page 7: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/7.jpg)
Decision Tree for Tax Fraud Detec/on
7
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Query Data
No
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
![Page 8: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/8.jpg)
Decision Tree for Tax Fraud Detec/on
8
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Query Data
No
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
![Page 9: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/9.jpg)
Decision Tree for Tax Fraud Detec/on
9
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Query Data
No
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Married
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
![Page 10: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/10.jpg)
Decision Tree for Tax Fraud Detec/on
10
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Query Data
No
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Married
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Assign Cheat to “No”
![Page 11: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/11.jpg)
Decision Tree more generally…
11
1 1
1 0
1 0
• Features can be discrete, con$nuous or categorical
• Each internal node: test some set of features {Xi}
• Each branch from a node: selects a set of value for {Xi}
• Each leaf node: predic$on for Y
1
1 1
0 1 1
1 0
![Page 12: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/12.jpg)
So far…
• What does a decision tree represent • Given a decision tree, how do we assign label to a test point
Now …
• How do we learn a decision tree from training data
12
![Page 13: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/13.jpg)
How to learn a decision tree • Top-‐down induc$on [ID3, C4.5, CART, …]
13
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
6. Prune back tree to reduce overfihng, assign majority label to the leaf node
![Page 14: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/14.jpg)
How to learn a decision tree • Top-‐down induc$on [ID3, C4.5, CART, …]
14
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
ID3
(steps 1-‐5) aler removing current amribute 6. When all amributes exhausted, assign majority label to the leaf node
![Page 15: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/15.jpg)
Which feature is best to split?
15
X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F
T F
Y: 4 Ts 0 Fs
Y: 1 Ts 3 Fs
T F
Y: 3 Ts 1 Fs
Y: 2 Ts 2 Fs
Good split if we are more certain about classifica$on aler split – Uniform distribu$on of labels is bad
Absolutely sure
Kind of sure
Kind of sure
Absolutely unsure
![Page 16: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/16.jpg)
Which feature is best to split?
16
Pick the amribute/feature which yields maximum informa$on gain:
H(Y) – entropy of Y H(Y|Xi) – condi$onal entropy of Y
![Page 17: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/17.jpg)
Entropy • Entropy of a random variable Y
More uncertainty, more entropy! Y ~ Bernoulli(p)
Informa/on Theory interpreta/on: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)
17
p
Entrop
y, H(Y)
Uniform Max entropy
Determinis/c Zero entropy
![Page 18: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/18.jpg)
Andrew Moore’s Entropy in a Nutshell
18
Low Entropy High Entropy
..the values (loca$ons of soup) unpredictable... almost uniformly sampled throughout our dining room
..the values (loca$ons of soup) sampled en$rely from within the soup bowl
![Page 19: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/19.jpg)
Informa/on Gain • Advantage of amribute = decrease in uncertainty
– Entropy of Y before split
– Entropy of Y aler splihng based on Xi • Weight by probability of following each branch
• Informa$on gain is difference
Max Informa/on gain = min condi/onal entropy 19
![Page 20: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/20.jpg)
Informa/on Gain
20
X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F
T F T F
Y: 4 Ts 0 Fs
Y: 1 Ts 3 Fs
Y: 3 Ts 1 Fs
Y: 2 Ts 2 Fs
> 0
![Page 21: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/21.jpg)
Which feature is best to split?
21
Pick the amribute/feature which yields maximum informa$on gain:
H(Y) – entropy of Y H(Y|Xi) – condi$onal entropy of Y
Feature which yields maximum reduc$on in entropy provides maximum informa$on about Y
![Page 22: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/22.jpg)
Expressiveness of Decision Trees
22
• Decision trees can express any func$on of the input features. • E.g., for Boolean func$ons, truth table row → path to leaf:
• There is a decision tree which perfectly classifies a training set with one path to leaf for each example -‐ overfihng
• But it won't generalize well to new examples -‐ prefer to find more compact decision trees
![Page 23: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/23.jpg)
Bias-‐Variance Tradeoff
23
fine par$$on
coarse par$$on variance small
variance large bias small
bias large
Ideal classifier
average classifier
Classifiers based on different training data
![Page 24: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/24.jpg)
When to Stop?
• Many strategies for picking simpler trees: – Pre-‐pruning
• Fixed depth • Fixed number of leaves
– Post-‐pruning • Chi-‐square test
– Convert decision tree to a set of rules – Eliminate variable values in rules which are independent of label (using chi-‐square test for independence)
– Simplify rule set by elimina$ng unnecessary rules
– Informa$on Criteria: MDL(Minimum Descrip$on Length)
24
Refund
MarSt
NO
Yes No
Married Single, Divorced
![Page 25: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/25.jpg)
• Penalize complex models by introducing cost
25
log likelihood cost
regression classifica$on
penalize trees with more leaves
Informa/on Criteria
CART
![Page 26: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/26.jpg)
How to assign label to each leaf Classifica$on – Majority vote Regression – ?
26
![Page 27: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/27.jpg)
How to assign label to each leaf Classifica$on – Majority vote Regression – Constant/ Linear/Poly fit
27
![Page 28: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/28.jpg)
Regression trees
28
Average (fit a constant ) using training data at the leaves
Num Children?
≥ 2 < 2
![Page 29: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/29.jpg)
Connec/on between histogram classifiers
and decision trees
29
![Page 30: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/30.jpg)
Local predic/on
30
Histogram, kernel density es$ma$on, k-‐nearest neighbor classifier, kernel regression
Histogram Classifier
Δ
![Page 31: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/31.jpg)
Local Adap/ve predic/on
31
Let neighborhood size adapt to data – small neighborhoods near decision boundary (small bias), large neighborhoods elsewhere (small variance)
Decision Tree Classifier Majority vote at each leaf
Δx
![Page 32: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/32.jpg)
Histogram Classifier vs Decision Trees
32
Ideal classifier Decision tree histogram
256 cells in each par$$on
![Page 33: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/33.jpg)
Applica/on to Image Coding
33
1024 cells in each par$$on
![Page 34: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/34.jpg)
34
JPEG 0.125 bpp JPEG 2000 0.125 bpp non-‐adap$ve par$$oning adap$ve par$$oning
Applica/on to Image Coding
![Page 35: DecisionTrees( - Carnegie Mellon School of …DecisionTree for(Tax(Fraud(Detec/on( 5 Refund% MarSt% TaxInc% NO% YES% NO% NO% Yes% No% Single,Divorced% Married%% 80K %](https://reader036.vdocuments.mx/reader036/viewer/2022070713/5ed1b7609f7ccc70ed3588ef/html5/thumbnails/35.jpg)
What you should know
35
• Decision trees are one of the most popular data mining tools • Simplicity of design • Interpretability • Ease of implementa$on • Good performance in prac$ce (for small dimensions)
• Informa$on gain to select amributes (ID3, C4.5,…) • Decision trees will overfit!!!
– Must use tricks to find “simple trees”, e.g., • Pre-‐Pruning: Fixed depth/Fixed number of leaves • Post-‐Pruning: Chi-‐square test of independence • Complexity Penalized/MDL model selec$on
• Can be used for classifica$on, regression and density es$ma$on too