cs b351: decision trees
DESCRIPTION
CS B351: Decision Trees. Agenda. Decision trees Learning curves Combatting overfitting. a small one!. Classification Tasks. Supervised learning setting The target function f(x) takes on values True and False A example is positive if f is True, else it is negative - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/1.jpg)
CS B351: DECISION TREES
![Page 2: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/2.jpg)
AGENDA Decision trees Learning curves
Combatting overfitting
![Page 3: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/3.jpg)
CLASSIFICATION TASKS Supervised learning setting The target function f(x) takes on
values True and False A example is positive if f is True, else it
is negative The set X of all possible examples is
the example set The training set is a subset of X
a small one!
![Page 4: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/4.jpg)
LOGICAL CLASSIFICATION DATASET Here, examples (x, f(x)) take on discrete
values
![Page 5: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/5.jpg)
LOGICAL CLASSIFICATION DATASET Here, examples (x, f(x)) take on discrete
valuesConcept
Note that the training set does not say whether an observable predicate is pertinent or not
![Page 6: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/6.jpg)
LOGICAL CLASSIFICATION TASK Find a representation of CONCEPT in the
form:
CONCEPT(x) S(A,B, …)
where S(A,B,…) is a sentence built with the observable attributes, e.g.:
CONCEPT(x) A(x) (B(x) v C(x))
![Page 7: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/7.jpg)
PREDICATE AS A DECISION TREEThe predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED
![Page 8: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/8.jpg)
PREDICATE AS A DECISION TREEThe predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED• D = FUNNEL-CAP• E = BULKY
![Page 9: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/9.jpg)
TRAINING SETEx. # A B C D E CONCEP
T1 False False True False True False2 False True False False False False3 False True True True True False4 False False True False False False5 False False False True True False6 True False True False False True7 True False False True False True8 True False True False True True9 True True True False True True10 True True True True True True11 True True False False False False12 True True False False True False13 True False True True True True
![Page 10: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/10.jpg)
TrueTrueTrueTrueFalseTrue13FalseTrueFalseFalseTrueTrue12FalseFalseFalseFalseTrueTrue11TrueTrueTrueTrueTrueTrue10TrueTrueFalseTrueTrueTrue9TrueTrueFalseTrueFalseTrue8TrueFalseTrueFalseFalseTrue7TrueFalseFalseTrueFalseTrue6FalseTrueTrueFalseFalseFalse5FalseFalseFalseTrueFalseFalse4FalseTrueTrueTrueTrueFalse3FalseFalseFalseFalseTrueFalse2FalseTrueFalseTrueFalseFalse1CONCEPTEDCBAEx. #
POSSIBLE DECISION TREED
CE
B
E
AA
A
T
F
F
FF
F
T
T
T
TT
![Page 11: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/11.jpg)
POSSIBLE DECISION TREED
CE
B
E
AA
A
T
F
F
FF
F
T
T
T
TT
CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
CONCEPT A (B v C)
![Page 12: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/12.jpg)
POSSIBLE DECISION TREED
CE
B
E
AA
A
T
F
F
FF
F
T
T
T
TT
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
CONCEPT A (B v C)
KIS bias Build smallest decision tree
Computationally intractable problem greedy algorithm
CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))
![Page 13: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/13.jpg)
GETTING STARTED:TOP-DOWN INDUCTION OF DECISION TREE
Ex. # A B C D E CONCEPT
1 False False True False True False
2 False True False False False False
3 False True True True True False
4 False False True False False False
5 False False False True True False
6 True False True False False True
7 True False False True False True
8 True False True False True True
9 True True True False True True
10 True True True True True True
11 True True False False False False
12 True True False False True False
13 True False True True True True
True: 6, 7, 8, 9, 10,13False: 1, 2, 3, 4, 5, 11, 12
The distribution of training set is:
![Page 14: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/14.jpg)
GETTING STARTED: TOP-DOWN INDUCTION OF DECISION TREE
True: 6, 7, 8, 9, 10,13False: 1, 2, 3, 4, 5, 11, 12
The distribution of training set is:
Without testing any observable predicate, wecould report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13
Assuming that we will only include one observable predicate in the decision tree, which predicateshould we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)? Greedy algorithm
![Page 15: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/15.jpg)
ASSUME IT’S AA
True:False:
6, 7, 8, 9, 10, 1311, 12 1, 2, 3, 4, 5
T F
If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise
The number of misclassified examples from the training set is 2
![Page 16: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/16.jpg)
ASSUME IT’S BB
True:False:
9, 102, 3, 11, 12 1, 4, 5
T F
If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise
The number of misclassified examples from the training set is 5
6, 7, 8, 13
![Page 17: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/17.jpg)
ASSUME IT’S CC
True:False:
6, 8, 9, 10, 131, 3, 4 1, 5, 11, 12
T F
If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise
The number of misclassified examples from the training set is 4
7
![Page 18: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/18.jpg)
ASSUME IT’S DD
T F
If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise
The number of misclassified examples from the training set is 5
True:False:
7, 10, 133, 5 1, 2, 4, 11, 12
6, 8, 9
![Page 19: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/19.jpg)
ASSUME IT’S EE
True:False:
8, 9, 10, 131, 3, 5, 12 2, 4, 11
T F
If we test only E we will report that CONCEPT is False,independent of the outcome
The number of misclassified examples from the training set is 6
6, 7
![Page 20: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/20.jpg)
ASSUME IT’S EE
True:False:
8, 9, 10, 131, 3, 5, 12 2, 4, 11
T F
If we test only E we will report that CONCEPT is False,independent of the outcome
The number of misclassified examples from the training set is 6
6, 7
So, the best predicate to test is A
![Page 21: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/21.jpg)
CHOICE OF SECOND PREDICATE
AT F
C
True:False:
6, 8, 9, 10, 1311, 127
T FFalse
The number of misclassified examples from the
training set is 1
![Page 22: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/22.jpg)
CHOICE OF THIRD PREDICATE
CT F
B
True:False: 11,12
7
T F
AT F
False
True
![Page 23: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/23.jpg)
FINAL TREEA
CTrue
True
True BTrue
TrueFalse
False
FalseFalse
False
CONCEPT A (C v B) CONCEPT A (B v C)
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
![Page 24: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/24.jpg)
TOP-DOWNINDUCTION OF A DT
DTL(D, Predicates)1. If all examples in D are positive then return True2. If all examples in D are negative then return False3. If Predicates is empty then return failure4. A error-minimizing predicate in Predicates5. Return the tree whose:
- root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)
A
CTrue
True
TrueB
True
TrueFalse
False
FalseFalse
False
Subset of examples that satisfy A
![Page 25: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/25.jpg)
TOP-DOWNINDUCTION OF A DT
DTL(D, Predicates)1. If all examples in D are positive then return True2. If all examples in D are negative then return False3. If Predicates is empty then return failure4. A error-minimizing predicate in Predicates5. Return the tree whose:
- root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)
A
CTrue
True
TrueB
True
TrueFalse
False
FalseFalse
False
Noise in training set!May return majority rule,
instead of failure
![Page 26: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/26.jpg)
TOP-DOWNINDUCTION OF A DT
DTL(D, Predicates)1. If all examples in D are positive then return True2. If all examples in D are negative then return False3. If Predicates is empty then return majority rule4. A error-minimizing predicate in Predicates5. Return the tree whose:
- root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)
A
CTrue
True
TrueB
True
TrueFalse
False
FalseFalse
False
![Page 27: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/27.jpg)
COMMENTS Widely used algorithm Easy to extend to k-class classification Greedy Robust to noise (incorrect examples) Not incremental
![Page 28: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/28.jpg)
HUMAN-READABILITY DTs also have the advantage of being easily
understood by humans Legal requirement in many areas
Loans & mortgages Health insurance Welfare
![Page 29: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/29.jpg)
LEARNABLE CONCEPTSSome simple concepts cannot be
represented compactly in DTsParity(x) = X1 xor X2 xor … xor XnMajority(x) = 1 if most of Xi’s are 1, 0
otherwiseExponential size in # of attributesNeed exponential # of examples to
learn exactlyThe ease of learning is dependent on
shrewdly (or luckily) chosen attributes that correlate with CONCEPT
![Page 30: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/30.jpg)
PERFORMANCE ISSUES Assessing performance:
Training set and test set Learning curve
size of training set
% c
orre
ct o
n te
st s
et 100
Typical learning curve
![Page 31: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/31.jpg)
PERFORMANCE ISSUES Assessing performance:
Training set and test set Learning curve
size of training set
% c
orre
ct o
n te
st s
et 100
Typical learning curve
Some concepts are unrealizable within a machine’s capacity
![Page 32: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/32.jpg)
PERFORMANCE ISSUES Assessing performance:
Training set and test set Learning curve
Overfitting Risk of using irrelevantobservable predicates togenerate an hypothesis
that agrees with all examples
in the training set
size of training set
% c
orre
ct o
n te
st s
et
100
Typical learning curve
![Page 33: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/33.jpg)
PERFORMANCE ISSUES Assessing performance:
Training set and test set Learning curve
Overfitting Tree pruning
Risk of using irrelevantobservable predicates togenerate an hypothesis
that agrees with all examples
in the training set
Terminate recursion when# errors (or information gain)
is small
![Page 34: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/34.jpg)
PERFORMANCE ISSUES Assessing performance:
Training set and test set Learning curve
Overfitting Tree pruning
Terminate recursion when# errors (or information gain)
is small
Risk of using irrelevantobservable predicates togenerate an hypothesis
that agrees with all examples
in the training setThe resulting decision tree + majority rule may not classify correctly all examples in the training set
![Page 35: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/35.jpg)
STATISTICAL METHODS FOR ADDRESSING OVERFITTING / NOISE There may be few training examples that
match the path leading to a deep node in the decision tree More susceptible to choosing irrelevant/incorrect
attributes when sample is small Idea:
Make a statistical estimate of predictive power (which increases with larger samples)
Prune branches with low predictive power Chi-squared pruning
![Page 36: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/36.jpg)
TOP-DOWN DT PRUNING Consider an inner node X that by itself
(majority rule) predicts p examples correctly and n examples incorrectly
At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk
Chi-squared statistical significance test: Null hypothesis: example labels randomly chosen
with distribution p/(p+n) (X is irrelevant) Alternate hypothesis: examples not randomly
chosen (X is relevant) Prune X if testing X is not statistically
significant
![Page 37: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/37.jpg)
CHI-SQUARED TEST Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’
Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds
Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom
Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)
![Page 38: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/38.jpg)
PERFORMANCE ISSUES Assessing performance:
Training set and test set Learning curve
Overfitting Tree pruning
Incorrect examples Missing data Multi-valued and continuous attributes
![Page 39: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/39.jpg)
MULTI-VALUED ATTRIBUTES Simple change: consider splits on all values A
can take on Caveat: the more values A can take on, the
more important it may appear to be, even if it is irrelevant More values => dataset split into smaller
example sets when picking attributes Smaller example sets => more likely to fit well
to spurious noise
![Page 40: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/40.jpg)
CONTINUOUS ATTRIBUTES Continuous attributes can be converted into
logical ones via thresholds X => X<a
When considering splitting on X, pick the threshold a to minimize # of errors / entropy
7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7
![Page 41: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/41.jpg)
DECISION BOUNDARIES With continuous attributes, a decision
boundary is the surface in example space that splits positive from negative examples
x1>=20x2
x1
T FF T
![Page 42: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/42.jpg)
DECISION BOUNDARIES With continuous attributes, a decision
boundary is the surface in example space that splits positive from negative examples
x1>=20x2
x1F
Fx2>=10
T
F
F
T
![Page 43: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/43.jpg)
DECISION BOUNDARIES With continuous attributes, a decision
boundary is the surface in example space that splits positive from negative examples
x1>=20x2
x1F
x2>=10
T
F
F
T
x2>=15
T F
T
![Page 44: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/44.jpg)
DECISION BOUNDARIES With continuous attributes, a decision
boundary is the surface in example space that splits positive from negative examples
![Page 45: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/45.jpg)
EXERCISE With 2 attributes, what kinds of decision
boundaries can be achieved by a decision tree with arbitrary splitting threshold and maximum depth: 1? 2? 3?
Describe the appearance and the complexity of these decision boundaries
![Page 46: CS B351: Decision Trees](https://reader035.vdocuments.mx/reader035/viewer/2022062501/5681642a550346895dd5ea84/html5/thumbnails/46.jpg)
READING Next class:
Neural networks & function learning R&N 18.6-7