Download - CS B551: Decision Trees
CS B551: DECISION TREES
AGENDA
Decision trees Complexity Learning curves Combatting overfitting
Boosting
RECAP
Still in supervised setting with logical attributes
Find a representation of CONCEPT in the form:
CONCEPT(x) S(A,B, …)
where S(A,B,…) is a sentence built with the observable attributes, e.g.:
CONCEPT(x) A(x) (B(x) v C(x))
PREDICATE AS A DECISION TREE
The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED
PREDICATE AS A DECISION TREE
The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED• D = FUNNEL-CAP• E = BULKY
TRAINING SET
Ex. # A B C D E CONCEPT
1 False False True False True False
2 False True False False False False
3 False True True True True False
4 False False True False False False
5 False False False True True False
6 True False True False False True
7 True False False True False True
8 True False True False True True
9 True True True False True True
10 True True True True True True
11 True True False False False False
12 True True False False True False
13 True False True True True True
TrueTrueTrueTrueFalseTrue13
FalseTrueFalseFalseTrueTrue12
FalseFalseFalseFalseTrueTrue11
TrueTrueTrueTrueTrueTrue10
TrueTrueFalseTrueTrueTrue9
TrueTrueFalseTrueFalseTrue8
TrueFalseTrueFalseFalseTrue7
TrueFalseFalseTrueFalseTrue6
FalseTrueTrueFalseFalseFalse5
FalseFalseFalseTrueFalseFalse4
FalseTrueTrueTrueTrueFalse3
FalseFalseFalseFalseTrueFalse2
FalseTrueFalseTrueFalseFalse1
CONCEPTEDCBAEx. #
POSSIBLE DECISION TREE
D
CE
B
E
AA
A
T
F
F
FF
F
T
T
T
TT
POSSIBLE DECISION TREE
D
CE
B
E
AA
A
T
F
F
FF
F
T
T
T
TT
CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
CONCEPT A (B v C)
POSSIBLE DECISION TREE
D
CE
B
E
AA
A
T
F
F
FF
F
T
T
T
TT
A?
B?
C?True
True
True
True
FalseTrue
False
FalseFalse
False
CONCEPT A (B v C)
KIS bias Build smallest decision tree
Computationally intractable problem greedy algorithm
CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))
TOP-DOWNINDUCTION OF A DT
DTL(D, Predicates)1. If all examples in D are positive then return True2. If all examples in D are negative then return False3. If Predicates is empty then return majority rule4. A error-minimizing predicate in Predicates5. Return the tree whose:
- root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)
A
C
True
True
TrueB
True
TrueFalse
False
FalseFalse
False
COMMENTS
Widely used algorithm Greedy Robust to noise (incorrect examples) Not incremental
LEARNABLE CONCEPTS
Some simple concepts cannot be represented compactly in DTsParity(x) = X1 xor X2 xor … xor Xn
Majority(x) = 1 if most of Xi’s are 1, 0 otherwise
Exponential size in # of attributesNeed exponential # of examples to
learn exactlyThe ease of learning is dependent on
shrewdly (or luckily) chosen attributes that correlate with CONCEPT
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
size of training set
% c
orr
ect
on
tes
t se
t 100
Typical learning curve
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
size of training set
% c
orr
ect
on
tes
t se
t 100
Typical learning curve
Some concepts are unrealizable within a machine’s capacity
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
OverfittingRisk of using irrelevant
observable predicates togenerate an hypothesis
that agrees with all examples
in the training set
size of training set
% c
orr
ect
on
tes
t se
t
100
Typical learning curve
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
Overfitting Tree pruning
Risk of using irrelevantobservable predicates togenerate an hypothesis
that agrees with all examples
in the training set
Terminate recursion when# errors / information gain
is small
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
Overfitting Tree pruning
Terminate recursion when# errors / information gain
is small
Risk of using irrelevantobservable predicates togenerate an hypothesis
that agrees with all examples
in the training setThe resulting decision tree + majority rule may not classify correctly all examples in the training set
MISCELLANEOUS ISSUES
Assessing performance: Training set and test set Learning curve
Overfitting Tree pruning
Incorrect examples Missing data Multi-valued and continuous attributes
USING INFORMATION THEORY Rather than minimizing the probability of
error, minimize the expected number of questions needed to decide if an object x satisfies CONCEPT
Use the information-theoretic quantity known as information gain
Split on variable with highest information gain
ENTROPY / INFORMATION GAIN Entropy: encodes the quantity of uncertainty in a
random variable H(X) = -xVal(X) P(x) log P(x)
Properties H(X) = 0 if X is known, i.e. P(x)=1 for some value x H(X) > 0 if X is not known with certainty H(X) is maximal if P(X) is uniform distribution
Information gain: measures the reduction in uncertainty in X given knowledge of Y I(X,Y) = Ey[H(X) – H(X|Y)] =
y P(y) x [P(x|y) log P(x|y) – x P(x)log P(x)] Properties
Always nonnegative = 0 if X and Y are independent
If Y is a choice, maximizing IG = > minimizing Ey[H(X|Y)]
MAXIMIZING IG / MINIMIZING CONDITIONAL ENTROPY IN DECISION TREES
Ey[H(X|Y)] = y P(y) x P(x|y) log P(x|y)
Let n be # of examples Let n+,n- be # of examples on T/F branches of
Y Let p+,p- be accuracy on true/false branches
of Y P(Y) = (p+n++p-n-)/n P(correct|Y) = p+, P(correct|-Y) = p-
Ey[H(X|Y)] n+ [p+log p+ + (1-p+)log (1-p+)] + n- [p-log p- + (1-p-) log (1-p-)]
STATISTICAL METHODS FOR ADDRESSING OVERFITTING / NOISE There may be few training examples that
match the path leading to a deep node in the decision tree More susceptible to choosing irrelevant/incorrect
attributes when sample is small Idea:
Make a statistical estimate of predictive power (which increases with larger samples)
Prune branches with low predictive power Chi-squared pruning
TOP-DOWN DT PRUNING Consider an inner node X that by itself (majority rule) predicts
p examples correctly and n examples incorrectly At k leaf nodes, number of of correct/incorrect examples are
p1/n1,…,pk/nk
Chi-squared test: Null hypothesis: example labels randomly chosen with distribution
p/(p+n) (X is irrelevant) Alternate hypothesis: examples not randomly chosen (X is relevant)
Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’ Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected
number of true/false examples at leaf node i if the null hypothesis holds
Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom
Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)
CONTINUOUS ATTRIBUTES
Continuous attributes can be converted into logical ones via thresholds X => X<a
When considering splitting on X, pick the threshold a to minimize # of errors
7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7
APPLICATIONS OF DECISION TREE
Medical diagnostic / Drug design Evaluation of geological systems for
assessing gas and oil basins Early detection of problems (e.g., jamming)
during oil drilling operations Automatic generation of rules in expert
systems
HUMAN-READABILITY
DTs also have the advantage of being easily understood by humans
Legal requirement in many areas Loans & mortgages Health insurance Welfare
ENSEMBLE LEARNING (BOOSTING)
IDEA
It may be difficult to search for a single hypothesis that explains the data
Construct multiple hypotheses (ensemble), and combine their predictions
“Can a set of weak learners construct a single strong learner?” – Michael Kearns, 1988
MOTIVATION
5 classifiers with 60% accuracy On a new example, run them all, and pick the
prediction using majority voting
If errors are independent, new classifier has 94% accuracy! (In reality errors will not be independent, but we
hope they will be mostly uncorrelated)
BOOSTING
Weighted training set
Ex. # Weight A B C D E CONCEPT
1 w1 False False True False True False
2 w2 False True False False False False
3 w3 False True True True True False
4 w4 False False True False False False
5 w5 False False False True True False
6 w6 True False True False False True
7 w7 True False False True False True
8 w8 True False True False True True
9 w9 True True True False True True
10 w10 True True True True True True
11 w11 True True False False False False
12 w12 True True False False True False
13 w13 True False True True True True
BOOSTING
Start with uniform weights wi=1/N
Use learner 1 to generate hypothesis h1
Adjust weights to give higher importance to misclassified examples
Use learner 2 to generate hypothesis h2
… Weight hypotheses according to
performance, and return weighted majority
MUSHROOM EXAMPLE
“Decision stumps” - single attribute DT
Ex. # Weight A B C D E CONCEPT
1 1/13 False False True False True False
2 1/13 False True False False False False
3 1/13 False True True True True False
4 1/13 False False True False False False
5 1/13 False False False True True False
6 1/13 True False True False False True
7 1/13 True False False True False True
8 1/13 True False True False True True
9 1/13 True True True False True True
10 1/13 True True True True True True
11 1/13 True True False False False False
12 1/13 True True False False True False
13 1/13 True False True True True True
MUSHROOM EXAMPLE
Pick C first, learns CONCEPT = C
Ex. # Weight A B C D E CONCEPT
1 1/13 False False True False True False
2 1/13 False True False False False False
3 1/13 False True True True True False
4 1/13 False False True False False False
5 1/13 False False False True True False
6 1/13 True False True False False True
7 1/13 True False False True False True
8 1/13 True False True False True True
9 1/13 True True True False True True
10 1/13 True True True True True True
11 1/13 True True False False False False
12 1/13 True True False False True False
13 1/13 True False True True True True
MUSHROOM EXAMPLE
Pick C first, learns CONCEPT = C
Ex. # Weight A B C D E CONCEPT
1 1/13 False False True False True False
2 1/13 False True False False False False
3 1/13 False True True True True False
4 1/13 False False True False False False
5 1/13 False False False True True False
6 1/13 True False True False False True
7 1/13 True False False True False True
8 1/13 True False True False True True
9 1/13 True True True False True True
10 1/13 True True True True True True
11 1/13 True True False False False False
12 1/13 True True False False True False
13 1/13 True False True True True True
MUSHROOM EXAMPLE
Update weights
Ex. # Weight A B C D E CONCEPT
1 .125 False False True False True False
2 .056 False True False False False False
3 .125 False True True True True False
4 .125 False False True False False False
5 .056 False False False True True False
6 .056 True False True False False True
7 .125 True False False True False True
8 .056 True False True False True True
9 .056 True True True False True True
10 .056 True True True True True True
11 .056 True True False False False False
12 .056 True True False False True False
13 .056 True False True True True True
MUSHROOM EXAMPLE
Next try A, learn CONCEPT=A
Ex. # Weight A B C D E CONCEPT
1 .125 False False True False True False
2 .056 False True False False False False
3 .125 False True True True True False
4 .125 False False True False False False
5 .056 False False False True True False
6 .056 True False True False False True
7 .125 True False False True False True
8 .056 True False True False True True
9 .056 True True True False True True
10 .056 True True True True True True
11 .056 True True False False False False
12 .056 True True False False True False
13 .056 True False True True True True
MUSHROOM EXAMPLE
Next try A, learn CONCEPT=A
Ex. # Weight A B C D E CONCEPT
1 .125 False False True False True False
2 .056 False True False False False False
3 .125 False True True True True False
4 .125 False False True False False False
5 .056 False False False True True False
6 .056 True False True False False True
7 .125 True False False True False True
8 .056 True False True False True True
9 .056 True True True False True True
10 .056 True True True True True True
11 .056 True True False False False False
12 .056 True True False False True False
13 .056 True False True True True True
MUSHROOM EXAMPLE
Update weights
Ex. # Weight A B C D E CONCEPT
1 0.07 False False True False True False
2 0.03 False True False False False False
3 0.07 False True True True True False
4 0.07 False False True False False False
5 0.03 False False False True True False
6 0.03 True False True False False True
7 0.07 True False False True False True
8 0.03 True False True False True True
9 0.03 True True True False True True
10 0.03 True True True True True True
11 0.25 True True False False False False
12 0.25 True True False False True False
13 0.03 True False True True True True
MUSHROOM EXAMPLE
Next try E, learn CONCEPT=E
Ex. # Weight A B C D E CONCEPT
1 0.07 False False True False True False
2 0.03 False True False False False False
3 0.07 False True True True True False
4 0.07 False False True False False False
5 0.03 False False False True True False
6 0.03 True False True False False True
7 0.07 True False False True False True
8 0.03 True False True False True True
9 0.03 True True True False True True
10 0.03 True True True True True True
11 0.25 True True False False False False
12 0.25 True True False False True False
13 0.03 True False True True True True
MUSHROOM EXAMPLE
Next try E, learn CONCEPT=E
Ex. # Weight A B C D E CONCEPT
1 0.07 False False True False True False
2 0.03 False True False False False False
3 0.07 False True True True True False
4 0.07 False False True False False False
5 0.03 False False False True True False
6 0.03 True False True False False True
7 0.07 True False False True False True
8 0.03 True False True False True True
9 0.03 True True True False True True
10 0.03 True True True True True True
11 0.25 True True False False False False
12 0.25 True True False False True False
13 0.03 True False True True True True
MUSHROOM EXAMPLE
Update Weights…
Ex. # Weight A B C D E CONCEPT
1 0.07 False False True False True False
2 0.03 False True False False False False
3 0.07 False True True True True False
4 0.07 False False True False False False
5 0.03 False False False True True False
6 0.03 True False True False False True
7 0.07 True False False True False True
8 0.03 True False True False True True
9 0.03 True True True False True True
10 0.03 True True True True True True
11 0.25 True True False False False False
12 0.25 True True False False True False
13 0.03 True False True True True True
MUSHROOM EXAMPLE
Final classifier, order C,A,E,D,B Weights on hypotheses determined by overall
error Weighted majority weights
A=2.1, B=0.9, C=0.8, D=1.4, E=0.09 100% accuracy on training set
BOOSTING STRATEGIES
Prior weighting strategy was the popular AdaBoost algorithm
see R&N pp. 667 Many other strategies Typically as the number of hypotheses
increases, accuracy increases as well Does this conflict with Occam’s razor?
ANNOUNCEMENTS
Next class: Neural networks & function learning R&N 18.6-7
HW3 graded, solutions online HW4 due today HW5 out today