cs b351: decision trees

46
CS B351: DECISION TREES

Upload: orsin

Post on 23-Feb-2016

35 views

Category:

Documents


0 download

DESCRIPTION

CS B351: Decision Trees. Agenda. Decision trees Learning curves Combatting overfitting. a small one!. Classification Tasks. Supervised learning setting The target function f(x) takes on values True and False A example is positive if f is True, else it is negative - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS B351: Decision Trees

CS B351: DECISION TREES

Page 2: CS B351: Decision Trees

AGENDA Decision trees Learning curves

Combatting overfitting

Page 3: CS B351: Decision Trees

CLASSIFICATION TASKS Supervised learning setting The target function f(x) takes on

values True and False A example is positive if f is True, else it

is negative The set X of all possible examples is

the example set The training set is a subset of X

a small one!

Page 4: CS B351: Decision Trees

LOGICAL CLASSIFICATION DATASET Here, examples (x, f(x)) take on discrete

values

Page 5: CS B351: Decision Trees

LOGICAL CLASSIFICATION DATASET Here, examples (x, f(x)) take on discrete

valuesConcept

Note that the training set does not say whether an observable predicate is pertinent or not

Page 6: CS B351: Decision Trees

LOGICAL CLASSIFICATION TASK Find a representation of CONCEPT in the

form:

CONCEPT(x) S(A,B, …)

where S(A,B,…) is a sentence built with the observable attributes, e.g.:

CONCEPT(x) A(x) (B(x) v C(x))

Page 7: CS B351: Decision Trees

PREDICATE AS A DECISION TREEThe predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED

Page 8: CS B351: Decision Trees

PREDICATE AS A DECISION TREEThe predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED• D = FUNNEL-CAP• E = BULKY

Page 9: CS B351: Decision Trees

TRAINING SETEx. # A B C D E CONCEP

T1 False False True False True False2 False True False False False False3 False True True True True False4 False False True False False False5 False False False True True False6 True False True False False True7 True False False True False True8 True False True False True True9 True True True False True True10 True True True True True True11 True True False False False False12 True True False False True False13 True False True True True True

Page 10: CS B351: Decision Trees

TrueTrueTrueTrueFalseTrue13FalseTrueFalseFalseTrueTrue12FalseFalseFalseFalseTrueTrue11TrueTrueTrueTrueTrueTrue10TrueTrueFalseTrueTrueTrue9TrueTrueFalseTrueFalseTrue8TrueFalseTrueFalseFalseTrue7TrueFalseFalseTrueFalseTrue6FalseTrueTrueFalseFalseFalse5FalseFalseFalseTrueFalseFalse4FalseTrueTrueTrueTrueFalse3FalseFalseFalseFalseTrueFalse2FalseTrueFalseTrueFalseFalse1CONCEPTEDCBAEx. #

POSSIBLE DECISION TREED

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

Page 11: CS B351: Decision Trees

POSSIBLE DECISION TREED

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

CONCEPT A (B v C)

Page 12: CS B351: Decision Trees

POSSIBLE DECISION TREED

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

CONCEPT A (B v C)

KIS bias Build smallest decision tree

Computationally intractable problem greedy algorithm

CONCEPT (D(EvA))v(D(C(Bv(B((EA)v(EA))))))

Page 13: CS B351: Decision Trees

GETTING STARTED:TOP-DOWN INDUCTION OF DECISION TREE

Ex. # A B C D E CONCEPT

1 False False True False True False

2 False True False False False False

3 False True True True True False

4 False False True False False False

5 False False False True True False

6 True False True False False True

7 True False False True False True

8 True False True False True True

9 True True True False True True

10 True True True True True True

11 True True False False False False

12 True True False False True False

13 True False True True True True

True: 6, 7, 8, 9, 10,13False: 1, 2, 3, 4, 5, 11, 12

The distribution of training set is:

Page 14: CS B351: Decision Trees

GETTING STARTED: TOP-DOWN INDUCTION OF DECISION TREE

True: 6, 7, 8, 9, 10,13False: 1, 2, 3, 4, 5, 11, 12

The distribution of training set is:

Without testing any observable predicate, wecould report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

Assuming that we will only include one observable predicate in the decision tree, which predicateshould we test to minimize the probability of error (i.e., the # of misclassified examples in the training set)? Greedy algorithm

Page 15: CS B351: Decision Trees

ASSUME IT’S AA

True:False:

6, 7, 8, 9, 10, 1311, 12 1, 2, 3, 4, 5

T F

If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise

The number of misclassified examples from the training set is 2

Page 16: CS B351: Decision Trees

ASSUME IT’S BB

True:False:

9, 102, 3, 11, 12 1, 4, 5

T F

If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise

The number of misclassified examples from the training set is 5

6, 7, 8, 13

Page 17: CS B351: Decision Trees

ASSUME IT’S CC

True:False:

6, 8, 9, 10, 131, 3, 4 1, 5, 11, 12

T F

If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise

The number of misclassified examples from the training set is 4

7

Page 18: CS B351: Decision Trees

ASSUME IT’S DD

T F

If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise

The number of misclassified examples from the training set is 5

True:False:

7, 10, 133, 5 1, 2, 4, 11, 12

6, 8, 9

Page 19: CS B351: Decision Trees

ASSUME IT’S EE

True:False:

8, 9, 10, 131, 3, 5, 12 2, 4, 11

T F

If we test only E we will report that CONCEPT is False,independent of the outcome

The number of misclassified examples from the training set is 6

6, 7

Page 20: CS B351: Decision Trees

ASSUME IT’S EE

True:False:

8, 9, 10, 131, 3, 5, 12 2, 4, 11

T F

If we test only E we will report that CONCEPT is False,independent of the outcome

The number of misclassified examples from the training set is 6

6, 7

So, the best predicate to test is A

Page 21: CS B351: Decision Trees

CHOICE OF SECOND PREDICATE

AT F

C

True:False:

6, 8, 9, 10, 1311, 127

T FFalse

The number of misclassified examples from the

training set is 1

Page 22: CS B351: Decision Trees

CHOICE OF THIRD PREDICATE

CT F

B

True:False: 11,12

7

T F

AT F

False

True

Page 23: CS B351: Decision Trees

FINAL TREEA

CTrue

True

True BTrue

TrueFalse

False

FalseFalse

False

CONCEPT A (C v B) CONCEPT A (B v C)

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Page 24: CS B351: Decision Trees

TOP-DOWNINDUCTION OF A DT

DTL(D, Predicates)1. If all examples in D are positive then return True2. If all examples in D are negative then return False3. If Predicates is empty then return failure4. A error-minimizing predicate in Predicates5. Return the tree whose:

- root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)

A

CTrue

True

TrueB

True

TrueFalse

False

FalseFalse

False

Subset of examples that satisfy A

Page 25: CS B351: Decision Trees

TOP-DOWNINDUCTION OF A DT

DTL(D, Predicates)1. If all examples in D are positive then return True2. If all examples in D are negative then return False3. If Predicates is empty then return failure4. A error-minimizing predicate in Predicates5. Return the tree whose:

- root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)

A

CTrue

True

TrueB

True

TrueFalse

False

FalseFalse

False

Noise in training set!May return majority rule,

instead of failure

Page 26: CS B351: Decision Trees

TOP-DOWNINDUCTION OF A DT

DTL(D, Predicates)1. If all examples in D are positive then return True2. If all examples in D are negative then return False3. If Predicates is empty then return majority rule4. A error-minimizing predicate in Predicates5. Return the tree whose:

- root is A, - left branch is DTL(D+A,Predicates-A), - right branch is DTL(D-A,Predicates-A)

A

CTrue

True

TrueB

True

TrueFalse

False

FalseFalse

False

Page 27: CS B351: Decision Trees

COMMENTS Widely used algorithm Easy to extend to k-class classification Greedy Robust to noise (incorrect examples) Not incremental

Page 28: CS B351: Decision Trees

HUMAN-READABILITY DTs also have the advantage of being easily

understood by humans Legal requirement in many areas

Loans & mortgages Health insurance Welfare

Page 29: CS B351: Decision Trees

LEARNABLE CONCEPTSSome simple concepts cannot be

represented compactly in DTsParity(x) = X1 xor X2 xor … xor XnMajority(x) = 1 if most of Xi’s are 1, 0

otherwiseExponential size in # of attributesNeed exponential # of examples to

learn exactlyThe ease of learning is dependent on

shrewdly (or luckily) chosen attributes that correlate with CONCEPT

Page 30: CS B351: Decision Trees

PERFORMANCE ISSUES Assessing performance:

Training set and test set Learning curve

size of training set

% c

orre

ct o

n te

st s

et 100

Typical learning curve

Page 31: CS B351: Decision Trees

PERFORMANCE ISSUES Assessing performance:

Training set and test set Learning curve

size of training set

% c

orre

ct o

n te

st s

et 100

Typical learning curve

Some concepts are unrealizable within a machine’s capacity

Page 32: CS B351: Decision Trees

PERFORMANCE ISSUES Assessing performance:

Training set and test set Learning curve

Overfitting Risk of using irrelevantobservable predicates togenerate an hypothesis

that agrees with all examples

in the training set

size of training set

% c

orre

ct o

n te

st s

et

100

Typical learning curve

Page 33: CS B351: Decision Trees

PERFORMANCE ISSUES Assessing performance:

Training set and test set Learning curve

Overfitting Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesis

that agrees with all examples

in the training set

Terminate recursion when# errors (or information gain)

is small

Page 34: CS B351: Decision Trees

PERFORMANCE ISSUES Assessing performance:

Training set and test set Learning curve

Overfitting Tree pruning

Terminate recursion when# errors (or information gain)

is small

Risk of using irrelevantobservable predicates togenerate an hypothesis

that agrees with all examples

in the training setThe resulting decision tree + majority rule may not classify correctly all examples in the training set

Page 35: CS B351: Decision Trees

STATISTICAL METHODS FOR ADDRESSING OVERFITTING / NOISE There may be few training examples that

match the path leading to a deep node in the decision tree More susceptible to choosing irrelevant/incorrect

attributes when sample is small Idea:

Make a statistical estimate of predictive power (which increases with larger samples)

Prune branches with low predictive power Chi-squared pruning

Page 36: CS B351: Decision Trees

TOP-DOWN DT PRUNING Consider an inner node X that by itself

(majority rule) predicts p examples correctly and n examples incorrectly

At k leaf nodes, number of correct/incorrect examples are p1/n1,…,pk/nk

Chi-squared statistical significance test: Null hypothesis: example labels randomly chosen

with distribution p/(p+n) (X is irrelevant) Alternate hypothesis: examples not randomly

chosen (X is relevant) Prune X if testing X is not statistically

significant

Page 37: CS B351: Decision Trees

CHI-SQUARED TEST Let Z = Si (pi – pi’)2/pi’ + (ni – ni’)2/ni’

Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the expected number of true/false examples at leaf node i if the null hypothesis holds

Z is a statistic that is approximately drawn from the chi-squared distribution with k degrees of freedom

Look up p-Value of Z from a table, prune if p-Value > a for some a (usually ~.05)

Page 38: CS B351: Decision Trees

PERFORMANCE ISSUES Assessing performance:

Training set and test set Learning curve

Overfitting Tree pruning

Incorrect examples Missing data Multi-valued and continuous attributes

Page 39: CS B351: Decision Trees

MULTI-VALUED ATTRIBUTES Simple change: consider splits on all values A

can take on Caveat: the more values A can take on, the

more important it may appear to be, even if it is irrelevant More values => dataset split into smaller

example sets when picking attributes Smaller example sets => more likely to fit well

to spurious noise

Page 40: CS B351: Decision Trees

CONTINUOUS ATTRIBUTES Continuous attributes can be converted into

logical ones via thresholds X => X<a

When considering splitting on X, pick the threshold a to minimize # of errors / entropy

7 7 6 5 6 5 4 5 4 3 4 5 4 5 6 7

Page 41: CS B351: Decision Trees

DECISION BOUNDARIES With continuous attributes, a decision

boundary is the surface in example space that splits positive from negative examples

x1>=20x2

x1

T FF T

Page 42: CS B351: Decision Trees

DECISION BOUNDARIES With continuous attributes, a decision

boundary is the surface in example space that splits positive from negative examples

x1>=20x2

x1F

Fx2>=10

T

F

F

T

Page 43: CS B351: Decision Trees

DECISION BOUNDARIES With continuous attributes, a decision

boundary is the surface in example space that splits positive from negative examples

x1>=20x2

x1F

x2>=10

T

F

F

T

x2>=15

T F

T

Page 44: CS B351: Decision Trees

DECISION BOUNDARIES With continuous attributes, a decision

boundary is the surface in example space that splits positive from negative examples

Page 45: CS B351: Decision Trees

EXERCISE With 2 attributes, what kinds of decision

boundaries can be achieved by a decision tree with arbitrary splitting threshold and maximum depth: 1? 2? 3?

Describe the appearance and the complexity of these decision boundaries

Page 46: CS B351: Decision Trees

READING Next class:

Neural networks & function learning R&N 18.6-7