decision trees and decision tree learning philipp kärger

46
Decision Trees and Decision Tree Learning Philipp Kärger

Upload: joelle-beasley

Post on 01-Jan-2016

47 views

Category:

Documents


1 download

DESCRIPTION

Decision Trees and Decision Tree Learning Philipp Kärger. Outline: Decision Trees Decision Tree Learning ID3 Algorithm Which attribute to split on? Some examples Overfitting Where to use Decision Trees?. Decision tree representation for PlayTennis. Outlook. Sunny. Overcast. Rain. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Decision Trees and Decision Tree Learning Philipp Kärger

Decision Treesand

Decision Tree Learning

Philipp Kärger

Page 2: Decision Trees and Decision Tree Learning Philipp Kärger
Page 3: Decision Trees and Decision Tree Learning Philipp Kärger

Outline:1. Decision Trees

2. Decision Tree Learning1. ID3 Algorithm

2. Which attribute to split on?

3. Some examples

3. Overfitting

4. Where to use Decision Trees?

Page 4: Decision Trees and Decision Tree Learning Philipp Kärger
Page 5: Decision Trees and Decision Tree Learning Philipp Kärger

Decision tree representation for PlayTennis

Outlook

Humidity WindYes

OvercastSunny Rain

High Normal Strong Weak

No Yes No Yes

Page 6: Decision Trees and Decision Tree Learning Philipp Kärger

Decision tree representation for PlayTennis

AttributeOutlook

Humidity WindYes

OvercastSunny Rain

High Normal Strong Weak

No Yes No Yes

Page 7: Decision Trees and Decision Tree Learning Philipp Kärger

Decision tree representation for PlayTennis

ValueOutlook

Humidity WindYes

OvercastSunny Rain

High Normal Strong Weak

No Yes No Yes

Page 8: Decision Trees and Decision Tree Learning Philipp Kärger

Decision tree representation for PlayTennis

ClassificationOutlook

Humidity WindYes

OvercastSunny Rain

High Normal Strong Weak

No Yes No Yes

Page 9: Decision Trees and Decision Tree Learning Philipp Kärger

PlayTennis:Other representations

• Logical expression for PlayTennis=Yes:

– (Outlook=Sunny Humidity=Normal) (Outlook=Overcast)

(Outlook=Rain Wind=Weak)

• If-then rules

– IF Outlook=Sunny Humidity=Normal THEN PlayTennis=Yes

– IF Outlook=Overcast THEN PlayTennis=Yes

– IF Outlook=Rain Wind=Weak THEN PlayTennis=Yes

– IF Outlook=Sunny Humidity=High THEN PlayTennis=No

– IF Outlook=Rain Wind=Strong THEN PlayTennis=Yes

Page 10: Decision Trees and Decision Tree Learning Philipp Kärger

Decision Trees - Summary

• a model of a part of the world

• allows us to classify instances (by performing a sequence of tests)

• allows us to predict classes of (unseen) instances

• understandable by humans (unlike many other representations)

Page 11: Decision Trees and Decision Tree Learning Philipp Kärger

Decision Tree Learning

Page 12: Decision Trees and Decision Tree Learning Philipp Kärger

• Goal: Learn from known instances how to classify unseen instances

• by means of building and exploiting a Decision Tree

• supervised or unsupervised learning?

Page 13: Decision Trees and Decision Tree Learning Philipp Kärger

Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training SetDecision

Tree

seen patients

unseen patients rules telling whichattributes of the

patient indicates a disease

check attributes

of an unseen patient

Application:classification of medical patients by their

disease

Page 14: Decision Trees and Decision Tree Learning Philipp Kärger

Basic algorithm: ID3 (simplified)

ID3 = Iterative Dichotomiser 3

- given a goal class to build the tree for

- create a root node for the tree- if all examples from the test set belong to

the same goal class C then label the root with C

- else– select the ‘most informative’ attribute A – split the training set according to the values V1..Vn of A– recursively build the resulting subtrees T1 … Tn– generate decision tree T: A

...

...T1 Tn

vnv1

A1=weather A2=day happy

sun odd yes

rain odd no

rain even no

sun even yes

rain odd no

sun even yes

Humidity

High

No Yes

Low

Page 15: Decision Trees and Decision Tree Learning Philipp Kärger

• lessons learned:– there is always more than one decision tree– finding the “best” one is NP complete– all the known algorithms use heuristics

• finding the right attribute A to split on is tricky

Page 16: Decision Trees and Decision Tree Learning Philipp Kärger

Search heuristics in ID3

• Which attribute should we split on?

• Need a heuristic– Some function gives big numbers for “good”

splits

• Want to get to “pure” sets

• How can we measure “pure”?sunny rain

odd

even

Page 17: Decision Trees and Decision Tree Learning Philipp Kärger

Measuring Information: Entropy

• The average amount of information I needed to classify an object is given by the entropy measure

• For a two-class problem:

entropy

p(c)

p(c) = probability of class Cc

(sum over all classes)

Page 18: Decision Trees and Decision Tree Learning Philipp Kärger

• What is the entropy of the set of happy/unhappy days?

sunny rain

odd

even

A1=weather A2=day happy

sun odd yes

rain odd no

rain even no

sun even yes

rain odd no

sun even yes

Page 19: Decision Trees and Decision Tree Learning Philipp Kärger

Residual Information

• After applying attribute A, S is partitioned into subsets according to values v of A

• Ires represents the amount of information still needed to classify an instance

• Ires is equal to weighted sum of the amounts of information for the subsets

p(c|v) = probability that an instance belongs to class C given that it belongs to v

=I(v)

Page 20: Decision Trees and Decision Tree Learning Philipp Kärger

• What is Ires(A) if I split for “weather” and if

I split for “day”?

Ires(weather) = 0

Ires(day) = 1

sunny rain

odd

even

A1=weather A2=day happy

sun odd yes

rain odd no

rain even no

sun even yes

rain odd no

sun even yes

Page 21: Decision Trees and Decision Tree Learning Philipp Kärger

Information Gain:= the amount of information I rule out by splitting

on attribute A:

Gain(A) = I – Ires(A)= information in the current set minus the

residual information after splitting

The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes the Gain

Page 22: Decision Trees and Decision Tree Learning Philipp Kärger

Triangles and Squares

.

.

..

.

.

# Shape

Color Outline Dot

1 green dashed no triange

2 green dashed yes triange

3 yellow dashed no square

4 red dashed no square

5 red solid no square

6 red solid yes triange

7 green solid no square

8 green dashed no triange

9 yellow solid yes square

10 red solid no square

11 green solid yes square

12 yellow dashed yes square

13 yellow solid no square

14 red dashed yes triange

Attribute

Data Set:A set of classified objects

Page 23: Decision Trees and Decision Tree Learning Philipp Kärger

Entropy

• 5 triangles• 9 squares• class probabilities

• entropy of the data set

.

.

..

.

.

Page 24: Decision Trees and Decision Tree Learning Philipp Kärger

Entropyreduction

bydata set

partitioning

.

.

..

.

.

..

..

.

.

Color?

red

yellow

green

Page 25: Decision Trees and Decision Tree Learning Philipp Kärger

..

..

.

.

..

..

.

.

Color?

red

yellow

green

resi

dual

info

rmat

ion

Page 26: Decision Trees and Decision Tree Learning Philipp Kärger

Info

rmat

ion

Gai

n ..

..

.

.

..

..

.

.

Color?

red

yellow

green

Page 27: Decision Trees and Decision Tree Learning Philipp Kärger

Information Gain of The Attribute

• Attributes– Gain(Color) = 0.246– Gain(Outline) = 0.151– Gain(Dot) = 0.048

• Heuristics: attribute with the highest gain is chosen

• This heuristics is local (local minimization of impurity)

Page 28: Decision Trees and Decision Tree Learning Philipp Kärger

.

.

..

.

.

..

..

.

.

Color?

red

yellow

green

Gain(Outline) = 0.971 – 0 = 0.971 bitsGain(Dot) = 0.971 – 0.951 = 0.020

bits

Page 29: Decision Trees and Decision Tree Learning Philipp Kärger

.

.

..

.

.

..

..

.

.

Color?

red

yellow

green

.

.

Outline?

dashed

solid

Gain(Outline) = 0.971 – 0.951 = 0.020 bits

Gain(Dot) = 0.971 – 0 = 0.971 bits

Page 30: Decision Trees and Decision Tree Learning Philipp Kärger

.

.

..

.

.

..

..

.

.

Color?

red

yellow

green

.

.

dashed

solid

Dot?

no

yes

.

.

Outline?

Page 31: Decision Trees and Decision Tree Learning Philipp Kärger

Decision Tree

Color

Dot Outlinesquare

redyellow

green

squaretriangle

yes no

squaretriangle

dashed solid

.

.

..

.

.

Page 32: Decision Trees and Decision Tree Learning Philipp Kärger

A Defect of Ires

• Ires favors attributes with many values

• Such attribute splits S to many subsets, and if these are small, they will tend to be pure anyway

• One way to rectify this is through a corrected measure of information gain ratio.

A1=weather A2=day happy

sun 17.1.08 yes

rain 18.1.08 no

rain 19.1.08 no

sun 20.1.08 yes

sun 21.1.08 yes

Page 33: Decision Trees and Decision Tree Learning Philipp Kärger

Information Gain Ratio

• I(A) is amount of information needed to determine the value of an attribute A

• Information gain ratio

Page 34: Decision Trees and Decision Tree Learning Philipp Kärger

Info

rmat

ion

Gai

n R

atio .

.

..

.

.

..

..

.

.

Color?

red

yellow

green

Page 35: Decision Trees and Decision Tree Learning Philipp Kärger

Information Gain and Information Gain Ratio

A |v(A)| Gain(A) GainRatio(A)

Color 3 0.247 0.156

Outline 2 0.152 0.152

Dot 2 0.048 0.049

Page 36: Decision Trees and Decision Tree Learning Philipp Kärger

Overfitting (Example)

Page 37: Decision Trees and Decision Tree Learning Philipp Kärger

OverfittingOverfitting

Underfitting: when model is too simple, both training and test errors are large

Page 38: Decision Trees and Decision Tree Learning Philipp Kärger

Notes on Overfitting

• Overfitting results in decision trees that are more complex than necessary

• Training error no longer provides a good estimate of how well the tree will perform on previously unseen records

Page 39: Decision Trees and Decision Tree Learning Philipp Kärger

How to Address Overfitting

Idea: prune the tree so that it is not too specific

Two possibilities:

Pre-Pruning

- prune while building the tree

Post-Pruning

- prune after building the tree

Page 40: Decision Trees and Decision Tree Learning Philipp Kärger

How to Address Overfitting• Pre-Pruning (Early Stopping Rule)

– Stop the algorithm before it becomes a fully-grown tree

– More restrictive stopping conditions:• Stop if number of instances is less than some user-specified threshold• Stop if expanding the current node does not improve impurity measures (e.g., information gain).

– Not successful in practice

Page 41: Decision Trees and Decision Tree Learning Philipp Kärger

How to Address Overfitting…

• Post-pruning– Grow decision tree to its entirety– Trim the nodes of the decision tree in a

bottom-up fashion– If generalization error improves after

trimming, replace sub-tree by a leaf node.– Class label of leaf node is determined from

majority class of instances in the sub-tree

Page 42: Decision Trees and Decision Tree Learning Philipp Kärger

Occam’s Razor

• Given two models of similar generalization errors, one should prefer the simpler model over the more complex model

• For complex models, there is a greater chance that it was fitted accidentally by errors in data

• Therefore, one should prefer less complex models in general

Page 43: Decision Trees and Decision Tree Learning Philipp Kärger

When to use Decision Tree Learning?

Page 44: Decision Trees and Decision Tree Learning Philipp Kärger

Appropriate problems for decision tree learning

• Classification problems

• Characteristics:– instances described by attribute-value pairs– target function has discrete output values– training data may be noisy – training data may contain missing attribute values

Page 45: Decision Trees and Decision Tree Learning Philipp Kärger

Strengths

• can generate understandable rules• perform classification without much computation• can handle continuous and categorical variables

• provide a clear indication of which fields are most important for prediction or classification

Page 46: Decision Trees and Decision Tree Learning Philipp Kärger

Weakness

• Not suitable for prediction of continuous attribute.• Perform poorly with many class and small data.• Computationally expensive to train.

– At each node, each candidate splitting field must be sorted before its best split can be found.

– In some algorithms, combinations of fields are used and a search must be made for optimal combining weights.

– Pruning algorithms can also be expensive since many potential sub-trees must be formed and compared