7-decision trees learning

8/7/2019 7-Decision Trees Learning

1/51

Decision Trees Learning


2/51

2

Outline

Decision tree representation

ID3 learning algorithm

Entropy, information gain

Overfitting


3/51

3

Decision Tree for PlayTennis

Attributes and their values:

Outlook: Sunny, Overcast, Rain

Humidity: High, Normal

Wind: Strong, Weak

Temperature: Hot, Mild, Cool

Target concept - Play Tennis: Yes, No


4/51

4


Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo


5/51

5


Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Each internal node tests an attribute

Each branch corresponds to anattribute value node

Each leaf node assigns a classification


6/51

6

No


Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

Outlook Temperature Humidity Wind PlayTennisSunny Hot High Weak ?


7/51

7

Decision Tree for Conjunction

Outlook

Sunny Overcast Rain

Wind

Strong Weak

No Yes

No

Outlook=Sunny Wind=Weak

No


8/51

8

Decision Tree for Disjunction

Outlook

Sunny Overcast Rain

Yes

Outlook=Sunny Wind=Weak

Wind

Strong Weak

No Yes

Wind

Strong Weak

No Yes


9/51

9

Decision Tree for XOR

Outlook

Sunny Overcast Rain

Wind

Strong Weak

Yes No

Outlook=Sunny XOR Wind=Weak

Wind

Strong Weak

No Yes

Wind

Strong Weak

No Yes


10/51

10

Decision Tree

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

decision trees represent disjunctions of conjunctions

(Outlook=Sunny Humidity=Normal) (Outlook=Overcast)

(Outlook=Rain

Wind=Weak)


11/51

11

When to consider Decision Trees

Instances describable by attribute-value pairs

e.g Humidity: High, Normal

Target function is discrete valued

e.g Play tennis;Y

es, No Disjunctive hypothesis may be required

e.g Outlook=Sunny Wind=Weak

Possibly noisy training data

Missing attribute values Examples:

Medical diagnosis

Credit risk analysis

Object classification for robot manipulator (Tan 1993)


12/51

12

Top-Down Induction of Decision Trees ID3

1. An the best decision attribute for next node2. AssignA as decision attribute fornode

3. For each value ofA createnew descendant4. Sort training examples to leafnode according to

the attribute value of the branch5. If all training examples are perfectlyclassified

(same value of targetattribute) stop, elseiterate overnew leafnodes.


13/51

13

Which Attribute is best?

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]


14/51

14

Entropy

S is a sample of training examples

p+ is the proportion of positive examples

p- is the proportion of negative examples

Entropy measures the impurity ofS

Entropy(S) = -p+ log2 p+ - p- log2 p-


15/51

15

Entropy

Entropy(S)= expected number of bits needed to

encode class (+ or -) of randomly drawn members of

S (under the optimal, shortest length-code)

Information theory optimal length code assign

log2 p bits to messages having probability p.

So the expected number of bits to encode

(+ or -) of random member ofS:

-p+ log2 p+ - p- log2 p-(Note that: 0log 0 = 0)


16/51

16

Information Gain

Gain(S,A): expected reduction in entropy due to sortingS

on attribute A

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)

Entropy([29+,35-]) = -29/64 log2 29/64 35/64 log2 35/64= 0.99


17/51

17

Information Gain

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-]

Entropy([21+,5-]) = 0.71

Entropy([8+,30-]) = 0.74

Gain(S,A1)=Entropy(S)

-26/64*Entropy([21+,5-])

-38/64*Entropy([8+,30-])

=0.27

Entropy([18+,33-]) = 0.94

Entropy([8+,30-]) = 0.62Gain(S,A2)=Entropy(S)

-51/64*Entropy([18+,33-])

-13/64*Entropy([11+,2-])=0.12

A2=?

True False

[18+, 33-][11+

, 2-]

[29+,35-]


18/51

18

Training Examples

Day Outlook Temp. Humidity Wind Play Tennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Weak Yes

D8 Sunny Mild High Weak No

D9 Sunny Cold Normal Weak Yes

D10 Rain Mild Normal Strong Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No


19/51

19

Selecting the Next Attribute

Humidity

High Normal

[3+, 4-] [6+, 1-]

S=[9+,5-]E=0.940

Gain(S,Humidity)=0.940-(7/14)*0.985 (7/14)*0.592

=0.151

E=0.985 E=0.592

Wind

Weak Strong

[6+, 2-] [3+, 3-]

S=[9+,5-]E=0.940

E=0.811 E=1.0Gain(S,Wind)=0.940-(8/14)*0.811 (6/14)*1.0

=0.048Humidity provides greater info. gain than Wind, w.r.t target classification.


20/51

20


Outlook

Sunny Rain

[2+, 3-] [3+, 2-]

S=[9+,5-]E=0.940

Gain(S,Outlook)=0.940-(5/14)*0.971

-(4/14)*0.0 (5/14)*0.0971=0.247

E=0.971 E=0.971

Overcast

[4+, 0]

E=0.0


21/51

21


Gain(S, Temperature)=0.940-(4/14)*1

-(6/14)*0.918 (4/14)*0.811=0.029

E=1 E=0.811

Temperature

Hot Cool

[2+, 2-] [3+, 1-]

S=[9+,5-]E=0.940

Mild

[4+, 2-]

E=0.918

Temperature ?


22/51

22


Theinformationgainvalues for the 4 attributesare: Gain(S,Outlook) =0.247

Gain(S,Humidity) =0.151 Gain(S,Wind) =0.048 Gain(S,Temperature) =0.029

whereS denotes thecollection of training examples

Note: 0Log20 =0


23/51

23

ID3 Algorithm

Outlook

Sunny Overcast Rain

Yes

[D1,D2,,D14][9+,5-]

Ssunny=[D1,D2,D8,D9,D11][2+,3-]

? ?

[D3,D7,D12,D13][4+,0-]

[D4,D5,D6,D10,D14][3+,2-]

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 3/5(0.918) = 0.019

Test for this node


24/51

24

ID3 Algorithm

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

[D3,D7,D12,D13]

[D8,D9,D11] [D6,D14

][D1,D2] [D4,D

5,D

10

]


25/51

25

Hypothesis Space Search ID3

+

- +

+ - +

A1

- - ++ - +

A2

+ - -

+ - +

A2

-

A4+ -

A2

-

A3- +


26/51

26

Hypothesis Space Search ID3

Hypothesis space is complete!

Target function surely in there

Outputs a single hypothesis

No backtracking on selected attributes (greedy search)

ocal minimal (suboptimal splits)

Statistically-based search choices

Robust to noisy data

Inductive bias (search bias)

Prefer shorter trees over longer ones Place high information gain attributes close to the root


27/51

27

Converting a Tree to Rules

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

R1: If (Outlook=Sunny)(Humidity=High) Then PlayTennis=NoR2: If (Outlook=Sunny)(Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=YesR4: If (Outlook=Rain)(Wind=Strong) Then PlayTennis=No

R5: If (Outlook=Rain)(Wind=Weak) Then PlayTennis=Yes


28/51

28

Continuous Valued Attributes

Create a discrete attribute to test continuous

Temperature = 24.50C

(Temperature > 20.00C) = {true, false}

Where to set the threshold?

Temperature 150C 180C 190C 220C 240C 270C

PlayTennis No No Yes Yes Yes No

(see paper by [Fayyad, Irani 1993]


29/51

29

Attributes with many Values

Problem: if an attribute has many values, maximizing

InformationGain will select it.

E.g.: Imagine using Date=12.7.1996 as attribute

perfectly splits the data into subsets of size 1

Use GainRatio instead of information gain as criteria:

GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A)

SplitInformation(S,A) = -7i=1..c |Si|/|S| log2 |Si|/|S| Where Si is the subset for which attribute A has the

value vi


30/51

30

Attributes with Cost

Consider:

Medical diagnosis : blood test costs 1000 SEK(

) Robotics: width_from_one_feet has cost 23 secs.

How to learn a consistent tree with low expected cost?

Replace Gain by :

Gain2(S,A)/Cost(A) [Tan, Schimmer 1990]2Gain(S,A)-1/(Cost(A)+1)w w [0,1] [Nunez 1988]


31/51

31

Unknown Attribute Values

What if examples are missing values ofA?

Use training example anyway sort through tree

If node n tests A, assign most common value ofA among other

examples sorted to node n.

Assign most common value ofA among other examples with

same target value

Assign probability pi to each possible value vi ofA

Assign fraction pi of example to each descendant in tree

Classify new examples in the same fashion


32/51

32

Occams Razor

Prefer shorter hypotheses

Why prefer short hypotheses?

Argument in favor:

Fewer short hypotheses than long hypotheses

A short hypothesis that fits the data is unlikely to be a coincidence

A long hypothesis that fits the data might be a coincidence

Argument opposed:

T

here are many ways to define small sets of hypotheses What is so special about small sets based on size of hypothesis


33/51

33

Overfitting

Consider error of hypothesis h over

Training data: errortrain(h)

Entire distribution D of data: errorD(h)Hypothesis hHoverfits training data if there is an

alternative hypothesis hH such that

errortrain(h) < errortrain(h)

anderrorD(h) > errorD(h)


34/51

34

Overfitting in Decision Tree Learning


35/51

35

Avoid Overfitting

How can we avoid overfitting?

Stop growing when data split not

statistically significant

Grow full tree then post-prune


36/51

36

Reduced-Error Pruning

Split data into trainingand validation set

Do until further pruning is harmful:

1. Evaluate impact on validation set of pruning each

possible node (plus those below it)2. Greedily remove the one that less improves the

validation set accuracy

Produces smallest version of most accurate subtree


37/51

37

ReducedReduced--ErrorError PruningPruning

Split data into training and validationsets.

Pruning a decision node dconsists of:

1. removing the subtree rooted at d.

2. making d a leaf node.3. assigning d the most common

classification of the traininginstances associated with d.

Do until further pruning is harmful:

1. Evaluate impact on validation setof pruning each possible node(plus those below it).

2. Greedily remove the one thatmost improves validation setaccuracy.

Outlook

sunny overcast rainy

Humidity Windy

high normal

no

false true

yes

yes yes no


38/51

38

Effect ofReduced Error Pruning


39/51

39

Rule Post-Pruning

Infer the decision tree from the training setallow overfitting

Convert tree into equivalent set of rules

Prune each rule by removing preconditions that result in

improving its estimated accuracy

Sort the pruned rules by estimated accuracy and considerthem in order when classifying


40/51

40

Outlook

Humidity WindYes

Strong Weak

No Yes No Yes

Sunny Overcast Rain

High Normal

If (Outlook = Sunny) ( Humidity = High) Then (PlayTennis = No)


41/51

41

Why convertthe decision tree to rules

before pruning?

Allows distinguishing among the different contexts

in which a decision node is used

Removes the distinction between attribute tests

near the root and those that occur near leaves Enhances readability


42/51

42

Evaluation

Training accuracy How many training instances can be correctly classify based on

the available data?

Is high when the tree is deep/large, or when there is lessconfliction in the training instances.

however, higher training accuracy does not mean goodgeneralization

Testing accuracy Given a number of new instances, how many of them can we

correctly classify?

Cross validation


43/51

43

Strengths

can generate understandable rules

perform classification without much computation

can handle continuous and categorical variables

provide a clear indication of which fields are most important

for prediction or classification


44/51

44

Weakness

Not suitable for prediction of continuous attribute.

Perform poorly with many class and small data.

Computationally expensive to train.

At each node, each candidate splitting field must be sorted beforeits best split can be found.

In some algorithms, combinations of fields are used and a searchmust be made for optimal combining weights.

Pruning algorithms can also be expensive since many candidatesub-trees must be formed and compared.

Do not treat well non-rectangular regions.


45/51

45

Cross-Validation

Estimate the accuracy of a hypothesis induced by

a supervised learning algorithm

Predict the accuracy of a hypothesis over future

unseen instances

Select the optimal hypothesis from a given set of

alternative hypotheses

Pruning decision trees

Model selection

Feature selection

Combining multiple classifiers (boosting)


46/51

46

Holdout Method

Partition data set D = {(v1,y1),,(vn,yn)} into trainingDt and

validation set Dh=D\Dt

Training Dt Validation D\Dt

acch= 1/h (vi,yi)DhH(I(Dt,vi),yi)

I(Dt,vi) : output ofhypothesisinduced by learnerItrained on data D

t

forinstanceviH(i,j) = 1ifi=j and 0 otherwise

Problems: makesinsufficient use of data

trainingand validationsetarecorrelated


47/51

47

Cross-Validation

k-fold cross-validation splits the data set D into k mutually

exclusive subsets D1,D2,,Dk

Train and test the learning algorithm k times, each time it istrained on D\Di and tested on Di

D1 D2 D3 D4

D1 D2 D3 D4 D1 D2 D3 D4

D1 D2 D3 D4 D1 D2 D3 D4

acccv= 1/n (vi,yi)D H(I(D\Di,vi),yi)


48/51

48

Cross-Validation

Uses all the data for training and testing

Complete k-fold cross-validation splits the

dataset of size m in all (m over m/k) possible

ways (choosing m/k instances out of m) Leave n-out cross-validation sets n instances

aside for testing and uses the remaining ones for

training (leave one-out is equivalent to n-fold

cross-validation)

Leave one out is widely used

In stratified cross-validation, the folds are

stratified so that they contain approximately the

same proportion of labels as the original data set


49/51

49

= { X1 , X2, .. ,Xn } n p(Xi)

= (Xi Xj ... Xk)

(Xi Xj ... Xk) = I(Xi Xj ... Xk)

p(X) = 1, I(X) = 0 p(X) = 0, I(X) = g

p(Xi) > P(Xj), I(Xi) < I(Xj)

XiXjXk I (XiXjXk) = I (Xi) + I (Xj) + I (Xk)


50/51

50

I (X) = log2[ 1/ p(X) ] = - log2 p(X) (bit)

[] X = (X1X2X3) , p(X1),p(X2),p(X3)I(X)

I (X) = I (X1X2X3)

= -log [ p(X1X2X3) ]

= -log[ p(X1) p(X2) p(X3) ] Xi = -log p(X1) -log p(X2) -log p(X3)

= I (X1) + I (X2) + I (X3)

[] (X1X2X3X4), 1 p(X1) = 1/2, p(X2) = 1/4, p(X3) = 1/8, p(X4) = 1/8

2 p(X1) = 1/4, p(X2) = 1/4, p(X3) = 1/4, p(X4) = 1/4

1 I(X1X2X3X4) = I(X1) + I(X2) + I(X3) + I(X4) = -log(1/2) -log(1/4) -log(1/8) - log(1/8)

= 1 + 2 + 3 +3 = 9 (bit)

2 I(X1X2X3X4) = I(X1) + I(X2) + I(X3) + I(X4)

= -log(1/4) -log(1/4) -log(1/4) - log(1/4)

= 2 + 2 + 2 +2 = 8 (bit) Imin


51/51

51

( ) H(X)

[] {X1, X2 , X3 , X4 },

p(X1) = 1/2, p(X2) = 1/4, p(X3) = 1/8, p(X4) = 1/8 (Hmax )

H(X) =

= [ -(1/2)log(1/2) ] + [ -(1/4)log(1/4) ] + [ -(1/8)log(1/8) ] + [ -(1/8)log(1/8) ]

= 1/2 + (1/4)

2 + (1/8)

3 + (1/8)

3 = 1/2 + 1/2 + 3/8 + 3/8

= 1.75 (bit)

!

!!n

i

n

i

XipXipXiIXipXH11

)(log)()()()(

!

p i p ii

n

( ) log ( )1

7-decision trees learning

Documents