decision trees definitiondefinition mechanismmechanism splitting functionssplitting functions issues...

24
Decision Trees Definition Definition Mechanism Mechanism Splitting Functions Splitting Functions Issues in Decision-Tree Learning Issues in Decision-Tree Learning Avoiding overfitting through pruning Avoiding overfitting through pruning Numeric and Missing attributes Numeric and Missing attributes

Upload: zack-hinton

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Decision Trees

• DefinitionDefinition

• MechanismMechanism

• Splitting FunctionsSplitting Functions

• Issues in Decision-Tree LearningIssues in Decision-Tree Learning

• Avoiding overfitting through pruningAvoiding overfitting through pruning

• Numeric and Missing attributes Numeric and Missing attributes

Page 2: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Illustration

Example: Learning to classify stars. Example: Learning to classify stars.

LuminosityLuminosity

MassMass

Type AType A Type BType B

Type CType C

> r1> r1<= r1<= r1

> r2> r2<= r2<= r2

Page 3: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Definition

A decision-tree learning algorithm approximates a target A decision-tree learning algorithm approximates a target concept using a tree representation, where each internal concept using a tree representation, where each internal node corresponds to an attribute, and every terminal node node corresponds to an attribute, and every terminal node corresponds to a class. corresponds to a class.

There are two types of nodes:There are two types of nodes:

Internal node.- Splits into different branches according Internal node.- Splits into different branches according to the different values the corresponding attribute can to the different values the corresponding attribute can take. Example: luminosity <= r1 or luminosity > r1.take. Example: luminosity <= r1 or luminosity > r1.

Terminal Node.- Decides the class assigned to the Terminal Node.- Decides the class assigned to the example.example.

Page 4: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Classifying Examples

LuminosityLuminosity

MassMass

Type AType A Type BType B

Type CType C

> r1> r1<= r1<= r1

> r2> r2<= r2<= r2

X = (Luminosity <= r1, Mass > r2)X = (Luminosity <= r1, Mass > r2)

Assigned ClassAssigned Class

Page 5: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Appropriate Problems for Decision Trees

Attributes are both numeric and nominal.Attributes are both numeric and nominal.

Target function takes on a discrete number of values.Target function takes on a discrete number of values.

Data may have errors.Data may have errors.

Some examples may have missing attribute values. Some examples may have missing attribute values.

Page 6: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Decision Trees

• DefinitionDefinition

• MechanismMechanism

• Splitting FunctionsSplitting Functions

• Issues in Decision-Tree LearningIssues in Decision-Tree Learning

• Avoiding overfitting through pruningAvoiding overfitting through pruning

• Numeric and Missing attributes Numeric and Missing attributes

Page 7: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Historical Information

Ross Quinlan – Induction of Decision Trees. Machine LearningRoss Quinlan – Induction of Decision Trees. Machine LearningJournal 1: 81-106, 1986 (over 8 thousand citations)Journal 1: 81-106, 1986 (over 8 thousand citations)

Page 8: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Historical Information

Leo Breiman – CART (Classification and Regression Trees), 1984. Leo Breiman – CART (Classification and Regression Trees), 1984.

Page 9: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Mechanism

There are different ways to construct trees from data.There are different ways to construct trees from data.We will concentrate on the top-down, greedy search approach:We will concentrate on the top-down, greedy search approach:

Basic idea:Basic idea: 1. Choose the best attribute a* to place at the root of the tree.1. Choose the best attribute a* to place at the root of the tree.

2. Separate training set 2. Separate training set D D into subsets {into subsets {D1, D2, .., DkD1, D2, .., Dk} where } where each subset each subset DiDi contains examples having the same value for a* contains examples having the same value for a*

3. Recursively apply the algorithm on each new subset until 3. Recursively apply the algorithm on each new subset until examples have the same class or there are few of them.examples have the same class or there are few of them.

Page 10: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Illustration

Attributes: size and humidityAttributes: size and humiditySize has two values: >r1 or <= r1Size has two values: >r1 or <= r1Humidity has three values: >r2, (>r3 and <=r2), <= r3Humidity has three values: >r2, (>r3 and <=r2), <= r3

sizesize

hum

idity

hum

idity

r1r1

r2r2

r3r3

Class P: poisonousClass P: poisonousClass N: not-poisonousClass N: not-poisonous

Page 11: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Illustration

hum

idit

yhu

mid

ity

r1r1

r2r2

r3r3

Suppose we choose Suppose we choose sizesize as the best attribute: as the best attribute:

sizesize

PP

> r1> r1<= r1<= r1

Class P: poisonousClass P: poisonous Class N: not-poisonousClass N: not-poisonous

??

Page 12: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Illustration

hum

idit

yhu

mid

ity

r1r1

r2r2

r3r3

Suppose we choose Suppose we choose humidityhumidity as the next best attribute: as the next best attribute:

sizesize

PP

> r1> r1<= r1<= r1

humidityhumidity

PP NPNPNPNP

>r2>r2 <= r3<= r3> r3 & <= r2> r3 & <= r2

Page 13: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Formal Mechanism

• Create a root for the treeCreate a root for the tree• If all examples are of the same class or the number of examplesIf all examples are of the same class or the number of examples is below a threshold return that classis below a threshold return that class• If no attributes available return majority classIf no attributes available return majority class• Let a* be the best attributeLet a* be the best attribute• For each possible value For each possible value vv of of aa**

• Add a branch below Add a branch below aa* labeled “a = v”* labeled “a = v”• Let Sv be the subsets of example where attribute a*=vLet Sv be the subsets of example where attribute a*=v• Recursively apply the algorithm to Sv Recursively apply the algorithm to Sv

Page 14: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

What attribute is the best to split the data?What attribute is the best to split the data?Let us remember some definitions from information theory. Let us remember some definitions from information theory.

A measure of uncertainty or entropy that is associated A measure of uncertainty or entropy that is associated to a random variable X is defined as to a random variable X is defined as

H(X) = - Σ pi log piH(X) = - Σ pi log pi

where the logarithm is in base 2.where the logarithm is in base 2.

This is the “average amount of information or entropy of a finiteThis is the “average amount of information or entropy of a finitecomplete probability scheme” (Introduction to I. Theory by Reza F.).complete probability scheme” (Introduction to I. Theory by Reza F.).

Page 15: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

P(A) = 1/256, P(B) = 255/256P(A) = 1/256, P(B) = 255/256 H(X) = 0.0369 bitH(X) = 0.0369 bit

P(A) = 1/2, P(B) = 1/2P(A) = 1/2, P(B) = 1/2 H(X) = 1 bitH(X) = 1 bit

P(A) = 7/16, P(B) = 9/16P(A) = 7/16, P(B) = 9/16 H(X) = 0.989 bitH(X) = 0.989 bit

There are two possible complete events There are two possible complete events AA and and BB(Example: flipping a biased coin). (Example: flipping a biased coin).

Page 16: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Entropy is a function concave downward. Entropy is a function concave downward.

00 0.50.5 11

1 bit1 bit

Page 17: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Illustration

Attributes: size and humidityAttributes: size and humiditySize has two values: >r1 or <= r1Size has two values: >r1 or <= r1Humidity has three values: >r2, (>r3 and <=r2), <= r3Humidity has three values: >r2, (>r3 and <=r2), <= r3

sizesize

hum

idity

hum

idity

r1r1

r2r2

r3r3

Class P: poisonousClass P: poisonousClass N: not-poisonousClass N: not-poisonous

Page 18: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Splitting based on Entropy

sizesizer1r1

r2r2

r3r3

hum

idit

yhu

mid

ity

Size divides the sample in two.Size divides the sample in two.S1 = { 6P, 0NP}S1 = { 6P, 0NP}S2 = { 3P, 5NP}S2 = { 3P, 5NP}

S1S1S2S2

H(S1) = 0H(S1) = 0H(S2) = -(3/8)log2(3/8)H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8) -(5/8)log2(5/8)

Page 19: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Splitting based on Entropy

sizesizer1r1

r2r2

r3r3

hum

idit

yhu

mid

ity

humidity divides the sample humidity divides the sample in three.in three.S1 = { 2P, 2NP}S1 = { 2P, 2NP}S2 = { 5P, 0NP}S2 = { 5P, 0NP}S3 = { 2P, 3NP}S3 = { 2P, 3NP}

S1S1

S3S3H(S1) = 1H(S1) = 1H(S2) = 0H(S2) = 0H(S3) = -(2/5)log2(2/5)H(S3) = -(2/5)log2(2/5) -(3/5)log2(3/5) -(3/5)log2(3/5)

S2S2

Page 20: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Information Gain

IG(A) = H(S) - Σv (Sv/S) H (Sv) IG(A) = H(S) - Σv (Sv/S) H (Sv)

H(S) is the entropy of all examplesH(S) is the entropy of all examples

H(Sv) is the entropy of one subsample after partitioning SH(Sv) is the entropy of one subsample after partitioning Sbased on all possible values of attribute A. based on all possible values of attribute A.

Page 21: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Components of IG(A)

sizesizer1r1

r2r2

r3r3

hum

idit

yhu

mid

ity

S1S1S2S2

H(S1) = 0H(S1) = 0H(S2) = -(3/8)log2(3/8)H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8)-(5/8)log2(5/8)

H(S) = -(9/14)log2(9/14)H(S) = -(9/14)log2(9/14) -(5/14)log2(5/14)-(5/14)log2(5/14)

|S1|/|S| = 6/14|S1|/|S| = 6/14|S2|/|S| = 8/14|S2|/|S| = 8/14

Page 22: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Components of IG(A)

sizesizer1r1

r2r2

r3r3

hum

idit

yhu

mid

ity

S1S1S2S2

H(S1) = 0H(S1) = 0H(S2) = -(3/8)log2(3/8)H(S2) = -(3/8)log2(3/8) -(5/8)log2(5/8)-(5/8)log2(5/8)

H(S) = -(9/14)log2(9/14)H(S) = -(9/14)log2(9/14) -(5/14)log2(5/14)-(5/14)log2(5/14)

|S1|/|S| = 6/14|S1|/|S| = 6/14|S2|/|S| = 8/14|S2|/|S| = 8/14

Page 23: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Gain Ratio

Let’s define the entropy of the attribute:

H(A) = - Σ pj log pjH(A) = - Σ pj log pj

Where pj is the probability that attribute A takes value Vj.

Then

GainRatio(A) = IG(A) / H(A)

Page 24: Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionsSplitting Functions Issues in Decision-Tree LearningIssues in Decision-Tree Learning

Gain Ratio

sizesizer1r1

r2r2

r3r3

hum

idit

yhu

mid

ity

S1S1S2S2

H(size) = -(6/14)log2(6/14) -(8/14)log2(8/14)H(size) = -(6/14)log2(6/14) -(8/14)log2(8/14)

where |S1|/|S| = 6/14 |S2|/|S| = 8/14where |S1|/|S| = 6/14 |S2|/|S| = 8/14