cc282 decision trees lecture 2 slides for cc282 machine

32
CC282 Decision trees Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 1

Upload: butest

Post on 14-Jun-2015

661 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: CC282 Decision trees Lecture 2 slides for CC282 Machine

CC282 Decision trees

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 1

Page 2: CC282 Decision trees Lecture 2 slides for CC282 Machine

Lecture 2 - Outline

• More ML principles:– Concept learning– Hypothesis space– Generalisation and overfitting– Model (hypothesis) evaluation

• Inductive learning– Inductive bias– Decision trees– ID3 algorithm (entropy, information gain)

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 2

Page 3: CC282 Decision trees Lecture 2 slides for CC282 Machine

Concept learning

• Concept, c is the problem to be learned– Example:

• Classification problem by an optician• Concept - whether to fit or not to fit contact lenses based on user’s budget, user’s

eye condition, user’s environment etc• Inputs, x: user’s budget, user’s eye condition, user’s environment • Output, y: to fit or not to fit

• A learning model is needed to learn a concept• The learning model should ideally

– Capture the training data, <x, y> -> descriptive ability– Generalise to unseen test data, <xnew ,?> -> predictive ability

– Provide plausible explanation on the learned concept, c -> explanatory ability– But descriptive and predictive abilities are generally considered sufficient

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 3

x c y

Page 4: CC282 Decision trees Lecture 2 slides for CC282 Machine

Learning a concept

• Concept learning • Given many examples - <input, output> of what c does, find a function h

that approximates c• The number of examples is usually a small subset of all possible <input,

output> pairs• h is known as a hypothesis (i.e. learning model)• There might be a number of h that are candidate solutions -we select h

from a hypothesis space H• If the hypothesis matches the behaviour of the target concept for all

training data, then it is a consistent hypothesis• Occam’s razor

• Simpler hypothesis that fits c is preferred• Simpler h means shorter, smaller h• Simpler h is unlikely to be an effect of coincidence

• Learning == search in the H for an appropriate h• Realisable task – H contains the h that fits the concept• Unreliasable task – H does not contain the h that fits the concept

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 4

Page 5: CC282 Decision trees Lecture 2 slides for CC282 Machine

More terms - Generalisation, overfitting, induction, deduction

• Generalisation– The ability of the trained model to perform well on test data

• Overfitting – If the model learns the training data well but performs poorly on the test data

• Inductive learning (induction)– learning a hypothesis by example, where a system tries to induce a general rule/model

from a set of observed instances/samples

• Inductive bias • Since many choices of h exist in H, any preference of one hypothesis over another

without prior knowledge is called bias • Any hypothesis consistent with the training examples is likely to generalise to unseen

examples - the trick is to find the right bias

• An unbiased learner– Can never generalise so not practically useful

• Deduction– ML gives an output (prediction, classification etc) based on the previously acquired

learning

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 5

Page 6: CC282 Decision trees Lecture 2 slides for CC282 Machine

Generalisation and overfitting example

• Assume, we have the inputs, x and corresponding outputs, y and we wish to have concept, c that matches x to y

• Examples of hypotheses:

• h1 will give good generalisation• h2 is overfitted

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 6

Page 7: CC282 Decision trees Lecture 2 slides for CC282 Machine

Model (hypothesis) evaluation

• We need to have some performance measure to estimate how the model h approximates c, i.e. how good is h?

• Possible evaluation methods– Explanatory, gives qualitative evaluation – Descriptive, gives quantitative (numerical) evaluation

• Explanatory evaluation – Does the model provide a plausible description of the learned concept– Classification: does it base its classification on plausible rules?– Association: does it discover plausible relationships in the data?– Clustering: does it come up with plausible clusters?– The meaning of plausible to be defined by the human expert– Hence, not popular in ML

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 7

Page 8: CC282 Decision trees Lecture 2 slides for CC282 Machine

Descriptive evaluation

• Example: bowel cancer classification problem

– True positives (TP) - diseased patients identified as with cancer– True negatives (TN) - healthy subjects identified as healthy– False negatives (FN)- test identifies cancer patient as healthy – False positives (FP) – test identifies healthy subject as with cancer

• Precision

• Sensitivity (Recall)

• F measure (balanced F score)

• Simple classification accuracyLecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 8

Patients with bowel cancer

True False

Blood Test

Positive TP=2 FP=18

Negative FN=1 TN=182

Source: Wikipedia

%39.17..2

recallprecision

recallprecisionF

%10)182/(2)/( FPTPTPprecision

%67.66)12/(2)/( FNTPTPrecall

%64.90203

184

_

casestotal

TNTPAccu

Page 9: CC282 Decision trees Lecture 2 slides for CC282 Machine

Descriptive evaluation (contd)

• For prediction problems, mean square error (MSE) is used

where– di is the desired output in the data set

– ai is the actual output from the model

– n is the number of instances in the data set

– If N=2, d1=1.0, a1=0.5, d2=0, a2=1.0:

– MSE=1.25

• Sometimes, root mean square is used instead =sqrt(MSE) Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 9

n

i ii adn

MSE1

2)(1

Page 10: CC282 Decision trees Lecture 2 slides for CC282 Machine

Decision trees (DT)

• Simple form of inductive learning • Yet successful form of learning algorithm• Consider an example of playing tennis• Attributes (features)

– Outlook, temp, humidity, wind

• Values– Description of features– Eg: Outlook values - sunny, cloudy, rainy

• Target– Play – Represents the output of the model

• Instances– Examples D1 to D14 of the dataset

• Concept– Learn to decide whether to play tennis i.e. find h from given data set

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 10

Day Outlook Temp Humidity Wind Play

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Cloudy Hot High Weak Yes

D4 Rainy Mild High Weak Yes

D5 Rainy Cool Normal Weak Yes

D6 Rainy Cool Normal Strong No

D7 Cloudy Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rainy Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Cloudy Mild High Strong Yes

D13 Cloudy Hot Normal Weak Yes

D14 Rainy Mild High Strong No

Adapted from Mitchell, 1997

Page 11: CC282 Decision trees Lecture 2 slides for CC282 Machine

Decision trees (DT)

• Decision tree takes a set of properties as input and provides a decision as output– each row of table corresponds to a path in the tree – decision tree may form more compact representation, especially if

many attributes are irrelevant • DT could be considered as the learning method when

– Instances describable by attribute-value pairs– Target function is discrete valued (eg: YES, NO)– Possibly noisy training data

• It is not suitable (needs further adaptation)– When attribute values and/or target are numerical values

• Eg: Attribute values: Temp=22⁰ C, Windy=25 mph• Target function=70%, 30%

– Some functions require exponentially large decision tree• parity function

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 11

Page 12: CC282 Decision trees Lecture 2 slides for CC282 Machine

Forming rules from DT

• Example of concept: ‘Should I play tennis today’– Takes inputs (set of attributes)– Outputs a decision (say YES/NO)

• Each non-leaf node is an attribute– The first non-leaf node is root node

• Each leaf node is either Yes or No• Each link (branch) is labeled with

possible values of the associated attribute• Rule formation

– A decision tree can be expressed as a disjunction of conjunctions• PLAY tennis IF (Outlook = sunny) (Humidity = normal) (Outlook=Cloudy)

(Outlook = Rainy) (Wind=Weak)• is disjunction operator (OR)• is conjunction operator (AND)

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 12

NoYesNo Yes

WeakNormalHigh

Outlook

Humidity WindYes

Sunny RainyCloudy

Strong

Page 13: CC282 Decision trees Lecture 2 slides for CC282 Machine

Another DT example

– Another example (from Lecture 1)

– Reading the tree on the right

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 13

If the parents visiting=yes, then go to the cinemaorIf the parents visiting=no and weather=sunny, then play tennisorIf the parents visiting=no and weather=windy and money=rich, then go shoppingorIf the parents visiting=no and weather=windy and money=poor, then go to cinemaorIf the parents visiting=no and weather=rainy, then stay in.

Source: http://wwwhomes.doc.ic.ac.uk/~sgc/teaching/v231/lecture10.html

Page 14: CC282 Decision trees Lecture 2 slides for CC282 Machine

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 14

Obtaining DT through top-down induction

• How can we obtain a DT?• Perform a top-down search, through the space of possible

decision trees• Determine the attribute that best classifies the training data• Use this attribute as the root of the tree• Repeat this process for each branch from left to right• Proceed to the next level and determine the next best feature• Repeat until a leaf is reached.• How to choose the best attribute?• Choose the attribute that will yield more information (i.e. the

attribute with the highest information gain)

14

Page 15: CC282 Decision trees Lecture 2 slides for CC282 Machine

Information gain

• Information gain - > a reduction of Entropy, E• But what is Entropy?

– Is the amount of energy that cannot be used to do work– Measured in bits– A measure of disorder in a system (high entropy = disorder)

– where: – S is the training data set– c is the number of target classes– pi is the proportion of examples in S belonging to target class i

– Note: if your calculator doesn't do log2, use log2(x)=1.443 ln(x) or 3.322 log10(x). For even better accuracy, use log2(x)=ln(x)/ln(2) or log2(x)=log10(x)/log10(2)

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 15

c

iii ppSE

12log)(

Page 16: CC282 Decision trees Lecture 2 slides for CC282 Machine

Entropy example

• A coin is flipped– If the coin was fair -> 50% chance of head– Now, let us rig the coin -> so that 99% of the time head comes up

• Let’s look at this in terms of entropy:– Two outcomes: head, tail– Probability: phead, ptail

– E(0.5, 0.5)= – 0.5 log2 (0.5) – (0.5) log2 (0.5) = 1 bit

– E(0.01, 0.99) = – 0.01 log2 (0.01) – 0.99 log2 (0.99) = 0.08 bit

• If the probability of heads =1, then entropy=0– E(0, 1.0) = – 0 log2 (0) – 1.0 log2 (1.0) = 0 bit

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 16

Page 17: CC282 Decision trees Lecture 2 slides for CC282 Machine

Information Gain

• Information Gain, G will be defined as:

• where– Values (A) is the set of all possible values of attribute A– Sv is the subset of S for which A has a value v

– |S| is the size of S and Sv is the size of Sv

• The information gain is the expected reduction in entropy caused by knowing the value of attribute A

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 17

)(

)(||

||)(),(

AValuesvv

v SES

SSEASGain

Page 18: CC282 Decision trees Lecture 2 slides for CC282 Machine

Example – entropy calculation

• Compute the entropy of the play-tennis example:• We have two classes, YES and NO• We have 14 instances with 9 classified as YES and 5 as NO

– i.e. no. of classes, c=2

• EYES = - (9/14) log2 (9/14) = 0.41

• ENO = - (5/14) log2 (5/14) = 0.53

• E(S) = EYES + ENO = 0.94

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 18

c

iii ppSE

12log)(

Page 19: CC282 Decision trees Lecture 2 slides for CC282 Machine

Example – information gain calculation

• Compute the information gain for the attributes wind in the play-tennis data set:– |S|=14– Attribute wind

• Two values: weak and strong• |Sweak| = 8

• |Sstrong| = 6

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 19

Day Outlook Temp Humidity Wind Play

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Cloudy Hot High Weak Yes

D4 Rainy Mild High Weak Yes

D5 Rainy Cool Normal Weak Yes

D6 Rainy Cool Normal Strong No

D7 Cloudy Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rainy Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Cloudy Mild High Strong Yes

D13 Cloudy Hot Normal Weak Yes

D14 Rainy Mild High Strong No

Page 20: CC282 Decision trees Lecture 2 slides for CC282 Machine

Example – information gain calculation

• Now, let us determine E(Sweak)

• Instances=8, YES=6, NO=2• [6+,2-]• E(Sweak)=-(6/8)log2(6/8)-(2/8)log2(2/8)=0.81

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 20

Day Outlook Temp Humidity Wind Play

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Cloudy Hot High Weak Yes

D4 Rainy Mild High Weak Yes

D5 Rainy Cool Normal Weak Yes

D6 Rainy Cool Normal Strong No

D7 Cloudy Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rainy Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Cloudy Mild High Strong Yes

D13 Cloudy Hot Normal Weak Yes

D14 Rainy Mild High Strong No

c

iii ppSE

12log)(

Page 21: CC282 Decision trees Lecture 2 slides for CC282 Machine

Example – information gain calculation

• Now, let us determine E(Sstrong)

• Instances=6, YES=3, NO=3• [3+,3-]• E(Sstrong)=-(3/6)log2(3/6)-(3/6)log2(3/6)=1.0

• Note, do not waste time if pYES=pNO

Lecture 1 slides for CC282 Machine Learning, R. Palaniappan, 2008 21

Day Outlook Temp Humidity Wind Play

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Cloudy Hot High Weak Yes

D4 Rainy Mild High Weak Yes

D5 Rainy Cool Normal Weak Yes

D6 Rainy Cool Normal Strong No

D7 Cloudy Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rainy Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Cloudy Mild High Strong Yes

D13 Cloudy Hot Normal Weak Yes

D14 Rainy Mild High Strong No

c

iii ppSE

12log)(

21

Page 22: CC282 Decision trees Lecture 2 slides for CC282 Machine

Example – information gain calculation

• Going back to information gain computation for the attribute wind:

= 0.94 - (8/14) 0.81 - (6/14)1.00= 0.048

Lecture 1 slides for CC282 Machine Learning, R. Palaniappan, 2008 22

)(||

||)(

||

||)(),( strong

strongweak

weak SES

SSE

S

SSEwindSGain

22

Page 23: CC282 Decision trees Lecture 2 slides for CC282 Machine

Example – information gain calculation

• Now, compute the information gain for the attribute humidity in the play-tennis data set:

– |S|=14– Attribute humidity

• Two values: high and normal• |Shigh| = 7

• |Snormal| = 7

– For value: high –> [3+,4-]– For value: normal->[6+,1-]

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 23

Day Outlook Temp Humidity Wind Play

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Cloudy Hot High Weak Yes

D4 Rainy Mild High Weak Yes

D5 Rainy Cool Normal Weak Yes

D6 Rainy Cool Normal Strong No

D7 Cloudy Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rainy Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Cloudy Mild High Strong Yes

D13 Cloudy Hot Normal Weak Yes

D14 Rainy Mild High Strong No

Page 24: CC282 Decision trees Lecture 2 slides for CC282 Machine

Example – information gain calculation

• Now, compute the information gain for the attribute humidity in the play-tennis:

– |S|=14– Attribute humidity

• Two values: high and normal• |Shigh| = 7

• |Snormal| = 7

– For value: high –> [3+,4-]– For value: normal->[6+,1-]

= 0.94 - (7/14) 0.98 - (7/14)0.59= 0.15

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 24

)(||

||)(

||

||)(),( normal

normalhigh

high SES

SSE

S

SSEhumiditySGain

E(Shigh)=-(3/7)log2(3/7)-(4/7)log2(4/7)=0.98

E(Snormal)=-(6/7)log2(6/7)-(1/7)log2(1/7)=0.59

So, humidity provides GREATER information gain than wind

24

Page 25: CC282 Decision trees Lecture 2 slides for CC282 Machine

Example – information gain calculation

• Now, compute the information gain for the attribute outlook and temperature in the play-tennis data set:

• Attribute outlook:

• Attribute temperature:

• Gain(S, outlook)=0.25• Gain(S, temp)=0.03• Gain(S, humidity)=0.15• Gain(S, wind)=0.048• So, attribute with highest info. gain

• OUTLOOK, therefore use outlook as the root node

Lecture 1 slides for CC282 Machine Learning, R. Palaniappan, 2008 25

Day Outlook Temp Humidity Wind Play

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Cloudy Hot High Weak Yes

D4 Rainy Mild High Weak Yes

D5 Rainy Cool Normal Weak Yes

D6 Rainy Cool Normal Strong No

D7 Cloudy Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rainy Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Cloudy Mild High Strong Yes

D13 Cloudy Hot Normal Weak Yes

D14 Rainy Mild High Strong No

25

Page 26: CC282 Decision trees Lecture 2 slides for CC282 Machine

DT – next level

• After determining OUTLOOK as the root node, we need to expand the tree

• E(Ssunny)=-(2/5)log2(2/5)-(3/5)log2(3/5)=0.97

• Entropy (Ssunny)=0.97

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 26

Day Outlook Temp Humidity Wind Play

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Cloudy Hot High Weak Yes

D4 Rainy Mild High Weak Yes

D5 Rainy Cool Normal Weak Yes

D6 Rainy Cool Normal Strong No

D7 Cloudy Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rainy Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Cloudy Mild High Strong Yes

D13 Cloudy Hot Normal Weak Yes

D14 Rainy Mild High Strong No

Page 27: CC282 Decision trees Lecture 2 slides for CC282 Machine

DT – next level

• Gain(Ssunny, Humidity)=0.97-(3/5) 0.0 – (2/5) 0.0=0.97• Gain (Ssunny, Wind)= 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019• Gain(Ssunny, Temperature)=0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57

• Highest information gain is humidity, so use this attribute

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 27

Page 28: CC282 Decision trees Lecture 2 slides for CC282 Machine

Continue ….. and Final DT

• Continue until all the examples are classified– Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain (Srainy, Temp)

– Gain (Srainy, Wind) is the highest

• All leaf nodes are associated with training examples from the same class (entropy=0)

• The attribute temperature is not used Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 28

Page 29: CC282 Decision trees Lecture 2 slides for CC282 Machine

ID3 algorithm –pseudocode

• Sufficient for exam

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 29

Page 30: CC282 Decision trees Lecture 2 slides for CC282 Machine

ID3 algorithm –pseudocode (Mitchell)

• From Mitchell (1997) – not important for exam

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 30

Page 31: CC282 Decision trees Lecture 2 slides for CC282 Machine

Search strategy in ID3

• Complete hypothesis space: any finite discrete-valued function can be expressed

• Incomplete search: searches incompletely through the hypothesis space until the tree is consistent with the data

• Single hypothesis: only one current hypothesis (simplest one) is maintained

• No backtracking: one an attribute is selected, this cannot be changed. Problem: might not be the optimum solution (globally)

• Full training set: attributes are selected by computing information gain on the full training set. Advantage: Robustness to errors. Problem: Non-incremental

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 31

Page 32: CC282 Decision trees Lecture 2 slides for CC282 Machine

Lecture 2 summary

• From this lecture, you should be able to: • Define concept, learning model, hypothesis, hypothesis space,

consistent hypothesis, induction learning & bias, reliasable & unreliasable tasks, Occam’s razor in view of ML

• Differentiate between generalisation and overfitting• Define entropy & information gain and know how to calculate

them for a given data set• Explain the ID3 algorithm, how it works and describe it in

pseudo-code• Apply ID3 algorithm on a given data set

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 32