cc282 decision trees lecture 2 slides for cc282 machine

CC282 Decision trees

Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 1

Lecture 2 - Outline

• More ML principles:– Concept learning– Hypothesis space– Generalisation and overfitting– Model (hypothesis) evaluation

• Inductive learning– Inductive bias– Decision trees– ID3 algorithm (entropy, information gain)


Concept learning

• Concept, c is the problem to be learned– Example:

• Classification problem by an optician• Concept - whether to fit or not to fit contact lenses based on user’s budget, user’s

eye condition, user’s environment etc• Inputs, x: user’s budget, user’s eye condition, user’s environment • Output, y: to fit or not to fit

• A learning model is needed to learn a concept• The learning model should ideally

– Capture the training data, <x, y> -> descriptive ability– Generalise to unseen test data, <xnew ,?> -> predictive ability

– Provide plausible explanation on the learned concept, c -> explanatory ability– But descriptive and predictive abilities are generally considered sufficient


x c y

Learning a concept

• Concept learning • Given many examples - <input, output> of what c does, find a function h

that approximates c• The number of examples is usually a small subset of all possible <input,

output> pairs• h is known as a hypothesis (i.e. learning model)• There might be a number of h that are candidate solutions -we select h

from a hypothesis space H• If the hypothesis matches the behaviour of the target concept for all

training data, then it is a consistent hypothesis• Occam’s razor

• Simpler hypothesis that fits c is preferred• Simpler h means shorter, smaller h• Simpler h is unlikely to be an effect of coincidence

• Learning == search in the H for an appropriate h• Realisable task – H contains the h that fits the concept• Unreliasable task – H does not contain the h that fits the concept


More terms - Generalisation, overfitting, induction, deduction

• Generalisation– The ability of the trained model to perform well on test data

• Overfitting – If the model learns the training data well but performs poorly on the test data

• Inductive learning (induction)– learning a hypothesis by example, where a system tries to induce a general rule/model

from a set of observed instances/samples

• Inductive bias • Since many choices of h exist in H, any preference of one hypothesis over another

without prior knowledge is called bias • Any hypothesis consistent with the training examples is likely to generalise to unseen

examples - the trick is to find the right bias

• An unbiased learner– Can never generalise so not practically useful

• Deduction– ML gives an output (prediction, classification etc) based on the previously acquired

learning


Generalisation and overfitting example

• Assume, we have the inputs, x and corresponding outputs, y and we wish to have concept, c that matches x to y

• Examples of hypotheses:

• h1 will give good generalisation• h2 is overfitted


Model (hypothesis) evaluation

• We need to have some performance measure to estimate how the model h approximates c, i.e. how good is h?

• Possible evaluation methods– Explanatory, gives qualitative evaluation – Descriptive, gives quantitative (numerical) evaluation

• Explanatory evaluation – Does the model provide a plausible description of the learned concept– Classification: does it base its classification on plausible rules?– Association: does it discover plausible relationships in the data?– Clustering: does it come up with plausible clusters?– The meaning of plausible to be defined by the human expert– Hence, not popular in ML


Descriptive evaluation

• Example: bowel cancer classification problem

– True positives (TP) - diseased patients identified as with cancer– True negatives (TN) - healthy subjects identified as healthy– False negatives (FN)- test identifies cancer patient as healthy – False positives (FP) – test identifies healthy subject as with cancer

• Precision

• Sensitivity (Recall)

• F measure (balanced F score)

• Simple classification accuracyLecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 8

Patients with bowel cancer

True False

Blood Test

Positive TP=2 FP=18

Negative FN=1 TN=182

Source: Wikipedia

%39.17..2

recallprecision

recallprecisionF

%10)182/(2)/( FPTPTPprecision

%67.66)12/(2)/( FNTPTPrecall

%64.90203

184

_

casestotal

TNTPAccu

Descriptive evaluation (contd)

• For prediction problems, mean square error (MSE) is used

where– di is the desired output in the data set

– ai is the actual output from the model

– n is the number of instances in the data set

– If N=2, d1=1.0, a1=0.5, d2=0, a2=1.0:

– MSE=1.25

• Sometimes, root mean square is used instead =sqrt(MSE) Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 9

n

i ii adn

MSE1

2)(1

Decision trees (DT)

• Simple form of inductive learning • Yet successful form of learning algorithm• Consider an example of playing tennis• Attributes (features)

– Outlook, temp, humidity, wind

• Values– Description of features– Eg: Outlook values - sunny, cloudy, rainy

• Target– Play – Represents the output of the model

• Instances– Examples D1 to D14 of the dataset

• Concept– Learn to decide whether to play tennis i.e. find h from given data set


Day Outlook Temp Humidity Wind Play

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Cloudy Hot High Weak Yes

D4 Rainy Mild High Weak Yes

D5 Rainy Cool Normal Weak Yes

D6 Rainy Cool Normal Strong No

D7 Cloudy Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rainy Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Cloudy Mild High Strong Yes

D13 Cloudy Hot Normal Weak Yes

D14 Rainy Mild High Strong No

Adapted from Mitchell, 1997

Decision trees (DT)

• Decision tree takes a set of properties as input and provides a decision as output– each row of table corresponds to a path in the tree – decision tree may form more compact representation, especially if

many attributes are irrelevant • DT could be considered as the learning method when

– Instances describable by attribute-value pairs– Target function is discrete valued (eg: YES, NO)– Possibly noisy training data

• It is not suitable (needs further adaptation)– When attribute values and/or target are numerical values

• Eg: Attribute values: Temp=22⁰ C, Windy=25 mph• Target function=70%, 30%

– Some functions require exponentially large decision tree• parity function


Forming rules from DT

• Example of concept: ‘Should I play tennis today’– Takes inputs (set of attributes)– Outputs a decision (say YES/NO)

• Each non-leaf node is an attribute– The first non-leaf node is root node

• Each leaf node is either Yes or No• Each link (branch) is labeled with

possible values of the associated attribute• Rule formation

– A decision tree can be expressed as a disjunction of conjunctions• PLAY tennis IF (Outlook = sunny) (Humidity = normal) (Outlook=Cloudy)

(Outlook = Rainy) (Wind=Weak)• is disjunction operator (OR)• is conjunction operator (AND)


NoYesNo Yes

WeakNormalHigh

Outlook

Humidity WindYes

Sunny RainyCloudy

Strong

Another DT example

– Another example (from Lecture 1)

– Reading the tree on the right


If the parents visiting=yes, then go to the cinemaorIf the parents visiting=no and weather=sunny, then play tennisorIf the parents visiting=no and weather=windy and money=rich, then go shoppingorIf the parents visiting=no and weather=windy and money=poor, then go to cinemaorIf the parents visiting=no and weather=rainy, then stay in.

Source: http://wwwhomes.doc.ic.ac.uk/~sgc/teaching/v231/lecture10.html


Obtaining DT through top-down induction

• How can we obtain a DT?• Perform a top-down search, through the space of possible

decision trees• Determine the attribute that best classifies the training data• Use this attribute as the root of the tree• Repeat this process for each branch from left to right• Proceed to the next level and determine the next best feature• Repeat until a leaf is reached.• How to choose the best attribute?• Choose the attribute that will yield more information (i.e. the

attribute with the highest information gain)

14

Information gain

• Information gain - > a reduction of Entropy, E• But what is Entropy?

– Is the amount of energy that cannot be used to do work– Measured in bits– A measure of disorder in a system (high entropy = disorder)

– where: – S is the training data set– c is the number of target classes– pi is the proportion of examples in S belonging to target class i

– Note: if your calculator doesn't do log2, use log2(x)=1.443 ln(x) or 3.322 log10(x). For even better accuracy, use log2(x)=ln(x)/ln(2) or log2(x)=log10(x)/log10(2)


c

iii ppSE

12log)(

Entropy example

• A coin is flipped– If the coin was fair -> 50% chance of head– Now, let us rig the coin -> so that 99% of the time head comes up

• Let’s look at this in terms of entropy:– Two outcomes: head, tail– Probability: phead, ptail

– E(0.5, 0.5)= – 0.5 log2 (0.5) – (0.5) log2 (0.5) = 1 bit

– E(0.01, 0.99) = – 0.01 log2 (0.01) – 0.99 log2 (0.99) = 0.08 bit

• If the probability of heads =1, then entropy=0– E(0, 1.0) = – 0 log2 (0) – 1.0 log2 (1.0) = 0 bit


Information Gain

• Information Gain, G will be defined as:

• where– Values (A) is the set of all possible values of attribute A– Sv is the subset of S for which A has a value v

– |S| is the size of S and Sv is the size of Sv

• The information gain is the expected reduction in entropy caused by knowing the value of attribute A


)(

)(||

||)(),(

AValuesvv

v SES

SSEASGain

Example – entropy calculation

• Compute the entropy of the play-tennis example:• We have two classes, YES and NO• We have 14 instances with 9 classified as YES and 5 as NO

– i.e. no. of classes, c=2

• EYES = - (9/14) log2 (9/14) = 0.41

• ENO = - (5/14) log2 (5/14) = 0.53

• E(S) = EYES + ENO = 0.94


c

iii ppSE

12log)(

Example – information gain calculation

• Compute the information gain for the attributes wind in the play-tennis data set:– |S|=14– Attribute wind

• Two values: weak and strong• |Sweak| = 8

• |Sstrong| = 6


















• Now, let us determine E(Sweak)

• Instances=8, YES=6, NO=2• [6+,2-]• E(Sweak)=-(6/8)log2(6/8)-(2/8)log2(2/8)=0.81

















c

iii ppSE

12log)(


• Now, let us determine E(Sstrong)

• Instances=6, YES=3, NO=3• [3+,3-]• E(Sstrong)=-(3/6)log2(3/6)-(3/6)log2(3/6)=1.0

• Note, do not waste time if pYES=pNO

















c

iii ppSE

12log)(

21


• Going back to information gain computation for the attribute wind:

= 0.94 - (8/14) 0.81 - (6/14)1.00= 0.048


)(||

||)(

||

||)(),( strong

strongweak

weak SES

SSE

S

SSEwindSGain

22


• Now, compute the information gain for the attribute humidity in the play-tennis:

– |S|=14– Attribute humidity

• Two values: high and normal• |Shigh| = 7

• |Snormal| = 7

– For value: high –> [3+,4-]– For value: normal->[6+,1-]

= 0.94 - (7/14) 0.98 - (7/14)0.59= 0.15


)(||

||)(

||

||)(),( normal

normalhigh

high SES

SSE

S

SSEhumiditySGain

E(Shigh)=-(3/7)log2(3/7)-(4/7)log2(4/7)=0.98

E(Snormal)=-(6/7)log2(6/7)-(1/7)log2(1/7)=0.59

So, humidity provides GREATER information gain than wind

24


• Now, compute the information gain for the attribute outlook and temperature in the play-tennis data set:

• Attribute outlook:

• Attribute temperature:

• Gain(S, outlook)=0.25• Gain(S, temp)=0.03• Gain(S, humidity)=0.15• Gain(S, wind)=0.048• So, attribute with highest info. gain

• OUTLOOK, therefore use outlook as the root node

















25

DT – next level

• After determining OUTLOOK as the root node, we need to expand the tree

• E(Ssunny)=-(2/5)log2(2/5)-(3/5)log2(3/5)=0.97

• Entropy (Ssunny)=0.97

















DT – next level

• Gain(Ssunny, Humidity)=0.97-(3/5) 0.0 – (2/5) 0.0=0.97• Gain (Ssunny, Wind)= 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019• Gain(Ssunny, Temperature)=0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57

• Highest information gain is humidity, so use this attribute


Continue ….. and Final DT

• Continue until all the examples are classified– Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain (Srainy, Temp)

– Gain (Srainy, Wind) is the highest

• All leaf nodes are associated with training examples from the same class (entropy=0)

• The attribute temperature is not used Lecture 2 slides for CC282 Machine Learning, R. Palaniappan, 2008 28

ID3 algorithm –pseudocode

• Sufficient for exam


ID3 algorithm –pseudocode (Mitchell)

• From Mitchell (1997) – not important for exam


Search strategy in ID3

• Complete hypothesis space: any finite discrete-valued function can be expressed

• Incomplete search: searches incompletely through the hypothesis space until the tree is consistent with the data

• Single hypothesis: only one current hypothesis (simplest one) is maintained

• No backtracking: one an attribute is selected, this cannot be changed. Problem: might not be the optimum solution (globally)

• Full training set: attributes are selected by computing information gain on the full training set. Advantage: Robustness to errors. Problem: Non-incremental


Lecture 2 summary

• From this lecture, you should be able to: • Define concept, learning model, hypothesis, hypothesis space,

consistent hypothesis, induction learning & bias, reliasable & unreliasable tasks, Occam’s razor in view of ML

• Differentiate between generalisation and overfitting• Define entropy & information gain and know how to calculate

them for a given data set• Explain the ID3 algorithm, how it works and describe it in

pseudo-code• Apply ID3 algorithm on a given data set


cc282 decision trees lecture 2 slides for cc282 machine

Documents

concept learning concept

concept concept learning

learning lecture

learning model

cc282 machine learning

concept lecture

effect of coincidence

target concept