seem4630 2013-2014 tutorial 2 classification: decision tree, naïve bayes & k-nn wentao tian,...

SEEM4630 2013-2014 Tutorial 2

Classification:

Decision tree, Naïve Bayes & k-NN

Wentao TIAN, [email protected]

Given a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes. Decision tree Naïve bayes k-NN

Goal: previously unseen records should be assigned a class as accurately as possible.

2

Classification: Definition

GoalConstruct a tree so that instances belonging to

different classes should be separated Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive manner

At start, all the training examples are at the rootTest attributes are selected on the basis of a

heuristics or statistical measure (e.g., information gain)

Examples are partitioned recursively based on selected attributes 3

Decision Tree

4

Attribute Selection Measure 1: Information Gain

Let pi be the probability that a tuple belongs to class Ci, estimated by |Ci,D|/|D|

Expected information (entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to classify D:

Information gained by branching on attribute A

)(||

||)(

1j

v

j

jA DInfo

D

DDInfo

(D)InfoInfo(D)Gain(A) A

)(log)( 21

i

m

ii ppDInfo

5

Attribute Selection Measure 2: Gain Ratio

Information gain measure is biased towards attributes with a large number of values

C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)

GainRatio(A) = Gain(A)/SplitInfo(A)

)||

||(log

||

||)( 2

1 D

D

D

DDSplitInfo j

v

j

jA

6

Attribute Selection Measure 3: Gini index

If a data set D contains examples from n classes, gini index, gini(D) is defined as

where pj is the relative frequency of class j in D

If a data set D is split on A into two subsets D1

and D2, the gini index gini(D) is defined as

Reduction in Impurity:

n

jp jDgini

1

21)(

)(||||)(

||||)( 2

21

1 DginiDD

DginiDDDginiA

)()()( DginiDginiAginiA

ExampleOutlook Temperature Humidity Wind Play Tennis

Sunny >25 High Weak No

Sunny >25 High Strong No

Overcast >25 High Weak Yes

Rain 15-25 High Weak Yes

Rain <15 Normal Weak Yes

Rain <15 Normal Strong No

Overcast <15 Normal Strong Yes

Sunny 15-25 High Weak No

Sunny <15 Normal Weak Yes

Rain 15-25 Normal Weak Yes

Sunny 15-25 Normal Strong Yes

Overcast 15-25 High Strong Yes

Overcast >25 Normal Weak Yes

Rain 15-25 High Strong No

7

8

Tree induction example

S[9+, 5-] Outlook

Sunny [2+,3-] Overcast [4+,0-] Rain [3+,2-]

Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))] – 4/14[-4/4(log2(4/4))-0/4(log2(0/4))] – 5/14[-3/5(log2(3/5))-2/5(log2(2/5))] = 0.94 – 0.69 = 0.25

Entropy of data S

Split data by attribute Outlook

Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94

9


S[9+, 5-] Temperature

<15 [3+,1-]15-25 [5+,1-]>25 [2+,2-]

Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))] – 6/14[-5/6(log2(5/6))-1/6(log2(1/6))] – 4/14[-2/4(log2(2/4))-2/4(log2(2/4))] = 0.94 – 0.80 = 0.14

Split data by attribute Temperature

10

S[9+, 5-] Wind

Weak [6+, 2-]

Strong [3+, 3-]

Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))] – 7/14[-6/7(log2(6/7))-1/7(log2(1/7))] = 0.94 – 0.79 = 0.15

Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))] – 6/14[-3/6(log2(3/6))-3/6(log2(3/6))] = 0.94 – 0.89 = 0.05

Split data by attribute Humidity

Split data by attribute Wind


S[9+, 5-] Humidity

High [3+,4-]

Normal [6+, 1-]

11

Outlook

Yes?? ??

Overcast

Sunny Rain

Gain(Outlook) = 0.25Gain(Temperature)=0.14Gain(Humidity) = 0.15Gain(Wind) = 0.05

NoWeakHigh>25Sunny

NoStrongHigh>25Sunny

YesWeakHigh>25Overcast

YesWeakHigh15-25Rain

YesWeakNormal<15Rain

NoStrongNormal<15Rain

YesStrongNormal<15Overcast

NoWeakHigh15-25Sunny

YesWeakNormal<15Sunny

YesWeakNormal15-25Rain

YesStrongNormal15-25Sunny

YesStrongHigh15-25Overcast

YesWeakNormal>25Overcast

NoStrongHigh15-25Rain

Play Tennis

WindHumidity

Temperature

Outlook


12

Sunny[2+, 3-] Wind

Weak [1+, 2-]

Strong [1+, 1-]

Gain(Humidity) = 0.97 – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))]= 0.97 – 0 = 0.97

Gain(Wind)= 0.97 – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))]= 0.97 – 0.95= 0.02

Entropy of branch Sunny

Split Sunny branch by attribute Temperature

Split Sunny branch by attribute Humidity

Split Sunny branch by attribute Wind

Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97

Sunny[2+,3-] Temperature

<15 [1+,0-]

15-25 [1+,1-]>25 [0+,2-]

Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0.4 = 0.57

Sunny[2+,3-] Humidity

High [0+,3-]

Normal [2+, 0-]

13

Outlook

YesHumidity ??

YesNo

High

Sunny Rain

Normal

Overcast


Gain(Humidity) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))]= 0.97 – 0.95 = 0.02

Gain(Wind) = 0.97 – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))]= 0.97 – 0 = 0.97

Entropy of branch Rain

Split Rain branch by attribute Temperature

Split Rain branch by attribute Humidity

Split Rain branch by attribute Wind

14

Info(Rain) = -3/5(log2(3/5))-2/5(log2(2/5)) = 0.97

Rain[3+,2-] Temperature

<15 [1+,1-]

15-25 [2+,1-]>25 [0+,0-]

Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))]= 0.97 – 0.95 = 0.02

Rain[3+,2-] Wind

Weak [3+, 0-]

Strong [0+, 2-]

Rain[3+,2-] Humidity

High [1+,1-]

Normal [2+, 1-]

15

Outlook

YesHumidity Wind

YesNo

NormalHigh

NoYes

StrongWeak

OvercastSunny Rain

16

Bayesian Classification A statistical classifier: performs probabilistic

prediction, i.e., predicts class membership probabilities

where xi is the value of attribute

Ai

Choose the class label that has the highest probability Foundation: Based on Bayes’ Theorem.

posteriori probability

prior probability

likelihood

),...,,|( 21 ni xxxCP

),...,,|( 21 ni xxxCP

)|,...,,( 21 in CxxxP

)( iCP

),...,,(

)()|,...,,(),...,,|(

21

2121

n

iinni xxxP

CPCxxxPxxxCP

Model: compute

from data

)|,...,,( 21 in CxxxP

17

Naïve Bayes Classifier Problem: joint probabilities are difficult to estimate

Naïve Bayes Classifier Assumption: attributes are conditionally independent

)|()|()|,...,,( 121 iniin CxPCxPCxxxP

11 2

1 2

( | ) ( )( | , ,..., )

( , ,..., )

n

j i iji n

n

P x C P CP C x x x

P x x x

A B C

m b t

m s t

g q t

h s t

g q t

g q f

g s f

h b F

h q f

m b f

18

Example: Naïve Bayes Classifier

P(C=t) = 1/2 P(C=f) = 1/2

P(A=m|C=t) = 2/5P(A=m|C=f) = 1/5P(B=q|C=t) = 2/5P(B=q|C=f) = 2/5

Test Record: A=m, B=q, C=?

InputA set of stored recordsk: # of nearest neighbors

OutputCompute distance: Identify k nearest neighborsDetermine the class label of unknown record based on

class labels of nearest neighbors (i.e. by taking majority vote)

20

Nearest Neighbor Classification

i ii

qpqpd 2)(),(

21

Nearest Neighbor ClassificationInput Given 8 training

instancesP1 (4, 2) OrangeP2 (0.5, 2.5) OrangeP3 (2.5, 2.5) OrangeP4 (3, 3.5) OrangeP5 (5.5, 3.5) OrangeP6 (2, 4) BlackP7 (4, 5) BlackP8 (2.5, 5.5) Black k = 1 & k = 3

New Instance:Pn (4, 4) ?

Calculate the distances:

d(P1, Pn) = d(P2, Pn) = 3.80d(P3, Pn) = 2.12d(P4, Pn) = 1.12d(P5, Pn) = 1.58d(P6, Pn) = 2d(P7, Pn) = 1d(P8, Pn) = 2.12

A Discrete Example

2)42()44( 22

22

Nearest Neighbor Classification

k = 1

P1P2 P3

P4 P5P6

P7P8

Pn

P1P2 P3

P4 P5

P6

P7P8

Pn

k = 3

Scaling issuesAttributes may have to be scaled to

prevent distance measures from being dominated by one of the attributes• Each attribute must follow in the same range• Min-Max normalization

Example:• Two data records: a = (1, 1000), b = (0.5, 1)• dis(a, b) = ?

23

Nearest Neighbor Classification…

Two Types of Learning MethodologiesLazy Learning

• Instance-based learning. (k-NN)Eager Learning

• Decision-tree and Bayesian classification.• ANN & SVM

24

Classification: Lazy & Eager Learning

P1P2 P3

P4 P5

P6

P7P8

Pn

P1P2 P3

P4 P5

P6

P7P8

Pn

Lazy Learninga. Do not require model buildingb. Less time training but more time predictingc. Lazy method effectively uses a richer

hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function

Eager Learninga. Require model buildingb. More time training but less time predictingc. Must commit to a single hypothesis that

covers the entire instance space

25

Differences Between Lazy &Eager Learning

Thank you & Questions?

26

seem4630 2013-2014 tutorial 2 classification: decision tree, naïve bayes & k-nn wentao tian,...

Documents