lecture05 - sl - rbfnn svm knn & dt · 23 lecture 05: sl -rbfnn, svm, k-nn and dt the wools...

39
Machine Learning Lecture 5 Supervised Learning RBFNN, SVM, k-nn and DT Dr. Patrick Chan [email protected] South China University of Technology, China 1 Dr. Patrick Chan @ SCUT Agenda Radial Basis Function Neural Network Support Vector Machine K-Nearest Neighbor Decision Tree Lecture 05: SL - RBFNN, SVM, k-nn and DT 2

Upload: others

Post on 08-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Machine Learning

Lecture 5

Supervised Learning

RBFNN, SVM, k-nn and DT

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Radial Basis Function Neural Network

Support Vector Machine

K-Nearest Neighbor

Decision Tree

Lecture 05: SL - RBFNN, SVM, k-nn and DT2

Dr. Patrick Chan @ SCUT

Recall…

Multi-Layer Perception

Black-Box

No idea what happened in side a progress

Complex parameter space

High training complexity

Sub-optimal solution

x1

x2

w11w21

Input Layer Output Layer

Σ

Σ g1

g2

Hidden Layers

w31

w'11

w'21

Lecture 05: SL - RBFNN, SVM, k-nn and DT3

Dr. Patrick Chan @ SCUT

RBF Neural Network

Radial Basis Function (RBF) Neural Network is special type of ANN

Only three layers

Input Layer

Hidden Layer

Activation Function:RBF (Gaussian)

Output Layer

Linearly weighted Sum

Parameters are meaningful

Input

Layer

Hidden

Layer

Output

Layer

w11

w12

w13

w1m

wc1w

c2wc3

wcm

Lecture 05: SL - RBFNN, SVM, k-nn and DT4

Dr. Patrick Chan @ SCUT

RBF Neural Network

Recall, Gaussian function

The most common distribution function

Univariate function:

Two parameters

: mean

: variance

Lecture 05: SL - RBFNN, SVM, k-nn and DT5

Dr. Patrick Chan @ SCUT

RBF Neural Network

Neuron ( ) of RBFNN represents Gaussian Distribution with and Measures the effect of the neuron to x

If the output is large, it means x and the neuron have more similar nature

x

Different center ( )

Different width ( �)

x

x is more related to � x is more related to �Lecture 05: SL - RBFNN, SVM, k-nn and DT6

Dr. Patrick Chan @ SCUT

RBF Neural Network

If the weight is larger, the neuron is more important as it has larger effect to the output

Lecture 05: SL - RBFNN, SVM, k-nn and DT7

1

w1=1

2 1

w1=1

w2=1

3 2 1

w1=1

w2=1

w3=0.5

Dr. Patrick Chan @ SCUT

RBF Neural Network

The output of RBFNN for the class k is:

w : the weight

(importance of the neuron)

m : the number of neurons

Input

Layer

Hidden

Layer

Output

Layer

w11

w1cw

21

w2c

w31

w3c

wm1

wmc

���

� ���

���

Lecture 05: SL - RBFNN, SVM, k-nn and DT8

g1

gc

Dr. Patrick Chan @ SCUT

RBF Neural Network

w3=2

w1=1.5

w2=0.3

Blue Red Red

Lecture 05: SL - RBFNN, SVM, k-nn and DT9

Dr. Patrick Chan @ SCUT

RBF Neural Network

Parameter Determination

Parameters ( , and w) are meaningful

Can be determined by not only gradient descent but also clustering technique

Objective is to seek the “natural” clusters in the data

Clustering will be discussed later in detail

Lecture 05: SL - RBFNN, SVM, k-nn and DT10

Dr. Patrick Chan @ SCUT

RBF Neural Network: Parameter Determination

Gradient Descent

Weights (w)

Centers ( )

Width ( )

�� � �

�� � � �

��� �� � ��

�� ��

��(�) =�����

���exp − 1

2���� �� − ��� ��

���

and

Lecture 05: SL - RBFNN, SVM, k-nn and DT11

Learning rate of parameters can be different

Dr. Patrick Chan @ SCUT

RBF Neural Network: Parameter Determination

Clustering Technique

Given a dataset

Separate samples into groups for each class by clustering technique E.g. K-mean May determine the number of

clusters for each class in advance

Center Mean of samples belonging to

the same cluster

Width Variance of samples belonging

to the same cluster

3 6 180 129 15

15

12

18

0

9

6

3

Lecture 05: SL - RBFNN, SVM, k-nn and DT12

Dr. Patrick Chan @ SCUT

RBF Neural Network: Parameter Determination

Clustering Technique

is fixed when and are determined

RBFNN:

As and are linear relation

Weight (w) can be calculated by

Pseudoinverse (refer to the previous lecture)

Lecture 05: SL - RBFNN, SVM, k-nn and DT13

Dr. Patrick Chan @ SCUT

RBF Neural Network: Parameter Determination

2-Phase Learning

Initialization of gradient descent affects the performance significantly

2-Phase Learning combines clustering and gradient descent

1. Cluster technique initials values of , and w

2. Gradient descent fine-tunes the values

Lecture 05: SL - RBFNN, SVM, k-nn and DT14

Dr. Patrick Chan @ SCUT

RBF Network

How to determine the number of neurons?

Less neurons

Low complexity

Low training cost

More neurons

Represents more complicated distribution

Special CaseMaximum m = training sample # (n)

Mean: a training sample

Covariance: influence of a sample to unseen samples

Lecture 05: SL - RBFNN, SVM, k-nn and DT15

Larger CovSmaller Cov

Dr. Patrick Chan @ SCUT

RBF Neural Network

XOR Example

Input space:

Output space:

(1,1)(0,1)

(0,0) (1,0) x1

x2

y10

Lecture 05: SL - RBFNN, SVM, k-nn and DT16

Dr. Patrick Chan @ SCUT

RBF Neural Network

XOR Example

The feature space:

x 1 2(1, 1) 1 0.1353

(0, 1) 0.3678 0.3678

(0, 0) 0.1353 1

(1, 0) 0.3678 0.3678

� � ��|| �!"||

�#"

$� = [1,1](

� (

�� � �

�|| �! ||

�#

1

2

1.0

1.0

(0,0)

0.5

0.5(1,1)

(0,1)

(1,0)

(1,1)(0,1)

(0,0) (1,0)x1

x2

Lecture 05: SL - RBFNN, SVM, k-nn and DT17

Dr. Patrick Chan @ SCUT

Weight Determination:

RBF Neural Network

XOR Example

w1

w2

1

w3

bias

1 0.1353 1

0.3678 0 3678 1

0.1353 1 1

0.3678 0 3678 1

=

w1

w2

w3

=

1

0

1

0

=

1

2

1.0

1.0

(0,0)

0.5

0.5(1,1)

(0,1)

(1,0)

2.284

2.284

-1.692

=

Decision boundary

1 (x1,x2) 2 (x1,x2) w3 Class outputs of samples

Lecture 05: SL - RBFNN, SVM, k-nn and DT18

Dr. Patrick Chan @ SCUT

RBF Network: Characteristic

Advantage

RBF Network trains faster than MLP

The hidden layer is easier to interpret than MLP

Disadvantage

During the test phase, the calculation speed ofa neuron in RBF is slower than MLP

Architecture is fixed

Lecture 05: SL - RBFNN, SVM, k-nn and DT19

Dr. Patrick Chan @ SCUT

Support Vector Machine (SVM)

Which one is the best linear separator?

Lecture 05: SL - RBFNN, SVM, k-nn and DT20

Dr. Patrick Chan @ SCUT

A clever sheep dog

who was herding his

sheep…

It runs between the

sheep and tries to

separate the black

sheep and white

sheep

Clever

Sheep dog

White

Sheep

Black

Sheep

Support Vector Machine (SVM)

Lecture 05: SL - RBFNN, SVM, k-nn and DT21

Dr. Patrick Chan @ SCUT

Support Vector Machine (SVM)

Lecture 05: SL - RBFNN, SVM, k-nn and DT22

The sheep dog keeps

running…

The sheep start to

grow wools…

The dog feel the gap

between black sheep

and white sheep is

narrower…

Dr. Patrick Chan @ SCUT

Support Vector Machine (SVM)

Lecture 05: SL - RBFNN, SVM, k-nn and DT23

The wools become

bigger and bigger…

Finally, only one path

is left..

Dr. Patrick Chan @ SCUT

Support Vector Machine (SVM)

Lecture 05: SL - RBFNN, SVM, k-nn and DT24From: Learning with Kernels, Schölkopf & Smola

The sheep found out

that the single path

relies only on the

some sheep.

These sheep are

“sheep vectors”

* “Support vector” in

SVM

Largest Margin

Sheep Vectors

Dr. Patrick Chan @ SCUT

Support Vector Machine (SVM)

Similar to the sheep, if a sample is growing…

This should be the best classifier

Lecture 05: SL - RBFNN, SVM, k-nn and DT25

Dr. Patrick Chan @ SCUT

Support Vector Machine (SVM)

A classifier with the largest margin

The concept is the same

Lecture 05: SL - RBFNN, SVM, k-nn and DT26

Dr. Patrick Chan @ SCUT

Support Vector Machine (SVM)

Maximum Margin Classifier ONLY depends on few samples, called Support Vectors

Support Vectors

Lecture 05: SL - RBFNN, SVM, k-nn and DT27

Dr. Patrick Chan @ SCUT

Support Vector Machine

Statistical Learning Theory

Recall, a classifier is trained by minimizing the training error (Remp, empirical risk)

Our ultimate objective is to minimize the error on unseen sample (R, true risk)

Training error can be reduced easily by using a more complicated classifier

Overfit easily too

A simpler classifier is preferred (assume empirical risk is the same)

Lecture 05: SL - RBFNN, SVM, k-nn and DT28

Dr. Patrick Chan @ SCUT

Support Vector Machine

Statistical Learning Theory

SLT is proposed by Prof. Vladimir Vapnik

Mention the relation between R, Remp and classifier complexity

With 1-η probability, the bound on true risk (R) is:

n = number of training samples

h = VC (Vapnik-Chervonenkis) dimension

Complexity of the classifier

)�*

Complexity

Lecture 05: SL - RBFNN, SVM, k-nn and DT29

Dr. Patrick Chan @ SCUT

Support Vector Machine

Statistical Learning Theory

With 1-η probability, the bound on true risk (R) is:

Observation: When l → ∞, Remp → R

More sample, less complexity

This bound is independent of the sample distribution

Lecture 05: SL - RBFNN, SVM, k-nn and DT30

ComplexityTraining ErrorTrue Error

Dr. Patrick Chan @ SCUT

SVM: Statistical Learning Theory

VC Dimension

VC Dimension for a set of functions { f } is h if there exists a set of h points that can be shattered by { f } but no set of h+1

points

Shattered means that any labeling of the points can be classified correctly by a function from { f }

Larger h = more complicated classifier

Lecture 05: SL - RBFNN, SVM, k-nn and DT31

Dr. Patrick Chan @ SCUT

SVM: SLT: VC Dimension

Linear Classifier Example

2 features

Linear Classifier is a straight line

h is sample #

h=1 h=2 h=3

Yes!

Yes!

Yes!

Lecture 05: SL - RBFNN, SVM, k-nn and DT32

Dr. Patrick Chan @ SCUT

SVM: SLT: VC Dimension

Linear Classifier Example

Maximumvalue of h is 3

Therefore, the VC-Dimensionof Linearclassifier is 3

? ?

h=4

No!

Lecture 05: SL - RBFNN, SVM, k-nn and DT33

Dr. Patrick Chan @ SCUT

Support Vector Machine (SVM)

SVM with a larger margin has a smaller VC dimension

Not only Remp but also the margin is optimized in SVM training

Implicitly optimize R

How to find the boundary?

Separable case

Non-separable case

Slack variables

Kernel method

Lecture 05: SL - RBFNN, SVM, k-nn and DT34

Dr. Patrick Chan @ SCUT

SVM: Linearly Separable

The hyperplane is:

wx + b = 0

Suppose the dotted lines are:

wx + b = ±±±±

If change:

w = λw, b = λb and = λ

the equations are still valid

We choose = 1

Margin width

wwx+b = 0

wx+b = 1

wx+b = -1

wx+b =

wx+b = -

Lecture 05: SL - RBFNN, SVM, k-nn and DT35

Dr. Patrick Chan @ SCUT

SVM: Linearly Separable

Distance from origin to hyperplane is

||b|| / w

Similar, the distance to dotted line is

(||b|| ± 1) /w

Therefore, the margin is:

Margin width

w

b

||w||

wx+b = 0wx+b = 1

wx+b = -1

b-1

||w||

b+1

||w||

b-1

||w||

b+1

||w||

2

||w||- =

Lecture 05: SL - RBFNN, SVM, k-nn and DT36

Dr. Patrick Chan @ SCUT

SVM: Linearly Separable

Problem can be formulated as Quadratic Optimization Problem and solve for w and b

subject to

minimizew, b

w

2Margin width =

where

y(x) = wx+b

y(x) = 1

y(x) = -1

and

Lecture 05: SL - RBFNN, SVM, k-nn and DT37

Dr. Patrick Chan @ SCUT

SVM: Linearly Separable

This optimization problem can be formulated as Dual Problem using

Lagrangian method:

Weight is determined by:

�subject to

Maximumα

� �+

���and

� ( �subject to

minimizew, b

Lecture 05: SL - RBFNN, SVM, k-nn and DT38

Dr. Patrick Chan @ SCUT

SVM: Linearly Separable

Many αi are zero

xi with non-zero αi are support vectors (SV) The decision boundary is

determined only by the SV

α1=0.6 α2=0.8

α3=1.4

α4=0 α5=0

α7=0

α8=0

α9=0

α10=0

Non-SV αi=0

α6=0

Lecture 05: SL - RBFNN, SVM, k-nn and DT39

Dr. Patrick Chan @ SCUT

SVM: Linearly Separable

For testing with a new data x

If f(z) > 0, class 1

If f(z) < 0, class 2

Note: w need not be formed explicitly

, , ,

where

Lecture 05: SL - RBFNN, SVM, k-nn and DT40

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

How about Non-Linearly Separable Case? The margin cannot be defined anymore

Two approaches: Add a slack variables

Use a kernel (Non-Linear SVM)

Lecture 05: SL - RBFNN, SVM, k-nn and DT41

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

Slack Variable

Slack Variable (ξ) is

added as a punishmentto allow a sample in / far away from the margin

Optimalization:

12 � � + .�/�

0

���1� �(�� + 2 ≥ 1 − /� 4 = 1. . . 6subject to

Minimizew, b

/� ≥ 0

ξ1

Error> 1

ξ2

Error> 1

ξ3

Correct< 1

ξ1

ξ2

ξ3

ξ4

Error> 1

ξ5

Correct< 1

ξ5

ξ4

ξ of other samples are 0C : tradeoff parameter between error

and margin

where

Lecture 05: SL - RBFNN, SVM, k-nn and DT42

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

Slack Variable

The dual of this new constrained optimization problem is

It is as the same as the linearly separable case, except that there is an upper bound C on αi

w is calculated by

�+

���� � � � �( �

+

���

+

����subject to

Maximumα

� �+

���and

Samples on the margin

Samples in / far away from margin

(ξ > 0)

Support Vector

(C ≥ αi ≥ 0)

� � �+

��� Non-SV αi=0

Lecture 05: SL - RBFNN, SVM, k-nn and DT43

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

Kernel Method

Map samples to higher dimensional space

Input space: the original space of xi

Feature space: (xi)

Linear operation in the feature space is equivalent to non-linear operation in input space

Classification can become easier with a proper transformation

Lecture 05: SL - RBFNN, SVM, k-nn and DT44

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

Kernel Method

Kernel is a function which maps input into high dimensional feature space

Construct linear SVM in feature space

Input Space Feature Space

Lecture 05: SL - RBFNN, SVM, k-nn and DT45

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

Kernel Method: Trick

Recall, the linearly separable case:

The data points only appear as inner product

As long as the inner product is calculated, the mapping is not required explicitly

Define the kernel function K by

�8�+

���− 12��8�8�1�1���(��

+

���

+

���

8� ≥ 0subject to

Maximumα

�8�1�+

���= 04 = 1. . . 9 and

inner product

Lecture 05: SL - RBFNN, SVM, k-nn and DT46

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

Kernel Method: Trick

Suppose () is given as follows

An inner product in the feature space is

The kernel function is defined as follows

No need to define (.) explicitly

It is known as the kernel trick

� � ( � � �� �� � �

� � ( � � ( � � � � �

Lecture 05: SL - RBFNN, SVM, k-nn and DT47

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

Kernel Method: Example

Polynomial kernel with degree d

Radial basis function kernel with width

Closely related to radial basis function neural networks

The feature space is infinite-dimensional

Sigmoid with parameter and

It does not satisfy the Mercer condition on all and

��

( �

(

Lecture 05: SL - RBFNN, SVM, k-nn and DT48

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

Kernel Method: Example

Similar to the linearly separable case but change all inner products to kernel functions

� ( �subject to

minimizew, b

�+

���� � � � � �

+

���

+

���

�subject to

maximumα

� �+

���and

Lecture 05: SL - RBFNN, SVM, k-nn and DT49

Dr. Patrick Chan @ SCUT

SVM: Non-Linearly Separable

Kernel Method: Example

For testing with a new data z

If f(z) > 0, class 1

If f(z) < 0, class 2

Note: needs not be formed explicitly

:, :, :,

;

���

where

Lecture 05: SL - RBFNN, SVM, k-nn and DT50

Dr. Patrick Chan @ SCUT

SVM: Example

Given:

5 one-dimensional training samples:

Polynomial kernel of degree 2 is used:

K(x,y) = (xy+1)2

C is set to 100

x y

1 1

2 1

4 -1

5 -1

6 1

Lecture 05: SL - RBFNN, SVM, k-nn and DT51

Dr. Patrick Chan @ SCUT

SVM: Example

By

Result:

1=0, 2=2.5, 3=0, 4=7.333, 5=4.833

The support vectors are {x2=2, x4=5, x5=6}

The discriminate function is

�8�<

���− 12��8�8�1�1� ���� + 1 �

<

���

<

���

�subject to

Maximumα

�8�1�<

���= 0and

=�8:,1:, �:,( = + 1 ��

���+ 2

� � �

x2 x4 x6

Lecture 05: SL - RBFNN, SVM, k-nn and DT52

Dr. Patrick Chan @ SCUT

SVM: Example

b can be determined by using the points on the margin,

f (2) = 1, f (6) = 1 and f (5) = -1

Therefore, b is -9

Lecture 05: SL - RBFNN, SVM, k-nn and DT53

Dr. Patrick Chan @ SCUT

SVM: Example

1 2 4 5 6

class 2 class 1class 1

f(z) < 0 f(z) > 0f(z) > 0

Support Vectors

Lecture 05: SL - RBFNN, SVM, k-nn and DT54

Dr. Patrick Chan @ SCUT

SVM: Multi-Class Problem

SVM only can handle 2-class problem

How to handle multi-class problem?

g in LDF can be formulated as the estimation

on posterior probability to a class

However, SVM must considers two classes

Do not estimate the probability of a class

Max method cannot be applied to SVM

Lecture 05: SL - RBFNN, SVM, k-nn and DT55

Dr. Patrick Chan @ SCUT

SVM: Multi-Class Problem

How to handle multi-class problem?

1-against-All

1-against-1

Discussed in previous lecture

1-against-All

y3

y1

y2

y4

1-against-1

y3

y1

y2

y4

Lecture 05: SL - RBFNN, SVM, k-nn and DT56

Dr. Patrick Chan @ SCUT

SVM: Characteristic

Advantages Training is relatively easy

No local optimal

It scales relatively well to high dimensional data (inner product)

Tradeoff between classifier complexity and error can be controlled explicitly

Disadvantage Slow when the number of samples is large

Need to choose a “good” kernel function

Lecture 05: SL - RBFNN, SVM, k-nn and DT57

Dr. Patrick Chan @ SCUT

K-Nearest Neighbor (K-NN)

A new pattern is classified by a majority vote of its k nearest neighbors (training

samples)

n distances are

calculated for each new sample

n: the number of training samples

k = 1Triangle

k = 3Circle Triangle

Unseen Sample

k = 5

Lecture 05: SL - RBFNN, SVM, k-nn and DT58

Dr. Patrick Chan @ SCUT

K-Nearest Neighbor (K-NN)

Target function for the entire space may be described as a combination of less complex local approximations

Lecture 05: SL - RBFNN, SVM, k-nn and DT59

Dr. Patrick Chan @ SCUT

K-Nearest Neighbor (K-NN)

How to determine k?

Small k

Noise Sensitive

Large k

Neighbours may be too far away from the unseen sample

Less representative

Lecture 05: SL - RBFNN, SVM, k-nn and DT60

CircleTriangle

k = 1k = 3

CircleTriangle

k = 5k = 19

Dr. Patrick Chan @ SCUT

K-NN: Characteristic

Advantages: Very simple

No training is needed

All computations deferred until classification

Disadvantages: Difficult to determine k

Affected by noisy training data

Classification is time consuming

Need to calculate the distance between the unseen sample and each training sample

Lecture 05: SL - RBFNN, SVM, k-nn and DT61

Dr. Patrick Chan @ SCUT

Decision Tree (DT)

Most classifiers are black-box

DT provides explanation on decisions

One of the most widely used and practical methods for inductive inference

Approximates discrete-valued functions (including disjunctions)

Lecture 05: SL - RBFNN, SVM, k-nn and DT62

Dr. Patrick Chan @ SCUT

DT: Example

Do we go to play tennis today?

If Outlook is Sunny AND

Humidity is Normal

If Outlook is Overcast

If Outlook is Rain AND

Wind is Weak

Yes

Yes

Yes

Other situation? No

Lecture 05: SL - RBFNN, SVM, k-nn and DT63

Dr. Patrick Chan @ SCUT

DT: Classification

x2

x1

Decision Region:

Internal nodes can be univariate

(Only one feature is used)

a

b

x1

> a

No Yes

x2

> b

No Yes

Lecture 05: SL - RBFNN, SVM, k-nn and DT64

Dr. Patrick Chan @ SCUT

DT: Classification

x2

x1

Internal nodes can be multivariate More than one features are used Shape of Decision Region is irregular

ax1

+ bx2

+ c > 0

No Yes

Lecture 05: SL - RBFNN, SVM, k-nn and DT65

Dr. Patrick Chan @ SCUT

DT: Learning Algorithm

LOOP:1. Select the best feature (A)

2. For each value of A, create new descendant of node

3. Sort training samples to leaf nodes

STOP when training samples perfectly classified

# x1

x2

x3

1 1 3 5

2 1 4 2

3 3 1 5

4 3 5 6

5 3 3 4

6 4 5 7

x1

> 2

No Yes

STOP x2

> 2

No Yes

STOP STOPLecture 05: SL - RBFNN, SVM, k-nn and DT66

Dr. Patrick Chan @ SCUT

DT: Learning Algorithm

Observation

Many trees may code a training set without any error

Finding the smallest tree is a NP-hard problem

Local search algorithm to find reasonable solutions

What is the best feature?

Lecture 05: SL - RBFNN, SVM, k-nn and DT67

Dr. Patrick Chan @ SCUT

DT: Feature Measurement

Entropy is used to evaluate features

Measure of uncertainty

Range: 0 - 1

Smaller value, less uncertainty

If all samples belongs to xi, then p(xi) = 1, and other p(xj) = 0, i ≠ j

Thus, H(X) = 0 (no uncertainty)

� � �+

���

X: a random variable with n outcomes, X = {xi | i = 1,2,…,n}

where

p(x): the probability mass function of outcome x.

Lecture 05: SL - RBFNN, SVM, k-nn and DT68

Dr. Patrick Chan @ SCUT

DT: Feature Measurement

Information Gain

Reduction in entropy (reduce uncertainty) due to sorting on a feature A

Current

entropyEntropy after

using feature A

Lecture 05: SL - RBFNN, SVM, k-nn and DT69

Dr. Patrick Chan @ SCUT

DT: Example

Which feature is the best?

� � ��

���

� �

� � �+

���

Current:

Uncertainty is high w/o any

sorting by feature

x1 = yes x2 = No

yes no

Lecture 05: SL - RBFNN, SVM, k-nn and DT70

Dr. Patrick Chan @ SCUT

DT: Example

Outlook

SunnyRain Overcast

No:

Yes:

3

2No:

Yes:

2

3

No:

Yes:

0

4

Let A = Outlook

)()|(

)()|(

)()|(

overcastAPovercastAXH

RainAPRainAXH

sunnyAPsunnyAXH

)|( AXHRecall:

)|()(),( AXHXHAXGain

Lecture 05: SL - RBFNN, SVM, k-nn and DT71

Dr. Patrick Chan @ SCUT

Outlook

SunnyRain Overcast

No:

Yes:

3

2No:

Yes:

2

3

No:

Yes:

0

4

DT: Example

Let A = Outlook

)()|(

)()|(

)()|(

overcastAPovercastAXH

RainAPRainAXH

sunnyAPsunnyAXH

)|( AXH

)|( sunnyXH 0.971)5

2(log

5

2)5

3(log

5

322

)|( RainXH 0.971)5

3(log

5

3)5

2(log

5

222

)|( overcastXH 0)4

4(log

4

4)4

0(log

4

022

14

40

14

5971.0

14

5971.0)|( AXH

694.0

Lecture 05: SL - RBFNN, SVM, k-nn and DT72

Dr. Patrick Chan @ SCUT

DT: Example

0.941)( XHCurrent:

694.0)|( OutlookXH

Similarly, for each feature

0.911)|( eTemperaturXH

0.789)|( HumidityXH

0.892)|( WindXH

Recall:

)|()(),( AXHXHAXGain

Information Gain is:

247.0),( OutlookXGain

030.0),( eTemperaturXGain

152.0),( HumidityXGain

049.0),( WindXGain

Outlook is the best feature and

Should be used as the first node

Lecture 05: SL - RBFNN, SVM, k-nn and DT73

Dr. Patrick Chan @ SCUT

DT: Example

Next Step

Repeat the steps for each sub-branch

Until there is no ambiguity (all samples are of the same class)

Outlook

SunnyRain Overcast

No:

Yes:

3

2No:

Yes:

2

3

No:

Yes:

0

4

DoneContinues to select next features

Lecture 05: SL - RBFNN, SVM, k-nn and DT74

Dr. Patrick Chan @ SCUT

DT: Continuous-Valued Feature

So far, we handle features with categorial values

How to build a decision tree whose features are numerical?

29.9

28.2

35.2

26.4

18.9

21.2

20.4

24.4

17.0

25.1

24.0

24.5

27.7

25.5Lecture 05: SL - RBFNN, SVM, k-nn and DT75

Dr. Patrick Chan @ SCUT

DT: Continuous-Valued Feature

Accomplished by partitioning the continuous attribute value into a discrete set of intervals

In particular, a new Boolean feature Ac(A < c) can

be created

How to select the best value for c

17.0 18.9 20.4 21.2 24.0 24.4 24.5 25.1 25.5 26.4 27.7 28.2 29.9 35.2

Yes Yes Yes No Yes No Yes Yes No Yes Yes No No Yes

? ??

Lecture 05: SL - RBFNN, SVM, k-nn and DT76

Dr. Patrick Chan @ SCUT

DT: Continuous-Valued Feature

Objective is to minimize the entropy (or maximize the information gain)

Entropy only needs to be evaluated between points of different classes

Best c is:

c1

c2

c3

c4

c5

c6

c7

c8

17.0 18.9 20.4 21.2 24.0 24.4 24.5 25.1 25.5 26.4 27.7 28.2 29.9 35.2

Yes Yes Yes No Yes No Yes Yes No Yes Yes No No Yes

Lecture 05: SL - RBFNN, SVM, k-nn and DT77