lecture05 - sl - rbfnn svm knn & dt · 23 lecture 05: sl -rbfnn, svm, k-nn and dt the wools...

Machine Learning

Lecture 5

Supervised Learning

RBFNN, SVM, k-nn and DT

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Radial Basis Function Neural Network

Support Vector Machine

K-Nearest Neighbor

Decision Tree

Lecture 05: SL - RBFNN, SVM, k-nn and DT2


Recall…

Multi-Layer Perception

Black-Box

No idea what happened in side a progress

Complex parameter space

High training complexity

Sub-optimal solution

x1

x2

w11w21

Input Layer Output Layer

Σ

Σ g1

g2

Hidden Layers

w31

w'11

w'21



RBF Neural Network

Radial Basis Function (RBF) Neural Network is special type of ANN

Only three layers

Input Layer

Hidden Layer

Activation Function:RBF (Gaussian)

Output Layer

Linearly weighted Sum

Parameters are meaningful

…

…

…

Input

Layer

Hidden

Layer

Output

Layer

w11

w12

w13

w1m

wc1w

c2wc3

wcm



RBF Neural Network

Recall, Gaussian function

The most common distribution function

Univariate function:

Two parameters

: mean

: variance



RBF Neural Network

Neuron ( ) of RBFNN represents Gaussian Distribution with and Measures the effect of the neuron to x

If the output is large, it means x and the neuron have more similar nature

x

Different center ( )

�

�

Different width ( �)

x

�

�

x is more related to � x is more related to �Lecture 05: SL - RBFNN, SVM, k-nn and DT6


RBF Neural Network

If the weight is larger, the neuron is more important as it has larger effect to the output


1

w1=1

�

2 1

w1=1

w2=1

�

�

3 2 1

w1=1

w2=1

w3=0.5

�

�

�


RBF Neural Network

The output of RBFNN for the class k is:

w : the weight

(importance of the neuron)

m : the number of neurons

…

…

…

Input

Layer

Hidden

Layer

Output

Layer

w11

w1cw

21

w2c

w31

w3c

wm1

wmc

�

�

�

�

��

� ��

�

��


g1

gc


RBF Neural Network

w3=2

w1=1.5

w2=0.3

Blue Red Red



RBF Neural Network

Parameter Determination

Parameters ( , and w) are meaningful

Can be determined by not only gradient descent but also clustering technique

Objective is to seek the “natural” clusters in the data

Clustering will be discussed later in detail



RBF Neural Network: Parameter Determination

Gradient Descent

Weights (w)

Centers ( )

Width ( )

��

��

��

…

…

…

��

��(�) =��

��exp − 1

2�� − ��

��

and


Learning rate of parameters can be different



Clustering Technique

Given a dataset

Separate samples into groups for each class by clustering technique E.g. K-mean May determine the number of

clusters for each class in advance

Center Mean of samples belonging to

the same cluster

Width Variance of samples belonging

to the same cluster

3 6 180 129 15

15

12

18

0

9

6

3




Clustering Technique

is fixed when and are determined

RBFNN:

As and are linear relation

Weight (w) can be calculated by

Pseudoinverse (refer to the previous lecture)




2-Phase Learning

Initialization of gradient descent affects the performance significantly

2-Phase Learning combines clustering and gradient descent

1. Cluster technique initials values of , and w

2. Gradient descent fine-tunes the values



RBF Network

How to determine the number of neurons?

Less neurons

Low complexity

Low training cost

More neurons

Represents more complicated distribution

Special CaseMaximum m = training sample # (n)

Mean: a training sample

Covariance: influence of a sample to unseen samples


Larger CovSmaller Cov


RBF Neural Network

XOR Example

Input space:

Output space:

(1,1)(0,1)

(0,0) (1,0) x1

x2

y10



RBF Neural Network

XOR Example

The feature space:

x 1 2(1, 1) 1 0.1353

(0, 1) 0.3678 0.3678

(0, 0) 0.1353 1

(1, 0) 0.3678 0.3678

� � ��|| �!"||

�#"

$� = [1,1](

�

� (

��

�|| �! ||

�#

1

2

1.0

1.0

(0,0)

0.5

0.5(1,1)

(0,1)

(1,0)

(1,1)(0,1)

(0,0) (1,0)x1

x2



Weight Determination:

RBF Neural Network

XOR Example

w1

w2

1

w3

bias

1 0.1353 1

0.3678 0 3678 1

0.1353 1 1

0.3678 0 3678 1

=

w1

w2

w3

=

1

0

1

0

=

1

2

1.0

1.0

(0,0)

0.5

0.5(1,1)

(0,1)

(1,0)

2.284

2.284

-1.692

=

Decision boundary

1 (x1,x2) 2 (x1,x2) w3 Class outputs of samples



RBF Network: Characteristic

Advantage

RBF Network trains faster than MLP

The hidden layer is easier to interpret than MLP

Disadvantage

During the test phase, the calculation speed ofa neuron in RBF is slower than MLP

Architecture is fixed



Support Vector Machine (SVM)

Which one is the best linear separator?



A clever sheep dog

who was herding his

sheep…

It runs between the

sheep and tries to

separate the black

sheep and white

sheep

Clever

Sheep dog

White

Sheep

Black

Sheep






The sheep dog keeps

running…

The sheep start to

grow wools…

The dog feel the gap

between black sheep

and white sheep is

narrower…




The wools become

bigger and bigger…

Finally, only one path

is left..



Lecture 05: SL - RBFNN, SVM, k-nn and DT24From: Learning with Kernels, Schölkopf & Smola

The sheep found out

that the single path

relies only on the

some sheep.

These sheep are

“sheep vectors”

* “Support vector” in

SVM

Largest Margin

Sheep Vectors



Similar to the sheep, if a sample is growing…

This should be the best classifier




A classifier with the largest margin

The concept is the same




Maximum Margin Classifier ONLY depends on few samples, called Support Vectors

Support Vectors




Statistical Learning Theory

Recall, a classifier is trained by minimizing the training error (Remp, empirical risk)

Our ultimate objective is to minimize the error on unseen sample (R, true risk)

Training error can be reduced easily by using a more complicated classifier

Overfit easily too

A simpler classifier is preferred (assume empirical risk is the same)





SLT is proposed by Prof. Vladimir Vapnik

Mention the relation between R, Remp and classifier complexity

With 1-η probability, the bound on true risk (R) is:

n = number of training samples

h = VC (Vapnik-Chervonenkis) dimension

Complexity of the classifier

)�*

Complexity





With 1-η probability, the bound on true risk (R) is:

Observation: When l → ∞, Remp → R

More sample, less complexity

This bound is independent of the sample distribution


ComplexityTraining ErrorTrue Error


SVM: Statistical Learning Theory

VC Dimension

VC Dimension for a set of functions { f } is h if there exists a set of h points that can be shattered by { f } but no set of h+1

points

Shattered means that any labeling of the points can be classified correctly by a function from { f }

Larger h = more complicated classifier



SVM: SLT: VC Dimension

Linear Classifier Example

2 features

Linear Classifier is a straight line

h is sample #

h=1 h=2 h=3

Yes!

Yes!

Yes!



SVM: SLT: VC Dimension

Linear Classifier Example

Maximumvalue of h is 3

Therefore, the VC-Dimensionof Linearclassifier is 3

? ?

h=4

No!




SVM with a larger margin has a smaller VC dimension

Not only Remp but also the margin is optimized in SVM training

Implicitly optimize R

How to find the boundary?

Separable case

Non-separable case

Slack variables

Kernel method



SVM: Linearly Separable

The hyperplane is:

wx + b = 0

Suppose the dotted lines are:

wx + b = ±±±±

If change:

w = λw, b = λb and = λ

the equations are still valid

We choose = 1

Margin width

wwx+b = 0

wx+b = 1

wx+b = -1

wx+b =

wx+b = -




Distance from origin to hyperplane is

||b|| / w

Similar, the distance to dotted line is

(||b|| ± 1) /w

Therefore, the margin is:

Margin width

w

b

||w||

wx+b = 0wx+b = 1

wx+b = -1

b-1

||w||

b+1

||w||

b-1

||w||

b+1

||w||

2

||w||- =




Problem can be formulated as Quadratic Optimization Problem and solve for w and b

subject to

minimizew, b

w

2Margin width =

where

y(x) = wx+b

y(x) = 1

y(x) = -1

and




This optimization problem can be formulated as Dual Problem using

Lagrangian method:

Weight is determined by:

�subject to

Maximumα

� �+

��and

�

� ( �subject to

minimizew, b




Many αi are zero

xi with non-zero αi are support vectors (SV) The decision boundary is

determined only by the SV

α1=0.6 α2=0.8

α3=1.4

α4=0 α5=0

α7=0

α8=0

α9=0

α10=0

Non-SV αi=0

α6=0




For testing with a new data x

If f(z) > 0, class 1

If f(z) < 0, class 2

Note: w need not be formed explicitly

, , ,

where



SVM: Non-Linearly Separable

How about Non-Linearly Separable Case? The margin cannot be defined anymore

Two approaches: Add a slack variables

Use a kernel (Non-Linear SVM)




Slack Variable

Slack Variable (ξ) is

added as a punishmentto allow a sample in / far away from the margin

Optimalization:

12 � � + .�/�

0

��1� �(�� + 2 ≥ 1 − /� 4 = 1. . . 6subject to

Minimizew, b

/� ≥ 0

ξ1

Error> 1

ξ2

Error> 1

ξ3

Correct< 1

ξ1

ξ2

ξ3

ξ4

Error> 1

ξ5

Correct< 1

ξ5

ξ4

ξ of other samples are 0C : tradeoff parameter between error

and margin

where




Slack Variable

The dual of this new constrained optimization problem is

It is as the same as the linearly separable case, except that there is an upper bound C on αi

w is calculated by

�+

�� ( �

+

��

+

��subject to

Maximumα

� �+

��and

Samples on the margin

Samples in / far away from margin

(ξ > 0)

Support Vector

(C ≥ αi ≥ 0)

� � �+

�� Non-SV αi=0




Kernel Method

Map samples to higher dimensional space

Input space: the original space of xi

Feature space: (xi)

Linear operation in the feature space is equivalent to non-linear operation in input space

Classification can become easier with a proper transformation




Kernel Method

Kernel is a function which maps input into high dimensional feature space

Construct linear SVM in feature space

Input Space Feature Space




Kernel Method: Trick

Recall, the linearly separable case:

The data points only appear as inner product

As long as the inner product is calculated, the mapping is not required explicitly

Define the kernel function K by

�8�+

��− 12��8�8�1�1��(��

+

��

+

��

8� ≥ 0subject to

Maximumα

�8�1�+

��= 04 = 1. . . 9 and

inner product




Kernel Method: Trick

Suppose () is given as follows

An inner product in the feature space is

The kernel function is defined as follows

No need to define (.) explicitly

It is known as the kernel trick

� � ( � � ��

� � ( � � ( � � � � �




Kernel Method: Example

Polynomial kernel with degree d

Radial basis function kernel with width

Closely related to radial basis function neural networks

The feature space is infinite-dimensional

Sigmoid with parameter and

It does not satisfy the Mercer condition on all and

��

( �

(





Similar to the linearly separable case but change all inner products to kernel functions

�

� ( �subject to

minimizew, b

�+

��

+

��

+

��

�subject to

maximumα

� �+

��and





For testing with a new data z

If f(z) > 0, class 1

If f(z) < 0, class 2

Note: needs not be formed explicitly

:, :, :,

;

��

where



SVM: Example

Given:

5 one-dimensional training samples:

Polynomial kernel of degree 2 is used:

K(x,y) = (xy+1)2

C is set to 100

x y

1 1

2 1

4 -1

5 -1

6 1



SVM: Example

By

Result:

1=0, 2=2.5, 3=0, 4=7.333, 5=4.833

The support vectors are {x2=2, x4=5, x5=6}

The discriminate function is

�8�<

��− 12��8�8�1�1� �� + 1 �

<

��

<

��

�subject to

Maximumα

�8�1�<

��= 0and

=�8:,1:, �:,( = + 1 ��

��+ 2

� � �

�

x2 x4 x6



SVM: Example

b can be determined by using the points on the margin,

f (2) = 1, f (6) = 1 and f (5) = -1

Therefore, b is -9



SVM: Example

1 2 4 5 6

class 2 class 1class 1

�

f(z) < 0 f(z) > 0f(z) > 0

Support Vectors



SVM: Multi-Class Problem

SVM only can handle 2-class problem

How to handle multi-class problem?

g in LDF can be formulated as the estimation

on posterior probability to a class

However, SVM must considers two classes

Do not estimate the probability of a class

Max method cannot be applied to SVM



SVM: Multi-Class Problem

How to handle multi-class problem?

1-against-All

1-against-1

Discussed in previous lecture

1-against-All

y3

y1

y2

y4

1-against-1

y3

y1

y2

y4



SVM: Characteristic

Advantages Training is relatively easy

No local optimal

It scales relatively well to high dimensional data (inner product)

Tradeoff between classifier complexity and error can be controlled explicitly

Disadvantage Slow when the number of samples is large

Need to choose a “good” kernel function



K-Nearest Neighbor (K-NN)

A new pattern is classified by a majority vote of its k nearest neighbors (training

samples)

n distances are

calculated for each new sample

n: the number of training samples

k = 1Triangle

k = 3Circle Triangle

Unseen Sample

k = 5




Target function for the entire space may be described as a combination of less complex local approximations




How to determine k?

Small k

Noise Sensitive

Large k

Neighbours may be too far away from the unseen sample

Less representative


CircleTriangle

k = 1k = 3

CircleTriangle

k = 5k = 19


K-NN: Characteristic

Advantages: Very simple

No training is needed

All computations deferred until classification

Disadvantages: Difficult to determine k

Affected by noisy training data

Classification is time consuming

Need to calculate the distance between the unseen sample and each training sample



Decision Tree (DT)

Most classifiers are black-box

DT provides explanation on decisions

One of the most widely used and practical methods for inductive inference

Approximates discrete-valued functions (including disjunctions)



DT: Example

Do we go to play tennis today?

If Outlook is Sunny AND

Humidity is Normal

If Outlook is Overcast

If Outlook is Rain AND

Wind is Weak

Yes

Yes

Yes

Other situation? No



DT: Classification

x2

x1

Decision Region:

Internal nodes can be univariate

(Only one feature is used)

a

b

x1

> a

No Yes

x2

> b

No Yes



DT: Classification

x2

x1

Internal nodes can be multivariate More than one features are used Shape of Decision Region is irregular

ax1

+ bx2

+ c > 0

No Yes



DT: Learning Algorithm

LOOP:1. Select the best feature (A)

2. For each value of A, create new descendant of node

3. Sort training samples to leaf nodes

STOP when training samples perfectly classified

# x1

x2

x3

1 1 3 5

2 1 4 2

3 3 1 5

4 3 5 6

5 3 3 4

6 4 5 7

x1

> 2

No Yes

STOP x2

> 2

No Yes

STOP STOPLecture 05: SL - RBFNN, SVM, k-nn and DT66


DT: Learning Algorithm

Observation

Many trees may code a training set without any error

Finding the smallest tree is a NP-hard problem

Local search algorithm to find reasonable solutions

What is the best feature?



DT: Feature Measurement

Entropy is used to evaluate features

Measure of uncertainty

Range: 0 - 1

Smaller value, less uncertainty

If all samples belongs to xi, then p(xi) = 1, and other p(xj) = 0, i ≠ j

Thus, H(X) = 0 (no uncertainty)

� � �+

��

X: a random variable with n outcomes, X = {xi | i = 1,2,…,n}

where

p(x): the probability mass function of outcome x.



DT: Feature Measurement

Information Gain

Reduction in entropy (reduce uncertainty) due to sorting on a feature A

Current

entropyEntropy after

using feature A



DT: Example

Which feature is the best?

� � ��

��

� �

� � �+

��

Current:

Uncertainty is high w/o any

sorting by feature

x1 = yes x2 = No

yes no



DT: Example

Outlook

SunnyRain Overcast

No:

Yes:

3

2No:

Yes:

2

3

No:

Yes:

0

4

Let A = Outlook

)()|(

)()|(

)()|(

overcastAPovercastAXH

RainAPRainAXH

sunnyAPsunnyAXH

)|( AXHRecall:

)|()(),( AXHXHAXGain



Outlook

SunnyRain Overcast

No:

Yes:

3

2No:

Yes:

2

3

No:

Yes:

0

4

DT: Example

Let A = Outlook

)()|(

)()|(

)()|(

overcastAPovercastAXH

RainAPRainAXH

sunnyAPsunnyAXH

)|( AXH

)|( sunnyXH 0.971)5

2(log

5

2)5

3(log

5

322

)|( RainXH 0.971)5

3(log

5

3)5

2(log

5

222

)|( overcastXH 0)4

4(log

4

4)4

0(log

4

022

14

40

14

5971.0

14

5971.0)|( AXH

694.0



DT: Example

0.941)( XHCurrent:

694.0)|( OutlookXH

Similarly, for each feature

0.911)|( eTemperaturXH

0.789)|( HumidityXH

0.892)|( WindXH

Recall:

)|()(),( AXHXHAXGain

Information Gain is:

247.0),( OutlookXGain

030.0),( eTemperaturXGain

152.0),( HumidityXGain

049.0),( WindXGain

Outlook is the best feature and

Should be used as the first node



DT: Example

Next Step

Repeat the steps for each sub-branch

Until there is no ambiguity (all samples are of the same class)

Outlook

SunnyRain Overcast

No:

Yes:

3

2No:

Yes:

2

3

No:

Yes:

0

4

DoneContinues to select next features



DT: Continuous-Valued Feature

So far, we handle features with categorial values

How to build a decision tree whose features are numerical?

29.9

28.2

35.2

26.4

18.9

21.2

20.4

24.4

17.0

25.1

24.0

24.5

27.7

25.5Lecture 05: SL - RBFNN, SVM, k-nn and DT75



Accomplished by partitioning the continuous attribute value into a discrete set of intervals

In particular, a new Boolean feature Ac(A < c) can

be created

How to select the best value for c

17.0 18.9 20.4 21.2 24.0 24.4 24.5 25.1 25.5 26.4 27.7 28.2 29.9 35.2

Yes Yes Yes No Yes No Yes Yes No Yes Yes No No Yes

? ??




Objective is to minimize the entropy (or maximize the information gain)

Entropy only needs to be evaluated between points of different classes

Best c is:

c1

c2

c3

c4

c5

c6

c7

c8

17.0 18.9 20.4 21.2 24.0 24.4 24.5 25.1 25.5 26.4 27.7 28.2 29.9 35.2

Yes Yes Yes No Yes No Yes Yes No Yes Yes No No Yes


lecture05 - sl - rbfnn svm knn & dt · 23 lecture 05: sl -rbfnn, svm, k-nn and dt the wools...

Documents