lecture05 - sl - rbfnn svm knn & dt · 23 lecture 05: sl -rbfnn, svm, k-nn and dt the wools...

Machine Learning

Lecture 5

Supervised Learning

RBFNN, SVM, k-nn and DT

Dr. Patrick Chanpatrickchan@ieee.org

South China University of Technology, China

Dr. Patrick Chan @ SCUT

Agenda

Radial Basis Function Neural Network

Support Vector Machine

K-Nearest Neighbor

Decision Tree

Lecture 05: SL - RBFNN, SVM, k-nn and DT2

Recall…

Multi-Layer Perception

Black-Box

No idea what happened in side a progress

Complex parameter space

High training complexity

Sub-optimal solution

w11w21

Input Layer Output Layer

Hidden Layers

RBF Neural Network

Radial Basis Function (RBF) Neural Network is special type of ANN

Only three layers

Input Layer

Hidden Layer

Activation Function:RBF (Gaussian)

Output Layer

Linearly weighted Sum

Parameters are meaningful

Hidden

Output

RBF Neural Network

Recall, Gaussian function

The most common distribution function

Univariate function:

Two parameters

: mean

: variance

RBF Neural Network

Neuron ( ) of RBFNN represents Gaussian Distribution with and Measures the effect of the neuron to x

If the output is large, it means x and the neuron have more similar nature

Different center ( )

Different width ( �)

x is more related to � x is more related to �Lecture 05: SL - RBFNN, SVM, k-nn and DT6

RBF Neural Network

If the weight is larger, the neuron is more important as it has larger effect to the output

w3=0.5

RBF Neural Network

The output of RBFNN for the class k is:

w : the weight

(importance of the neuron)

m : the number of neurons

Hidden

Output

��

� ��

��

RBF Neural Network

w1=1.5

w2=0.3

Blue Red Red

RBF Neural Network

Parameter Determination

Parameters ( , and w) are meaningful

Can be determined by not only gradient descent but also clustering technique

Objective is to seek the “natural” clusters in the data

Clustering will be discussed later in detail

RBF Neural Network: Parameter Determination

Gradient Descent

Weights (w)

Centers ( )

Width ( )

��

��(�) =��

��exp − 1

2�� − ��

��

Learning rate of parameters can be different

Clustering Technique

Given a dataset

Separate samples into groups for each class by clustering technique E.g. K-mean May determine the number of

clusters for each class in advance

Center Mean of samples belonging to

the same cluster

Width Variance of samples belonging

to the same cluster

3 6 180 129 15

Clustering Technique

is fixed when and are determined

RBFNN:

As and are linear relation

Weight (w) can be calculated by

Pseudoinverse (refer to the previous lecture)

2-Phase Learning

Initialization of gradient descent affects the performance significantly

2-Phase Learning combines clustering and gradient descent

1. Cluster technique initials values of , and w

2. Gradient descent fine-tunes the values

RBF Network

How to determine the number of neurons?

Less neurons

Low complexity

Low training cost

More neurons

Represents more complicated distribution

Special CaseMaximum m = training sample # (n)

Mean: a training sample

Covariance: influence of a sample to unseen samples

Larger CovSmaller Cov

RBF Neural Network

XOR Example

Input space:

Output space:

(1,1)(0,1)

(0,0) (1,0) x1

RBF Neural Network

XOR Example

The feature space:

x 1 2(1, 1) 1 0.1353

(0, 1) 0.3678 0.3678

(0, 0) 0.1353 1

(1, 0) 0.3678 0.3678

� � ��|| �!"||

$� = [1,1](

��

�|| �! ||

0.5(1,1)

(1,1)(0,1)

(0,0) (1,0)x1

Weight Determination:

RBF Neural Network

XOR Example

1 0.1353 1

0.3678 0 3678 1

0.1353 1 1

0.3678 0 3678 1

0.5(1,1)

-1.692

Decision boundary

1 (x1,x2) 2 (x1,x2) w3 Class outputs of samples

RBF Network: Characteristic

Advantage

RBF Network trains faster than MLP

The hidden layer is easier to interpret than MLP

Disadvantage

During the test phase, the calculation speed ofa neuron in RBF is slower than MLP

Architecture is fixed

Support Vector Machine (SVM)

Which one is the best linear separator?

A clever sheep dog

who was herding his

sheep…

It runs between the

sheep and tries to

separate the black

sheep and white

Clever

Sheep dog

The sheep dog keeps

running…

The sheep start to

grow wools…

The dog feel the gap

between black sheep

and white sheep is

narrower…

The wools become

bigger and bigger…

Finally, only one path

is left..

Lecture 05: SL - RBFNN, SVM, k-nn and DT24From: Learning with Kernels, Schölkopf & Smola

The sheep found out

that the single path

relies only on the

some sheep.

These sheep are

“sheep vectors”

* “Support vector” in

Largest Margin

Sheep Vectors

Similar to the sheep, if a sample is growing…

This should be the best classifier

A classifier with the largest margin

The concept is the same

Maximum Margin Classifier ONLY depends on few samples, called Support Vectors

Support Vectors

Statistical Learning Theory

Recall, a classifier is trained by minimizing the training error (Remp, empirical risk)

Our ultimate objective is to minimize the error on unseen sample (R, true risk)

Training error can be reduced easily by using a more complicated classifier

Overfit easily too

A simpler classifier is preferred (assume empirical risk is the same)

SLT is proposed by Prof. Vladimir Vapnik

Mention the relation between R, Remp and classifier complexity

With 1-η probability, the bound on true risk (R) is:

n = number of training samples

h = VC (Vapnik-Chervonenkis) dimension

Complexity of the classifier

Complexity

With 1-η probability, the bound on true risk (R) is:

Observation: When l → ∞, Remp → R

More sample, less complexity

This bound is independent of the sample distribution

ComplexityTraining ErrorTrue Error

SVM: Statistical Learning Theory

VC Dimension

VC Dimension for a set of functions { f } is h if there exists a set of h points that can be shattered by { f } but no set of h+1

points

Shattered means that any labeling of the points can be classified correctly by a function from { f }

Larger h = more complicated classifier

SVM: SLT: VC Dimension

Linear Classifier Example

2 features

Linear Classifier is a straight line

h is sample #

h=1 h=2 h=3

SVM: SLT: VC Dimension

Linear Classifier Example

Maximumvalue of h is 3

Therefore, the VC-Dimensionof Linearclassifier is 3

SVM with a larger margin has a smaller VC dimension

Not only Remp but also the margin is optimized in SVM training

Implicitly optimize R

How to find the boundary?

Separable case

Non-separable case

Slack variables

Kernel method

SVM: Linearly Separable

The hyperplane is:

wx + b = 0

Suppose the dotted lines are:

wx + b = ±±±±

If change:

w = λw, b = λb and = λ

the equations are still valid

We choose = 1

Margin width

wwx+b = 0

wx+b = 1

wx+b = -1

wx+b =

wx+b = -

Distance from origin to hyperplane is

||b|| / w

Similar, the distance to dotted line is

(||b|| ± 1) /w

Therefore, the margin is:

Margin width

wx+b = 0wx+b = 1

wx+b = -1

||w||- =

Problem can be formulated as Quadratic Optimization Problem and solve for w and b

subject to

minimizew, b

2Margin width =

y(x) = wx+b

y(x) = 1

y(x) = -1

This optimization problem can be formulated as Dual Problem using

Lagrangian method:

Weight is determined by:

�subject to

Maximumα

� �+

��and

� ( �subject to

minimizew, b

Many αi are zero

xi with non-zero αi are support vectors (SV) The decision boundary is

determined only by the SV

α1=0.6 α2=0.8

α3=1.4

α4=0 α5=0

α10=0

Non-SV αi=0

For testing with a new data x

If f(z) > 0, class 1

If f(z) < 0, class 2

Note: w need not be formed explicitly

SVM: Non-Linearly Separable

How about Non-Linearly Separable Case? The margin cannot be defined anymore

Two approaches: Add a slack variables

Use a kernel (Non-Linear SVM)

Slack Variable

Slack Variable (ξ) is

added as a punishmentto allow a sample in / far away from the margin

Optimalization:

12 � � + .�/�

��1� �(�� + 2 ≥ 1 − /� 4 = 1. . . 6subject to

Minimizew, b

/� ≥ 0

Error> 1

Correct< 1

Error> 1

Correct< 1

ξ of other samples are 0C : tradeoff parameter between error

and margin

Slack Variable

The dual of this new constrained optimization problem is

It is as the same as the linearly separable case, except that there is an upper bound C on αi

w is calculated by

�� ( �

��

��subject to

Maximumα

� �+

��and

Samples on the margin

Samples in / far away from margin

(ξ > 0)

Support Vector

(C ≥ αi ≥ 0)

� � �+

�� Non-SV αi=0

Kernel Method

Map samples to higher dimensional space

Input space: the original space of xi

Feature space: (xi)

Linear operation in the feature space is equivalent to non-linear operation in input space

Classification can become easier with a proper transformation

Kernel Method

Kernel is a function which maps input into high dimensional feature space

Construct linear SVM in feature space

Input Space Feature Space

Kernel Method: Trick

Recall, the linearly separable case:

The data points only appear as inner product

As long as the inner product is calculated, the mapping is not required explicitly

Define the kernel function K by

�8�+

��− 12��8�8�1�1��(��

��

8� ≥ 0subject to

Maximumα

�8�1�+

��= 04 = 1. . . 9 and

inner product

Kernel Method: Trick

Suppose () is given as follows

An inner product in the feature space is

The kernel function is defined as follows

No need to define (.) explicitly

It is known as the kernel trick

� � ( � � ��

� � ( � � ( � � � � �

Kernel Method: Example

Polynomial kernel with degree d

Radial basis function kernel with width

Closely related to radial basis function neural networks

The feature space is infinite-dimensional

Sigmoid with parameter and

It does not satisfy the Mercer condition on all and

��

Similar to the linearly separable case but change all inner products to kernel functions

� ( �subject to

minimizew, b

��

�subject to

maximumα

� �+

��and

For testing with a new data z

If f(z) > 0, class 1

If f(z) < 0, class 2

Note: needs not be formed explicitly

:, :, :,

��

SVM: Example

Given:

5 one-dimensional training samples:

Polynomial kernel of degree 2 is used:

K(x,y) = (xy+1)2

C is set to 100

SVM: Example

Result:

1=0, 2=2.5, 3=0, 4=7.333, 5=4.833

The support vectors are {x2=2, x4=5, x5=6}

The discriminate function is

�8�<

��− 12��8�8�1�1� �� + 1 �

��

�subject to

Maximumα

�8�1�<

��= 0and

=�8:,1:, �:,( = + 1 ��

��+ 2

� � �

x2 x4 x6

SVM: Example

b can be determined by using the points on the margin,

f (2) = 1, f (6) = 1 and f (5) = -1

Therefore, b is -9

SVM: Example

1 2 4 5 6

class 2 class 1class 1

f(z) < 0 f(z) > 0f(z) > 0

Support Vectors

SVM: Multi-Class Problem

SVM only can handle 2-class problem

How to handle multi-class problem?

g in LDF can be formulated as the estimation

on posterior probability to a class

However, SVM must considers two classes

Do not estimate the probability of a class

Max method cannot be applied to SVM

SVM: Multi-Class Problem

How to handle multi-class problem?

1-against-All

1-against-1

Discussed in previous lecture

1-against-All

1-against-1

SVM: Characteristic

Advantages Training is relatively easy

No local optimal

It scales relatively well to high dimensional data (inner product)

Tradeoff between classifier complexity and error can be controlled explicitly

Disadvantage Slow when the number of samples is large

Need to choose a “good” kernel function

K-Nearest Neighbor (K-NN)

A new pattern is classified by a majority vote of its k nearest neighbors (training

samples)

n distances are

calculated for each new sample

n: the number of training samples

k = 1Triangle

k = 3Circle Triangle

Unseen Sample

Target function for the entire space may be described as a combination of less complex local approximations

How to determine k?

Small k

Noise Sensitive

Large k

Neighbours may be too far away from the unseen sample

Less representative

CircleTriangle

k = 1k = 3

CircleTriangle

k = 5k = 19

K-NN: Characteristic

Advantages: Very simple

No training is needed

All computations deferred until classification

Disadvantages: Difficult to determine k

Affected by noisy training data

Classification is time consuming

Need to calculate the distance between the unseen sample and each training sample

Decision Tree (DT)

Most classifiers are black-box

DT provides explanation on decisions

One of the most widely used and practical methods for inductive inference

Approximates discrete-valued functions (including disjunctions)

DT: Example

Do we go to play tennis today?

If Outlook is Sunny AND

Humidity is Normal

If Outlook is Overcast

If Outlook is Rain AND

Wind is Weak

Other situation? No

DT: Classification

Decision Region:

Internal nodes can be univariate

(Only one feature is used)

No Yes

DT: Classification

Internal nodes can be multivariate More than one features are used Shape of Decision Region is irregular

+ c > 0

No Yes

DT: Learning Algorithm

LOOP:1. Select the best feature (A)

2. For each value of A, create new descendant of node

3. Sort training samples to leaf nodes

STOP when training samples perfectly classified

1 1 3 5

2 1 4 2

3 3 1 5

4 3 5 6

5 3 3 4

6 4 5 7

No Yes

STOP x2

No Yes

STOP STOPLecture 05: SL - RBFNN, SVM, k-nn and DT66

DT: Learning Algorithm

Observation

Many trees may code a training set without any error

Finding the smallest tree is a NP-hard problem

Local search algorithm to find reasonable solutions

What is the best feature?

DT: Feature Measurement

Entropy is used to evaluate features

Measure of uncertainty

Range: 0 - 1

Smaller value, less uncertainty

If all samples belongs to xi, then p(xi) = 1, and other p(xj) = 0, i ≠ j

Thus, H(X) = 0 (no uncertainty)

� � �+

��

X: a random variable with n outcomes, X = {xi | i = 1,2,…,n}

p(x): the probability mass function of outcome x.

DT: Feature Measurement

Information Gain

Reduction in entropy (reduce uncertainty) due to sorting on a feature A

Current

entropyEntropy after

using feature A

DT: Example

Which feature is the best?

� � ��

��

� �

� � �+

��

Current:

Uncertainty is high w/o any

sorting by feature

x1 = yes x2 = No

yes no

DT: Example

Outlook

SunnyRain Overcast

Let A = Outlook

overcastAPovercastAXH

RainAPRainAXH

sunnyAPsunnyAXH

)|( AXHRecall:

)|()(),( AXHXHAXGain

Outlook

SunnyRain Overcast

DT: Example

Let A = Outlook

overcastAPovercastAXH

RainAPRainAXH

sunnyAPsunnyAXH

)|( AXH

)|( sunnyXH 0.971)5

)|( RainXH 0.971)5

)|( overcastXH 0)4

5971.0

5971.0)|( AXH

DT: Example

0.941)( XHCurrent:

694.0)|( OutlookXH

Similarly, for each feature

0.911)|( eTemperaturXH

0.789)|( HumidityXH

0.892)|( WindXH

Recall:

)|()(),( AXHXHAXGain

Information Gain is:

247.0),( OutlookXGain

030.0),( eTemperaturXGain

152.0),( HumidityXGain

049.0),( WindXGain

Outlook is the best feature and

Should be used as the first node

DT: Example

Next Step

Repeat the steps for each sub-branch

Until there is no ambiguity (all samples are of the same class)

Outlook

SunnyRain Overcast

DoneContinues to select next features

DT: Continuous-Valued Feature

So far, we handle features with categorial values

How to build a decision tree whose features are numerical?

25.5Lecture 05: SL - RBFNN, SVM, k-nn and DT75

Accomplished by partitioning the continuous attribute value into a discrete set of intervals

In particular, a new Boolean feature Ac(A < c) can

be created

How to select the best value for c

17.0 18.9 20.4 21.2 24.0 24.4 24.5 25.1 25.5 26.4 27.7 28.2 29.9 35.2

Yes Yes Yes No Yes No Yes Yes No Yes Yes No No Yes

Objective is to minimize the entropy (or maximize the information gain)

Entropy only needs to be evaluated between points of different classes

Best c is:

17.0 18.9 20.4 21.2 24.0 24.4 24.5 25.1 25.5 26.4 27.7 28.2 29.9 35.2

Yes Yes Yes No Yes No Yes Yes No Yes Yes No No Yes

lecture05 - sl - rbfnn svm knn & dt · 23 lecture 05: sl -rbfnn, svm, k-nn and dt the wools...

Documents

market forecasting using (2d) 2 pca + rbfnn by: danny...

research article wheeled mobile robot rbfnn dynamic surface...

svm tr53 svm tr107 - star foils tecniche/svm tr...svm tr53...

svm kinematic svm viscometer series svm 3001 svm 4001-...

sakusaku svm

svm replicas

svm practical

neural network (rbfnn) models for predicting students

radial basis function neural networks (rbfnn) and p-q power...

svm tutorial

32hl67u svm

svm tutorial2

q-rbfnn: a quantum calculus-based rbf neural network

svm admin2

perbandingan metode arima dengan rbfnn dalam peramalan...

a novel neutrosophic logic svm (n-svm) and its application

stützvektormethode (svm)

boosted svm

construction of star catalogue based on svm · construction...

new · 2004. 11. 16. · svm -o 155- 10 etrioii'/ 4.1 svm...