mldm2004s lecture-11-an introduction to support vector...

An introduction to Support Vector Machine

報告者：黃立德

References:Simon Haykin, "Neural Networks: a comprehensive foundation, second edition“, 1999, Chapter 2,6

Nello Christanini, John Shawe-Tayer, “An Introduction to Support Vector Machines”, 2000, Chapter 3~6

2

Outline• Drawbacks of learning• Overview of SVM• The Empirical Risk Minimization Principle• VC-dimension• Structural Risk Minimization• Linearly separable patterns• Non-linearly separable patterns• How to build a SVM for pattern recognition• Example: XOR problem• Properties and expansions of SVM• Conclusion• Applications of SVM• LIBSVM

3

Drawbacks of learning

• The choice of the class of functions from which the input/output mapping must be sought. – Learning in three-node neural networks is

known to be NP-complete

4

Drawbacks of learning (cont.)• In practice, there are following problems

– The learning algorithm may prove inefficient as for example in the case of local minima

– The size of output hypothesis can frequently become very large and impractical

– If there are only a limited number of training examples, the hypothesis found by learning algorithm will lead to overfitting and hence poor generalization

– The learning algorithm is usually controlled by a large number of parameters that are often chosen by turning heuristics, making the system difficult and unreliable to use

5

Overview of SVM

• What is SVM?– A linear machine with some very nice properties– The goal is to construct a decision surface such

that the margin of separation between positive and negative samples is maximized.

– SVM is a learning system that uses a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory

6

The Empirical Risk Minimization Principle

• Given a set of data

• Also given a set of decision functions

• The expected risk is

( ) ( ) { }1,1,,,,...,, 11 −∈∈ inNN yRxyxyx

{ } { }1,1:,: −→∈ nRfwhereIf λλ λ

( ) ( ) ( )∫ −= yxdPyxfR ,λλ

7

The Empirical Risk Minimization Principle(cont.)

• The approximation (empirical risk)

• Theory of uniform convergence in probability

{ }lim sup( ( ) ( )) 0, 0empN IP R Rλ λ λ ε ε→∞ ∈ − > = ∀ >

( ) ( )∑=

−=N

iiiemp yxfN

R1

1λλ

8

Vapnik-Chervonenkis dimension

• VC-dimension– It is a measure of the capacity or expressive power

of the family of classification functions realized by the learning machine

9

Structure Risk Minimization

• Let be a subset of

• Define a structure of nested subset

• Each subset satisfies the condition

kI I{ }:k kS f Iλ λ= ∈

1 2 ... ...nS S S⊂ ⊂ ⊂ ⊂

1 2 ... ...nh h h≤ ≤ ≤ ≤ , hi: VC dimension

10

Structure Risk Minimization (cont.)

• Implementing SRM can be difficult because the VC dimension of could be hard to compute.

• Support Vector Machine, SVM, are able to achieve the goal of minimizing the upper bound of by minimizing a bound on the VC dimension h and at the same time.

min ( ) nemphRN

λ⎛ ⎞

+⎜ ⎟⎜ ⎟⎝ ⎠

nS

( )R λ( )empR λ

11

Concepts of SVM

• SVM is an approximate implementation of the method of structural risk minimization

• It does not incorporate problem-domain knowledge

12

Linearly separable patterns

• A training sample• The patterns are linear separable.• The equation of a decision surface that

does the separation is

• And we can write

{ } 1( , )N

i i ix d

=

0tw x b+ =

0t iw x b+ ≥ for 1id = +0t iw x b+ < for 1id = −

( ) 1 1, 2,...,ti id w x b for i N+ ≥ =

13

Linearly separable patterns (cont.)

• The discriminant function of the optimal hyperplaneis

• Maximum Margin Rule– We select the hyperplane

with maximum nearest the data point (support vectors).

0 0( )tg x w x b= +

14


0 0 1tw x b+ =

0 0 0tw x b+ =

0 0 1tw x b+ = −

0

2:Marginw

15


• The final goal is to minimizes the cost function

• We may solve the constrained optimization problem using the method of Lagrange multipliers.

1( )2

tw w wΦ =

Lagrangian function

1

1( , , ) ( ) 12

Nt t

i i ii

J w b w w d w w bα α=

⎡ ⎤= − + −⎣ ⎦∑:i Lagrange multipliersα

16


• dual form– Given the training sample ,find the

Lagrange multipliers that maximize the objective function

{ } 1( , )N

i i ix d

=

{ } 1N

i iα

=

1 1 1

1( )2

N N Nt

i i j i j i ji i j

Q d d x xα α α α= = =

= −∑ ∑∑

subject to the constraints

(1)

(2)1

0N

i ii

dα=

=∑0 1,2,...,i for i Nα ≥ =

17


• Having determined the optimum Lagrange multipliers, we can compute the optimum weight vector and bias.

0 ,1

N

o i i ii

w d xα=

=∑

( ) ( )0 01 1

t s sb w x for d= − =

18

Non-linearly separable patterns

• allow training errors• The definition of decision surface is

• At last, we only have to minimizing the following function

( ) 1 1, 2,...,ti i id w x b for i Nξ+ ≥ − =

1

1( , )2

Nt

ii

w w w Cξ ξ=

Φ = + ∑

19

Non-linearly separable patterns (cont.)

• dual form– Given the training sample ,find the


{ } 1( , )N

i i ix d

=

{ } 1N

i iα

=

1 1 1

1( )2

N N Nt

i i j i j i ji i j

Q d d x xα α α α= = =

= −∑ ∑∑subject to the constraints(1)

(2)1

0N

i ii

dα=

=∑0 1,2,...,i C for i Nα≤ ≤ =where C is a user-specified positive parameter

20

Non-linearly separable patterns (cont.)

• After the optimum Lagrange multipliers have determined, we can compute the optimum weight vector and bias.

0 ,1

sN

o i i ii

w d xα=

=∑( ) ( )

0 01 1t s sb w x for d= − =

21

How to build a SVM for pattern recognition

• Steps for constructing a SVM– Nonlinear mapping of input vectors into a high dimensional

feature space that is hidden from both the input and output.– Construction of an optimal hyperplane for separating the

features.

22

How to build a SVM for pattern recognition (cont.)

• x denotes a vector from the input space.• denotes a set of nonlinear mapping

from the input space to the feature space.• Define a hyperplane as following

• Define the vector

{ } 11

( )m

j jxϕ

=

1

1( ) 0

m

j jj

w x bϕ=

+ =∑1

0( ) 0

m

j jj

w xϕ=

=∑

10 1( ) ( ), ( ),..., ( )

t

mx x x xϕ ϕ ϕ ϕ⎡ ⎤= ⎣ ⎦

23


• We can write the equation in the compact form.

• Because the features are linear separable, we may write

• Substituting Eq.2 in Eq.1,we get

( ) 0t xϕ =w

1( )

N

i i ii

d xα ϕ=

=∑w

1

2

1( ) ( ) 0

Nt

i i ii

d x xα ϕ ϕ=

=∑

24

• Define the inner-product kernel denoted by

• Now, we may use the inner-product kernel to construct the optimal decision surface in the feature space without considering the feature space in explicit form.


( , ) ( ) ( )ti iK x x x xϕ ϕ=1

0( ) ( ) 1,...,

m

j j ij

x x for i Nϕ ϕ=

= =∑

1( , ) 0

N

i i ii

d K x xα=

=∑

25

• Optimum design of a SVM (dual form)– Given the training sample ,find the


{ } 1( , )N

i i ix d

=

{ } 1N

i iα

=

1 1 1

1( ) ( , )2

N N N

i i j i j i ji i j

Q d d K x xα α α α= = =

= −∑ ∑∑subject to the constraints(1)

(2)1

0N

i ii

dα=

=∑0 1,2,...,i C for i Nα≤ ≤ =where C is a user-specified positive parameter


26

• We may view as the ij-th element of a symmetric N-by-N matrix K

• Having found the optimum values of , we can get


( , )i jK x x

{ }( , ) 1

( , )N

i j i jK x x

==K

,o iα

,1

( )N

o o i i ii

d xα ϕ=

=∑w

27


28

Example: XOR problem

• First, we choose kernel as

• With and ,we get( )2( , ) 1 ti iK x x x x= +

[ ]1 2,tx x x= [ ]1 2,

ti i ix x x=

2 2 2 21 1 1 2 1 2 2 2 1 1 2 2( , ) 1 2 2 2i i i i i i iK x x x x x x x x x x x x x x= + + + + +

2 21 1 2 2 1 2( ) 1, , 2 , , 2 , 2

tx x x x x x xϕ ⎡ ⎤= ⎣ ⎦

2 21 1 2 2 1 2( ) 1, , 2 , , 2 , 2 , 1,2,3,4

t

i i i i i i ix x x x x x x iϕ ⎡ ⎤= =⎣ ⎦

29

Example: XOR problem (cont.)

• We also find that

• The objective function for the dual form is

9 1 1 11 9 1 11 1 9 11 1 1 9

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

K

21 2 3 4 1 1 2 1 3 1 4

1( ) (9 2 2 22

Q α α α α α α α α α α α α= + + + − − − +

2 2 22 2 3 2 4 3 3 4 49 2 2 9 2 9 )α α α α α α α α α+ + − + − +

30


• Optimizing

1 2 3 41

( ) 9 1Q α α α α αα

∂⇒ − − + =

∂

( )Q α

1 2 3 42

( ) 9 1Q α α α α αα

∂⇒ − + + − =

∂

1 2 3 43

( ) 9 1Q α α α α αα

∂⇒ − + + − =

∂

1 2 3 44

( ) 9 1Q α α α α αα

∂⇒ − − + =

∂

31


• The optimum values of are,o iα

,1 ,2 ,3 ,418o o o o

α α α α= = = =

1( )4o

Q α =

21 1 12 4 2o o

= ⇒ =w w

32


• We find that the optimum weight vector is[ ]1 2 3 4

1 ( ) ( ) ( ) ( )8o

x x x xϕ ϕ ϕ ϕ= − + + −w

1 1 1 1 01 1 1 1 02 2 2 21 1 2

1 1 1 18 02 2 2 2 0

02 2 2 2

⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − −⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥= − + + − = ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− −⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⎣ ⎦⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦⎣ ⎦

33


• The optimal hyperplane is defined by( ) 0to xϕ =w

21

1 22

2

1

2

1

210,0, ,0,0,0 02

2

2

x

x xx

x

x

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥−⎡ ⎤ ⎢ ⎥ =⎢ ⎥ ⎢ ⎥⎣ ⎦⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

1 2 0x x− =

34

Properties and expansions of SVM

• Two important features:– Duality is the first feature of SVM– Operate in a kernel induced feature space

• Several expansions of SVM:– C-Support Vector Classification (binary case)– v-Support Vector Classification (binary case)– Distribution Estimation (one-class SVM)– -Support Vector Regression ( -SVR)– v-Support Vector Regression (v-SVR)ε ε

35

Conclusion• The SVM is an elegant and highly principled

learning method for the design of classifying nonlinear input data.

• Compared with back-propagation algorithm– Only operate in a batch mode– Whatever the learning task, it provide a method for

controlling model complexity independently of dimensionality

– It is guaranteed to find a global extremum of the error surface

– The computation can be performed efficiently– By using a suitable inner-product kernel, the SVM

computes all the important network parameters automatically.

36

Applications of SVM

• Classification• Regression• Recognition• Bioinformatics

37

LIBSVM

• A Library for Support Vector Machines• Made by Chih-Jen Lin and Chih-Chung

Chang• Both C++ and Java sources • http://www.csie.ntu.edu.tw/~cjlin/

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /Unknown

/Description >>> setdistillerparams> setpagedevice

mldm2004s lecture-11-an introduction to support vector...

Documents