mldm2004s lecture-11-an introduction to support vector...

37
An introduction to Support Vector Machine 報告者:黃立德 References: Simon Haykin, "Neural Networks: a comprehensive foundation, second edition“, 1999, Chapter 2,6 Nello Christanini, John Shawe-Tayer, “An Introduction to Support Vector Machines”, 2000, Chapter 3~6

Upload: others

Post on 23-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • An introduction to Support Vector Machine

    報告者:黃立德

    References:Simon Haykin, "Neural Networks: a comprehensive foundation, second edition“, 1999, Chapter 2,6

    Nello Christanini, John Shawe-Tayer, “An Introduction to Support Vector Machines”, 2000, Chapter 3~6

  • 2

    Outline• Drawbacks of learning• Overview of SVM• The Empirical Risk Minimization Principle• VC-dimension• Structural Risk Minimization• Linearly separable patterns• Non-linearly separable patterns• How to build a SVM for pattern recognition• Example: XOR problem• Properties and expansions of SVM• Conclusion• Applications of SVM• LIBSVM

  • 3

    Drawbacks of learning

    • The choice of the class of functions from which the input/output mapping must be sought. – Learning in three-node neural networks is

    known to be NP-complete

  • 4

    Drawbacks of learning (cont.)• In practice, there are following problems

    – The learning algorithm may prove inefficient as for example in the case of local minima

    – The size of output hypothesis can frequently become very large and impractical

    – If there are only a limited number of training examples, the hypothesis found by learning algorithm will lead to overfitting and hence poor generalization

    – The learning algorithm is usually controlled by a large number of parameters that are often chosen by turning heuristics, making the system difficult and unreliable to use

  • 5

    Overview of SVM

    • What is SVM?– A linear machine with some very nice properties– The goal is to construct a decision surface such

    that the margin of separation between positive and negative samples is maximized.

    – SVM is a learning system that uses a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory

  • 6

    The Empirical Risk Minimization Principle

    • Given a set of data

    • Also given a set of decision functions

    • The expected risk is

    ( ) ( ) { }1,1,,,,...,, 11 −∈∈ inNN yRxyxyx

    { } { }1,1:,: −→∈ nRfwhereIf λλ λ

    ( ) ( ) ( )∫ −= yxdPyxfR ,λλ

  • 7

    The Empirical Risk Minimization Principle(cont.)

    • The approximation (empirical risk)

    • Theory of uniform convergence in probability

    { }lim sup( ( ) ( )) 0, 0empN IP R Rλ λ λ ε ε→∞ ∈ − > = ∀ >

    ( ) ( )∑=

    −=N

    iiiemp yxfN

    R1

    1λλ

  • 8

    Vapnik-Chervonenkis dimension

    • VC-dimension– It is a measure of the capacity or expressive power

    of the family of classification functions realized by the learning machine

  • 9

    Structure Risk Minimization

    • Let be a subset of

    • Define a structure of nested subset

    • Each subset satisfies the condition

    kI I{ }:k kS f Iλ λ= ∈

    1 2 ... ...nS S S⊂ ⊂ ⊂ ⊂

    1 2 ... ...nh h h≤ ≤ ≤ ≤ , hi: VC dimension

  • 10

    Structure Risk Minimization (cont.)

    • Implementing SRM can be difficult because the VC dimension of could be hard to compute.

    • Support Vector Machine, SVM, are able to achieve the goal of minimizing the upper bound of by minimizing a bound on the VC dimension h and at the same time.

    min ( ) nemphRN

    λ⎛ ⎞

    +⎜ ⎟⎜ ⎟⎝ ⎠

    nS

    ( )R λ( )empR λ

  • 11

    Concepts of SVM

    • SVM is an approximate implementation of the method of structural risk minimization

    • It does not incorporate problem-domain knowledge

  • 12

    Linearly separable patterns

    • A training sample• The patterns are linear separable.• The equation of a decision surface that

    does the separation is

    • And we can write

    { } 1( , )N

    i i ix d

    =

    0tw x b+ =

    0t iw x b+ ≥ for 1id = +0t iw x b+ < for 1id = −

    ( ) 1 1, 2,...,ti id w x b for i N+ ≥ =

  • 13

    Linearly separable patterns (cont.)

    • The discriminant function of the optimal hyperplaneis

    • Maximum Margin Rule– We select the hyperplane

    with maximum nearest the data point (support vectors).

    0 0( )tg x w x b= +

  • 14

    Linearly separable patterns (cont.)

    0 0 1tw x b+ =

    0 0 0tw x b+ =

    0 0 1tw x b+ = −

    0

    2:Marginw

  • 15

    Linearly separable patterns (cont.)

    • The final goal is to minimizes the cost function

    • We may solve the constrained optimization problem using the method of Lagrange multipliers.

    1( )2

    tw w wΦ =

    Lagrangian function

    1

    1( , , ) ( ) 12

    Nt t

    i i ii

    J w b w w d w w bα α=

    ⎡ ⎤= − + −⎣ ⎦∑:i Lagrange multipliersα

  • 16

    Linearly separable patterns (cont.)

    • dual form– Given the training sample ,find the

    Lagrange multipliers that maximize the objective function

    { } 1( , )N

    i i ix d

    =

    { } 1N

    i iα

    =

    1 1 1

    1( )2

    N N Nt

    i i j i j i ji i j

    Q d d x xα α α α= = =

    = −∑ ∑∑

    subject to the constraints

    (1)

    (2)1

    0N

    i ii

    dα=

    =∑0 1,2,...,i for i Nα ≥ =

  • 17

    Linearly separable patterns (cont.)

    • Having determined the optimum Lagrange multipliers, we can compute the optimum weight vector and bias.

    0 ,1

    N

    o i i ii

    w d xα=

    =∑

    ( ) ( )0 01 1

    t s sb w x for d= − =

  • 18

    Non-linearly separable patterns

    • allow training errors• The definition of decision surface is

    • At last, we only have to minimizing the following function

    ( ) 1 1, 2,...,ti i id w x b for i Nξ+ ≥ − =

    1

    1( , )2

    Nt

    ii

    w w w Cξ ξ=

    Φ = + ∑

  • 19

    Non-linearly separable patterns (cont.)

    • dual form– Given the training sample ,find the

    Lagrange multipliers that maximize the objective function

    { } 1( , )N

    i i ix d

    =

    { } 1N

    i iα

    =

    1 1 1

    1( )2

    N N Nt

    i i j i j i ji i j

    Q d d x xα α α α= = =

    = −∑ ∑∑subject to the constraints(1)

    (2)1

    0N

    i ii

    dα=

    =∑0 1,2,...,i C for i Nα≤ ≤ =where C is a user-specified positive parameter

  • 20

    Non-linearly separable patterns (cont.)

    • After the optimum Lagrange multipliers have determined, we can compute the optimum weight vector and bias.

    0 ,1

    sN

    o i i ii

    w d xα=

    =∑( ) ( )

    0 01 1t s sb w x for d= − =

  • 21

    How to build a SVM for pattern recognition

    • Steps for constructing a SVM– Nonlinear mapping of input vectors into a high dimensional

    feature space that is hidden from both the input and output.– Construction of an optimal hyperplane for separating the

    features.

  • 22

    How to build a SVM for pattern recognition (cont.)

    • x denotes a vector from the input space.• denotes a set of nonlinear mapping

    from the input space to the feature space.• Define a hyperplane as following

    • Define the vector

    { } 11

    ( )m

    j jxϕ

    =

    1

    1( ) 0

    m

    j jj

    w x bϕ=

    + =∑1

    0( ) 0

    m

    j jj

    w xϕ=

    =∑

    10 1( ) ( ), ( ),..., ( )

    t

    mx x x xϕ ϕ ϕ ϕ⎡ ⎤= ⎣ ⎦

  • 23

    How to build a SVM for pattern recognition (cont.)

    • We can write the equation in the compact form.

    • Because the features are linear separable, we may write

    • Substituting Eq.2 in Eq.1,we get

    ( ) 0t xϕ =w

    1( )

    N

    i i ii

    d xα ϕ=

    =∑w

    1

    2

    1( ) ( ) 0

    Nt

    i i ii

    d x xα ϕ ϕ=

    =∑

  • 24

    • Define the inner-product kernel denoted by

    • Now, we may use the inner-product kernel to construct the optimal decision surface in the feature space without considering the feature space in explicit form.

    How to build a SVM for pattern recognition (cont.)

    ( , ) ( ) ( )ti iK x x x xϕ ϕ=1

    0( ) ( ) 1,...,

    m

    j j ij

    x x for i Nϕ ϕ=

    = =∑

    1( , ) 0

    N

    i i ii

    d K x xα=

    =∑

  • 25

    • Optimum design of a SVM (dual form)– Given the training sample ,find the

    Lagrange multipliers that maximize the objective function

    { } 1( , )N

    i i ix d

    =

    { } 1N

    i iα

    =

    1 1 1

    1( ) ( , )2

    N N N

    i i j i j i ji i j

    Q d d K x xα α α α= = =

    = −∑ ∑∑subject to the constraints(1)

    (2)1

    0N

    i ii

    dα=

    =∑0 1,2,...,i C for i Nα≤ ≤ =where C is a user-specified positive parameter

    How to build a SVM for pattern recognition (cont.)

  • 26

    • We may view as the ij-th element of a symmetric N-by-N matrix K

    • Having found the optimum values of , we can get

    How to build a SVM for pattern recognition (cont.)

    ( , )i jK x x

    { }( , ) 1

    ( , )N

    i j i jK x x

    ==K

    ,o iα

    ,1

    ( )N

    o o i i ii

    d xα ϕ=

    =∑w

  • 27

    How to build a SVM for pattern recognition (cont.)

  • 28

    Example: XOR problem

    • First, we choose kernel as

    • With and ,we get( )2( , ) 1 ti iK x x x x= +

    [ ]1 2,tx x x= [ ]1 2,

    ti i ix x x=

    2 2 2 21 1 1 2 1 2 2 2 1 1 2 2( , ) 1 2 2 2i i i i i i iK x x x x x x x x x x x x x x= + + + + +

    2 21 1 2 2 1 2( ) 1, , 2 , , 2 , 2

    tx x x x x x xϕ ⎡ ⎤= ⎣ ⎦

    2 21 1 2 2 1 2( ) 1, , 2 , , 2 , 2 , 1,2,3,4

    t

    i i i i i i ix x x x x x x iϕ ⎡ ⎤= =⎣ ⎦

  • 29

    Example: XOR problem (cont.)

    • We also find that

    • The objective function for the dual form is

    9 1 1 11 9 1 11 1 9 11 1 1 9

    ⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

    K

    21 2 3 4 1 1 2 1 3 1 4

    1( ) (9 2 2 22

    Q α α α α α α α α α α α α= + + + − − − +

    2 2 22 2 3 2 4 3 3 4 49 2 2 9 2 9 )α α α α α α α α α+ + − + − +

  • 30

    Example: XOR problem (cont.)

    • Optimizing

    1 2 3 41

    ( ) 9 1Q α α α α αα

    ∂⇒ − − + =

    ( )Q α

    1 2 3 42

    ( ) 9 1Q α α α α αα

    ∂⇒ − + + − =

    1 2 3 43

    ( ) 9 1Q α α α α αα

    ∂⇒ − + + − =

    1 2 3 44

    ( ) 9 1Q α α α α αα

    ∂⇒ − − + =

  • 31

    Example: XOR problem (cont.)

    • The optimum values of are,o iα

    ,1 ,2 ,3 ,418o o o o

    α α α α= = = =

    1( )4o

    Q α =

    21 1 12 4 2o o

    = ⇒ =w w

  • 32

    Example: XOR problem (cont.)

    • We find that the optimum weight vector is[ ]1 2 3 4

    1 ( ) ( ) ( ) ( )8o

    x x x xϕ ϕ ϕ ϕ= − + + −w

    1 1 1 1 01 1 1 1 02 2 2 21 1 2

    1 1 1 18 02 2 2 2 0

    02 2 2 2

    ⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − −⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥= − + + − = ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− −⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥− − ⎣ ⎦⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦⎣ ⎦

  • 33

    Example: XOR problem (cont.)

    • The optimal hyperplane is defined by( ) 0to xϕ =w

    21

    1 22

    2

    1

    2

    1

    210,0, ,0,0,0 02

    2

    2

    x

    x xx

    x

    x

    ⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥−⎡ ⎤ ⎢ ⎥ =⎢ ⎥ ⎢ ⎥⎣ ⎦⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

    1 2 0x x− =

  • 34

    Properties and expansions of SVM

    • Two important features:– Duality is the first feature of SVM– Operate in a kernel induced feature space

    • Several expansions of SVM:– C-Support Vector Classification (binary case)– v-Support Vector Classification (binary case)– Distribution Estimation (one-class SVM)– -Support Vector Regression ( -SVR)– v-Support Vector Regression (v-SVR)ε ε

  • 35

    Conclusion• The SVM is an elegant and highly principled

    learning method for the design of classifying nonlinear input data.

    • Compared with back-propagation algorithm– Only operate in a batch mode– Whatever the learning task, it provide a method for

    controlling model complexity independently of dimensionality

    – It is guaranteed to find a global extremum of the error surface

    – The computation can be performed efficiently– By using a suitable inner-product kernel, the SVM

    computes all the important network parameters automatically.

  • 36

    Applications of SVM

    • Classification• Regression• Recognition• Bioinformatics

  • 37

    LIBSVM

    • A Library for Support Vector Machines• Made by Chih-Jen Lin and Chih-Chung

    Chang• Both C++ and Java sources • http://www.csie.ntu.edu.tw/~cjlin/

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /Unknown

    /Description >>> setdistillerparams> setpagedevice