support vector machines jie tang

Upload: rabbityeah

Post on 30-May-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Support Vector Machines Jie Tang

    1/28

    Introduction to Support Vector

    Machines

    Jie Tang

    25 July 2005

  • 8/14/2019 Support Vector Machines Jie Tang

    2/28

    Introduction

    Support Vector Machine (SVM) is a learning

    methodology based on Vapniks statistical

    learning theory

    Addressed in the 1990s To solve the problems in traditional statistical

    learning( over fitting, capacity control,)

    It achieved the best performances in practical

    applications Handwritten digit recognition

    text categorization

  • 8/14/2019 Support Vector Machines Jie Tang

    3/28

    Classification Problem

    Given training set S={(x1,y1),(x2,y2),,

    (xl,yl)}, and xi X=Rn, yi Y={1,-1},

    i=1,2,,l

    To learn a function g(x), and make thedecision function f(x)=sgn(g(x)) canclassify new inputx

    So this is a supervised batch learningmethod

  • 8/14/2019 Support Vector Machines Jie Tang

    4/28

    Linear classifier

    ( ) ( )

    1, ( ) 0

    sgn( ( )) 1, ( ) 0

    ( ) sgn( ( ))

    Tg x w x b

    g x

    g x g x

    f x g x

  • 8/14/2019 Support Vector Machines Jie Tang

    5/28

    Maximum Marginal Classifier

    (2) w bx

    w w

    (1) w bx

    w w

  • 8/14/2019 Support Vector Machines Jie Tang

    6/28

    Maximum Marginal Classifier

    Let us select two points on the two hyperplan

    respectively.

    ( ) ( ) ( )

    min

    i i T y w x b

    (1)

    (2 )

    (1)

    (2 )

    ( 0)

    T

    T

    w x b

    w x b

    wx

    w

    bwx

    w w

    bwx

    w w

    Distance from hyperplan to origin

  • 8/14/2019 Support Vector Machines Jie Tang

    7/28

    Maximum Marginal Classifier

    (2) w bx

    w w

    (1) w bx

    w w

    (1)

    (2 )

    bwx

    w w

    bwx

    w w

    2

    w

  • 8/14/2019 Support Vector Machines Jie Tang

    8/28

    Then

    , ,

    2max w b

    w

    equal to , ,min

    2w b

    w

    Note: we have constraints

    ( )

    ( )

    . . , 0

    ,

    T i

    T j

    s t w x b i k

    w x b k j m

    equal to

    ( ) ( ) , 0T i y w x b i m

    By scaling wand b by setting=1

    ,

    ( )

    min2

    . . ( ) 1, 0

    w b

    T i

    w

    s t y w x b i m

  • 8/14/2019 Support Vector Machines Jie Tang

    9/28

  • 8/14/2019 Support Vector Machines Jie Tang

    10/28

    Let us review generalized Lagrangian

    min ( )

    . . ( ) 0, 0

    ( ) 0, 0

    i

    j

    f x

    s t g x i k

    h x j l

    By lagrangian:

    0 0

    ( , ) ( ) ( ) ( )

    . . 0, 0

    k l

    i i j j

    i j

    i j

    L w b f x g x h x

    s t

    Let us consider

    ( ) max ( , )P R w L w b

    Note: the constraints must be satisfied, otherwise, the maxL will be infinite.

  • 8/14/2019 Support Vector Machines Jie Tang

    11/28

    Let us review generalized Lagrangian

    max ( ) L f x

    If the constraints are satisfied, then we must have

    Now you can found that, maxL takes the same value as the objective of

    our problem f(x).

    Therefore, we can consider the minimization problem

    ,min ( ) min max ( , , )w P w R w L w

    Let us define the optimal value of the primal problem asp*

    Then, let us define the dual problem

    ( , ) min ( , , )D w R L w

    , ,max ( , ) max min ( , , )D w R L w

    They are similar

    Now we define the optimal value of the dual problem as d*

  • 8/14/2019 Support Vector Machines Jie Tang

    12/28

    Relationship between Primal and

    Dual problems

    * *

    ,max min ( , , ) min max ( , , )w wd L w L w p

    Why?

    Just remember it

    Then if under some conditions, d*=p*We can solve the dual problem in lieu of the primal problem

    What is the conditions?

  • 8/14/2019 Support Vector Machines Jie Tang

    13/28

    The famous KKT conditions

    Karush-Kuhn-Tucker conditions

    * * *

    * * *

    * *

    *

    *

    ( , , ) 0, [1, ]

    ( , , ) 0, [1, ]

    ( ) 0, [1, ]

    ( ) 0, [1, ]

    0, [1, ]

    i

    i

    i i

    i

    i

    L w i mw

    L w i l

    g w i k

    g w i k

    i k

    What does it imply?

  • 8/14/2019 Support Vector Machines Jie Tang

    14/28

    The famous KKT conditions

    Karush-Kuhn-Tucker conditions* *

    * *

    * *

    ( ) 0, [1, ]

    ( ) 0, 0

    ( ) 0, 0???

    i i

    i i

    i i

    g w i k

    g w

    g w

    Very important!

  • 8/14/2019 Support Vector Machines Jie Tang

    15/28

    Return to our problem

    ( )

    0

    ( )

    ( , , ) [ ( ) 1]2

    . . ( ) 1, 0

    mT i

    i

    i

    T i

    w L w b y w x b

    s t y w x b i m

    Let us first solve minwL with respective to the w:

    ( ) ( )

    1

    ( ) ( )

    1

    ( , , ) 0m

    i i

    w i

    i

    mi i

    i

    i

    L w b w y x

    w y x

    ( )

    1

    ( , , ) 0m

    i

    b i

    i

    L w b y

    Substitute the two

    equations back to

    L(w,b,a)

  • 8/14/2019 Support Vector Machines Jie Tang

    16/28

    We have

    ( ) ( ) ( ) ( )

    0 , 1

    1( , , ) ( )

    2

    m mi j i T j

    i i j

    i i j

    L w b y y x x

    Then, what we have the maximum optimum problem with respect to :

    ( ) ( ) ( ) ( )

    0 , 1

    ( )

    1

    1max ( ) ,2

    . . 0, [1, ]

    0

    m m

    i j i ji i j

    i i j

    i

    mi

    i

    i

    L y y x x

    s t i m

    y

    Now, we have only one parameter:

    We can solve it and then solve w,

    And then b, because:

    ( ) ( )

    * ( ) * ( )

    : 1 : 1*max min

    2

    i i

    T i T i

    i y i yw x w x

    b

  • 8/14/2019 Support Vector Machines Jie Tang

    17/28

    How to predict

    ( ) ( )

    1

    ( ) ( )

    1

    ( )

    ,

    mT i i T

    i

    i

    m

    i ii

    i

    w x b y x x b

    y x x b

    For a new sample x, we can predict it by:

  • 8/14/2019 Support Vector Machines Jie Tang

    18/28

    Non-separable case

    ,

    1

    ( )

    min2

    . . ( ) 1 , 0

    0, 0

    m

    w b i

    i

    T i

    i

    i

    w C

    s t y w x b i m

    i m

    What is non-separable case? I will not give an example. I supposeyou know that

    Then what is the optimal problem:

    Next, by forming the lagrangian:

    ( )

    1 1 1

    ( , , , , ) [ ( ) 1 ]2

    m m mT i

    i i i i i

    i i i

    w L w b C y w x b

  • 8/14/2019 Support Vector Machines Jie Tang

    19/28

    Dual form

    ( ) ( ) ( ) ( )

    0 , 1

    ( )

    1

    1max ( ) ,2

    . . 0, [1, ]

    0

    m mi j i j

    i i j

    i i j

    i

    mi

    i

    i

    L y y x x

    s t C i m

    y

    What is the difference

    from the previous

    form??!!

    ( ) ( )

    ( ) ( )

    ( ) ( )

    0 ( ) 1

    ( ) 1

    0 ( ) 1

    i T i

    i

    i T i

    i

    i T i

    i

    y w x b

    C y w x b

    C y w x b

    Also note following conditions:

  • 8/14/2019 Support Vector Machines Jie Tang

    20/28

    How to train SVM = how to solve

    the optimal problem

    Sequential minimal optimization (SMO) algorithm, due to John Platt.

    First, let us introduce coordinate ascent algorithm:

    Loop until convergence:{

    For i=1, , m

    {

    ai:=argmaxaiL(a1,, ai-1, ai, ai+1,, am)

    }

    }

  • 8/14/2019 Support Vector Machines Jie Tang

    21/28

  • 8/14/2019 Support Vector Machines Jie Tang

    22/28

    SMO

    (1) (2) ( )

    1 2

    3

    (2) (1)

    1 2

    (2) (1)

    2 2

    ( )

    ( ) (( ) , ,..., )

    mi

    i

    i

    m

    y y y

    y y

    L a L y y

    Change the algorithm by: this is just SMORepeat until convergence

    {

    1. select some pairai and aj to update next. (using a heuristic that tries to

    pick the two that will allow us to make the biggest progress towards the

    global maximum).

    2. reoptimize L(a) with respect to ai and aj, while holding all the othera.}

  • 8/14/2019 Support Vector Machines Jie Tang

    23/28

    SMO(2)

    (2) (1)

    2 2( ) (( ) , ,..., )m L a L y y

    This is a quadratic function in a2. I.e. it can be written as:

    2

    2 2a b c

  • 8/14/2019 Support Vector Machines Jie Tang

    24/28

    Solving a2

    ,

    2

    ,

    2 2 2

    ,

    2

    ( )

    ( )

    ( )

    new unclipped

    new new new unclipped

    new unclipped

    H if H

    if L H

    L if L

    Having find a2, we can go back to find the optimal a1.

    Please read Platts paper if you want to read more details

    For the quadratic function, we can simply solve it by setting its derivative to

    zero. Let us use a2new, unclipped as the resulting value.

    22 2a b c

  • 8/14/2019 Support Vector Machines Jie Tang

    25/28

    Kernel

    1. Why kernel?2. What is feature space mapping?

    ( )

    ( , ) ( ) ( )T

    x x

    K x z x z

    kernel function

    With kernel, whats more interesting to us?

  • 8/14/2019 Support Vector Machines Jie Tang

    26/28

    We can compute kernel without

    calculating mapping

    ( ) ( ) ( ) ( )

    0 , 1

    ( )

    1

    1max ( ) ( ), ( )

    2

    . . 0, [1, ]

    0

    m mi j i j

    i i j

    i i j

    i

    m i

    i

    i

    L y y x x

    s t i m

    y

    ( ) ( )

    1

    ( ) ( )

    1

    ( )

    ( ), ( )

    mT i i T

    i

    i

    mi i

    i

    i

    w x b y x x b

    y x x b

    Now, we need to compute (x) first. That may be expensive.

    But with kernel, we can ignore the step.

    Why?

    Because, both in the training and test, there have the expression .

    Replace all by( ) ,ix x ( )( ), ( )ix x

    ( , ) ( ) ( )T K x z x z

    For example:2

    2( , ) exp( )

    2

    x z K x z

  • 8/14/2019 Support Vector Machines Jie Tang

    27/28

    References

    Vladimir N. Vapnik. The nature of statistical learningtheory. Springer-Verlag New York. 1998.

    Andrew Ng. CS229 Lecture notes. Lectures from10/19/03 to 10/26/03. Part V. Support Vector Machines

    CHRISTOPHER J.C. BURGES. A Tutorial on SupportVector Machines for Pattern Recognition. Data Miningand Knowledge Discovery, 2, 121167 (1998). 1998Kluwer Academic Publishers, Boston. Manufactured inThe Netherlands.

    Cristianini, N., Shawe-Taylor, J., An Introduction toSupport Vector Machines, Cambridge University Press,(2000).

  • 8/14/2019 Support Vector Machines Jie Tang

    28/28

    People

    Vladimir Vapnik.

    J. Platt

    J. Platt, N. Cristianini, J. Shawe-Taylor Shawe-Taylor, J.

    Burges, C. J. C.

    Thorsten Joachims Etc.