support vector machines jie tang

8/14/2019 Support Vector Machines Jie Tang

1/28

Introduction to Support Vector

Machines

Jie Tang

25 July 2005


2/28

Introduction

Support Vector Machine (SVM) is a learning

methodology based on Vapniks statistical

learning theory

Addressed in the 1990s To solve the problems in traditional statistical

learning( over fitting, capacity control,)

It achieved the best performances in practical

applications Handwritten digit recognition

text categorization


3/28

Classification Problem

Given training set S={(x1,y1),(x2,y2),,

(xl,yl)}, and xi X=Rn, yi Y={1,-1},

i=1,2,,l

To learn a function g(x), and make thedecision function f(x)=sgn(g(x)) canclassify new inputx

So this is a supervised batch learningmethod


4/28

Linear classifier

( ) ( )

1, ( ) 0

sgn( ( )) 1, ( ) 0

( ) sgn( ( ))

Tg x w x b

g x

g x g x

f x g x


5/28

Maximum Marginal Classifier

(2) w bx

w w

(1) w bx

w w


6/28


Let us select two points on the two hyperplan

respectively.

( ) ( ) ( )

min

i i T y w x b

(1)

(2 )

(1)

(2 )

( 0)

T

T

w x b

w x b

wx

w

bwx

w w

bwx

w w

Distance from hyperplan to origin


7/28


(2) w bx

w w

(1) w bx

w w

(1)

(2 )

bwx

w w

bwx

w w

2

w


8/28

Then

, ,

2max w b

w

equal to , ,min

2w b

w

Note: we have constraints

( )

( )

. . , 0

,

T i

T j

s t w x b i k

w x b k j m

equal to

( ) ( ) , 0T i y w x b i m

By scaling wand b by setting=1

,

( )

min2

. . ( ) 1, 0

w b

T i

w

s t y w x b i m


9/28


10/28

Let us review generalized Lagrangian

min ( )

. . ( ) 0, 0

( ) 0, 0

i

j

f x

s t g x i k

h x j l

By lagrangian:

0 0

( , ) ( ) ( ) ( )

. . 0, 0

k l

i i j j

i j

i j

L w b f x g x h x

s t

Let us consider

( ) max ( , )P R w L w b

Note: the constraints must be satisfied, otherwise, the maxL will be infinite.


11/28

Let us review generalized Lagrangian

max ( ) L f x

If the constraints are satisfied, then we must have

Now you can found that, maxL takes the same value as the objective of

our problem f(x).

Therefore, we can consider the minimization problem

,min ( ) min max ( , , )w P w R w L w

Let us define the optimal value of the primal problem asp*

Then, let us define the dual problem

( , ) min ( , , )D w R L w

, ,max ( , ) max min ( , , )D w R L w

They are similar

Now we define the optimal value of the dual problem as d*


12/28

Relationship between Primal and

Dual problems

* *

,max min ( , , ) min max ( , , )w wd L w L w p

Why?

Just remember it

Then if under some conditions, d*=p*We can solve the dual problem in lieu of the primal problem

What is the conditions?


13/28

The famous KKT conditions

Karush-Kuhn-Tucker conditions

* * *

* * *

* *

*

*

( , , ) 0, [1, ]

( , , ) 0, [1, ]

( ) 0, [1, ]

( ) 0, [1, ]

0, [1, ]

i

i

i i

i

i

L w i mw

L w i l

g w i k

g w i k

i k

What does it imply?


14/28

The famous KKT conditions

Karush-Kuhn-Tucker conditions* *

* *

* *

( ) 0, [1, ]

( ) 0, 0

( ) 0, 0???

i i

i i

i i

g w i k

g w

g w

Very important!


15/28

Return to our problem

( )

0

( )

( , , ) [ ( ) 1]2

. . ( ) 1, 0

mT i

i

i

T i

w L w b y w x b

s t y w x b i m

Let us first solve minwL with respective to the w:

( ) ( )

1

( ) ( )

1

( , , ) 0m

i i

w i

i

mi i

i

i

L w b w y x

w y x

( )

1

( , , ) 0m

i

b i

i

L w b y

Substitute the two

equations back to

L(w,b,a)


16/28

We have

( ) ( ) ( ) ( )

0 , 1

1( , , ) ( )

2

m mi j i T j

i i j

i i j

L w b y y x x

Then, what we have the maximum optimum problem with respect to :

( ) ( ) ( ) ( )

0 , 1

( )

1

1max ( ) ,2

. . 0, [1, ]

0

m m

i j i ji i j

i i j

i

mi

i

i

L y y x x

s t i m

y

Now, we have only one parameter:

We can solve it and then solve w,

And then b, because:

( ) ( )

* ( ) * ( )

: 1 : 1*max min

2

i i

T i T i

i y i yw x w x

b


17/28

How to predict

( ) ( )

1

( ) ( )

1

( )

,

mT i i T

i

i

m

i ii

i

w x b y x x b

y x x b

For a new sample x, we can predict it by:


18/28

Non-separable case

,

1

( )

min2

. . ( ) 1 , 0

0, 0

m

w b i

i

T i

i

i

w C

s t y w x b i m

i m

What is non-separable case? I will not give an example. I supposeyou know that

Then what is the optimal problem:

Next, by forming the lagrangian:

( )

1 1 1

( , , , , ) [ ( ) 1 ]2

m m mT i

i i i i i

i i i

w L w b C y w x b


19/28

Dual form

( ) ( ) ( ) ( )

0 , 1

( )

1

1max ( ) ,2

. . 0, [1, ]

0

m mi j i j

i i j

i i j

i

mi

i

i

L y y x x

s t C i m

y

What is the difference

from the previous

form??!!

( ) ( )

( ) ( )

( ) ( )

0 ( ) 1

( ) 1

0 ( ) 1

i T i

i

i T i

i

i T i

i

y w x b

C y w x b

C y w x b

Also note following conditions:


20/28

How to train SVM = how to solve

the optimal problem

Sequential minimal optimization (SMO) algorithm, due to John Platt.

First, let us introduce coordinate ascent algorithm:

Loop until convergence:{

For i=1, , m

{

ai:=argmaxaiL(a1,, ai-1, ai, ai+1,, am)

}

}


21/28


22/28

SMO

(1) (2) ( )

1 2

3

(2) (1)

1 2

(2) (1)

2 2

( )

( ) (( ) , ,..., )

mi

i

i

m

y y y

y y

L a L y y

Change the algorithm by: this is just SMORepeat until convergence

{

1. select some pairai and aj to update next. (using a heuristic that tries to

pick the two that will allow us to make the biggest progress towards the

global maximum).

2. reoptimize L(a) with respect to ai and aj, while holding all the othera.}


23/28

SMO(2)

(2) (1)

2 2( ) (( ) , ,..., )m L a L y y

This is a quadratic function in a2. I.e. it can be written as:

2

2 2a b c


24/28

Solving a2

,

2

,

2 2 2

,

2

( )

( )

( )

new unclipped

new new new unclipped

new unclipped

H if H

if L H

L if L

Having find a2, we can go back to find the optimal a1.

Please read Platts paper if you want to read more details

For the quadratic function, we can simply solve it by setting its derivative to

zero. Let us use a2new, unclipped as the resulting value.

22 2a b c


25/28

Kernel

1. Why kernel?2. What is feature space mapping?

( )

( , ) ( ) ( )T

x x

K x z x z

kernel function

With kernel, whats more interesting to us?


26/28

We can compute kernel without

calculating mapping

( ) ( ) ( ) ( )

0 , 1

( )

1

1max ( ) ( ), ( )

2

. . 0, [1, ]

0

m mi j i j

i i j

i i j

i

m i

i

i

L y y x x

s t i m

y

( ) ( )

1

( ) ( )

1

( )

( ), ( )

mT i i T

i

i

mi i

i

i

w x b y x x b

y x x b

Now, we need to compute (x) first. That may be expensive.

But with kernel, we can ignore the step.

Why?

Because, both in the training and test, there have the expression .

Replace all by( ) ,ix x ( )( ), ( )ix x

( , ) ( ) ( )T K x z x z

For example:2

2( , ) exp( )

2

x z K x z


27/28

References

Vladimir N. Vapnik. The nature of statistical learningtheory. Springer-Verlag New York. 1998.

Andrew Ng. CS229 Lecture notes. Lectures from10/19/03 to 10/26/03. Part V. Support Vector Machines

CHRISTOPHER J.C. BURGES. A Tutorial on SupportVector Machines for Pattern Recognition. Data Miningand Knowledge Discovery, 2, 121167 (1998). 1998Kluwer Academic Publishers, Boston. Manufactured inThe Netherlands.

Cristianini, N., Shawe-Taylor, J., An Introduction toSupport Vector Machines, Cambridge University Press,(2000).


28/28

People

Vladimir Vapnik.

J. Platt

J. Platt, N. Cristianini, J. Shawe-Taylor Shawe-Taylor, J.

Burges, C. J. C.

Thorsten Joachims Etc.

support vector machines jie tang

Documents