support vector machines jie tang
TRANSCRIPT
-
8/14/2019 Support Vector Machines Jie Tang
1/28
Introduction to Support Vector
Machines
Jie Tang
25 July 2005
-
8/14/2019 Support Vector Machines Jie Tang
2/28
Introduction
Support Vector Machine (SVM) is a learning
methodology based on Vapniks statistical
learning theory
Addressed in the 1990s To solve the problems in traditional statistical
learning( over fitting, capacity control,)
It achieved the best performances in practical
applications Handwritten digit recognition
text categorization
-
8/14/2019 Support Vector Machines Jie Tang
3/28
Classification Problem
Given training set S={(x1,y1),(x2,y2),,
(xl,yl)}, and xi X=Rn, yi Y={1,-1},
i=1,2,,l
To learn a function g(x), and make thedecision function f(x)=sgn(g(x)) canclassify new inputx
So this is a supervised batch learningmethod
-
8/14/2019 Support Vector Machines Jie Tang
4/28
Linear classifier
( ) ( )
1, ( ) 0
sgn( ( )) 1, ( ) 0
( ) sgn( ( ))
Tg x w x b
g x
g x g x
f x g x
-
8/14/2019 Support Vector Machines Jie Tang
5/28
Maximum Marginal Classifier
(2) w bx
w w
(1) w bx
w w
-
8/14/2019 Support Vector Machines Jie Tang
6/28
Maximum Marginal Classifier
Let us select two points on the two hyperplan
respectively.
( ) ( ) ( )
min
i i T y w x b
(1)
(2 )
(1)
(2 )
( 0)
T
T
w x b
w x b
wx
w
bwx
w w
bwx
w w
Distance from hyperplan to origin
-
8/14/2019 Support Vector Machines Jie Tang
7/28
Maximum Marginal Classifier
(2) w bx
w w
(1) w bx
w w
(1)
(2 )
bwx
w w
bwx
w w
2
w
-
8/14/2019 Support Vector Machines Jie Tang
8/28
Then
, ,
2max w b
w
equal to , ,min
2w b
w
Note: we have constraints
( )
( )
. . , 0
,
T i
T j
s t w x b i k
w x b k j m
equal to
( ) ( ) , 0T i y w x b i m
By scaling wand b by setting=1
,
( )
min2
. . ( ) 1, 0
w b
T i
w
s t y w x b i m
-
8/14/2019 Support Vector Machines Jie Tang
9/28
-
8/14/2019 Support Vector Machines Jie Tang
10/28
Let us review generalized Lagrangian
min ( )
. . ( ) 0, 0
( ) 0, 0
i
j
f x
s t g x i k
h x j l
By lagrangian:
0 0
( , ) ( ) ( ) ( )
. . 0, 0
k l
i i j j
i j
i j
L w b f x g x h x
s t
Let us consider
( ) max ( , )P R w L w b
Note: the constraints must be satisfied, otherwise, the maxL will be infinite.
-
8/14/2019 Support Vector Machines Jie Tang
11/28
Let us review generalized Lagrangian
max ( ) L f x
If the constraints are satisfied, then we must have
Now you can found that, maxL takes the same value as the objective of
our problem f(x).
Therefore, we can consider the minimization problem
,min ( ) min max ( , , )w P w R w L w
Let us define the optimal value of the primal problem asp*
Then, let us define the dual problem
( , ) min ( , , )D w R L w
, ,max ( , ) max min ( , , )D w R L w
They are similar
Now we define the optimal value of the dual problem as d*
-
8/14/2019 Support Vector Machines Jie Tang
12/28
Relationship between Primal and
Dual problems
* *
,max min ( , , ) min max ( , , )w wd L w L w p
Why?
Just remember it
Then if under some conditions, d*=p*We can solve the dual problem in lieu of the primal problem
What is the conditions?
-
8/14/2019 Support Vector Machines Jie Tang
13/28
The famous KKT conditions
Karush-Kuhn-Tucker conditions
* * *
* * *
* *
*
*
( , , ) 0, [1, ]
( , , ) 0, [1, ]
( ) 0, [1, ]
( ) 0, [1, ]
0, [1, ]
i
i
i i
i
i
L w i mw
L w i l
g w i k
g w i k
i k
What does it imply?
-
8/14/2019 Support Vector Machines Jie Tang
14/28
The famous KKT conditions
Karush-Kuhn-Tucker conditions* *
* *
* *
( ) 0, [1, ]
( ) 0, 0
( ) 0, 0???
i i
i i
i i
g w i k
g w
g w
Very important!
-
8/14/2019 Support Vector Machines Jie Tang
15/28
Return to our problem
( )
0
( )
( , , ) [ ( ) 1]2
. . ( ) 1, 0
mT i
i
i
T i
w L w b y w x b
s t y w x b i m
Let us first solve minwL with respective to the w:
( ) ( )
1
( ) ( )
1
( , , ) 0m
i i
w i
i
mi i
i
i
L w b w y x
w y x
( )
1
( , , ) 0m
i
b i
i
L w b y
Substitute the two
equations back to
L(w,b,a)
-
8/14/2019 Support Vector Machines Jie Tang
16/28
We have
( ) ( ) ( ) ( )
0 , 1
1( , , ) ( )
2
m mi j i T j
i i j
i i j
L w b y y x x
Then, what we have the maximum optimum problem with respect to :
( ) ( ) ( ) ( )
0 , 1
( )
1
1max ( ) ,2
. . 0, [1, ]
0
m m
i j i ji i j
i i j
i
mi
i
i
L y y x x
s t i m
y
Now, we have only one parameter:
We can solve it and then solve w,
And then b, because:
( ) ( )
* ( ) * ( )
: 1 : 1*max min
2
i i
T i T i
i y i yw x w x
b
-
8/14/2019 Support Vector Machines Jie Tang
17/28
How to predict
( ) ( )
1
( ) ( )
1
( )
,
mT i i T
i
i
m
i ii
i
w x b y x x b
y x x b
For a new sample x, we can predict it by:
-
8/14/2019 Support Vector Machines Jie Tang
18/28
Non-separable case
,
1
( )
min2
. . ( ) 1 , 0
0, 0
m
w b i
i
T i
i
i
w C
s t y w x b i m
i m
What is non-separable case? I will not give an example. I supposeyou know that
Then what is the optimal problem:
Next, by forming the lagrangian:
( )
1 1 1
( , , , , ) [ ( ) 1 ]2
m m mT i
i i i i i
i i i
w L w b C y w x b
-
8/14/2019 Support Vector Machines Jie Tang
19/28
Dual form
( ) ( ) ( ) ( )
0 , 1
( )
1
1max ( ) ,2
. . 0, [1, ]
0
m mi j i j
i i j
i i j
i
mi
i
i
L y y x x
s t C i m
y
What is the difference
from the previous
form??!!
( ) ( )
( ) ( )
( ) ( )
0 ( ) 1
( ) 1
0 ( ) 1
i T i
i
i T i
i
i T i
i
y w x b
C y w x b
C y w x b
Also note following conditions:
-
8/14/2019 Support Vector Machines Jie Tang
20/28
How to train SVM = how to solve
the optimal problem
Sequential minimal optimization (SMO) algorithm, due to John Platt.
First, let us introduce coordinate ascent algorithm:
Loop until convergence:{
For i=1, , m
{
ai:=argmaxaiL(a1,, ai-1, ai, ai+1,, am)
}
}
-
8/14/2019 Support Vector Machines Jie Tang
21/28
-
8/14/2019 Support Vector Machines Jie Tang
22/28
SMO
(1) (2) ( )
1 2
3
(2) (1)
1 2
(2) (1)
2 2
( )
( ) (( ) , ,..., )
mi
i
i
m
y y y
y y
L a L y y
Change the algorithm by: this is just SMORepeat until convergence
{
1. select some pairai and aj to update next. (using a heuristic that tries to
pick the two that will allow us to make the biggest progress towards the
global maximum).
2. reoptimize L(a) with respect to ai and aj, while holding all the othera.}
-
8/14/2019 Support Vector Machines Jie Tang
23/28
SMO(2)
(2) (1)
2 2( ) (( ) , ,..., )m L a L y y
This is a quadratic function in a2. I.e. it can be written as:
2
2 2a b c
-
8/14/2019 Support Vector Machines Jie Tang
24/28
Solving a2
,
2
,
2 2 2
,
2
( )
( )
( )
new unclipped
new new new unclipped
new unclipped
H if H
if L H
L if L
Having find a2, we can go back to find the optimal a1.
Please read Platts paper if you want to read more details
For the quadratic function, we can simply solve it by setting its derivative to
zero. Let us use a2new, unclipped as the resulting value.
22 2a b c
-
8/14/2019 Support Vector Machines Jie Tang
25/28
Kernel
1. Why kernel?2. What is feature space mapping?
( )
( , ) ( ) ( )T
x x
K x z x z
kernel function
With kernel, whats more interesting to us?
-
8/14/2019 Support Vector Machines Jie Tang
26/28
We can compute kernel without
calculating mapping
( ) ( ) ( ) ( )
0 , 1
( )
1
1max ( ) ( ), ( )
2
. . 0, [1, ]
0
m mi j i j
i i j
i i j
i
m i
i
i
L y y x x
s t i m
y
( ) ( )
1
( ) ( )
1
( )
( ), ( )
mT i i T
i
i
mi i
i
i
w x b y x x b
y x x b
Now, we need to compute (x) first. That may be expensive.
But with kernel, we can ignore the step.
Why?
Because, both in the training and test, there have the expression .
Replace all by( ) ,ix x ( )( ), ( )ix x
( , ) ( ) ( )T K x z x z
For example:2
2( , ) exp( )
2
x z K x z
-
8/14/2019 Support Vector Machines Jie Tang
27/28
References
Vladimir N. Vapnik. The nature of statistical learningtheory. Springer-Verlag New York. 1998.
Andrew Ng. CS229 Lecture notes. Lectures from10/19/03 to 10/26/03. Part V. Support Vector Machines
CHRISTOPHER J.C. BURGES. A Tutorial on SupportVector Machines for Pattern Recognition. Data Miningand Knowledge Discovery, 2, 121167 (1998). 1998Kluwer Academic Publishers, Boston. Manufactured inThe Netherlands.
Cristianini, N., Shawe-Taylor, J., An Introduction toSupport Vector Machines, Cambridge University Press,(2000).
-
8/14/2019 Support Vector Machines Jie Tang
28/28
People
Vladimir Vapnik.
J. Platt
J. Platt, N. Cristianini, J. Shawe-Taylor Shawe-Taylor, J.
Burges, C. J. C.
Thorsten Joachims Etc.