a brief introduction to support vector machine (svm)jzhang/cs689/ppdm-chapter2.pdf · a brief...
TRANSCRIPT
Chapter 2
A Brief Introduction to
Chapter 2
Support Vector Machine (SVM)
January 25, 2011
Overview
• A new, powerful method for 2-class classification O i i l id V ik 1965 f li l ifi– Original idea: Vapnik, 1965 for linear classifiers
– SVM, Cortes and Vapnik, 1995– Became very hot since 2001eca e ve y ot s ce 00
• Better generalization (less overfitting)• Can do linearly unseparable classification with globaly p g
optimal• Key ideas
– Use kernel function to transform low dimensional training samples to higher dim (for linear separability problem)
– Use quadratic programming (QP) to find the best classifier q p g g (Q )boundary hyperplane (for global optima)
Linear Classifiersfx
α
tf x yest
f(x,w,b) = sign(w. x - b)denotes +1
denotes -1
f(x,w,b) sign(w. x b)
How would you classify this data?classify this data?
Support Vector Machines: Slide 3
Linear Classifiersfx
α
tf x yest
f(x,w,b) = sign(w. x - b)denotes +1
denotes -1
f(x,w,b) sign(w. x b)
How would you classify this data?classify this data?
Support Vector Machines: Slide 4
Linear Classifiersfx
α
tf x yest
f(x,w,b) = sign(w. x - b)denotes +1
denotes -1
f(x,w,b) sign(w. x b)
How would you classify this data?classify this data?
Support Vector Machines: Slide 5
Linear Classifiersfx
α
tf x yest
f(x,w,b) = sign(w. x - b)denotes +1
denotes -1
f(x,w,b) sign(w. x b)
How would you classify this data?classify this data?
Support Vector Machines: Slide 6
Linear Classifiersfx
α
tf x yest
f(x,w,b) = sign(w. x - b)denotes +1
denotes -1
f(x,w,b) sign(w. x b)
Any of these would be finewould be fine..
..but which is best?
Support Vector Machines: Slide 7
Classifier Marginfx
α
tf x yest
f(x,w,b) = sign(w. x - b)denotes +1
denotes -1
f(x,w,b) sign(w. x b)
Define the marginof a linearof a linear classifier as the width that the b d ld bboundary could be increased by before hitting a datapoint.
Support Vector Machines: Slide 8
Maximum Marginfx
α
tf x yest
f(x,w,b) = sign(w. x - b)denotes +1
denotes -1
f(x,w,b) sign(w. x b)
The maximum margin linearmargin linear classifier is the linear classifier
ith iwith maximum margin.
This is theThis is the simplest kind of SVM (Called an
Support Vector Machines: Slide 9
LSVM)Linear SVM
Maximum Marginfx
α
tf x yest
f(x,w,b) = sign(w. x - b)denotes +1
denotes -1
f(x,w,b) sign(w. x b)
The maximum margin linearmargin linear classifier is the linear classifier
ith iSupport Vectors with maximum margin.
This is the
Support Vectors are those datapoints that the margin This is the
simplest kind of SVM (Called an
gpushes up against
Support Vector Machines: Slide 10
LSVM)Linear SVM
Why Maximum Margin?
f(x,w,b) = sign(w. x - b)1. Intuitively this feels safest.
2 If we’ve made a small error in thedenotes +1
denotes -1
f(x,w,b) sign(w. x b)
The maximum margin linear
2. If we ve made a small error in the location of the boundary this gives us least chance of causing a misclassification.margin linear
classifier is the linear classifier
ith thSupport Vectors
misclassification.
3. It also helps generalization
4. There’s some theory that this is a with the, um, maximum margin.
This is the
Support Vectors are those datapoints that the margin
good thing.
5. Empirically it works very very well.This is the simplest kind of SVM (Called an
gpushes up against
Support Vector Machines: Slide 11
LSVM)
Computing the margin widthM = Margin Width
H d tHow do we compute M in terms of wand b?
• Plus-plane = { x : w . x + b = +1 }
and b?
• Minus-plane = { x : w . x + b = -1 }• M=
ww.2
Support Vector Machines: Slide 12Copyright © 2001, 2003, Andrew W. Moore
Learning the Maximum Margin Classifier2
M = Margin Width =x+ww.
2
x-
Given a guess of w and b we can• Compute whether all data points in the correct half-planesCompute whether all data points in the correct half planes• Compute the width of the marginSo now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all the datapoints. How?
Gradient descent? Simulated Annealing?
Support Vector Machines: Slide 13
Gradient descent? Simulated Annealing?
Learning via Quadratic Programming
ll d d l f• QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real valued variables subject to linearsome real-valued variables subject to linear constraints.
• Minimize both w⋅w (to maximize M) and• Minimize both w⋅w (to maximize M) and misclassification error
Support Vector Machines: Slide 14
Quadratic Programminguud RTT ++Fi d Quadratic criterion
2maxarg udu
c T ++Find Quadratic criterion
mm
mm
buauauabuauaua
≤+++≤+++
...
...
22222121
11212111
n additional linear
Subject to
nmnmnn buauaua ≤+++ ...:
2211
inequality constraints
nmnmnn 2211
)1()1(22)1(11)1( ... nmmnnn
bbuauaua ++++ =+++
And subject to
eadditequcon
)2()2(22)2(11)2(
:... nmmnnn buauaua ++++ =+++
tional lineality straints
Support Vector Machines: Slide 15
)()(22)(11)( ... enmmenenen buauaua ++++ =+++
ear
Learning Maximum Margin with NoiseGiven guess of w b we canM = Given guess of w , b we can• Compute sum of distances
of points to their correct
M =
ww.2
pzones
• Compute the margin widthA R d i hAssume R datapoints, each
(xk,yk) where yk = +/- 1
What should our quadratic optimization criterion be?
How many constraints will we have?
Wh h ld h b ?What should they be?
Support Vector Machines: Slide 16
Learning Maximum Margin with NoiseGiven guess of w b we canM = Given guess of w , b we can• Compute sum of distances
of points to their correct
M =
ww.2ε11ε2
pzones
• Compute the margin widthA R d i h
ε7Assume R datapoints, each
(xk,yk) where yk = +/- 1
What should our quadratic optimization criterion be?
Mi i i
How many constraints will we have? 2R
Wh h ld h b ?Minimize What should they be?w . xk + b >= 1-εk if yk = 1w x + b <= 1+ε if y = 1
∑=
+R
kkεC
1.
21 ww
Support Vector Machines: Slide 17
w . xk + b <= -1+εk if yk = -1εk >= 0 for all k
εk = distance of error points to their correct place
From LSVM to general SVM
Suppose we are in 1-dimensionWhat would
g
What would SVMs do with this data?
x=0
Support Vector Machines: Slide 18
Not a big surpriseNot a big surprise
Positive “plane” Negative “plane”
x=0
Support Vector Machines: Slide 19
Harder 1-dimensional dataset
Points are notPoints are not linearly separable.p
What can we do now?now?
x=0
Support Vector Machines: Slide 20
Harder 1-dimensional datasetT f th d tTransform the data
points from 1-dim to 2-dim by some nonlinear basis function (called Kernel functions)Kernel functions)
These points sometimes are called feature vectors.
2x=0 ),( 2
kkk xx=z
Support Vector Machines: Slide 21
Harder 1-dimensional dataset
These points are linearly separable now!
Boundary can be found by QP
2x=0 ),( 2
kkk xx=z
Support Vector Machines: Slide 22
Common SVM basis functions
zk = (polynomial terms of xk of degree 1 to q )
z (radial basis functions of x )zk = (radial basis functions of xk )
⎟⎟⎠
⎞⎜⎜⎝
⎛ −==
KW||
KernelFn)(][ jkkjk φj
cxxz
z = (sigmoid functions of x )
⎟⎠
⎜⎝ KWj
zk = (sigmoid functions of xk )
Support Vector Machines: Slide 23
Doing multi-class classificationSVMs can only handle two class outputs (i e a• SVMs can only handle two-class outputs (i.e., a categorical output variable with variety 2).
• What can be done?• What can be done?• Answer: with N classes, learn N SVM’s
• SVM 1 learns “Output==1” vs “Output != 1”• SVM 1 learns Output==1 vs Output != 1• SVM 2 learns “Output==2” vs “Output != 2”• ::• SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just p p p , jpredict with each SVM and find out which one puts the prediction the furthest into the positive region.
Support Vector Machines: Slide 24
Compare SVM with NN• Similarity
− SVM + sigmoid kernel ~ two-layer feedforward NN− SVM + Gaussian kernel ~ RBF network− For most problems, SVM and NN have similar performance
• Advantages− Based on sound mathematics theoryBased on sound mathematics theory− Learning result is more robust− Over-fitting is not common− Not trapped in local minima (because of QP)Not trapped in local minima (because of QP)− Fewer parameters to consider (kernel, error cost C)− Works well with fewer training samples (number of support vectors
do not matter much)do not matter much). • Disadvantages
− Problem need to be formulated as 2-class classificationLearning takes long time (QP optimization)
Support Vector Machines: Slide 25
− Learning takes long time (QP optimization)
Advantages and Disadvantages of SVM
Advantages• Advantages• prediction accuracy is generally high • robust works when training examples contain errors• robust, works when training examples contain errors• fast evaluation of the learned target function
• Criticism• Criticism• long training time• difficult to understand the learned function (weights)( g )• not easy to incorporate domain knowledge
Support Vector Machines: Slide 26
SVM Related Links
• Representative implementations
• LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also various
interfaces with java, python, etc.
• SVM-light: simpler but performance is not better than LIBSVM,
support only binary classification and only C language
• SVM-torch: another recent implementation also written in C.
Support Vector Machines: Slide 27