an introduction to support vector machine (svm). famous examples that helped svm become popular
TRANSCRIPT
An Introduction to Support Vector Machine (SVM)
Famous Examples that helped SVM become popular
Classification
Everyday, all the time we classify things.
Eg crossing the street: Is there a car coming?At what speed?How far is it to the other side?Classification: Safe to walk or not!!!
Discriminant Function It can be arbitrary functions of x, such as:
Nearest Neighbor
Decision Tree
LinearFunctions
( ) Tg b x w x
NonlinearFunctions
Background – Classification Problem
Applications:Personal IdentificationCredit RatingMedical DiagnosisText CategorizationDenial of Service DetectionCharacter recognitionBiometricsImage classification
Classification FormulationGiven
an input spacea set of classes ={ }
the Classification Problem is to define a mapping f: where each x in is
assigned to one classThis mapping function is called a Decision
Function
c ,...,, 21
Decision Function
The basic problem in classification problem is to find c decision functions
with the property that, if a pattern x belongs to class i, then
is some similarity measure between x and class i, such as distance or probability concept
xdxdxd c,...,, 21
xdxd ji ijcji ;,...2,1,
xdi
Decision Function
Example
d3=d2
d1=d3
d1=d2
d1,d3<d2
Class 1
Class 3
Class 2
d2,d3<d1
d1,d2<d3
Single Classifier
Most popular single classifiers:Minimum Distance ClassifierBayes ClassifierK-Nearest NeighborDecision TreeNeural NetworkSupport Vector Machine
Support Vector Machines (SVM) (Separable case)
The one with largest margin!!
Which is the best separation hyperplane?
SVM
Linearly Separable Classes
Support Vector Machine
Basically a 2-class classifier developed by Vapnik and Chervonenkis (1992)
Which line is optimal?
2
2
1min w
niby ii ,...,2,1,01])[(s.t. xw
}1,1{,,,...,1,),( yRniy pii xx
11
11
ii
ii
yforb
yforb
xw
xw
bf )()( xwx
Maximizing Margin:
Correct Separation:
Support Vector Machines (SVM)
large margin provides better generalization ability
Why named “Support Vector Machine”?
Support VectorsSupport Vectors
SVsof#
1
** )(sgn)(i
iii byf xxx
Support Vector Machine
Training vectors : xi , i=1….n
Consider a simple case with two classes :Define a vector y yi = 1 if xi in class 1
= -1 if xi in class 2
A hyperplane which separates all data
r
ρ
Separating plane
Margin
Class 1
Class 2
Support Vector (Class 1)
Support Vector (Class 2)
i
i
2.8 SVM
Linear Separable SVM
Label the training data
Suppose we have some hyperplanes which separates the “+” from “-” examples (a separating hyperplane)
x which lie on the hyperplane, satisfy w is noraml to hyperplane, |b|/||w|| is the
perpendicular distance from hyperplane to origin
Linear Separable SVM
Define two support hyperplane as
H1:wTx = b +δ and H2:wTx = b –δTo solve over-parameterized problem, set δ=1Define the distance as
Margin = distance between H1 and H2 = 2/||w||
The Primal problem of SVM
Goal: Find a separating hyperplane with largest margin. A SVM is to find w and b that satisfy
(1) minimize ||w||/2 = wTw/2
(2) yi(xi·w+b)-1 ≥ 0
Switch the above problem to a Lagrangian formulation for two reason
(1) easier to handle by transforming into quadratic eq.(2) training data only appear in form of dot products
between vectors => can be generalized to nonlinear case
Langrange Muliplier Method
a method to find the extremum of a multivariate function f(x1,x2,…xn) subject to the constraint g(x1,x2,…xn) = 0
For an extremum of f to exist on g, the gradient of f must line up with the gradient of g .
for all k = 1, ...,n , where the constant λis called the Lagrange multiplier
The Lagrangian transformation of the problem is
Langrange Muliplier Method
To have , we need to find the gradient of L with respect to w and b.
(1)
(2) Substitute them into Lagrangian form, we have a
dual problem
Inner product form => Can be generalize to nonlinear case by applying kernel
KKT Conditions
Since the problems for SVM is convex, the KKT conditions are necessary and sufficient for w, b and α to be a solution.
w is determined by training procedure. b is easily found by using KKT complementary conditions,
by choosing any i for which αi≠ 0
Complementary slackness
2.8 SVM
What about non-linear boundary?
Non-Linear Separable SVM : Kernal
To extend to non-linear case, we need to the data to some other Euclidean space.
Kernal
Φ is a mapping function. Since the training algorithm only depend on data
thru dot products. We can use a “kernal function” K such that
One commonly used example is radial based function (RBF)
A RBF is a real-valued function whose value depends only on the distance from the origin, so that Φ(x)= Φ(||x||) ; or alternatively on the distance from some other point c, called a center, so that Φ(x,c)= Φ(||x-c||).
Non-separable SVM
Real world application usually have no OSH. We need to add an error term ζ.
=>
To give penalty to error term, define New Lagrangian form is
Non-separable SVM New KKT Conditions