an introduction to support vector machine (svm). famous examples that helped svm become popular

An Introduction to Support Vector Machine (SVM)

Famous Examples that helped SVM become popular

Classification

Everyday, all the time we classify things.

Eg crossing the street: Is there a car coming?At what speed?How far is it to the other side?Classification: Safe to walk or not!!!

Discriminant Function It can be arbitrary functions of x, such as:

Nearest Neighbor

Decision Tree

LinearFunctions

( ) Tg b x w x

NonlinearFunctions

Background – Classification Problem

Applications:Personal IdentificationCredit RatingMedical DiagnosisText CategorizationDenial of Service DetectionCharacter recognitionBiometricsImage classification

Classification FormulationGiven

an input spacea set of classes ={ }

the Classification Problem is to define a mapping f: where each x in is

assigned to one classThis mapping function is called a Decision

Function

c ,...,, 21

Decision Function

The basic problem in classification problem is to find c decision functions

with the property that, if a pattern x belongs to class i, then

is some similarity measure between x and class i, such as distance or probability concept

xdxdxd c,...,, 21

xdxd ji ijcji ;,...2,1,

xdi

Decision Function

Example

d3=d2

d1=d3

d1=d2

d1,d3<d2

Class 1

Class 3

Class 2

d2,d3<d1

d1,d2<d3

Single Classifier

Most popular single classifiers:Minimum Distance ClassifierBayes ClassifierK-Nearest NeighborDecision TreeNeural NetworkSupport Vector Machine

Support Vector Machines (SVM) (Separable case)

The one with largest margin!!

Which is the best separation hyperplane?

SVM

Linearly Separable Classes

Support Vector Machine

Basically a 2-class classifier developed by Vapnik and Chervonenkis (1992)

Which line is optimal?

2

2

1min w

niby ii ,...,2,1,01])[(s.t. xw

}1,1{,,,...,1,),( yRniy pii xx

11

11

ii

ii

yforb

yforb

xw

xw

bf )()( xwx

Maximizing Margin:

Correct Separation:

Support Vector Machines (SVM)

large margin provides better generalization ability

Why named “Support Vector Machine”?

Support VectorsSupport Vectors

SVsof#

1

** )(sgn)(i

iii byf xxx

Support Vector Machine

Training vectors : xi , i=1….n

Consider a simple case with two classes :Define a vector y yi = 1 if xi in class 1

= -1 if xi in class 2

A hyperplane which separates all data

r

ρ

Separating plane

Margin

Class 1

Class 2

Support Vector (Class 1)

Support Vector (Class 2)

i

i

2.8 SVM

Linear Separable SVM

Label the training data

Suppose we have some hyperplanes which separates the “+” from “-” examples (a separating hyperplane)

x which lie on the hyperplane, satisfy w is noraml to hyperplane, |b|/||w|| is the

perpendicular distance from hyperplane to origin

Linear Separable SVM

Define two support hyperplane as

H1:wTx = b +δ and H2:wTx = b –δTo solve over-parameterized problem, set δ=1Define the distance as

Margin = distance between H1 and H2 = 2/||w||

The Primal problem of SVM

Goal: Find a separating hyperplane with largest margin. A SVM is to find w and b that satisfy

(1) minimize ||w||/2 = wTw/2

(2) yi(xi·w+b)-1 ≥ 0

Switch the above problem to a Lagrangian formulation for two reason

(1) easier to handle by transforming into quadratic eq.(2) training data only appear in form of dot products

between vectors => can be generalized to nonlinear case

Langrange Muliplier Method

a method to find the extremum of a multivariate function f(x1,x2,…xn) subject to the constraint g(x1,x2,…xn) = 0

For an extremum of f to exist on g, the gradient of f must line up with the gradient of g .

for all k = 1, ...,n , where the constant λis called the Lagrange multiplier

The Lagrangian transformation of the problem is

Langrange Muliplier Method

To have , we need to find the gradient of L with respect to w and b.

(1)

(2) Substitute them into Lagrangian form, we have a

dual problem

Inner product form => Can be generalize to nonlinear case by applying kernel

KKT Conditions

Since the problems for SVM is convex, the KKT conditions are necessary and sufficient for w, b and α to be a solution.

w is determined by training procedure. b is easily found by using KKT complementary conditions,

by choosing any i for which αi≠ 0

Complementary slackness

2.8 SVM

What about non-linear boundary?

Non-Linear Separable SVM : Kernal

To extend to non-linear case, we need to the data to some other Euclidean space.

Kernal

Φ is a mapping function. Since the training algorithm only depend on data

thru dot products. We can use a “kernal function” K such that

One commonly used example is radial based function (RBF)

A RBF is a real-valued function whose value depends only on the distance from the origin, so that Φ(x)= Φ(||x||) ; or alternatively on the distance from some other point c, called a center, so that Φ(x,c)= Φ(||x-c||).

Non-separable SVM

Real world application usually have no OSH. We need to add an error term ζ.

=>

To give penalty to error term, define New Lagrangian form is

Non-separable SVM New KKT Conditions

an introduction to support vector machine (svm). famous examples that helped svm become popular

Documents