support vector machine (svm)
DESCRIPTION
Support Vector Machine (SVM). Based on Nello Cristianini presentation http:// www.support-vector.net/tutorial.html. Basic Idea. Use Linear Learning Machine (LLM). Overcome the linearity constraints: Map to non-linearly to higher dimension. Select between hyperplans Use margin as a test - PowerPoint PPT PresentationTRANSCRIPT
Support Vector Machine (SVM)
Based on Nello Cristianini presentationhttp://www.support-vector.net/tutorial.html
Basic Idea
• Use Linear Learning Machine (LLM).
• Overcome the linearity constraints: Map to non-linearly to higher dimension.
• Select between hyperplans Use margin as a test
• Generalization depends on the margin.
General idea
Original Problem Transformed Problem
Kernel Based Algorithms
• Two separate learning functions
• Learning Algorithm: in an imbedded space
• Kernel function performs the embedding
Basic Example: Kernel Perceptron
• Hyperplane classification f(x)=<w,x>+b = <w’,x’> h(x)= sign(f(x))
• Perceptron Algorithm: Sample: (xi,ti), ti{-1,+1}
If ti <wk,xi> < 0 THEN /* Error*/
wk+1 = wk + ti xi
k=k+1
Recall
• Margin of hyperplan w
• Mistake bound
2
*max
)(
wD
xM
w
xwwD
Sbx ii
,
min)(
Observations
• Solution is a linear combination of inputs w = ai ti xi
where ai >0
• Mistake driven Only points on which we make mistake
influence!
• Support vectors The non-zero ai
Dual representation
• Rewrite basic function: f(x) = <w,x> +b = ai ti <xi , x> +b
w = ai ti xi
• Change update rule: IF tj ( ai ti <xi , xj> +b) < 0
THEN aj = aj+1
• Observation: Data only inside inner product!
Limitation of Perceptron
• Only linear separations• Only converges for linearly
separable data• Only defined on vectorial data
The idea of a Kernel
• Embed data to a different space
• Possibly higher dimension
• Linearly separable in the new space.
Original Problem Transformed Problem
Kernel Mapping
• Need only to compute inner-products.
• Mapping: M(x)
• Kernel: K(x,y) = < M(x) , M(y)>
• Dimensionality of M(x): unimportant!
• Need only to compute K(x,y)
• Using it in the embedded space: Replace <x,y> by K(x,y)
Example
x=(x1 , x2); z=(z1 , z2); K(x,z) = (<x,z>)2
M(z)) (M(x),
])2,,[],2,,([
)2(
)(),(
2122
221
22
2
221122
22
22
22211
2
11
11
zzzzxxxx
zxzxzxzx
zxzxzx
Polynomial Kernel
Original Problem Transformed Problem
Kernel Matrix
k(1,4)k(1,3)k(1,2)K(1,1)K(2,4)K(2,3)K(2,2)K(2,1)K(3,4)K(3,3)K(3,2)K(3,1)K(4,4)K(4,3)K(4,2)K(4,1)
Example of Basic Kernels
• Polynomial K(x,z)= (<x,z> )d
• Gaussian K(x,z)= exp{- ||x-z||2 /2}
Kernel: Closure Properties
• K(x,z) = K1(x,z) + c
• K(x,z) = c*K1(x,z)
• K(x,z) = K1(x,z) * K2(x,z)
• K(x,z) = K1(x,z) + K2(x,z)
• Create new kernels using basic ones!
Support Vector Machines
• Linear Learning Machines (LLM)
• Use dual representation
• Work in the kernel induced feature space f(x) = ai ti K(xi , x) +b
• Which hyperplane to select
Generalization of SVM
• PAC theory: error = O( Vcdim / m) Problem: Vcdim >> m No preference between consistent hyperplanes
Margin based bounds
• H: Basic Hypothesis class
• conv(H): finite convex combinations of H
• D: Distribution over X and {+1,-1}
• S: Sample of size m over D
Margin based bounds
• THEOREM: for every f in conv(H)
Lxyfxyf SD ])([Pr]0)([Pr
/1log
θ
||loglog12
Hm
mOL
Maximal Margin Classifier
• Maximizes the margin
• Minimizes the overfitting due to margin selection.
• Increases margin Rather than reduce dimensionality
SVM: Support Vectors
Margins
• Geometric Margin: mini ti f(xi)/ ||w|| Functional margin: mini ti f(xi)
f(x)
Main trick in SVM
• Insist on functional marginal at least 1. Support vectors have margin 1.
• Geometric margin = 1 / || w||
• Proof.
SVM criteria
• Find a hyperplane (w,b)
• That Maximizes: || w ||2 = <w,w>
• Subject to: for all i ti (<w,xi>+b) 1
Quadratic Programming
• Quadratic goal function.
• Linear constraint.
• Unique Maximum.
• Polynomial time algorithms.
Dual Problem
• Maximize W(a) = ai - 1/2 i,j ai ti aj tj K(xi , xj) +b
• Subject to i ai ti =0
ai 0
Applications: Text
• Classify a text to given categories Sports, news, business, science, …
• Feature space Bag of words Huge sparse vector!
Applications: Text
• Practicalities: Mw(x) = tfw log (idfw) / K
ftw = text frequency of w
idfw = inverse document frequency
idfw = # documents / # documents with w
• Inner product <M(x),M(z)> sparse vectors
• SVM: finds a hyperplan in “document space”