classification / regression support vector...
TRANSCRIPT
Jeff Howbert Introduction to Machine Learning Winter 2014 1
Classification / Regression
Support Vector Machines
Jeff Howbert Introduction to Machine Learning Winter 2014 2
Topics– SVM classifiers for linearly separable classes– SVM classifiers for non-linearly separable
classes– SVM classifiers for nonlinear decision
boundarieskernel functions
– Other applications of SVMs– Software
Support vector machines
Jeff Howbert Introduction to Machine Learning Winter 2014 3
Support vector machines
Goal: find a linear decision boundary (hyperplane)that separates the classes
Linearlyseparableclasses
Jeff Howbert Introduction to Machine Learning Winter 2014 4
Support vector machines
One possible solution
Jeff Howbert Introduction to Machine Learning Winter 2014 5
Support vector machines
Another possible solution
Jeff Howbert Introduction to Machine Learning Winter 2014 6
Support vector machines
Other possible solutions
Jeff Howbert Introduction to Machine Learning Winter 2014 7
Support vector machines
Which one is better? B1 or B2? How do you define better?
Jeff Howbert Introduction to Machine Learning Winter 2014 8
Support vector machines
Hyperplane that maximizes the margin will have better generalization=> B1 is better than B2
Jeff Howbert Introduction to Machine Learning Winter 2014 9
Support vector machines
Hyperplane that maximizes the margin will have better generalization=> B1 is better than B2
B1
B2
b11
b12
b21
b22
margin
test sample
Jeff Howbert Introduction to Machine Learning Winter 2014 10
Support vector machines
Hyperplane that maximizes the margin will have better generalization=> B1 is better than B2
B1
B2
b11
b12
b21
b22
margin
test sample
Jeff Howbert Introduction to Machine Learning Winter 2014 11
0=+⋅ bxw
Support vector machines
B1
b11
b12
W
1−=+⋅ bxw
⎩⎨⎧
−≤+⋅−≥+⋅+
==1 if1
1 if1)(
bb
fyi xwxw
x ||||2margin w
=
1+=+⋅ bxw
Jeff Howbert Introduction to Machine Learning Winter 2014 12
We want to maximize:
Which is equivalent to minimizing:
But subject to the following constraints:
– This is a constrained convex optimization problem– Solve with numerical approaches, e.g. quadratic
programming
Support vector machines
||||2margin w
=
2||||)(
2ww =L
⎩⎨⎧
−≤+⋅−≥+⋅+
==1 if1
1 if1)(
bb
fyi xwxw
x
Jeff Howbert Introduction to Machine Learning Winter 2014 13
Solving for w that gives maximum margin:1. Combine objective function and constraints into new
objective function, using Lagrange multipliers λi
2. To minimize this Lagrangian, we take derivatives of wand b and set them to 0:
Support vector machines
( )∑=
−+⋅−=N
iiiiprimal byL
1
2 1)(21 xww λ
∑
∑
=
=
=⇒=∂
∂
=⇒=∂
∂
N
iii
p
N
iiii
p
ybL
yL
1
1
00
0
λ
λ xww
Jeff Howbert Introduction to Machine Learning Winter 2014 14
Solving for w that gives maximum margin:3. Substituting and rearranging gives the dual of the
Lagrangian:
which we try to maximize (not minimize).4. Once we have the λi, we can substitute into previous
equations to get w and b.5. This defines w and b as linear combinations of the
training data.
Support vector machines
∑∑ ⋅−== ji
jijiji
N
iidual yyL
,1 21 xxλλλ
Jeff Howbert Introduction to Machine Learning Winter 2014 15
Optimizing the dual is easier.– Function of λi only, not λi and w.
Convex optimization ⇒ guaranteed to find global optimum.Most of the λi go to zero.– The xi for which λi ≠ 0 are called the support vectors.
These “support” (lie on) the margin boundaries.– The xi for which λi = 0 lie away from the margin
boundaries are not required for defining the maximum margin hyperplane.
Support vector machines
Jeff Howbert Introduction to Machine Learning Winter 2014 16
Support vector machines
Example of solving for maximum margin hyperplane
Jeff Howbert Introduction to Machine Learning Winter 2014 17
Support vector machines
What if the classes are not linearly separable?
Jeff Howbert Introduction to Machine Learning Winter 2014 18
Support vector machines
Now which one is better? B1 or B2? How do you define better?
Jeff Howbert Introduction to Machine Learning Winter 2014 19
⎩⎨⎧
+−≤+⋅−+≥+⋅+
==i
ii b
bfy
ξξ
1 if11 if1
)(xwxw
x
What if the problem is not linearly separable?Solution: introduce slack variables– Need to minimize:
– Subject to:
– C is an important hyperparameter, whose value is usually optimized by cross-validation.
⎟⎠
⎞⎜⎝
⎛+= ∑
=
N
i
kiCL
1
2
2||||)( ξww
Support vector machines
Jeff Howbert Introduction to Machine Learning Winter 2014 20
Slack variables for nonseparable data
Support vector machines
Jeff Howbert Introduction to Machine Learning Winter 2014 21
Support vector machines
What if decision boundary is not linear?
Jeff Howbert Introduction to Machine Learning Winter 2014 22
Support vector machines
Solution: nonlinear transform of attributes])(,[],[: 4
21121 xxxxx +→Φ
Jeff Howbert Introduction to Machine Learning Winter 2014 23
Support vector machines
Solution: nonlinear transform of attributes)](),[(],[: 2
221
2121 xxxxxx −−→Φ
Jeff Howbert Introduction to Machine Learning Winter 2014 24
Issues with finding useful nonlinear transforms– Not feasible to do manually as number of attributes
grows (i.e. any real world problem)– Usually involves transformation to higher dimensional
spaceincreases computational burden of SVM optimizationcurse of dimensionality
With SVMs, can circumvent all the above via the kernel trick
Support vector machines
Jeff Howbert Introduction to Machine Learning Winter 2014 25
Support vector machines
Kernel trick– Don’t need to specify the attribute transform Φ( x )– Only need to know how to calculate the dot product of
any two transformed samples:k( x1, x2 ) = Φ( x1 ) ⋅ Φ( x2 )
Jeff Howbert Introduction to Machine Learning Winter 2014 26
Support vector machines
Kernel trick (cont’d.)– The kernel function k( ⋅, ⋅ ) is substituted into the dual
of the Lagrangian, allowing determination of a maximum margin hyperplane in the (implicitly) transformed space Φ( x ):
– All subsequent calculations, including predictions on test samples, are done using the kernel in place ofΦ( x1 ) ⋅ Φ( x2 )
∑∑
∑∑
−
=Φ⋅Φ−=
=
=
jijijiji
N
ii
jijijiji
N
iidual
yy
yyL
,1
,1
),k(21
)()(21
xx
xx
λλλ
λλλ
Jeff Howbert Introduction to Machine Learning Winter 2014 27
Common kernel functions for SVM
– linear
– polynomial
– Gaussian or radial basis
– sigmoid
Support vector machines
( )
) tanh(),(
exp),(
) (),(
),(
2121
22121
2121
2121
ck
k
ck
k
d
+⋅=
−−=
+⋅=
⋅=
xxxx
xxxx
xxxx
xxxx
γ
γ
γ
Jeff Howbert Introduction to Machine Learning Winter 2014 28
For some kernels (e.g. Gaussian) the implicit transform Φ( x ) is infinite-dimensional!– But calculations with kernel are done in original
space, so computational burden and curse of dimensionality aren’t a problem.
Support vector machines
Jeff Howbert Introduction to Machine Learning Winter 2014 29
Support vector machines
Jeff Howbert Introduction to Machine Learning Winter 2014 30
Support vector machines
Applications of SVMs to machine learning– Classification
binarymulticlassone-class
– Regression– Transduction (semi-supervised learning)– Ranking– Clustering– Structured labels
Jeff Howbert Introduction to Machine Learning Winter 2014 31
Support vector machines
Software– SVMlight
http://svmlight.joachims.org/
– libSVMhttp://www.csie.ntu.edu.tw/~cjlin/libsvm/includes MATLAB / Octave interface
– MATLAB svmtrain / svmclassifyonly supports binary classification
Jeff Howbert Introduction to Machine Learning Winter 2014 32
Online demos
– http://cs.stanford.edu/people/karpathy/svmjs/demo/
Support vector machines