machine learning - columbia universityjebara/4771/notes/class6x.pdf · machine learning 4771...
TRANSCRIPT
Tony Jebara, Columbia University
Topic 6 • Review: Support Vector Machines
• Primal & Dual Solution
• Non-separable SVMs
• Kernels
• SVM Demo
Tony Jebara, Columbia University
Review: SVM • Support vector machines are (in the simplest case) linear classifiers that do structural risk minimization (SRM) • Directly maximize margin to reduce guaranteed risk J(θ) • Assume first the 2-class data is linearly separable:
• Decision boundary or hyperplane given by • Note: can scale w & b while keeping same boundary • Many solutions exist which have empirical error Remp(θ)=0 • Want widest or thickest one (max margin), also it’s unique!
have x
1,y
1( ),…, xN,y
N( ){ } where xi∈ D and y
i∈ −1,1{ }
f x;θ( ) = sign wTx +b( )
wTx +b = 0
⇒
Tony Jebara, Columbia University
Support Vector Machines • Define: w
Tx +b = 0
H+=positive margin hyperplane H- =negative margin hyperplane q =distance from decision plane to origin
q = min
x
x −0 subject to wTx +b = 0
min
x12
x −0
2−λ wTx +b( )
∂∂x
12xTx −λ wTx +b( )( ) = 0
x −λw = 0x = λw
2) plug into constraint
wTx +b = 0
wT λw( ) +b = 0
λ = − bwTw
1) grad
3) Sol’n x̂ = − b
wTw( )w4) distance
q = x̂ −0 = − b
wTww =
b
wTwwTw =
b
w5) Define without loss of generality since can scale b & w
H → wTx +b = 0
H+→ wTx +b = +1
H− → wTx +b = −1
Tony Jebara, Columbia University
Support Vector Machines • The constraints on the SVM for Remp(θ)=0 are thus:
• Or more simply: • The margin of the SVM is:
• Distance to origin:
• Therefore: and margin
• Want to max margin, or equivalently minimize: • SVM Problem: • This is a quadratic program! • Can plug this into a matlab function called “qp()”, done!
d+
d−
H+
H−
H
H+→ wTx +b = +1
H− → wTx +b = −1
wTx
i+b ≥+1 ∀y
i= +1
wTx
i+b ≤−1 ∀y
i= −1
y
iwTx
i+b( )−1≥ 0
m = d+
+d−
H → q =
b
w H
+→ q
+=
b−1
w H−→ q
−=
−1−b
w
d
+= d
−= 1
w
m =2
w
w or w
2
min 1
2w
2subject to y
iwTx
i+b( )−1≥ 0
Tony Jebara, Columbia University
SVM in Dual Form • We can also solve the problem via convex duality • Primal SVM problem LP: • This is a quadratic program, quadratic cost function with multiple linear inequalities (these carve out a convex hull) • Subtract from cost each inequality times an α Lagrange multiplier, take derivatives of w & b:
• Plug back in, dual: • Also have constraints: • Above LD must be maximized! convex duality… also qp()
min 1
2w
2subject to y
iwTx
i+b( )−1≥ 0
L
P= min
w,bmax
α≥012
w2− α
iy
iwTx
i+b( )−1( )i∑
∂∂w
LP
= w− αiy
ix
i= 0 →
i∑ w = αiy
ix
ii∑ ∂∂b
LP
=− αiy
i= 0
i∑ L
D= α
ii∑ − 12
αiα
jy
iy
jx
iTx
jj∑i∑ α
iy
i= 0
i∑ & αi≥ 0
Tony Jebara, Columbia University
SVM Dual Solution Properties • We have dual convex program:
• Solve for N alphas (one per data point) instead of D w’s • Still convex (qp) so unique solution, gives alphas • Alphas can be used to get w: • Support Vectors: have non-zero alphas shown with thicker circles, all live on the margin: • Solution is sparse, most alphas=0 these are non-support vectors SVM ignores them if they move (without crossing margin) or if they are deleted, SVM doesn’t change (stays robust)
α
ii∑ − 12
αiα
jy
iy
jx
iTx
ji, j∑ subject to αiy
i= 0
i∑ & αi≥ 0
wTx
i+b = ±1
w = α
iy
ix
ii∑
Tony Jebara, Columbia University
SVM Dual Solution Properties • Primal & Dual Illustration:
• Recall we could get w from alphas: • Or could use as is: • Karush-Kuhn-Tucker Conditions (KKT): solve value of b on margin (for nonzero alphas) have: using known w, compute b for each support vector then… • Sparsity (few nonzero alphas) is useful for several reasons • Means SVM only uses some of training data to learn • Should help improve its ability to generalize to test data • Computationally faster when using final learned classifier
w = α
iy
ix
ii∑ f x( ) = sign xTw +b( ) = sign α
iy
ixTx
ii∑ +b( )
wTx
i+b ≥±1
wTx
i+b = y
i
bi
= yi−wTx
i∀i : α
i> 0
b = average b
i( )
Tony Jebara, Columbia University
Non-Separable SVMs • What happens when non-separable? • There is no solution and convex hull shrinks to nothing
• Not all constraints can be resolved, their alphas go to • Instead of perfectly classifying each point: we “Relax” the problem with (positive) slack variables xi’s allow data to (sometimes) fall on wrong side, for example:
• New constraints:
• But too much xi’s means too much slack, so penalize them
∞
wTx
i+b ≥+1−ξ
iif y
i= +1 where ξ
i≥ 0
wTx
i+b ≤−1 + ξ
iif y
i= −1 where ξ
i≥ 0
L
P: min 1
2w
2+C ξ
ii∑ subject to yi
wTxi
+b( )−1 + ξi≥ 0
wTx
i+b ≥−0.03 if y
i= +1
y
iwTx
i+b( )≥ 1
Tony Jebara, Columbia University
Non-Separable SVMs • This new problem is still convex, still qp()! • User chooses scalar C (or cross-validates) which controls how much slack xi to use (how non-separable) and how robust to outliers or bad points on the wrong side
• Can now write dual problem (to maximize):
• Same dual as before but alphas can’t grow beyond C
L
P: min 1
2w
2+C ξ
ii∑ − αi
yi
wTxi
+b( )−1 + ξi( )i∑ − β
iξ
ii∑
∂∂b
LPand ∂
∂wL
Pasbefore...
∂∂ξi
LP
= C −αi−β
i= 0
αi= C −β
ibut... α
i& β
i≥ 0
∴ 0 ≤ αi≤C
For xi positivity Large margin Low slack On right side
L
D: max α
ii∑ − 12
αiα
jy
iy
jx
iTx
ji, j∑ subject to αiy
i= 0
i∑ and αi∈ 0,C⎡⎣⎢
⎤⎦⎥
Tony Jebara, Columbia University
Non-Separable SVMs • As we try to enforce a classification for a data point its Lagrange multiplier alpha keeps growing endlessly • Clamping alpha to stop growing at C makes SVM “give up” on those non-separable points • The dual program is now: • Solve as before with extra constraints that alphas positive AND less than C… gives alphas… from alphas get
• Karush-Kuhn-Tucker Conditions (KKT): solve value of b on margin for not=zero alphas AND not=C alphas for all others have support vectors, assume and use formula to get and • Mechanical analogy: support vector forces & torques
yi
wTxi
+ bi( )−1 + ξ
i= 0
w = α
iy
ix
ii∑
b = average b
i( ) bi
ξi = 0
Tony Jebara, Columbia University
Nonlinear SVMs • What if the problem is not linear? • We can use our old trick… • Map d-dimensional x data from L-space to high dimensional H (Hilbert) feature-space via basis functions Φ(x) • For example, quadratic classifier:
• Call phi’s feature vectors computed from original x inputs • Replace all x’s in the SVM equations with phi’s • Now solve the following learning problem:
• Which gives a nonlinear classifier in original space:
L
Η
xi→Φ x
i( ) via Φx( ) =
x
vecxxT( )
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
L
D: max α
ii∑ − 12
αiα
jy
iy
jφ x
i( )T φ xj( )i, j∑ s.t. α
i∈ 0,C⎡⎣⎢
⎤⎦⎥ , α
iy
i= 0
i∑
f x( ) = sign α
iy
iφ x( )T φ x
i( )i∑ +b⎛⎝⎜⎜⎜
⎞⎠⎟⎟⎟
Tony Jebara, Columbia University
Kernels (see http://www.youtube.com/watch?v=3liCbRZPrZA) • One important aspect of SVMs: all math involves only the inner products between the phi features!
• Replace all inner products with a general kernel function • Mercer kernel: accepts 2 inputs and outputs a scalar via:
• Example: quadratic polynomial
f x( ) = sign α
iy
iφ x( )T φ x
i( )i∑ +b⎛⎝⎜⎜⎜
⎞⎠⎟⎟⎟
k x, x( ) = φ x( ),φ x( ) =φ x( )T φ x( ) for finite φ
φ x,t( )φ x,t( )dtt∫ otherwise
⎧
⎨
⎪⎪⎪⎪
⎩⎪⎪⎪⎪
φ x( ) = x
12 2x
1x
2x
22⎡
⎣⎢
⎤⎦⎥T
k x, x( ) = φ x( )T φ x( )= x
12x
12 + 2x
1x
1x
2x
2+ x
22x
22
= x1x
1+ x
2x
2( )2
Tony Jebara, Columbia University
Kernels • Sometimes, many Φ(x) will produce the same k(x,x’) • Sometimes k(x,x’) computable but features huge or infinite! • Example: polynomials If explicit polynomial mapping, feature space Φ(x) is huge
d-dimensional data, p-th order polynomial,
images of size 16x16 with p=4 have dim(H)=183million
but can equivalently just use kernel:
dim Η( ) =d + p−1
p
⎛
⎝
⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟
k x,y( ) = xTy( )p
k x, x( ) = xT x( )p= x
ix
ii∑( )p
∝p !
r1!r
2!r
3!… p−r
1−r
2−…( )!
x1
r1x2
r2xd
rd
r∑ x
1
r1 x2
r2xd
rd
∝ wrx
1
r1x2
r2xd
rd( )r∑ w
rx
1
r1 x2
r2xd
rd( )∝ φ x( )φ x( )
Multinomial Theorem
w=weight on term
Equivalent!
Tony Jebara, Columbia University
Kernels
• Replace each , for example: P-th Order Polynomial Kernel:
RBF Kernel (infinite!):
Sigmoid (hyperbolic tan) Kernel:
• Using kernels we get generalized inner product SVM:
• Still qp solver, just use Gram matrix K (positive definite)
L
D: max α
ii∑ − 12
αiα
jy
iy
jk x
i,x
j( )i, j∑ s.t. αi∈ 0,C⎡⎣⎢
⎤⎦⎥ , α
iy
i= 0
i∑
K
ij= k x
i,x
j( )
K =
k x1,x
1( ) k x1,x
2( ) k x1,x
3( )k x
1,x
2( ) k x2,x
2( ) k x2,x
3( )k x
1,x
3( ) k x2,x
3( ) k x3,x
3( )
⎡
⎣
⎢⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥⎥
k x, x( ) = xT x +1( )p
k x, x( ) = exp − 1
2σ2 x − x2⎛
⎝⎜⎜⎜
⎞⎠⎟⎟⎟
k x, x( ) = tanh κxT x −δ( )
f x( ) = sign α
iy
ik x
i,x( )i∑ +b( )
x
iTx
j→ k x
i,x
j( )
Tony Jebara, Columbia University
Kernelized SVMs • Polynomial kernel:
• Radial basis function kernel:
Polynomial Kernel RBF kernel
• Least-squares, logistic-regression, perceptron are also kernelizable
k xi,x
j( ) = xiTx
j+1( )p
k xi,x
j( ) = exp − 12σ2 x
i−x
j
2⎛⎝⎜⎜⎜
⎞⎠⎟⎟⎟
Tony Jebara, Columbia University
SVM Demo • SVM Demo by Steve Gunn: http://www.isis.ecs.soton.ac.uk/resources/svminfo/ • In svc.m replace [alpha lambda how] = qp(…); with [alpha lambda how] = quadprog(H,c,[],[],A,b,vlb,vub,x0);
This replaces the old Matlab command qp (quadratic programming) with the new one for more recent versions