support vector machine and kernel methodscse.msu.edu/~cse802/s17/slides/lec_17_mar27_svm.pdfjiayu...
TRANSCRIPT
Support Vector Machine and Kernel Methods
Jiayu Zhou
1Department of Computer Science and EngineeringMichigan State UniversityEast Lansing, MI USA
February 26, 2017
Jiayu Zhou CSE 847 Machine Learning 1 / 50
Which Separator Do You Pick?
Jiayu Zhou CSE 847 Machine Learning 2 / 50
Robustness to Noisy Data
Being robust to noise (measurement error) is good (rememberregularization).
Jiayu Zhou CSE 847 Machine Learning 3 / 50
Thicker Cushion Means More Robustness
We call such hyperplanes fat
Jiayu Zhou CSE 847 Machine Learning 4 / 50
Two Crucial Questions
1 Can we efficiently find the fattest separating hyperplane?
2 Is a fatter hyperplane better than a thin one?
Jiayu Zhou CSE 847 Machine Learning 5 / 50
Pulling Out the Bias
Before
x ∈ {1} × Rd;w ∈ Rd+1
x =
1x1...xd
;w =
w0
w1...wd
signal = wTx
After
x ∈ Rd; b ∈ R,w ∈ Rd
x =
x1...xd
;w =
w1...wd
bias b
signal = wTx+ b
Jiayu Zhou CSE 847 Machine Learning 6 / 50
Separating The Data
Hyperplane h = (b,w)h separates the data means:
yn(wTxn + b) > 0
By rescaling the weights andbias,
minn=1,...,N
yn(wTxn + b) = 1
Jiayu Zhou CSE 847 Machine Learning 7 / 50
Distance to the Hyperplane
w is normal to the hyperplane (why?)wT (x2 − x1) = wTx2 −wTx1 = −b+ b = 0
Scalar projection:
aTb = ∥a∥∥b∥ cos(a,b)⇒ aTb/∥b∥ = ∥a∥ cos(a,b)
let x⊥ be the orthogonal projection ofx to h, distance to hyperplane is givenby projection of x− x⊥ to w (why?)
dist(x, h) =1
∥w∥· |wTx−wTx⊥|
=1
∥w∥· |wTx+ b|
Jiayu Zhou CSE 847 Machine Learning 8 / 50
Fatness of a Separating Hyperplane
dist(x, h) =1
∥w∥· |wTx+ b| = 1
∥w∥· |yn(wTx+ b)| = 1
∥w∥· yn(wTx+ b)
Fatness
= Distance to the closest point
Fatness = minn
dist(xn, h)
=1
∥w∥minn
yn(wTx+ b)
=1
∥w∥
Jiayu Zhou CSE 847 Machine Learning 9 / 50
Maximizing the Margin
Formal definition of margin:
margin: γ(h) =1
∥w∥
NOTE: Bias b does not appear in the margin.
Objective maximizing margin:
minb,w
1
2wTw
subject to: minn=1,...,N
yn(wTxn + b) = 1
An equivalent objective:
minb,w
1
2wTw
subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N
Jiayu Zhou CSE 847 Machine Learning 10 / 50
Example - Our Toy Data Set
minb,w
1
2wTw
subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N
Training Data:
X =
0 02 22 03 0
,y =
−1−1+1+1
What is the margin?
Jiayu Zhou CSE 847 Machine Learning 11 / 50
Example - Our Toy Data Set
minb,w
1
2wTw
subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N
X =
0 02 22 03 0
,y =
−1−1+1+1
⇒
(1) : −b ≥ 1
(2) : −(2w1 + 2w2 + b) ≥ 1
(3) : 2w1 + b ≥ 1
(4) : 3w1 + b ≥ 1
{(1) + (3) → w1 ≥ 1
(2) + (3) → w2 ≤ −1⇒ 1
2wTw =
1
2(w2
1 + w22) ≥ 1
Thus: w1 = 1, w2 = −1, b = −1Jiayu Zhou CSE 847 Machine Learning 12 / 50
Example - Our Toy Data Set
Given data X =
0 02 22 03 0
Optimal solution
w∗ =
[w1 = 1w2 = −1
], b∗ = −1
Optimal hyperplaneg(x) = sign(x1 − x2 − 1)
margin:1
∥w∥ = 1√2≈ 0.707
For data points (1), (2) and(3) yn(x
Tnw
∗ + b∗) = 1Support Vectors
Jiayu Zhou CSE 847 Machine Learning 13 / 50
Solver: Quadratic Programming
minu∈Rq
1
2uTQu+ pTu
subject to: Au ≥ c
u∗ ← QP (Q,p, A, c)
(Q = 0 is linear programming.)http://cvxopt.org/examples/tutorial/qp.html
Jiayu Zhou CSE 847 Machine Learning 14 / 50
Maximum Margin Hyperplane is QP
minb,w
1
2wTw
subject to: yn(wTxn + b) ≥ 1, ∀n
minu∈Rq
1
2uTQu+ pTu
subject to: Au ≥ c
u =
[bw
]∈ Rd+1 ⇒
1
2wTw = [b,wT ]
[0 0T
d0d Id
] [b
wT
]= uT
[0 0T
d0d Id
]u
Q =
[0 0T
d0d Id
],p = 0d+1
yn(wTxn + b) ≥ 1 = [yn, ynx
Tn ]u ≥ 1 ⇒
y1 y1xT1
......
yN yNxTN
u ≥
1...1
A =
y1 y1xT1
......
yN yNxTN
, c =
1...1
Jiayu Zhou CSE 847 Machine Learning 15 / 50
Back To Our Example
Exercise:
X =
0 02 22 03 0
,y =
−1−1+1+1
(1) : −b ≥ 1
(2) : −(2w1 + 2w2 + b) ≥ 1
(3) : 2w1 + b ≥ 1
(4) : 3w1 + b ≥ 1
Show the corresponding Q,p, A, c.
Q =
0 0 00 1 00 0 1
,p =
000
, A =
−1 0 0−1 −2 −21 2 01 3 0
, c =
1111
Use your QP-solver to give
u∗ = [b∗, w∗1 , w
∗2 ]
T = [−1, 1,−1]
Jiayu Zhou CSE 847 Machine Learning 16 / 50
Primal QP algorithm for linear-SVM
1 Let p = 0d+1 be the (d+ 1)-vector of zeros and c = 1N theN -vector of ones. Construct matrices Q and A, where
A =
y1 −y1xT1−
......
yN −yNxTN−
, Q =
[0 0Td0d Id
]
2 Return
[b∗
w∗
]= u∗ ← QP (Q,p, A, c).
3 The final hypothesis is g(x) = sign(xTw∗ + b∗).
Jiayu Zhou CSE 847 Machine Learning 17 / 50
Link to Regularization
minw
Ein(w)
subject to: wTw ≤ C
optimal hyperplane regularization
minimize wTw Ein
subject to Ein = 0 wTw ≤ C
Jiayu Zhou CSE 847 Machine Learning 18 / 50
How to Handle Non-Separable Data?
(a) Tolerate noisy data points: soft-margin SVM.
(b) Inherent nonlinear boundary: non-linear transformation.
Jiayu Zhou CSE 847 Machine Learning 19 / 50
Non-Linear Transformation
Φ1(x) = (x1, x2)
Φ2(x) = (x1, x2, x21, x1x2, x
22)
Φ3(x) = (x1, x2, x21, x1x2, x
22, x
31, x
21x2, x1x
22, x
32)
Jiayu Zhou CSE 847 Machine Learning 20 / 50
Non-Linear Transformation
Using the nonlinear transform with the optimal hyperplaneusing a transform Φ: Rd → Rd̃:
zn = Φ(xn)
Solve the hard-margin SVM in the Z-space (w̃∗, b̃∗):
minb̃,w̃
1
2w̃T w̃
subject to: yn(w̃Tzn + b̃) ≥ 1, ∀n
Final hypothesis:
g(x) = sign(w̃∗TΦ(x) + b̃∗)
Jiayu Zhou CSE 847 Machine Learning 21 / 50
SVM and non-linear transformation
The margin is shaded in yellow, and the support vectors are boxed.
For Φ2, d̃2 = 5 and for Φ3, d̃3 = 9d̃2 is nearly double d̃3, yet the resulting SVM separator is notseverely overfitting with Φ3 (regularization?).
Jiayu Zhou CSE 847 Machine Learning 22 / 50
Support Vector Machine Summary
A very powerful, easy to use linear model which comes withautomatic regularization.
Fully exploit SVM: Kernel
potential robustness to overfitting even after transforming to amuch higher dimensionHow about infinite dimensional transforms?Kernel Trick
Jiayu Zhou CSE 847 Machine Learning 23 / 50
SVM Dual: Formulation
Primal and dual in optimization.
The dual view of SVM enables us to exploit the kernel trick.
In the primal SVM problem we solve w ∈ Rd, b, while in thedual problem we solve α ∈ RN
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymxTnxm
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
which is also a QP problem.
Jiayu Zhou CSE 847 Machine Learning 24 / 50
SVM Dual: Prediction
We can obtain the primal solution:
w∗ =
N∑n=1
ynα∗nxn
where for support vectors αn > 0
The optimal hypothesis:
g(x) = sign(w∗Tx+ b∗)
= sign
(N∑
n=1
ynα∗nx
Tnx+ b∗
)
= sign
∑α∗n>0
ynα∗nx
Tnx+ b∗
Jiayu Zhou CSE 847 Machine Learning 25 / 50
Dual SVM: Summary
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymxTnxm
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
w∗ =
N∑n=1
ynα∗nxn
Jiayu Zhou CSE 847 Machine Learning 26 / 50
Common SVM Basis Functions
zk = polynomial terms of xk of degree 1 to q
zk = radial basis function of xk
zk(j) = ϕj(xk) = exp(−|xk − cj |2/σ2)
zk = sigmoid functions of xk
Jiayu Zhou CSE 847 Machine Learning 27 / 50
Quadratic Basis Functions
Φ(x) =
1√2x1
...√2xd
x21...x2d√
2x1x2
...√2x1xd√2x2x3
...√2xd−1xd
Including Constant Term, LinearTerms, Pure Quadratic Terms,Quadratic Cross-Terms
The number of terms is approximatelyd2/2.
You may be wondering what those√2s are doing. Youll find out why
theyre there soon.
Jiayu Zhou CSE 847 Machine Learning 28 / 50
Dual SVM: Non-linear Transformation
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymΦ(xn)TΦ(xm)
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
w∗ =
N∑n=1
ynα∗nΦ(xn)
Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)
Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d2/2 additions and multiplications,The whole thing costs N2d2/4.
Jiayu Zhou CSE 847 Machine Learning 29 / 50
Quadratic Dot Products
Φ(a)TΦ(b) =
1√2a1...√2ama21...
a2m√2a1a2...√
2a1ad√2a2a3...√
2ad−1ad
T
1√2b1...√2bdb21...b2d√2b1b2...√
2b1bd√2b2b3...√
2bd−1bd
Constant Term 1
Linear Terms
d∑i=1
2aibi
Pure Quadratic Terms
d∑i=1
a2i b2i
Quadratic Cross-Terms
d∑i=1
d∑j=i+1
2aiajbibj
Jiayu Zhou CSE 847 Machine Learning 30 / 50
Quadratic Dot Product
Does Φ(a)TΦ(b) look familiar?
Φ(a)TΦ(b) = 1 + 2∑d
i=1aibi +
∑d
i=1a2i b
2i +
∑d
i=1
∑d
j=i+12aiajbibj
Try this: (aTb+ 1)2
(aT b+ 1)2 = (aT b)2 + 2aT b+ 1
=
(∑d
i=1aibi
)2
+ 2∑d
i=1aibi + 1
=∑d
i=1
∑d
j=1aibiajbj + 2
∑d
i=1aibi + 1
=∑d
i=1a2i b
2i + 2
∑d
i=1
∑d
j=i+1aiajbibj + 2
∑d
i=1aibi + 1
They’re the same! And this is only O(d) to compute!
Jiayu Zhou CSE 847 Machine Learning 31 / 50
Dual SVM: Non-linear Transformation
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymΦ(xn)TΦ(xm)
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
w∗ =N∑
n=1
ynα∗nΦ(xn)
Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)
Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d additions and multiplications.
Jiayu Zhou CSE 847 Machine Learning 32 / 50
Higher Order Polynomials
Φ(x) Cost 100dim
Quadratic d2/2 terms d2N2/4 2.5kN2
Cubic d3/6 terms d3N2/12 83kN2
Quartic d4/24 terms d4N2/48 1.96mN2
Φ(a)TΦ(b) Cost 100dim
Quadratic (aTb+ 1)2 dN2/2 50N2
Cubic (aTb+ 1)3 dN2/2 50N2
Quartic (aTb+ 1)4 dN2/2 50N2
Jiayu Zhou CSE 847 Machine Learning 33 / 50
Dual SVM with Quintic basis functions
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymΦ(xn)TΦ(xm)︸ ︷︷ ︸
(xTnxm+1)5
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
Classification:
g(x) = sign(w∗TΦ(x) + b∗) = sign
(∑α∗n>0
ynα∗nΦ(xn)
TΦ(x) + b∗)
= sign
(∑α∗n>0
ynα∗n(x
Tnx+ 1)5 + b∗
)
Jiayu Zhou CSE 847 Machine Learning 34 / 50
Dual SVM with general kernel functions
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymK(xn,xm)
subject toN∑
n=1
ynαn = 0, αn ≥ 0, ∀n
Classification:
g(x) = sign(w∗TΦ(x) + b∗) = sign
(∑α∗n>0
ynα∗nΦ(xn)
TΦ(x) + b∗)
= sign
(∑α∗n>0
ynα∗nK(xn,xm) + b∗
)
Jiayu Zhou CSE 847 Machine Learning 35 / 50
Kernel Tricks
Replacing dot product with a kernel function
Not all functions are kernel functions!
Need to be decomposable K(a,b) = Φ(a)TΦ(b)Could K(a,b) = (a− b)3 be a kernel function?Could K(a,b) = (a− b)4 − (a+ b)2 be a kernel function?
Mercer’s condition
To expand Kernel function K(a,b) into a dot product, i.e.,K(a,b) = Φ(a)TΦ(b), K(a,b) has to be positivesemi-definite function.kernel matrix K is always symmetric PSD for any givenx1, . . . ,xN .
Jiayu Zhou CSE 847 Machine Learning 36 / 50
Kernel Design: expression kernel
mRNA expression data:
Each matrix entry is an mRNAexpression measurement.Each column is an experiment.Each row corresponds to a gene.
Similar or dissimilarSimilar
Dissimilar
Kernel
K(x, y) =
∑i xiyi√∑
i xixi√∑
i yiyi
Jiayu Zhou CSE 847 Machine Learning 37 / 50
Kernel Design: sequence kernel
Work with non-vectorial data
Scalar product on a pair of variable-length, discrete strings?>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH
>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI
Jiayu Zhou CSE 847 Machine Learning 38 / 50
Commonly Used SVM Kernel Functions
K(a,b) = (α · aTb+ β)Q is an example of an SVM kernelfunction.
Beyond polynomials there are other very high dimensionalbasis functions that can be made practical by finding the rightKernel Function
Radial-basis style kernel (RBF)/Gaussian kernel function
K(a,b) = exp(−γ∥a− b∥2
)Sigmoid functions
Jiayu Zhou CSE 847 Machine Learning 39 / 50
2nd Order Polynomial Kernel
K(a,b) = (α · aTb+ β)2
Jiayu Zhou CSE 847 Machine Learning 40 / 50
Gaussian Kernels
K(a,b) = exp(−γ∥a− b∥2
)
When γ is large, we clearly see that even the protection of a largemargin cannot suppress overfitting. However, for a reasonablysmall γ, the sophisticated boundary discovered by SVM with theGaussian-RBF kernel looks quite good.
Jiayu Zhou CSE 847 Machine Learning 41 / 50
Gaussian Kernels
For (a) a noisy data set that linear classifier appears to work quitewell, (b) using the Gaussian-RBF kernel with the hard-margin SVMleads to overfitting.
Jiayu Zhou CSE 847 Machine Learning 42 / 50
From hard-margin to soft-margin
When there are outliers, hard margin SVM + Gaussian-RBFkernel result in an unnecessarily complicated decisionboundary that overfits the training noise.
Remedy: a soft formulation that allows small violation of themargins or even some classification errors.
Soft-margin: margin violation εn ≥ 0 for each data point(xn, yn) and require that
yn(wTxn + b) ≥ 1− εn
εn captures by how much (xn, yn) fails to be separated.
Jiayu Zhou CSE 847 Machine Learning 43 / 50
Soft-Margin SVM
We modify the hard-margin SVM to the soft-margin SVM byallowing margin violations but adding a penalty term to discouragelarge violations:
minb,w,ε
1
2wTw + C
N∑n=1
εn
subject to: yn(wTxn + b) ≥ 1− εn for n = 1, . . . , N
εn ≥ 0, for n = 1, . . . , N
The meaning of C?
When C is large, it means we care more about violating the margin,which gets us closer to the hard-margin SVM.
When C is small, on the other hand, we care less about violating themargin.
Jiayu Zhou CSE 847 Machine Learning 44 / 50
Soft Margin Example
Jiayu Zhou CSE 847 Machine Learning 45 / 50
Soft Margin and Hard Margin
minb,w,ε
12w
Tw︸ ︷︷ ︸margin
+C∑N
n=1εn︸ ︷︷ ︸
error tolerance
subject to: yn(wTxn + b) ≥ 1− εn, εn ≥ 0,∀N
Jiayu Zhou CSE 847 Machine Learning 46 / 50
The Hinge Loss
The trade-off sounds very similar, right?
We have εn ≥ 0, and thatyn(w
Txn + b) ≥ 1− εn ⇒ εn ≥ 1− yn(wTxn + b)
The SVM loss (aka. Hinge Loss) function
ESVM(b,w) =1
N
∑N
n=1max(1− yn(w
Txn + b), 0)
The soft-margin SVM can be re-written as the followingoptimization problem:
minb,w
ESVM(b,w) + λwTw
Jiayu Zhou CSE 847 Machine Learning 47 / 50
Dual Soft-Margin SVM
maxα∈RN
N∑n=1
αn −1
2
N∑m=1
N∑n=1
αnαmynymxTnxm
subject toN∑
n=1
ynαn = 0, 0 ≤ αn≤ C, ∀n
w∗ =
N∑n=1
ynα∗nxn
Jiayu Zhou CSE 847 Machine Learning 48 / 50
Summary of Dual SVM
Deliver a large-margin hyperplane, and in so doing it cancontrol the effective model complexity.
Deal with high- or infinite-dimensional transforms using thekernel trick.
Express the final hypothesis g(x) using only a few supportvectors, their corresponding dual variables (Lagrangemultipliers), and the kernel.
Control the sensitivity to outliers and regularize the solutionthrough setting C appropriately.
Jiayu Zhou CSE 847 Machine Learning 49 / 50
Support Vector Machine
Robust Classifier: Maximum Margin
Primal Objective
Dual Objective
QP Solver - d
QP Solver - N
Design: Hard Margin
Design: Soft Margin
Primal Objective
Dual Objective
QP Solver - d
QP Solver - N
Kernel TrickAllow Training Error
Kernel Trick
Support Vector
Support Vector
Hinge Loss
Jiayu Zhou CSE 847 Machine Learning 50 / 50