support vector machine and kernel methodscse.msu.edu/~cse802/s17/slides/lec_17_mar27_svm.pdfjiayu...

50
Support Vector Machine and Kernel Methods Jiayu Zhou 1 Department of Computer Science and Engineering Michigan State University East Lansing, MI USA February 26, 2017 Jiayu Zhou CSE 847 Machine Learning 1 / 50

Upload: others

Post on 21-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Support Vector Machine and Kernel Methods

Jiayu Zhou

1Department of Computer Science and EngineeringMichigan State UniversityEast Lansing, MI USA

February 26, 2017

Jiayu Zhou CSE 847 Machine Learning 1 / 50

Page 2: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Which Separator Do You Pick?

Jiayu Zhou CSE 847 Machine Learning 2 / 50

Page 3: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Robustness to Noisy Data

Being robust to noise (measurement error) is good (rememberregularization).

Jiayu Zhou CSE 847 Machine Learning 3 / 50

Page 4: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Thicker Cushion Means More Robustness

We call such hyperplanes fat

Jiayu Zhou CSE 847 Machine Learning 4 / 50

Page 5: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Two Crucial Questions

1 Can we efficiently find the fattest separating hyperplane?

2 Is a fatter hyperplane better than a thin one?

Jiayu Zhou CSE 847 Machine Learning 5 / 50

Page 6: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Pulling Out the Bias

Before

x ∈ {1} × Rd;w ∈ Rd+1

x =

1x1...xd

;w =

w0

w1...wd

signal = wTx

After

x ∈ Rd; b ∈ R,w ∈ Rd

x =

x1...xd

;w =

w1...wd

bias b

signal = wTx+ b

Jiayu Zhou CSE 847 Machine Learning 6 / 50

Page 7: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Separating The Data

Hyperplane h = (b,w)h separates the data means:

yn(wTxn + b) > 0

By rescaling the weights andbias,

minn=1,...,N

yn(wTxn + b) = 1

Jiayu Zhou CSE 847 Machine Learning 7 / 50

Page 8: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Distance to the Hyperplane

w is normal to the hyperplane (why?)wT (x2 − x1) = wTx2 −wTx1 = −b+ b = 0

Scalar projection:

aTb = ∥a∥∥b∥ cos(a,b)⇒ aTb/∥b∥ = ∥a∥ cos(a,b)

let x⊥ be the orthogonal projection ofx to h, distance to hyperplane is givenby projection of x− x⊥ to w (why?)

dist(x, h) =1

∥w∥· |wTx−wTx⊥|

=1

∥w∥· |wTx+ b|

Jiayu Zhou CSE 847 Machine Learning 8 / 50

Page 9: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Fatness of a Separating Hyperplane

dist(x, h) =1

∥w∥· |wTx+ b| = 1

∥w∥· |yn(wTx+ b)| = 1

∥w∥· yn(wTx+ b)

Fatness

= Distance to the closest point

Fatness = minn

dist(xn, h)

=1

∥w∥minn

yn(wTx+ b)

=1

∥w∥

Jiayu Zhou CSE 847 Machine Learning 9 / 50

Page 10: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Maximizing the Margin

Formal definition of margin:

margin: γ(h) =1

∥w∥

NOTE: Bias b does not appear in the margin.

Objective maximizing margin:

minb,w

1

2wTw

subject to: minn=1,...,N

yn(wTxn + b) = 1

An equivalent objective:

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

Jiayu Zhou CSE 847 Machine Learning 10 / 50

Page 11: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Example - Our Toy Data Set

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

Training Data:

X =

0 02 22 03 0

,y =

−1−1+1+1

What is the margin?

Jiayu Zhou CSE 847 Machine Learning 11 / 50

Page 12: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Example - Our Toy Data Set

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1 for n = 1, . . . , N

X =

0 02 22 03 0

,y =

−1−1+1+1

(1) : −b ≥ 1

(2) : −(2w1 + 2w2 + b) ≥ 1

(3) : 2w1 + b ≥ 1

(4) : 3w1 + b ≥ 1

{(1) + (3) → w1 ≥ 1

(2) + (3) → w2 ≤ −1⇒ 1

2wTw =

1

2(w2

1 + w22) ≥ 1

Thus: w1 = 1, w2 = −1, b = −1Jiayu Zhou CSE 847 Machine Learning 12 / 50

Page 13: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Example - Our Toy Data Set

Given data X =

0 02 22 03 0

Optimal solution

w∗ =

[w1 = 1w2 = −1

], b∗ = −1

Optimal hyperplaneg(x) = sign(x1 − x2 − 1)

margin:1

∥w∥ = 1√2≈ 0.707

For data points (1), (2) and(3) yn(x

Tnw

∗ + b∗) = 1Support Vectors

Jiayu Zhou CSE 847 Machine Learning 13 / 50

Page 14: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Solver: Quadratic Programming

minu∈Rq

1

2uTQu+ pTu

subject to: Au ≥ c

u∗ ← QP (Q,p, A, c)

(Q = 0 is linear programming.)http://cvxopt.org/examples/tutorial/qp.html

Jiayu Zhou CSE 847 Machine Learning 14 / 50

Page 15: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Maximum Margin Hyperplane is QP

minb,w

1

2wTw

subject to: yn(wTxn + b) ≥ 1, ∀n

minu∈Rq

1

2uTQu+ pTu

subject to: Au ≥ c

u =

[bw

]∈ Rd+1 ⇒

1

2wTw = [b,wT ]

[0 0T

d0d Id

] [b

wT

]= uT

[0 0T

d0d Id

]u

Q =

[0 0T

d0d Id

],p = 0d+1

yn(wTxn + b) ≥ 1 = [yn, ynx

Tn ]u ≥ 1 ⇒

y1 y1xT1

......

yN yNxTN

u ≥

1...1

A =

y1 y1xT1

......

yN yNxTN

, c =

1...1

Jiayu Zhou CSE 847 Machine Learning 15 / 50

Page 16: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Back To Our Example

Exercise:

X =

0 02 22 03 0

,y =

−1−1+1+1

(1) : −b ≥ 1

(2) : −(2w1 + 2w2 + b) ≥ 1

(3) : 2w1 + b ≥ 1

(4) : 3w1 + b ≥ 1

Show the corresponding Q,p, A, c.

Q =

0 0 00 1 00 0 1

,p =

000

, A =

−1 0 0−1 −2 −21 2 01 3 0

, c =

1111

Use your QP-solver to give

u∗ = [b∗, w∗1 , w

∗2 ]

T = [−1, 1,−1]

Jiayu Zhou CSE 847 Machine Learning 16 / 50

Page 17: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Primal QP algorithm for linear-SVM

1 Let p = 0d+1 be the (d+ 1)-vector of zeros and c = 1N theN -vector of ones. Construct matrices Q and A, where

A =

y1 −y1xT1−

......

yN −yNxTN−

, Q =

[0 0Td0d Id

]

2 Return

[b∗

w∗

]= u∗ ← QP (Q,p, A, c).

3 The final hypothesis is g(x) = sign(xTw∗ + b∗).

Jiayu Zhou CSE 847 Machine Learning 17 / 50

Page 18: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Link to Regularization

minw

Ein(w)

subject to: wTw ≤ C

optimal hyperplane regularization

minimize wTw Ein

subject to Ein = 0 wTw ≤ C

Jiayu Zhou CSE 847 Machine Learning 18 / 50

Page 19: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

How to Handle Non-Separable Data?

(a) Tolerate noisy data points: soft-margin SVM.

(b) Inherent nonlinear boundary: non-linear transformation.

Jiayu Zhou CSE 847 Machine Learning 19 / 50

Page 20: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Non-Linear Transformation

Φ1(x) = (x1, x2)

Φ2(x) = (x1, x2, x21, x1x2, x

22)

Φ3(x) = (x1, x2, x21, x1x2, x

22, x

31, x

21x2, x1x

22, x

32)

Jiayu Zhou CSE 847 Machine Learning 20 / 50

Page 21: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Non-Linear Transformation

Using the nonlinear transform with the optimal hyperplaneusing a transform Φ: Rd → Rd̃:

zn = Φ(xn)

Solve the hard-margin SVM in the Z-space (w̃∗, b̃∗):

minb̃,w̃

1

2w̃T w̃

subject to: yn(w̃Tzn + b̃) ≥ 1, ∀n

Final hypothesis:

g(x) = sign(w̃∗TΦ(x) + b̃∗)

Jiayu Zhou CSE 847 Machine Learning 21 / 50

Page 22: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

SVM and non-linear transformation

The margin is shaded in yellow, and the support vectors are boxed.

For Φ2, d̃2 = 5 and for Φ3, d̃3 = 9d̃2 is nearly double d̃3, yet the resulting SVM separator is notseverely overfitting with Φ3 (regularization?).

Jiayu Zhou CSE 847 Machine Learning 22 / 50

Page 23: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Support Vector Machine Summary

A very powerful, easy to use linear model which comes withautomatic regularization.

Fully exploit SVM: Kernel

potential robustness to overfitting even after transforming to amuch higher dimensionHow about infinite dimensional transforms?Kernel Trick

Jiayu Zhou CSE 847 Machine Learning 23 / 50

Page 24: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

SVM Dual: Formulation

Primal and dual in optimization.

The dual view of SVM enables us to exploit the kernel trick.

In the primal SVM problem we solve w ∈ Rd, b, while in thedual problem we solve α ∈ RN

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

which is also a QP problem.

Jiayu Zhou CSE 847 Machine Learning 24 / 50

Page 25: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

SVM Dual: Prediction

We can obtain the primal solution:

w∗ =

N∑n=1

ynα∗nxn

where for support vectors αn > 0

The optimal hypothesis:

g(x) = sign(w∗Tx+ b∗)

= sign

(N∑

n=1

ynα∗nx

Tnx+ b∗

)

= sign

∑α∗n>0

ynα∗nx

Tnx+ b∗

Jiayu Zhou CSE 847 Machine Learning 25 / 50

Page 26: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Dual SVM: Summary

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

w∗ =

N∑n=1

ynα∗nxn

Jiayu Zhou CSE 847 Machine Learning 26 / 50

Page 27: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Common SVM Basis Functions

zk = polynomial terms of xk of degree 1 to q

zk = radial basis function of xk

zk(j) = ϕj(xk) = exp(−|xk − cj |2/σ2)

zk = sigmoid functions of xk

Jiayu Zhou CSE 847 Machine Learning 27 / 50

Page 28: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Quadratic Basis Functions

Φ(x) =

1√2x1

...√2xd

x21...x2d√

2x1x2

...√2x1xd√2x2x3

...√2xd−1xd

Including Constant Term, LinearTerms, Pure Quadratic Terms,Quadratic Cross-Terms

The number of terms is approximatelyd2/2.

You may be wondering what those√2s are doing. Youll find out why

theyre there soon.

Jiayu Zhou CSE 847 Machine Learning 28 / 50

Page 29: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Dual SVM: Non-linear Transformation

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

w∗ =

N∑n=1

ynα∗nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)

Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d2/2 additions and multiplications,The whole thing costs N2d2/4.

Jiayu Zhou CSE 847 Machine Learning 29 / 50

Page 30: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Quadratic Dot Products

Φ(a)TΦ(b) =

1√2a1...√2ama21...

a2m√2a1a2...√

2a1ad√2a2a3...√

2ad−1ad

T

1√2b1...√2bdb21...b2d√2b1b2...√

2b1bd√2b2b3...√

2bd−1bd

Constant Term 1

Linear Terms

d∑i=1

2aibi

Pure Quadratic Terms

d∑i=1

a2i b2i

Quadratic Cross-Terms

d∑i=1

d∑j=i+1

2aiajbibj

Jiayu Zhou CSE 847 Machine Learning 30 / 50

Page 31: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Quadratic Dot Product

Does Φ(a)TΦ(b) look familiar?

Φ(a)TΦ(b) = 1 + 2∑d

i=1aibi +

∑d

i=1a2i b

2i +

∑d

i=1

∑d

j=i+12aiajbibj

Try this: (aTb+ 1)2

(aT b+ 1)2 = (aT b)2 + 2aT b+ 1

=

(∑d

i=1aibi

)2

+ 2∑d

i=1aibi + 1

=∑d

i=1

∑d

j=1aibiajbj + 2

∑d

i=1aibi + 1

=∑d

i=1a2i b

2i + 2

∑d

i=1

∑d

j=i+1aiajbibj + 2

∑d

i=1aibi + 1

They’re the same! And this is only O(d) to compute!

Jiayu Zhou CSE 847 Machine Learning 31 / 50

Page 32: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Dual SVM: Non-linear Transformation

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

w∗ =N∑

n=1

ynα∗nΦ(xn)

Need to prepare a matrix Q, Qnm = ynymΦ(xn)TΦ(xm)

Cost?We must do N2/2 dot products to get this matrix ready.Each dot product requires d additions and multiplications.

Jiayu Zhou CSE 847 Machine Learning 32 / 50

Page 33: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Higher Order Polynomials

Φ(x) Cost 100dim

Quadratic d2/2 terms d2N2/4 2.5kN2

Cubic d3/6 terms d3N2/12 83kN2

Quartic d4/24 terms d4N2/48 1.96mN2

Φ(a)TΦ(b) Cost 100dim

Quadratic (aTb+ 1)2 dN2/2 50N2

Cubic (aTb+ 1)3 dN2/2 50N2

Quartic (aTb+ 1)4 dN2/2 50N2

Jiayu Zhou CSE 847 Machine Learning 33 / 50

Page 34: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Dual SVM with Quintic basis functions

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymΦ(xn)TΦ(xm)︸ ︷︷ ︸

(xTnxm+1)5

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

Classification:

g(x) = sign(w∗TΦ(x) + b∗) = sign

(∑α∗n>0

ynα∗nΦ(xn)

TΦ(x) + b∗)

= sign

(∑α∗n>0

ynα∗n(x

Tnx+ 1)5 + b∗

)

Jiayu Zhou CSE 847 Machine Learning 34 / 50

Page 35: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Dual SVM with general kernel functions

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymK(xn,xm)

subject toN∑

n=1

ynαn = 0, αn ≥ 0, ∀n

Classification:

g(x) = sign(w∗TΦ(x) + b∗) = sign

(∑α∗n>0

ynα∗nΦ(xn)

TΦ(x) + b∗)

= sign

(∑α∗n>0

ynα∗nK(xn,xm) + b∗

)

Jiayu Zhou CSE 847 Machine Learning 35 / 50

Page 36: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Kernel Tricks

Replacing dot product with a kernel function

Not all functions are kernel functions!

Need to be decomposable K(a,b) = Φ(a)TΦ(b)Could K(a,b) = (a− b)3 be a kernel function?Could K(a,b) = (a− b)4 − (a+ b)2 be a kernel function?

Mercer’s condition

To expand Kernel function K(a,b) into a dot product, i.e.,K(a,b) = Φ(a)TΦ(b), K(a,b) has to be positivesemi-definite function.kernel matrix K is always symmetric PSD for any givenx1, . . . ,xN .

Jiayu Zhou CSE 847 Machine Learning 36 / 50

Page 37: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Kernel Design: expression kernel

mRNA expression data:

Each matrix entry is an mRNAexpression measurement.Each column is an experiment.Each row corresponds to a gene.

Similar or dissimilarSimilar

Dissimilar

Kernel

K(x, y) =

∑i xiyi√∑

i xixi√∑

i yiyi

Jiayu Zhou CSE 847 Machine Learning 37 / 50

Page 38: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Kernel Design: sequence kernel

Work with non-vectorial data

Scalar product on a pair of variable-length, discrete strings?>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYMENSHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH

>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

Jiayu Zhou CSE 847 Machine Learning 38 / 50

Page 39: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Commonly Used SVM Kernel Functions

K(a,b) = (α · aTb+ β)Q is an example of an SVM kernelfunction.

Beyond polynomials there are other very high dimensionalbasis functions that can be made practical by finding the rightKernel Function

Radial-basis style kernel (RBF)/Gaussian kernel function

K(a,b) = exp(−γ∥a− b∥2

)Sigmoid functions

Jiayu Zhou CSE 847 Machine Learning 39 / 50

Page 40: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

2nd Order Polynomial Kernel

K(a,b) = (α · aTb+ β)2

Jiayu Zhou CSE 847 Machine Learning 40 / 50

Page 41: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Gaussian Kernels

K(a,b) = exp(−γ∥a− b∥2

)

When γ is large, we clearly see that even the protection of a largemargin cannot suppress overfitting. However, for a reasonablysmall γ, the sophisticated boundary discovered by SVM with theGaussian-RBF kernel looks quite good.

Jiayu Zhou CSE 847 Machine Learning 41 / 50

Page 42: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Gaussian Kernels

For (a) a noisy data set that linear classifier appears to work quitewell, (b) using the Gaussian-RBF kernel with the hard-margin SVMleads to overfitting.

Jiayu Zhou CSE 847 Machine Learning 42 / 50

Page 43: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

From hard-margin to soft-margin

When there are outliers, hard margin SVM + Gaussian-RBFkernel result in an unnecessarily complicated decisionboundary that overfits the training noise.

Remedy: a soft formulation that allows small violation of themargins or even some classification errors.

Soft-margin: margin violation εn ≥ 0 for each data point(xn, yn) and require that

yn(wTxn + b) ≥ 1− εn

εn captures by how much (xn, yn) fails to be separated.

Jiayu Zhou CSE 847 Machine Learning 43 / 50

Page 44: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Soft-Margin SVM

We modify the hard-margin SVM to the soft-margin SVM byallowing margin violations but adding a penalty term to discouragelarge violations:

minb,w,ε

1

2wTw + C

N∑n=1

εn

subject to: yn(wTxn + b) ≥ 1− εn for n = 1, . . . , N

εn ≥ 0, for n = 1, . . . , N

The meaning of C?

When C is large, it means we care more about violating the margin,which gets us closer to the hard-margin SVM.

When C is small, on the other hand, we care less about violating themargin.

Jiayu Zhou CSE 847 Machine Learning 44 / 50

Page 45: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Soft Margin Example

Jiayu Zhou CSE 847 Machine Learning 45 / 50

Page 46: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Soft Margin and Hard Margin

minb,w,ε

12w

Tw︸ ︷︷ ︸margin

+C∑N

n=1εn︸ ︷︷ ︸

error tolerance

subject to: yn(wTxn + b) ≥ 1− εn, εn ≥ 0,∀N

Jiayu Zhou CSE 847 Machine Learning 46 / 50

Page 47: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

The Hinge Loss

The trade-off sounds very similar, right?

We have εn ≥ 0, and thatyn(w

Txn + b) ≥ 1− εn ⇒ εn ≥ 1− yn(wTxn + b)

The SVM loss (aka. Hinge Loss) function

ESVM(b,w) =1

N

∑N

n=1max(1− yn(w

Txn + b), 0)

The soft-margin SVM can be re-written as the followingoptimization problem:

minb,w

ESVM(b,w) + λwTw

Jiayu Zhou CSE 847 Machine Learning 47 / 50

Page 48: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Dual Soft-Margin SVM

maxα∈RN

N∑n=1

αn −1

2

N∑m=1

N∑n=1

αnαmynymxTnxm

subject toN∑

n=1

ynαn = 0, 0 ≤ αn≤ C, ∀n

w∗ =

N∑n=1

ynα∗nxn

Jiayu Zhou CSE 847 Machine Learning 48 / 50

Page 49: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Summary of Dual SVM

Deliver a large-margin hyperplane, and in so doing it cancontrol the effective model complexity.

Deal with high- or infinite-dimensional transforms using thekernel trick.

Express the final hypothesis g(x) using only a few supportvectors, their corresponding dual variables (Lagrangemultipliers), and the kernel.

Control the sensitivity to outliers and regularize the solutionthrough setting C appropriately.

Jiayu Zhou CSE 847 Machine Learning 49 / 50

Page 50: Support Vector Machine and Kernel Methodscse.msu.edu/~cse802/S17/slides/Lec_17_Mar27_SVM.pdfJiayu Zhou CSE 847 Machine Learning 21 / 50 SVM and non-linear transformation The margin

Support Vector Machine

Robust Classifier: Maximum Margin

Primal Objective

Dual Objective

QP Solver - d

QP Solver - N

Design: Hard Margin

Design: Soft Margin

Primal Objective

Dual Objective

QP Solver - d

QP Solver - N

Kernel TrickAllow Training Error

Kernel Trick

Support Vector

Support Vector

Hinge Loss

Jiayu Zhou CSE 847 Machine Learning 50 / 50