cpts 570 – machine learning school of eecs washington ...holder/courses/cpts570/... · cpts 570...

46
CptS 570 – Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

CptS 570 – Machine LearningSchool of EECS

Washington State University

CptS 570 - Machine Learning 1

Page 2: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Or, support vector machine (SVM) Discriminant-based method◦ Learn class boundaries

Support vector consists of examples closest to boundary

Kernel computes similarity between examples◦ Maps instance space to a higher-dimensional space

where (hopefully) linear models suffice Choosing the right kernel is crucial Kernel machines among best-performing

learners

CptS 570 - Machine Learning 2

Page 3: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Likely to underfit using only hyperplanes But we can map the data to a nonlinear

space and use hyperplanes there◦ Φ: Rd F◦ x Φ(x)

CptS 570 - Machine Learning 3

Φ

Page 4: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Note we want ≥+1, not ≥0 Want instances some distance from hyperplane

CptS 570 - Machine Learning 4

{ }

( ) 1asrewritten becan which

1for 1

1for 1

such that and find if1 if1

where,

0

0

0

0

2

1

+≥+

−=−≤+

+=+≥+

∈−∈+

==

wr

rwrw

wCC

rr

tTt

ttT

ttT

t

tt

ttt

xw

xwxw

wxx

xX

Page 5: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Distance from instance xt to hyperplanewTxt+w0

Distance from hyperplane to closest instances is the margin

CptS 570 - Machine Learning 5

wxw

wxw )( 00 wror

w tTttT++

w

margin

Page 6: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Optimal separating hyperplane is the one maximizing the margin

We want to choose w maximizing ρ such that

Infinite number of solutions by scaling w Thus, we choose solution minimizing ‖w‖

CptS 570 - Machine Learning 6

( )t

wr tTt

∀≥+

,ρwxw 0

( ) twr tTt ∀+≥+ ,121

02 xww to subject min

Page 7: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Quadratic optimization problem with complexity polynomial in d (#features)

Kernel will eventually map d-dimensional space to higher-dimensional space

Prefer complexity not based on #dimensions

CptS 570 - Machine Learning 7

( ) twr tTt ∀+≥+ ,121

02 xww to subject min

Page 8: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Convert optimization problem to depend on number of training examples N (not d)◦ Still polynomial in N

But optimization will depend only on closest examples (support vector)◦ Typically ≪N

CptS 570 - Machine Learning 8

Page 9: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Rewrite quadratic optimization problem using Lagrange multipliers αt, 1 ≤ t ≤ N

Minimize Lp

CptS 570 - Machine Learning 9

( )

( )[ ]

( ) ∑∑

==

=

++−=

−+−=

∀+≥+

N

t

tN

t

tTtt

N

t

tTttp

tTt

wr

wrL

twr

110

2

10

2

02

21

121

,1 subject to 21min

αα

α

xww

xww

xww

Page 10: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Equivalently, we can maximize Lp subject to the constraints:

Plugging these into Lp …

CptS 570 - Machine Learning 10

00

0

10

1

=⇒=∂

=⇒=∂

=

=

N

t

ttp

N

t

tttp

rwL

rL

α

α xww

Page 11: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Maximize Ld with respect to αt only Complexity O(N3)

CptS 570 - Machine Learning 11

( )

( )

( )

∑∑∑

∑ ∑∑

∀≥=

+−=

+−=

+−−=

t and to subject

tr

rr

rwrL

ttt

t

tsTtst

t s

st

t

tT

t t

ttt

t

tttTTd

,002121

21

0

αα

ααα

α

ααα

xx

ww

xwww

Page 12: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Most αt = 0◦ I.e., rt(wTxt+w0) > 1 (xt lie outside margin)

Support vectors: xt such that αt > 0◦ I.e., rt(wTxt+w0) = 1 (xt lie on margin)

w = Σt αt rtxt

w0 = rt – wTxt for any support vector xt

◦ Typically average over all support vectors Resulting discriminant is called the support

vector machine (SVM)

CptS 570 - Machine Learning 12

Page 13: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

CptS 570 - Machine Learning 13

O = support vectors

margin

Page 14: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Data not linearly separable Find hyperplane with least error Define slack variables ξt ≥ 0 storing deviation

from the margin

CptS 570 - Machine Learning 14

( ) ttTt wxr ξ−≥+ 10w

Page 15: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

(a) Correctly classified example far from margin (ξt = 0)

(b) Correctly classified example on the margin (ξt = 0)

(c) Correctly classified example, but inside the margin (0 < ξt < 1)

(d) Incorrectly classified example (ξt ≥ 1)

Soft error =

CptS 570 - Machine Learning 15

∑t

Page 16: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

CptS 570 - Machine Learning 16

O = support vectors

margin

Page 17: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Lagrangian equation with slack variables

C is penalty factor μt ≥ 0, new set of Lagrange multipliers Want to minimize Lp

CptS 570 - Machine Learning 17

( )[ ] ∑∑∑ −+−+−+=t

tt

t

ttTtt

t

tp wxrCL ξµξαξ 1

21

02 ww

Page 18: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Minimize Lp by setting derivatives to zero

Plugging these into Lp yields dual Ld Maximize Ld with respect to αt

CptS 570 - Machine Learning 18

00

00

0

10

1

=−−⇒=∂

=⇒=∂

=⇒=∂

=

=

tttp

N

t

ttp

N

t

tttp

CL

rwL

rL

µαξ

α

α xww

Page 19: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Quadratic optimization problem Support vectors have αt >0◦ Examples on margin: αt < C◦ Examples inside margin or misclassified: αt = C

CptS 570 - Machine Learning 19

( )

∑∑∑∀≤≤=

+−=

t, 0 and 0 subject to

21

tCr

rrL

tttt

tsTtst

t s

std

αα

ααα xx

Page 20: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

C is a regularization parameter◦ High C high penalty for non-separable examples

(overfit)◦ Low C less penalty (underfit)◦ Determine using validation set (C=1 typical)

CptS 570 - Machine Learning 20

( )

∑∑∑∀≤≤=

+−=

t, 0 and 0 subject to

21

tCr

rrL

tttt

tsTtst

t s

std

αα

ααα xx

Page 21: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

To use previous approaches, data must be near linearly separable

If not, perhaps a transformation φ(x) will help

φ(x) are basis functions

CptS 570 - Machine Learning 21

φ

Page 22: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Transform d-dimensional x space to k-dimensional z space using basis functions φ(x)

z=φ(x) where zj = φj(x) , j=1,…,k

Instead of w0, assume z1 = φ1(x) ≡1

CptS 570 - Machine Learning 22

∑=

==

=k

jjj

T

T

wg

g

1)()()(

)(

xxφwx

zwz

ϕ

Page 23: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Replace inner product of basis functions φ(xt)Tφ(xs) with kernel function K(xt,xs)

CptS 570 - Machine Learning 23

[ ] ∑∑∑ −+−−+=t

ttt

ttTttt

tp rCL ξµξαξ 1)(

21 2 xφww

( )

∑∑∑∀≤≤=

+−=

t, 0 and 0 subject to

)(21

tCr

rrL

tttt

tsTtst

t s

std

αα

ααα xx φφ

∑∑∑ +−=t

tstst

t s

std KrrL ααα ),(

21 xx

Page 24: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Kernel K(xt,xs) computes z-space product φ(xt)Tφ(xs) in x-space

Matrix of kernel values K, where Kts = K(xt,xs), called the Gram matrix

K should be symmetric and positive semidefinite

CptS 570 - Machine Learning 24

( )

( ) ( ) ( ) ( )

( ) ( )∑

∑∑

=

==

==

t

ttt

t

TtttT

t

ttt

t

ttt

Krg

rg

rr

xxx

xφxφxφwx

xφzw

α

αα

Page 25: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Polynomial kernel of degree q

If q=1, then use original features For example, when q=2 and d=2

CptS 570 - Machine Learning 25

( ) ( )qtTtK 1+= xxxx ,

( ) ( )( )

( ) [ ]T

T

xxxxxx

yxyxyyxxyxyx

yxyx

K

22

212121

22

22

21

2121212211

22211

2

2221

22211

1

,,,,,

,

=

+++++=

++=

+=

x

yxyx

φ

Page 26: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Polynomial kernel of degree 2

CptS 570 - Machine Learning 26

O = support vectors

margin

Page 27: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Radial basis functions (Gaussian kernel)

xt is the center s is the radius Larger s implies smoother boundaries

CptS 570 - Machine Learning 27

( )

−−= 2

2

2sK

tt

xxxx exp,

Page 28: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

CptS 570 - Machine Learning 28

Page 29: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Sigmoidal functions

CptS 570 - Machine Learning 29

)12tanh(),( += tTtK xxxx

tanh

Page 30: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Kernel K(x,y) increases with similarity between x and y

Prior knowledge can be included in the kernel function

E.g., training examples are documents◦ K(D1,D2) = # shared words

E.g., training examples are strings (e.g., DNA)◦ K(S1,S2) = 1 / edit distance between S1 and S2◦ Edit distance is the number of insertions, deletions

and/or substitutions to transform S1 into S2

CptS 570 - Machine Learning 30

Page 31: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

E.g., training examples are nodes in a graph (e.g., social network)

K(N1,N2) = 1 / length of shortest path connecting nodes

K(N1,N2) = #paths connecting nodes Diffusion kernel

CptS 570 - Machine Learning 31

Page 32: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Training examples are graphs, not feature vectors◦ E.g., carcinogenic vs. non-carcinogenic chemical

structures Compare substructures of graphs◦ E.g., walks, paths, cycles, trees, subgraphs

K(G1,G2) = number of identical random walks in both graphs

K(G1,G2) = number of subgraphs shared by both graphs

CptS 570 - Machine Learning 32

Page 33: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Training data from multiple modalities (e.g., biometrics, social network, audio/visual)

Construct new kernels by combining simpler kernels

If K1(x,y) and K2(x,y) are valid kernels, and c is a constant, then

◦ are valid kernels

CptS 570 - Machine Learning 33

( )( )

( ) ( )( ) ( )

+=

yxyxyxyx

yxyx

,,

,,

,

,

21

21

KK

KK

cK

K

Page 34: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Adaptive kernel combination

Learn both αts and ɳis

CptS 570 - Machine Learning 34

( ) ( )

( )( )∑∑

∑∑ ∑∑

=

−=

==

i

tii

t

tt

t s i

stii

stst

t

td

i

m

ii

Krg

KrrL

KK

xxx

xx

yxyx

,)(

,

,,

ηα

ηααα

η

21

1

Page 35: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Learn K different kernel machines gi(x)◦ Each uses one class as

positive, remaining classes as negative◦ Choose class i such that

i=argmaxj gj(x)◦ Best approach in practice

CptS 570 - Machine Learning 35

Page 36: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Learn K(K-1)/2 kernel machines◦ Each uses one class as

positive and another class as negative◦ Easier (faster) learning per

kernel machine

CptS 570 - Machine Learning 36

Page 37: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Learn all margins at once

◦ zt is the class index of xt

K*N variables to optimize (expensive)

CptS 570 - Machine Learning 37

02

21

00

1

2

≥≠∀−++≥+

+ ∑∑∑=

ti

ttii

tTiz

tT

z

i t

ti

K

ii

ziww

C

tt ξξ

ξ

to subject

min

,,xwxw

w

Page 38: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Normally, we would use squared error

For SVM, we use ε -sensitive loss

◦ Tolerate errors up to ε◦ Errors beyond ε have only linear effect

CptS 570 - Machine Learning 38

( )( ) ( )( )

−−

<−=

otherwise if0

εε tt

tttt

frfr

frex

xx

02 )()]([))(,( wffrfre Ttttt +=−= xwxxx

Page 39: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Use slack variables to account for deviations beyond ε◦ ξt

+ for positive deviations◦ ξt

- for negative deviations

CptS 570 - Machine Learning 39

( )∑ −+ ++t

ttC ξξ2

21 wmin

( )( )

00

0

+≤−+

+≤+−

−+

+

tt

ttT

tTt

rw

wr

ξξ

ξε

ξε

,

xwxw

Subject to

Page 40: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Non-support vectors (inside margin): Support vectors◦ ⊗ on the margin:◦ ⊠ outside margin (outlier):

CptS 570 - Machine Learning 40

0)(,0,0

)()(

))()((21

=−≤≤≤≤

−−+−

−−−=

−+−+

−+−+

−+−+

∑∑

∑∑

t

t

ttt

t

t

ttt

t

t

sTtsst

t s

td

CCtosubject

r

L

αααα

ααααε

αααα xx

0== −+tt αα

CorC tt <<<< −+ αα 00CorC tt == −+ αα

Page 41: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

CptS 570 - Machine Learning 41

Page 42: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Fitted line f(x) is weighted sum of support vectors

Average w0 over:

CptS 570 - Machine Learning 42

CifwrCifwr

ttTt

ttTt

<<−+=

<<++=

+

αε

αε

0,

0,

0

0

xwxw

−+

−+

−=

+−=+=

t

tttt

TtttT wwf

xw

xxxwx

)(

))(()( 00

αα

αα

Page 43: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

CptS 570 - Machine Learning 43

0)(,0,0

)()(

)())((21

=−≤≤≤≤

−−+−

−−−=

−+−+

−+−+

−+−+

∑∑

∑∑

t

t

ttt

t

t

ttt

t

t

stsst

t s

td

CCtosubject

r

,KL

αααα

ααααε

αααα xx

∑ +−=+= −+t

tttT w),Kwf 00 ()()( xxxwx αα

Page 44: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Polynomial (quadratic) kernel

Gaussian kernel

CptS 570 - Machine Learning 44

Page 45: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Classification: SMO Regression: SMOreg Sequential Minimal Optimization (SMO) Kernels◦ Polynomial◦ RBF◦ String

CptS 570 - Machine Learning 45

Page 46: CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

Seek optimal separating hyperplane Support vector machine (SVM) finds

hyperplane using only closest examples Kernel function allows SVM to operate in

higher dimensions Kernel regression Choosing correct kernel is crucial Kernel machines among best-performing

learners

CptS 570 - Machine Learning 46