classification / regression support vector...

32
Jeff Howbert Introduction to Machine Learning Winter 2014 1 Classification / Regression Support Vector Machines

Upload: others

Post on 04-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 1

Classification / Regression

Support Vector Machines

Page 2: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 2

Topics– SVM classifiers for linearly separable classes– SVM classifiers for non-linearly separable

classes– SVM classifiers for nonlinear decision

boundarieskernel functions

– Other applications of SVMs– Software

Support vector machines

Page 3: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 3

Support vector machines

Goal: find a linear decision boundary (hyperplane)that separates the classes

Linearlyseparableclasses

Page 4: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 4

Support vector machines

One possible solution

Page 5: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 5

Support vector machines

Another possible solution

Page 6: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 6

Support vector machines

Other possible solutions

Page 7: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 7

Support vector machines

Which one is better? B1 or B2? How do you define better?

Page 8: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 8

Support vector machines

Hyperplane that maximizes the margin will have better generalization=> B1 is better than B2

Page 9: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 9

Support vector machines

Hyperplane that maximizes the margin will have better generalization=> B1 is better than B2

B1

B2

b11

b12

b21

b22

margin

test sample

Page 10: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 10

Support vector machines

Hyperplane that maximizes the margin will have better generalization=> B1 is better than B2

B1

B2

b11

b12

b21

b22

margin

test sample

Page 11: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 11

0=+⋅ bxw

Support vector machines

B1

b11

b12

W

1−=+⋅ bxw

⎩⎨⎧

−≤+⋅−≥+⋅+

==1 if1

1 if1)(

bb

fyi xwxw

x ||||2margin w

=

1+=+⋅ bxw

Page 12: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 12

We want to maximize:

Which is equivalent to minimizing:

But subject to the following constraints:

– This is a constrained convex optimization problem– Solve with numerical approaches, e.g. quadratic

programming

Support vector machines

||||2margin w

=

2||||)(

2ww =L

⎩⎨⎧

−≤+⋅−≥+⋅+

==1 if1

1 if1)(

bb

fyi xwxw

x

Page 13: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 13

Solving for w that gives maximum margin:1. Combine objective function and constraints into new

objective function, using Lagrange multipliers λi

2. To minimize this Lagrangian, we take derivatives of wand b and set them to 0:

Support vector machines

( )∑=

−+⋅−=N

iiiiprimal byL

1

2 1)(21 xww λ

=

=

=⇒=∂

=⇒=∂

N

iii

p

N

iiii

p

ybL

yL

1

1

00

0

λ

λ xww

Page 14: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 14

Solving for w that gives maximum margin:3. Substituting and rearranging gives the dual of the

Lagrangian:

which we try to maximize (not minimize).4. Once we have the λi, we can substitute into previous

equations to get w and b.5. This defines w and b as linear combinations of the

training data.

Support vector machines

∑∑ ⋅−== ji

jijiji

N

iidual yyL

,1 21 xxλλλ

Page 15: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 15

Optimizing the dual is easier.– Function of λi only, not λi and w.

Convex optimization ⇒ guaranteed to find global optimum.Most of the λi go to zero.– The xi for which λi ≠ 0 are called the support vectors.

These “support” (lie on) the margin boundaries.– The xi for which λi = 0 lie away from the margin

boundaries are not required for defining the maximum margin hyperplane.

Support vector machines

Page 16: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 16

Support vector machines

Example of solving for maximum margin hyperplane

Page 17: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 17

Support vector machines

What if the classes are not linearly separable?

Page 18: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 18

Support vector machines

Now which one is better? B1 or B2? How do you define better?

Page 19: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 19

⎩⎨⎧

+−≤+⋅−+≥+⋅+

==i

ii b

bfy

ξξ

1 if11 if1

)(xwxw

x

What if the problem is not linearly separable?Solution: introduce slack variables– Need to minimize:

– Subject to:

– C is an important hyperparameter, whose value is usually optimized by cross-validation.

⎟⎠

⎞⎜⎝

⎛+= ∑

=

N

i

kiCL

1

2

2||||)( ξww

Support vector machines

Page 20: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 20

Slack variables for nonseparable data

Support vector machines

Page 21: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 21

Support vector machines

What if decision boundary is not linear?

Page 22: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 22

Support vector machines

Solution: nonlinear transform of attributes])(,[],[: 4

21121 xxxxx +→Φ

Page 23: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 23

Support vector machines

Solution: nonlinear transform of attributes)](),[(],[: 2

221

2121 xxxxxx −−→Φ

Page 24: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 24

Issues with finding useful nonlinear transforms– Not feasible to do manually as number of attributes

grows (i.e. any real world problem)– Usually involves transformation to higher dimensional

spaceincreases computational burden of SVM optimizationcurse of dimensionality

With SVMs, can circumvent all the above via the kernel trick

Support vector machines

Page 25: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 25

Support vector machines

Kernel trick– Don’t need to specify the attribute transform Φ( x )– Only need to know how to calculate the dot product of

any two transformed samples:k( x1, x2 ) = Φ( x1 ) ⋅ Φ( x2 )

Page 26: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 26

Support vector machines

Kernel trick (cont’d.)– The kernel function k( ⋅, ⋅ ) is substituted into the dual

of the Lagrangian, allowing determination of a maximum margin hyperplane in the (implicitly) transformed space Φ( x ):

– All subsequent calculations, including predictions on test samples, are done using the kernel in place ofΦ( x1 ) ⋅ Φ( x2 )

∑∑

∑∑

=Φ⋅Φ−=

=

=

jijijiji

N

ii

jijijiji

N

iidual

yy

yyL

,1

,1

),k(21

)()(21

xx

xx

λλλ

λλλ

Page 27: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 27

Common kernel functions for SVM

– linear

– polynomial

– Gaussian or radial basis

– sigmoid

Support vector machines

( )

) tanh(),(

exp),(

) (),(

),(

2121

22121

2121

2121

ck

k

ck

k

d

+⋅=

−−=

+⋅=

⋅=

xxxx

xxxx

xxxx

xxxx

γ

γ

γ

Page 28: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 28

For some kernels (e.g. Gaussian) the implicit transform Φ( x ) is infinite-dimensional!– But calculations with kernel are done in original

space, so computational burden and curse of dimensionality aren’t a problem.

Support vector machines

Page 29: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 29

Support vector machines

Page 30: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 30

Support vector machines

Applications of SVMs to machine learning– Classification

binarymulticlassone-class

– Regression– Transduction (semi-supervised learning)– Ranking– Clustering– Structured labels

Page 31: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 31

Support vector machines

Software– SVMlight

http://svmlight.joachims.org/

– libSVMhttp://www.csie.ntu.edu.tw/~cjlin/libsvm/includes MATLAB / Octave interface

– MATLAB svmtrain / svmclassifyonly supports binary classification

Page 32: Classification / Regression Support Vector Machinescourses.washington.edu/css581/lecture_slides/16_support_vector_machines.pdf=> B1 is better than B2 B 1 B 2 b 11 b 12 b 21 b 22 margin

Jeff Howbert Introduction to Machine Learning Winter 2014 32

Online demos

– http://cs.stanford.edu/people/karpathy/svmjs/demo/

Support vector machines