support vector machines - western universitydlizotte/teaching/slides/svm_1.pdf5 copyright © 2001,...

20
1 Nov 23rd, 2001 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 2 Linear Classifiers f x α y est denotes +1 denotes -1 f(x,w,b) = sign(w · x + b) How would you classify this data?

Upload: others

Post on 29-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

1

Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore

Support VectorMachines

Andrew W. MooreProfessor

School of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Note to other teachers and users ofthese slides. Andrew would be delightedif you found this source material useful ingiving your own lectures. Feel free to usethese slides verbatim, or to modify themto fit your own needs. PowerPointoriginals are available. If you make useof a significant portion of these slides inyour own lecture, please include thismessage, or the following link to thesource repository of Andrew’s tutorials:http://www.cs.cmu.edu/~awm/tutorials .Comments and corrections gratefullyreceived.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 2

Linear Classifiersfx

α

yest

denotes +1

denotes -1

f(x,w,b) = sign(w · x + b)

How would youclassify this data?

Page 2: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

2

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 3

Linear Classifiersfx

α

yest

denotes +1

denotes -1

f(x,w,b) = sign(w · x - b)

How would youclassify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 4

Linear Classifiersfx

α

yest

denotes +1

denotes -1

f(x,w,b) = sign(w · x - b)

How would youclassify this data?

Page 3: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

3

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 5

Linear Classifiersfx

α

yest

denotes +1

denotes -1

f(x,w,b) = sign(w · x - b)

How would youclassify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 6

Linear Classifiersfx

α

yest

denotes +1

denotes -1

f(x,w,b) = sign(w · x - b)

Any of thesewould be fine..

..but which isbest?

Page 4: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

4

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 7

Classifier Marginfx

α

yest

denotes +1

denotes -1

f(x,w,b) = sign(w · x - b)

Define the marginof a linearclassifier as thewidth that theboundary could beincreased bybefore hitting adatapoint.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 8

Maximum Marginfx

α

yest

denotes +1

denotes -1

f(x,w,b) = sign(w · x - b)

The maximummargin linearclassifier is thelinear classifierwith the, um,maximum margin.

This is thesimplest kind ofSVM (Called anLSVM)

Linear SVM

Note: SMO(SequentialMinimalOptimization)is onealgorithm forcomputing wand b.

Page 5: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

5

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9

Maximum Marginfx

α

yest

denotes +1

denotes -1

f(x,w,b) = sign(w · x - b)

The maximummargin linearclassifier is thelinear classifierwith the, um,maximum margin.

This is thesimplest kind ofSVM (Called anLSVM)

Support Vectorsare thosedatapoints thatthe marginpushes upagainst

Linear SVM

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 10

Why Maximum Margin?

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximummargin linearclassifier is thelinear classifierwith the, um,maximum margin.

This is thesimplest kind ofSVM (Called anLSVM)

Support Vectorsare thosedatapoints thatthe marginpushes upagainst

1. Intuitively this feels safest.

2. If we’ve made a small error in thelocation of the boundary (it’s beenjolted in its perpendicular direction)this gives us least chance of causing amisclassification.

3. LOOCV is easy since the model isimmune to removal of any non-support-vector datapoints.

4. There’s some theory (using VCdimension) that is related to (but notthe same as) the proposition that thisis a good thing.

5. Empirically it works very very well.

Page 6: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

6

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 11

Specifying a line and margin

• How do we represent this mathematically?• …in m input dimensions?• …so that we can maximize the margin?

Plus-Plane

Minus-PlaneClassifier Boundary

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zone

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 12

Specifying a line and margin

• Plus-plane = { x : w · x + b = +1 }• Minus-plane = { x : w · x + b = -1 }

Plus-Plane

Minus-PlaneClassifier Boundary

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zone

-1 < w · x + b < 1ifUniverseexplodes

w · x + b <= -1if-1

w · x + b >= 1if+1Classify as..

wx+b=1

wx+b=0

wx+b=-1

Page 7: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

7

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 13

Computing the margin width

• Plus-plane = { x : w · x + b = +1 }• Minus-plane = { x : w · x + b = -1 }Claim: The vector w is perpendicular to the Plus Plane. Why?

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width

How do we computeM in terms of wand b?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 14

Computing the margin width

• Plus-plane = { x : w · x + b = +1 }• Minus-plane = { x : w · x + b = -1 }Claim: The vector w is perpendicular to the Plus Plane. Why?• Definitions: “vector” == “point”• x1 perpendicular to x2 iff x1·x2 == 0

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width

How do we computeM in terms of wand b?

Let u and v be two vectors on thePlus Plane. What is w · ( u – v ) ?

And so of course the vector w is alsoperpendicular to the Minus Plane

Page 8: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

8

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 15

Computing the margin width

• Plus-plane = { x : w · x + b = +1 }• Minus-plane = { x : w · x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width

How do we computeM in terms of wand b?

x-

x+

Any location inℜm: notnecessarily adatapoint

Any location inRm: notnecessarily adatapoint

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 16

Computing the margin width

• Plus-plane = { x : w · x + b = +1 }• Minus-plane = { x : w · x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + λ w for some value of λ. Why?

• Note: λ is scalar and positive.

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width

How do we computeM in terms of wand b?

x-

x+

Page 9: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

9

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 17

Computing the margin width

• Plus-plane = { x : w · x + b = +1 }• Minus-plane = { x : w · x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + λ w for some value of λ. Why?

• Note: λ is scalar and positive.

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width

How do we computeM in terms of wand b?

x-

x+

The line from x- to x+ isperpendicular to theplanes.

So to get from x- to x+

travel some distance inthe direction of w.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 18

Computing the margin width

• Plus-plane = { x : w · x + b = +1 }• Minus-plane = { x : w · x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + λ w for some value of λ. Why?

• Note: λ is scalar and positive.

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width

How do we computeM in terms of wand b?

x-

x+

The line from x- to x+ isperpendicular to theplanes.

So to get from x- to x+

travel some distance inthe direction of w.

So to now we know thatx+ - x- = λ w.

Page 10: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

10

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 19

Computing the margin width

What we know:• w · x+ + b = +1• w · x- + b = -1• x+ - x- = λ w• |x+ - x-| = MIt’s now easy to get M

in terms of w and b

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width

x-

x+

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 20

Computing the margin width

What we know:• w · x+ + b = +1• w · x- + b = -1• x+ - x- = λ w• |x+ - x-| = MIt’s now easy to get M

in terms of w and b

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width

w · (x - + λ w) + b = 1

=>

(w · x - + b) + λ w · w = 1

=>

-1 + λ w · w = 1

=>

x-

x+

!

" =2

w #w

Page 11: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

11

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 21

Computing the margin width

What we know:• w · x+ + b = +1• w · x- + b = -1• x+ = x- + λ w• |x+ - x-| = M•

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width =

M = |x+ - x- | =| λ w |=

x-

x+

!

" =2

w #w

!

=2 w "w

w "w=

2

w "w!

= " |w |= " w #w

!

2

w "w

Yay! Just maximize

!

2

w "w

Wait…OMG the data! I forgot!

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 22

Learning the Maximum Margin Classifier

Given a guess of w and b we can• Compute whether all data points in the correct half-planes• Compute the width of the marginSo now we just need to write a program to search the space

of w’s and b’s to find the widest margin that matches allthe datapoints. How?

Gradient descent? Simulated Annealing? Matrix Inversion?EM? Newton’s Method?

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M = Margin Width =

x-

x+

!

2

w "w

Page 12: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

12

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 23

Learning via Quadratic Programming• QP is a well-studied class of optimization algorithms

to maximize a quadratic function of some real-valuedvariables subject to linear constraints.

• It will solve our problem for us!

• It doesn’t matter how it works!

• Popular ML approach:• Describe your learning problem as optimization…• …and give it to somebody else to solve!

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 24

Quadratic Programming - Don’t Panic

!

argminw

c + dTw+

wTKw

2Find

!

a11w1+ a

12w2

+ ...+ a1mwm" b

1

a21w1+ a

22w2

+ ...+ a2mwm" b

2

:

an1w1+ a

n2w2

+ ...+ anmwm" b

n

!

a(n+1)1w1 + a

(n+1)2w2+ ...+ a

(n+1)mwm= b

(n+1)

a(n+2)1w1 + a

(n+2)2w2+ ...+ a

(n+2)mwm= b

(n+2)

:

a(n+e )1w1 + a

(n+e )2w2+ ...+ a

(n+e )mwm= b

(n+e )

And subject to

n additional linearinequalityconstraints

e additional linearequalityconstraints

Quadratic criterion

Subject to

Note w·x = wTx

Page 13: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

13

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 25

Quadratic Programming

!

argminw

c + dTw+

wTKw

2Find

!

a11w1+ a

12w2

+ ...+ a1mwm" b

1

a21w1+ a

22w2

+ ...+ a2mwm" b

2

:

an1w1+ a

n2w2

+ ...+ anmwm" b

n

!

a(n+1)1w1 + a

(n+1)2w2+ ...+ a

(n+1)mwm= b

(n+1)

a(n+2)1w1 + a

(n+2)2w2+ ...+ a

(n+2)mwm= b

(n+2)

:

a(n+e )1w1 + a

(n+e )2w2+ ...+ a

(n+e )mwm= b

(n+e )

And subject to

n additional linearinequalityconstraints

e additional linearequalityconstraints

Quadratic criterion

Subject to

There exist algorithms for finding

such constrained quadratic

optima much more efficiently

and reliably than gradient

ascent.

(But they are very fiddly…you

probably don’t want to write

one yourself)

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 26

Learning the Maximum Margin Classifier

What should our quadraticoptimization criterion be?

How many constraints will wehave?

What should they be?

Given guess of w , b we can• Compute whether all data

points are in the correcthalf-planes

• Compute the margin widthAssume R datapoints, each

(xk,yk) where yk = +/- 1

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M =

!

2

w "w

Page 14: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

14

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 27

Learning the Maximum Margin ClassifierGiven guess of w , b we can• Compute whether all data

points are in the correcthalf-planes

• Compute the margin widthAssume R datapoints, each

(xk,yk) where yk = +/- 1

What should our quadraticoptimization criterion be?

Minimize w·w

How many constraints will wehave? R

What should they be?w · xk + b >= 1 if yk = 1w · xk + b <= -1 if yk = -1

“Predict C

lass = +1”

zone

“Predict C

lass = -1”

zonewx+b=1

wx+b=0

wx+b=-1

M =

!

2

w "w

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 28

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Page 15: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

15

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 29

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1:

Find minimum w·w, whileminimizing number oftraining set errors.

Problemette: Two thingsto minimize makes foran ill-definedoptimization

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 30

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1.1:

Minimize

w·w + C (#train errors)

There’s a serious practicalproblem that’s about to makeus reject this approach. Canyou guess what it is?

Tradeoff parameter

Page 16: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

16

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 31

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1.1:

Minimize

w·w + C (#train errors)

There’s a serious practicalproblem that’s about to makeus reject this approach. Canyou guess what it is?

Tradeoff parameterCan’t be expressed as a Quadratic

Programming problem.

Solving it may be too slow.

(Also, doesn’t distinguish betweendisastrous errors and near misses) So… any

other

ideas?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 32

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 2.0:

Minimize w·w + C (distance from

incorrectly labeled points to their correct place)

Page 17: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

17

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 33

Learning Maximum Margin with NoiseGiven guess of w , b we can• Compute sum of distances

of points to their correctzones

• Compute the margin widthAssume R datapoints, each

(xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-1

M =

ww.

2

What should our quadraticoptimization criterion be?

How many constraints will wehave?

What should they be?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 34

Learning Maximum Margin with NoiseGiven guess of w , b we can• Compute sum of distances

of points to their correctzones

• Compute the margin widthAssume R datapoints, each

(xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-1

M =

ww.

2

What should our quadraticoptimization criterion be?

Minimize

!

1

2w "w+ C #

k

k=1

R

$

ε7

ε11

ε2

How many constraints will wehave? R

What should they be?w · xk + b >= (1 - εk) if yk= 1w · xk + b <= (-1+εk) if yk=-1

Page 18: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

18

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 35

Learning Maximum Margin with NoiseGiven guess of w , b we can• Compute sum of distances

of points to their correctzones

• Compute the margin widthAssume R datapoints, each

(xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-1

M =

ww.

2

What should our quadraticoptimization criterion be?

Minimize

!

1

2w "w+ C #

k

k=1

R

$

ε7

ε11

ε2

How many constraints will wehave? R

What should they be?w · xk + b >= (1 - εk) if yk= 1w · xk + b <= (-1+εk) if yk=-1

Our original (noiseless data) QP had m+1variables: w1, w2, … wm, and b.

Our new (noisy data) QP has m+1+Rvariables: w1, w2, … wm, b, εk , ε1 ,… εR

m = # inputdimensions

R= # records

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 36

Learning Maximum Margin with NoiseGiven guess of w , b we can• Compute sum of distances

of points to their correctzones

• Compute the margin widthAssume R datapoints, each

(xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-1

M =

ww.

2

What should our quadraticoptimization criterion be?

Minimize

!

1

2w "w+ C #

k

k=1

R

$

ε7

ε11

ε2

How many constraints will wehave? R

What should they be?w · xk + b >= (1 - εk) if yk= 1w · xk + b <= (-1+εk) if yk=-1

There’s a bug in this QP. Can you spot it?

Page 19: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

19

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 37

Learning Maximum Margin with NoiseGiven guess of w , b we can• Compute sum of distances

of points to their correctzones

• Compute the margin widthAssume R datapoints, each

(xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-1

M =

ww.

2

What should our quadraticoptimization criterion be?

Minimize

!

1

2w "w+ C #

k

k=1

R

$

ε7

ε11

ε2

How many constraints will wehave? 2R

What should they be?w · xk + b >= (1 - εk) if yk= 1w · xk + b <= (-1+εk) if yk=-1εk >= 0 for all kCalled “slack variables”

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 38

Learning Maximum Margin with NoiseGiven guess of w , b we can• Compute sum of distances

of points to their correctzones

• Compute the margin widthAssume R datapoints, each

(xk,yk) where yk = +/- 1

wx+b=1

wx+b=0

wx+b=-1

M =

ww.

2

What should our quadraticoptimization criterion be?

Minimize

!

1

2w "w+ C #

k

k=1

R

$

ε7

ε11

ε2

How many constraints will wehave? 2R

What should they be?w · xk + b >= (1 - εk) if yk= 1w · xk + b <= (-1+εk) if yk=-1εk >= 0 for all kCalled “slack variables”

Big C means “Fit the trainingdata as much as possible!”(at the expense ofmaximizing margin)

Small C means “Maximize themargin as much as possible!”(at the expense of fitting thetraining data)

Page 20: Support Vector Machines - Western Universitydlizotte/teaching/slides/svm_1.pdf5 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Maximum Margin x f α yest

20

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 39

What do we have?• Method for learning a maximum-margin linear

classifier when the data are• “Linearly separable” - i.e. there is a line that

gets 0 training error• Not linearly separable - there is no such line.

• If not linearly separable, we make a trade-offbetween maximizing margin and minimizing“stuff-is-on-the-wrong-side-ness”

• Still, our output for a given x is

f(x,w,b) = sign(w · x + b)