new svms and kernels

8/12/2019 New Svms and Kernels

1/76

Support Vector Machines and

Kernel Methods

Geoff Gordon

[email protected]

June 15, 2004


2/76

Support vector machines

The SVM is a machine learning algorithm which

solves classification problems uses a flexible representation of the class boundaries

implements automatic complexity control to reduce overfitting has a single global minimum which can be found in polynomial

time

It is popular because

it can be easy to use

it often has good generalization performance

the same algorithm solves a variety of problems with little tuning


3/76

SVM concepts

Perceptrons

Convex programming and duality

Using maximum margin to control complexity

Representing nonlinear boundaries with feature expansion

The kernel trick for efficient optimization


4/76

Outline

Classification problems Perceptrons and convex programs From perceptrons to SVMs

Advanced topics


5/76

Classifi cation exampleFishers irises

0

0.5

1

1.5

2

2.5

1 2 3 4 5 6 7

setosaversicolor

virginica


6/76

Iris data

Three species of iris

Measurements of petal length, width

Iris setosais linearly separable fromI. versicolor andI. virginica


7/76

ExampleBoston housing data

40 50 60 70 80 90 100 110 120 13060

65

70

75

80

85

90

95

100

Industry & Pollution

%M

iddle&UpperC

lass


8/76

A good linear classifi er

40 50 60 70 80 90 100 110 120 13060

65

70

75

80

85

90

95

100


%M

iddle&UpperC

lass


9/76

ExampleNIST digits

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

28 28 = 784features in[0, 1]


10/76

Class means

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25


11/76


12/76

Sometimes a nonlinear classifi er is better

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25


13/76

Sometimes a nonlinear classifi er is better

10 8 6 4 2 0 2 4 6 8 1010

8

6

4

2

0

2

4

6

8

10


14/76

Classifi cation problem

X y

.

..

40 50 60 70 80 90 100 110 120 13060

65

70

75

80

85

90

95

100


%M

iddle&UpperClass

10 8 6 4 2 0 2 4 6 8 10

10

8

6

4

2

0

2

4

6

8

10

Data pointsX= [x1;x2;x3; . . .]withxi Rn

Labelsy= [y1; y2; y3; . . .]withyi {

1, 1

}Solution is a subset of Rn, the classifier

Often represented as a testf(x, learnable parameters) 0

Define: decision surface, linear separator, linearly separable


15/76

What is goal?

Classify new data with fewest possible mistakes

Proxy: minimize some function on training data

minw

i

l(yif(xi;w)) + l0(w)

Thatsl(f(x))for +ve examples,l(f(x))for -ve

3 2 1 0 1 2 30.5

0

0.5

1

1.5

2

2.5

3

3.5

4

3 2 1 0 1 2 30.5

0

0.5

1

1.5

2

2.5

3

3.5

4

piecewise linear loss logistic loss


16/76

Getting fancy

Text? Hyperlinks? Relational database records?

difficult to featurize w/ reasonable number of features

but what if we could handle large or infinite feature sets?


17/76

Outline

Classification problems Perceptrons and convex programs From perceptrons to SVMs

Advanced topics


18/76

Perceptrons

3 2 1 0 1 2 3

0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Weight vectorw, biasc

Classification rule: sign(f(x))wheref(x) = xw+ c

Penalty for mispredicting:l(yf(x)) = [yf(x)]

This penalty is convex inw, so all minima are global

Note: unit-lengthxvectors


19/76

Training perceptrons

Perceptron learning rule: on mistake,

w += yx

c += y

That is, gradient descent onl(yf(x)), since

w [y(x w+ c)]=yx ify(x w+ c) 0

0 otherwise


20/76

Perceptron demo

1 0.5 0 0.5 1

1

0.5

0

0.5

1

3 2 1 0 1 2 33

2

1

0

1

2

3


21/76

Perceptrons as linear inequalities

Linear inequalities (for separable case):

y(x w+ c)>0Thats

x w+ c >0 for positive examplesx w+ c not


22/76

Version space

x w+ c= 0

As a fn of x: hyperplane w/ normalwat distancec/w from origin

As a fn of w: hyperplane w/ normalxat distancec/x from origin


23/76

Convex programs

Convex program:

min f(x) subject to

gi(x) 0 i= 1 . . . mwherefandgiare convex functions

Perceptron is almost a convex program (>vs.

)

Trick: write

y(x w+ c) 1


24/76

Slack variables

3 2 1 0 1 2 30.5

0

0.5

1

1.5

2

2.5

3

3.5

4

If not linearly separable, add slack variables 0

y(x w+ c) + s 1Then

i siis total amount by which constraints are violated

So try to makei sias small as possible


25/76

Perceptron as convex program

The final convex program for the perceptron is:

min

i si subject to

(yixi) w+ yic + si 1si 0

We will try to understand this program using convex duality


26/76

Duality

To every convex program corresponds a dual

Solving original (primal) is equivalent to solving dual

Dual often provides insight

Can derive dual by using Lagrange multipliers to eliminate con-

straints


27/76

Lagrange Multipliers

Way to phrase constrained optimization problem as a game

maxx f(x) subject to g(x) 0

(assumef, g are convex downward)

maxx mina0 f(x) + ag(x)

Ifxplaysg(x)


28/76

Lagrange Multipliers: the picture

2 1.5 1 0.5 0 0.5 1 1.5 22

1.5

1

0.5

0

0.5

1

1.5

2


29/76

Lagrange Multipliers: the caption

Problem: maximize

f(x, y) = 6x + 8y

subject to

g(x, y) =x2 + y2 1 0Using a Lagrange multipliera,

maxxy

mina0

f(x, y) + ag(x, y)

At optimum,

0 =

f(x, y) + a

g(x, y) =

68 + 2a

x

y


30/76

Duality for the perceptron

Notation: zi =yixiandZ= [z1; z2; . . .], so that:

mins,w,c i si subject toZw+ cy+ s 1s 0

Using a Lagrange multiplier vectora

0,

mins,w,c maxai si aT(Zw+ cy+ s 1)

subject to s 0 a 0


31/76

Duality contd

From last slide:

mins,w,c maxa

i si aT(Zw+ cy+ s 1)

subject to s 0 a 0Minimize wrtw, cexplicitly by setting gradient to 0:

0 = aTZ

0 = aT

yMinimizing wrtsyields inequality:

0 1 a


32/76

Duality contd

Final form of dual program for perceptron:

maxa 1Ta subject to

0 = aTZ

0 = aTy

0 a 1


33/76

Problems with perceptrons

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1

1

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

Vulnerable to overfitting when many input features

Not very expressive (XOR)


34/76

Outline

Classification problems

Perceptrons and convex programs From perceptrons to SVMs

Advanced topics


35/76

Modernizing the perceptron

Three extensions:

Margins Feature expansion

Kernel trick

Result is called a Support Vector Machine (reason given below)


36/76

Margins

Margin is the signed distance from an example to the decisionboundary

+ve margin points are correctly classified, -ve margin means error


37/76

SVMs are maximum margin

Maximize minimum distance from data to separator

Ball center of version space (caveats)

Other centers: analytic center, center of mass, Bayes point

Note: if not linearly separable, must trade margin vs. errors


38/76

Why do margins help?

If our hypothesis is near the boundary of decision space, we dontnecessarily learn much from our mistakes

If were far away from any boundary, a mistake has to eliminate a

large volume from version space


39/76

Why margins help, explanation 2

Occams razor: simple classifiers are likely to do better in practice

Why? There are fewer simple classifiers than complicated ones, so

we are less likely to be able to fool ourselves by finding a really good

fit by accident.

What does simple mean? Anything, as long as you tell me before

you see the data.


40/76

Explanation 2 contd

Simple can mean:

Low-dimensional

Large margin

Short description length

For this lecture we are interested in large margins and compact de-

scriptions

By contrast, many classical complexity control methods (AIC, BIC)

rely on low dimensionality alone


41/76


3 2 1 0 1 2 30.5

0

0.5

1

1.5

2

2.5

3

3.5

4

3 2 1 0 1 2 30.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Margin loss is an upper bound on number of mistakes


42/76



43/76

Optimizing the margin

Most common method: convex quadratic program

Efficient algorithms exist (essentially the same as some interior point

LP algorithms)

Because QP is strictly convex,uniqueglobal optimum

Next few slides derive the QP. Notation: Assume w.l.o.g.xi2 = 1 Ignore slack variables for now (i.e., assume linearly separable)

if you ignore the intercept term


44/76

Optimizing the margin, contd

MarginM is distance to decision surface: for pos example,(x Mw/w) w+ c = 0

x w+ c = Mw w/w =Mw

SVM maximizesM >0such that all margins are M:

maxM>0,w,c M subject to

(yixi)

w+ yic

M

w

Notation: zi =yixiandZ= [z1; z2; . . .], so that:

Zw+ yc Mw

Notew, cis a solution ifw, cis


45/76

Optimizing the margin, contd

Divide byMw to get(Zw+ yc)/Mw 1

Definev= wMw

andd= cMw

, so that v = wMw

= 1M

maxv,d 1/v subject to

Zv+ y

d 1

Maximizing1/v is minimizing v is minimizing v2

minv,d

v

2 subject to

Zv+ yd 1Add slack variables to handle non-separable case:

mins

0,v

,d v2 + C

isi

subject to

Zv+ yd + s 1


46/76


Three extensions:

Margins Feature expansion Kernel trick


47/76

Feature expansion

Given an examplex= [a b ...]

Could add new features likea2,ab,a7b3,sin(b), ...

Same optimization as before, but with longerx vectors and so longer

wvector

Classifier: is3a + 2b + a2 + 3ab a7b3 + 4 sin(b) 2.6?

This classifier is nonlinear in original features, but linear in expanded

feature space

We have replaced x by (x) for some nonlinear , so decision

boundary is nonlinear surfacew (x) + c= 0


48/76

Feature expansion example

1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1

1

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

1

0.5

0

0.5

1

1

0.5

0

0.5

1

1

0.5

0

0.5

1

x, y x, y, xy


49/76

Some popular feature sets

Polynomials of degreek

1, a , a2, b , b2, ab

Neural nets (sigmoids)

tanh(3a + 2b 1), tanh(5a 4), . . .

RBFs of radius

exp

1

2((a a0)2 + (b b0)2)


50/76

Feature expansion problems

Feature expansion techniques yieldlotsof features

E.g. polynomial of degree k on n original features yields O(nk)

expanded features

E.g. RBFs yield infinitely many expanded features!

Inefficient (for i = 1 to infinity do ...)

Overfitting (VC-dimension argument)


51/76

How to fi x feature expansion

We have already shown we can handle the overfitting problem: even

if we have lots of parameters, large margins make simple classifiers

All thats left is efficiency

Solution: kernel trick


52/76


Three extensions:

Margins Feature expansion

Kernel trick


53/76

Kernel trick

Way to make optimization efficient when there are lots of features

Compute one Lagrange multiplier per training example instead of

one weight per feature (part I)

Use kernel function to avoid representingwever (part II)

Will mean we can handle infinitely many features!


54/76

Kernel trick, part I

minw,c |w

|2/2 subject to Zw+ yc

1

minw,c maxa0 wTw/2 + a (1 Zw yc)

Minimize wrtw, cby setting derivatives to 0

0 = w ZTa 0 =a y

Substitute back in forw, c

maxa0 a 1 aTZZTa/2 subject to a y= 0

Note: to allow slacks, add an upper bounda C


55/76

What did we just do?

max0aC a 1 aTZZTa/2 subject to a y= 0

Now we have a QP in ainstead ofw, c

Once we solve fora, we can findw=ZTato use for classification

We also needc

which we can get from complementarity:

yixi w+ yic= 1 ai>0or as dual variable fora y= 0


56/76

Representation of w

Optimalw=ZTais a linear combination of rows ofZ

I.e.,wis a linear combination of (signed) training examples

I.e., w has a finite representation even if there are infinitely many

features


57/76

Support vectors

Examples withai >0are called support vectors

Support vector machine = learning algorithm (machine) based

on support vectors

Often many fewer than number of training examples (ais sparse)

This is the short description of an SVM mentioned above


58/76

Intuition for support vectors

40 50 60 70 80 90 100 110 120

60

70

80

90

100

110

120


59/76


60/76

At end of optimization

Gradient wrtaiis1 yi(xi w+ c)

Increaseai if (scaled) margin 1

Stable iff (ai = 0AND margin 1) OR margin= 1


61/76

How to avoid writing down weights

Suppose number of features is really big or even infinite?

Cant write downX, so how do we solve the QP?

Cant even write down w, so how do we classify new examples?


62/76


63/76

Kernel trick, part II

Yes, we can computeGdirectlysometimes!

Recall thatxi was the result of applying a nonlinear feature expan-

sion functionto some shorter vector (say qi)

DefineK(qi,qj) =(qi) (qj)


64/76


65/76

Example kernels

Polynomial (typical component ofmight be17q21

q32

q4)

K(q,q) = (1 + q q)k

Sigmoid (typical componenttanh(q1+ 3q2))

K(q,q) = tanh(aq q + b)

Gaussian RBF (typical componentexp(12

(q1 5)2))

K(q,q) = exp(q q2/2)


66/76

Detail: polynomial kernel

Suppose x=

12q

q2

Thenx x= 1 + 2qq + q2(q)2

From previous slide,

K(q, q) = (1 + qq)2 = 1 + 2qq + q2(q)2

Dot product + addition + exponentiation vs.O(nk)terms


67/76

The new decision rule

Recall original decision rule: sign(x w+ c)

Use representation in terms of support vectors:

sign(xZTa+c) =sign

i

x xiyiai+ c =sign

i

K(q,qi)yiai+ c

Since there are usually not too many support vectors, this is a rea-

sonably fast calculation


68/76

Summary of SVM algorithm

Training:

Compute Gram matrixGij =yiyjK(qi,qj)

Solve QP to geta

Compute interceptcby using complementarity or duality

Classification:

Computeki =K(q,qi)for support vectorsqi Computef =c + i aikiyi

Test sign(f)


69/76

Outline

Classification problems

Perceptrons and convex programs From perceptrons to SVMs

Advanced topics


70/76

Advanced kernels

All problems so far: each example is a list of numbers

What about text, relational DBs, . . . ?

Insight:K(x, y)can be defined whenx andy are not fixed length

Examples:

String kernels

Path kernels Tree kernels

Graph kernels


71/76

String kernels

Pick (0, 1)

cat c, a, t, ca, at, 2 ct, 2 cat

Strings are similar if they share lots of nearly-contiguous substrings

Works for words in phrases too: man bites dog similar to man

bites hot dog, less similar to dog bites man

There is an efficient dynamic-programming algorithm to evaluate

this kernel (Lodhi et al, 2002)


72/76

Combining kernels

SupposeK(x, y)andK(x, y)are kernels

Then so are

K+ K K for >0

Given a set of kernelsK1, K2, . . ., can search for best

K=1K1+ 2K2+ . . .

using cross-validation, etc.


73/76

Kernel X

Kernel trick isnt limited to SVDs

Works whenever we can express an algorithm using only sums, dot

products of training examples

Examples:

kernel Fisher discriminant

kernel logistic regression kernel linear and ridge regression kernel SVD or PCA

1-class learning / density estimation


74/76

Summary

Perceptrons are a simple, popular way to learn a classifier

They suffer from inefficient use of data, overfitting, and lack of ex-

pressiveness

SVMs fix these problems using margins and feature expansion

In order to make feature expansion computationally feasible, we

need the kernel trick

Kernel trick avoids writing out high-dimensional feature vectors by

use of Lagrange multipliers and representer theorem

SVMs are popular classifiers because they usually achieve good

error rates and can handle unusual types of data


75/76

References

http://www.cs.cmu.edu/~ggordon/SVMs

these slides, together with code

http://svm.research.bell-labs.com/SVMdoc.html

Burgess SVM tutorial

http://citeseer.nj.nec.com/burges98tutorial.htmlBurgess paper A Tutorial on Support Vector Machines for Pattern

Recognition (1998)


76/76

References

Huma Lodhi, Craig Saunders, Nello Cristianini, John Shawe-Taylor,

Chris Watkins. Text Classification using String Kernels. 2002.

http://www.cs.rhbnc.ac.uk/research/compint/areas/

comp_learn/sv/pub/slide1.ps

Slides by Stitson & Weston

http://oregonstate.edu/dept/math/CalculusQuestStudyGuides/

vcalc/lagrang/lagrang.htmlLagrange Multipliers

http://svm.research.bell-labs.com/SVT/SVMsvt.html

on-line SVM applet

new svms and kernels

Documents