margins, support vectors, and linear programming thanks to terran lane and s. dreiseitl

Margins, support vectors, and linear

programmingThanks to Terran Lane and S. Dreiseitl

Exercise•Derive the vector derivative expressions:

•Find an expression for the minimum squared error weight vector, w, in the loss function:

Solution to LSE regression

The LSE method•The quantity is called a Gram

matrix and is positive semidefinite and symmetric

•The quantity is the pseudoinverse of X

•May not exist if Gram matrix is not invertible

•The complete “learning algorithm” is 2 whole lines of Matlab code

LSE example

w=[6.72, -0.36]

LSE examplet=y(x 1

,w)

LSE example

[0.36, 1]6.72t=

y(x 1

,w)

The LSE method

•So far, we have a regressor -- estimates a real valued ti for each xi

•Can convert to a classifier by assigning t=+1 or -1 to binary class training data

Multiclass trouble

?

Handling non-binary data•All against all:

•Train O(c2) classifiers, one for each pair of classes

•Run every test point through all classifiers

•Majority vote for final classifier

•More stable than 1 vs many

•Lot more overhead, esp for large c

•Data may be more balanced

•Each classifier trained on very small part of data

Support Vector Machines

Linear separators are nice• ... but what if your data looks like this:

Linearly nonseparable data•2 possibilities:

•Use nonlinear separators (diff hypothesis space)

•Possibly intersection of multiple linear separators, etc. (E.g., decision tree)

Linearly nonseparable data•2 possibilities:

•Use nonlinear separators (diff hypothesis space)

•Possibly intersection of multiple linear separators, etc. (E.g., decision tree)

•Change the data

•Nonlinear projection of data

•These turn out to be flip sides of each other

•Easier to think about (do math for) 2nd case

Nonlinear data projection•Suppose you have a

“projection function”:

•Original feature space

•“Projected” space

•Usually

•Do learning w/ linear model in

•Ex:

Nonlinear data projection

Common projections• Degree-k polynomials:

• Fourier expansions:

Example nonlinear surfaces

SVM images from lecture notes by S. Dreiseitl

The catch...• How many dimensions

does have?

• For degree-k polynomial expansions:

• E.g., for k=4, d=256 (16x16 images),

• Yike!

• For “radial basis functions”,

Linear surfaces for cheap• Can’t directly find linear

surfaces in

• Have to find a clever “method” for finding them indirectly

• It’ll take (quite) a bit of work to get there...

• Will need different criterion than

• We’ll look for the “maximum margin” classifier

• Surface s.t. class 1 (“true”) data falls as possible on one side; class -1 (“false”) falls as far as possible on the other

Max margin hyperplanes

Hyperplane

Margin

Max margin is uniqueHyperplane

Margin

Back to SVMs & margins• The margins are

parallel to hyperplane, so are defined by same w, plus constant offsets

w

bb

Back to SVMs & margins• The margins are parallel to

hyperplane, so are defined by same w, plus constant offsets

• Want to ensure that all data points are “outside” the margins

w

bb

Maximizing the margin• So now we have a

learning criterion function:

• Pick w to maximize b s.t. all points still satisfy

• Note: w.l.o.g. can rescale w arbitrarily (why?)

Maximizing the margin• So now we have a learning

criterion function:

• Pick w to maximize b s.t. all points still satisfy

• Note: w.l.o.g. can rescale w arbitrarily (why?)

• So can formulate full problem as:

•Minimize:

•Subject to:

• But how do you do that? And how does this help?

Quadratic programming• Problems of the form

•Minimize:

•Subject to:

• are called “quadratic programming” problems

• There are off-the-shelf methods to solve them

• Actually solving this is way, way beyond the scope of this class

• Consider it a black box

• If a solution exists, it will be found & be unique

• Expensive, but not intractably so

Nonseparable data• What if the data isn’t linearly separable?

• Project into higher dim space (we’ll get there)

• Allow some “slop” in the system

• Allow margins to be violated “a little”

w

The new “slackful” QP• The are “slack

variables”

• Allow margins to be violated a little

• Still want to minimize margin violations, so add them to QP instance:

•Minimize:

•Subject to:

You promised nonlinearity!• Where did the

nonlinear transform go in all this?

• Another clever trick

• With a little algebra (& help from Lagrange multipliers), can rewrite our QP in the form:

•Maximize:

•Subject to:

Kernel functions• So??? It’s still the

same linear system

• Note, though, that appears in the system only as a dot product:

Kernel functions• So??? It’s still the same

linear system

• Note, though, that appears in the system only as a dot product:

• Can replace with :

• The inner product

•is called a “kernel function”

Why are kernel fns cool?• The cool trick is that

many useful projections can be written as kernel functions in closed form

• I.e., can work with K() rather than

• If you know K(xi,xj) for

every (i,j) pair, then you can construct the maximum margin hyperplane between the projected data without ever explicitly doing the projection!

Example kernels• Homogeneous

degree-k polynomial:



• Inhomogeneous degree-k polynomial:




• Gaussian radial basis function:

Example kernels• Homogeneous degree-k

polynomial:


• Gaussian radial basis function:

• Sigmoidal (neural network):

Side note on kernels• What precisely do

kernel functions mean?

• Metric functions take two points and return a (generalized) distance between them

• What is the equivalent interpretation for kernels?

• Hint: think about what kernel function replaces in the max margin QP formulation

Side note on kernels• Kernel functions are

generalized inner products

• Essentially, give you the cosine of the angle between vectors

• Recall the law of cosines:

Side note on kernels• Replace traditional dot product with “generalized inner

product” and get:


product” and get:

• Kernel (essentially) represents:

• Angle between vectors in the projected, high-dimensional space


product” and get:

• Kernel (essentially) represents:

• Angle between vectors in the projected, high-dimensional space

• Alternatively:

• Nonlinear distribution of angles in low-dim space

Example of Kernel nonlin.

Using the classifier

• Solution of the QP gives back a set of

• Data points for which are called “support vectors”

• Turns out that we can write w as


• And our classification rule for query pt was:

• So:


SVM images from lecture notes by S. Dreiseitl

Supp

ort

vect

ors

Putting it all togetherOriginal (low dimensional) data

Putting it all together

Original datamatrix

Kernel matrix

Kernel function


Kernel +orig labels

Maximize

Subject to:

QuadraticPrograminstance


SupportVectorweights

Maximize

Subject to:

QuadraticPrograminstance

QP Solversubroutine

Putting it all togetherSupportVectorweights

Hyperplanein

Putting it all togetherSupportVectorweights

Finalclassifier

Putting it all togetherFinal

classifier

Nonlinearclassifier

in

Final notes on SVMs

• Note that only for which actually contribute to final classifier

• This is why they are called support vectors

• All the rest of the training data can be discarded

Final notes on SVMs• Complexity of training

(& ability to generalize) based only on amount of training data

• Not based on dimension of hyperplane space ( )

• Good classification performance

• In practice, SVMs among the strongest classifiers we have

• Closely related to neural nets, boosting, etc.

margins, support vectors, and linear programming thanks to terran lane and s. dreiseitl

Documents

data points

lecture notes

true data

linear programmingthanks

pick w

linear surfaces inhave

16x16 images

learning criterion function