margins, support vectors, and linear programming thanks to terran lane and s. dreiseitl
Post on 15-Jan-2016
221 views
TRANSCRIPT
![Page 1: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/1.jpg)
Margins, support vectors, and linear
programmingThanks to Terran Lane and S. Dreiseitl
![Page 2: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/2.jpg)
Exercise•Derive the vector derivative expressions:
•Find an expression for the minimum squared error weight vector, w, in the loss function:
![Page 3: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/3.jpg)
Solution to LSE regression
![Page 4: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/4.jpg)
The LSE method•The quantity is called a Gram
matrix and is positive semidefinite and symmetric
•The quantity is the pseudoinverse of X
•May not exist if Gram matrix is not invertible
•The complete “learning algorithm” is 2 whole lines of Matlab code
![Page 5: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/5.jpg)
LSE example
w=[6.72, -0.36]
![Page 6: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/6.jpg)
LSE examplet=y(x 1
,w)
![Page 7: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/7.jpg)
LSE example
[0.36, 1]6.72t=
y(x 1
,w)
![Page 8: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/8.jpg)
The LSE method
•So far, we have a regressor -- estimates a real valued ti for each xi
•Can convert to a classifier by assigning t=+1 or -1 to binary class training data
![Page 9: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/9.jpg)
Multiclass trouble
?
![Page 10: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/10.jpg)
Handling non-binary data•All against all:
•Train O(c2) classifiers, one for each pair of classes
•Run every test point through all classifiers
•Majority vote for final classifier
•More stable than 1 vs many
•Lot more overhead, esp for large c
•Data may be more balanced
•Each classifier trained on very small part of data
![Page 11: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/11.jpg)
Support Vector Machines
![Page 12: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/12.jpg)
Linear separators are nice• ... but what if your data looks like this:
![Page 13: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/13.jpg)
Linearly nonseparable data•2 possibilities:
•Use nonlinear separators (diff hypothesis space)
•Possibly intersection of multiple linear separators, etc. (E.g., decision tree)
![Page 14: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/14.jpg)
Linearly nonseparable data•2 possibilities:
•Use nonlinear separators (diff hypothesis space)
•Possibly intersection of multiple linear separators, etc. (E.g., decision tree)
•Change the data
•Nonlinear projection of data
•These turn out to be flip sides of each other
•Easier to think about (do math for) 2nd case
![Page 15: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/15.jpg)
Nonlinear data projection•Suppose you have a
“projection function”:
•Original feature space
•“Projected” space
•Usually
•Do learning w/ linear model in
•Ex:
![Page 16: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/16.jpg)
Nonlinear data projection
![Page 17: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/17.jpg)
Nonlinear data projection
![Page 18: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/18.jpg)
Nonlinear data projection
![Page 19: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/19.jpg)
Common projections• Degree-k polynomials:
• Fourier expansions:
![Page 20: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/20.jpg)
Example nonlinear surfaces
SVM images from lecture notes by S. Dreiseitl
![Page 21: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/21.jpg)
Example nonlinear surfaces
SVM images from lecture notes by S. Dreiseitl
![Page 22: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/22.jpg)
Example nonlinear surfaces
SVM images from lecture notes by S. Dreiseitl
![Page 23: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/23.jpg)
Example nonlinear surfaces
SVM images from lecture notes by S. Dreiseitl
![Page 24: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/24.jpg)
The catch...• How many dimensions
does have?
• For degree-k polynomial expansions:
• E.g., for k=4, d=256 (16x16 images),
• Yike!
• For “radial basis functions”,
![Page 25: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/25.jpg)
Linear surfaces for cheap• Can’t directly find linear
surfaces in
• Have to find a clever “method” for finding them indirectly
• It’ll take (quite) a bit of work to get there...
• Will need different criterion than
• We’ll look for the “maximum margin” classifier
• Surface s.t. class 1 (“true”) data falls as possible on one side; class -1 (“false”) falls as far as possible on the other
![Page 26: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/26.jpg)
Max margin hyperplanes
Hyperplane
Margin
![Page 27: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/27.jpg)
Max margin is uniqueHyperplane
Margin
![Page 28: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/28.jpg)
Back to SVMs & margins• The margins are
parallel to hyperplane, so are defined by same w, plus constant offsets
w
bb
![Page 29: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/29.jpg)
Back to SVMs & margins• The margins are parallel to
hyperplane, so are defined by same w, plus constant offsets
• Want to ensure that all data points are “outside” the margins
w
bb
![Page 30: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/30.jpg)
Maximizing the margin• So now we have a
learning criterion function:
• Pick w to maximize b s.t. all points still satisfy
• Note: w.l.o.g. can rescale w arbitrarily (why?)
![Page 31: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/31.jpg)
Maximizing the margin• So now we have a learning
criterion function:
• Pick w to maximize b s.t. all points still satisfy
• Note: w.l.o.g. can rescale w arbitrarily (why?)
• So can formulate full problem as:
•Minimize:
•Subject to:
• But how do you do that? And how does this help?
![Page 32: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/32.jpg)
Quadratic programming• Problems of the form
•Minimize:
•Subject to:
• are called “quadratic programming” problems
• There are off-the-shelf methods to solve them
• Actually solving this is way, way beyond the scope of this class
• Consider it a black box
• If a solution exists, it will be found & be unique
• Expensive, but not intractably so
![Page 33: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/33.jpg)
Nonseparable data• What if the data isn’t linearly separable?
• Project into higher dim space (we’ll get there)
• Allow some “slop” in the system
• Allow margins to be violated “a little”
w
![Page 34: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/34.jpg)
The new “slackful” QP• The are “slack
variables”
• Allow margins to be violated a little
• Still want to minimize margin violations, so add them to QP instance:
•Minimize:
•Subject to:
![Page 35: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/35.jpg)
You promised nonlinearity!• Where did the
nonlinear transform go in all this?
• Another clever trick
• With a little algebra (& help from Lagrange multipliers), can rewrite our QP in the form:
•Maximize:
•Subject to:
![Page 36: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/36.jpg)
Kernel functions• So??? It’s still the
same linear system
• Note, though, that appears in the system only as a dot product:
![Page 37: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/37.jpg)
Kernel functions• So??? It’s still the same
linear system
• Note, though, that appears in the system only as a dot product:
• Can replace with :
• The inner product
•is called a “kernel function”
![Page 38: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/38.jpg)
Why are kernel fns cool?• The cool trick is that
many useful projections can be written as kernel functions in closed form
• I.e., can work with K() rather than
• If you know K(xi,xj) for
every (i,j) pair, then you can construct the maximum margin hyperplane between the projected data without ever explicitly doing the projection!
![Page 39: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/39.jpg)
Example kernels• Homogeneous
degree-k polynomial:
![Page 40: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/40.jpg)
Example kernels• Homogeneous
degree-k polynomial:
• Inhomogeneous degree-k polynomial:
![Page 41: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/41.jpg)
Example kernels• Homogeneous
degree-k polynomial:
• Inhomogeneous degree-k polynomial:
• Gaussian radial basis function:
![Page 42: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/42.jpg)
Example kernels• Homogeneous degree-k
polynomial:
• Inhomogeneous degree-k polynomial:
• Gaussian radial basis function:
• Sigmoidal (neural network):
![Page 43: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/43.jpg)
Side note on kernels• What precisely do
kernel functions mean?
• Metric functions take two points and return a (generalized) distance between them
• What is the equivalent interpretation for kernels?
• Hint: think about what kernel function replaces in the max margin QP formulation
![Page 44: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/44.jpg)
Side note on kernels• Kernel functions are
generalized inner products
• Essentially, give you the cosine of the angle between vectors
• Recall the law of cosines:
![Page 45: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/45.jpg)
Side note on kernels• Replace traditional dot product with “generalized inner
product” and get:
![Page 46: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/46.jpg)
Side note on kernels• Replace traditional dot product with “generalized inner
product” and get:
• Kernel (essentially) represents:
• Angle between vectors in the projected, high-dimensional space
![Page 47: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/47.jpg)
Side note on kernels• Replace traditional dot product with “generalized inner
product” and get:
• Kernel (essentially) represents:
• Angle between vectors in the projected, high-dimensional space
• Alternatively:
• Nonlinear distribution of angles in low-dim space
![Page 48: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/48.jpg)
Example of Kernel nonlin.
![Page 49: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/49.jpg)
Example of Kernel nonlin.
![Page 50: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/50.jpg)
Using the classifier
• Solution of the QP gives back a set of
• Data points for which are called “support vectors”
• Turns out that we can write w as
![Page 51: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/51.jpg)
Using the classifier
• And our classification rule for query pt was:
• So:
![Page 52: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/52.jpg)
Using the classifier
SVM images from lecture notes by S. Dreiseitl
Supp
ort
vect
ors
![Page 53: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/53.jpg)
Putting it all togetherOriginal (low dimensional) data
![Page 54: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/54.jpg)
Putting it all together
Original datamatrix
Kernel matrix
Kernel function
![Page 55: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/55.jpg)
Putting it all together
Kernel +orig labels
Maximize
Subject to:
QuadraticPrograminstance
![Page 56: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/56.jpg)
Putting it all together
SupportVectorweights
Maximize
Subject to:
QuadraticPrograminstance
QP Solversubroutine
![Page 57: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/57.jpg)
Putting it all togetherSupportVectorweights
Hyperplanein
![Page 58: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/58.jpg)
Putting it all togetherSupportVectorweights
Finalclassifier
![Page 59: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/59.jpg)
Putting it all togetherFinal
classifier
Nonlinearclassifier
in
![Page 60: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/60.jpg)
Final notes on SVMs
• Note that only for which actually contribute to final classifier
• This is why they are called support vectors
• All the rest of the training data can be discarded
![Page 61: Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56649d485503460f94a23b52/html5/thumbnails/61.jpg)
Final notes on SVMs• Complexity of training
(& ability to generalize) based only on amount of training data
• Not based on dimension of hyperplane space ( )
• Good classification performance
• In practice, SVMs among the strongest classifiers we have
• Closely related to neural nets, boosting, etc.