kernelized perceptron support vector machines · 28 cse 446: machine learning support vector...
TRANSCRIPT
2/14/2017
1
CSE 446: Machine Learning
Kernelized PerceptronSupport Vector Machines
©2017 Emily Fox
CSE 446: Machine LearningEmily FoxUniversity of WashingtonFebruary 13, 2017
CSE 446: Machine Learning
What is the perceptron optimizing?
©2017 Emily Fox
2/14/2017
2
CSE 446: Machine Learning3
The perceptron algorithm [Rosenblatt ‘58, ‘62]
• Classification setting: y in {-1,+1}
• Linear model- Prediction:
• Training: - Initialize weight vector:
- At each time step:• Observe features:
• Make prediction:
• Observe true class:
• Update model:
- If prediction is not equal to truth
©2017 Emily Fox
CSE 446: Machine Learning4
Perceptron prediction: Margin of confidence
©2017 Emily Fox
2/14/2017
3
CSE 446: Machine Learning5
Hinge loss
• Perceptron prediction:
• Makes a mistake when:
• Hinge loss (same as maximizing the margin used by SVMs)
©2017 Emily Fox
CSE 446: Machine Learning6
Minimizing hinge loss in batch setting
• Given a dataset:
• Minimize average hinge loss:
• How do we compute the gradient?
©2017 Emily Fox
2/14/2017
4
CSE 446: Machine Learning7
Subgradients of convex functions
• Gradients lower bound convex functions:
• Gradients are unique at x if function differentiable at x
• Subgradients: Generalize gradients to non-differentiable points:- Any plane that lower bounds function:
©2017 Emily Fox
CSE 446: Machine Learning8
Subgradient of hinge
• Hinge loss:
• Subgradient of hinge loss:- If yt(w.xt) > 0:
- If yt (w.xt) < 0:
- If yt (w.xt) = 0:
- In one line:
©2017 Emily Fox
2/14/2017
5
CSE 446: Machine Learning9
Subgradient descent for hinge minimization
• Given data:
• Want to minimize:
• Subgradient descent works the same as gradient descent:- But if there are multiple subgradients at a point, just pick (any) one:
©2017 Emily Fox
CSE 446: Machine Learning10
Perceptron revisited
• Perceptron update:
• Batch hinge minimization update:
• Difference?
©2017 Emily Fox
2/14/2017
6
CSE 446: Machine Learning
The kernelized perceptron
©2017 Emily Fox
CSE 446: Machine Learning12
What if the data are not linearly separable?
©2017 Emily Fox
Use features of features of features of features….
Feature space can get really large really quickly!
2/14/2017
7
CSE 446: Machine Learning13
Higher order polynomials
©2017 Emily Fox
number of input dimensions
nu
mb
er
of
mo
no
mia
l te
rms
p=2
p=4
p=3
d – input dimensionp – degree of polynomial
grows fast!p = 6, d = 100about 1.6 billion terms
CSE 446: Machine Learning14
Perceptron revisited
• Given weight vector w(t), predict point x by:
• Mistake at time t: w(t+1) w(t) + yt xt
• Thus, write weight vector in terms of mistaken data points only:- Let M(t) be time steps up to t when mistakes were made:
• Prediction rule now:
• When using high dimensional features:
©2017 Emily Fox
2/14/2017
8
CSE 446: Machine Learning15
Dot-product of polynomials
©2017 Emily Fox
polynomials of degree exactly p
CSE 446: Machine Learning16
Finally, the kernel trick!Kernelized perceptron
• Every time you make a mistake, remember (xt,yt)
• Kernelized perceptron prediction for x:
©2017 Emily Fox
2/14/2017
9
CSE 446: Machine Learning17
Polynomial kernels
• All monomials of degree p in O(d) operations:
• How about all monomials of degree up to p?- Solution 0:
- Better solution:
©2017 Emily Fox
polynomials of degree exactly p
CSE 446: Machine Learning18
Common kernels
• Polynomials of degree exactly p
• Polynomials of degree up to p
• Gaussian (squared exponential) kernel
• Sigmoid
©2017 Emily Fox
2/14/2017
10
CSE 446: Machine Learning19
What you need to know
• Linear separability in higher-dim feature space
• The kernel trick
• Kernelized perceptron
• Derive polynomial kernel
• Common kernels
©2017 Emily Fox
CSE 446: Machine Learning
Support vector machines (SVMs)
©2017 Emily Fox
2/14/2017
11
CSE 446: Machine Learning21
Linear classifiers—Which line is better?
©2017 Emily Fox
CSE 446: Machine Learning22
Pick the one with the largest margin!
©2017 Emily Fox
2/14/2017
12
CSE 446: Machine Learning23
Maximize the margin
©2017 Emily Fox
CSE 446: Machine Learning24
But there are many planes…
©2017 Emily Fox
2/14/2017
13
CSE 446: Machine Learning25
Review: Normal to a plane
©2017 Emily Fox
CSE 446: Machine Learning26
A convention: Normalized marginCanonical hyperplanes
©2017 Emily Fox
x-x+
2/14/2017
14
CSE 446: Machine Learning27
Margin maximization using canonical hyperplanes
©2017 Emily Fox
Unnormalizedproblem:
Normalized Problem:
CSE 446: Machine Learning28
Support vector machines (SVMs)
• Solve efficiently by many methods, e.g.,- quadratic programming (QP)
• Well-studied solution algorithms
- Stochastic gradient descent
• Hyperplane defined by support vectors
©2017 Emily Fox
2/14/2017
15
CSE 446: Machine Learning29
What if data are not linearly separable?
©2017 Emily Fox
Use features of features of features of features….
CSE 446: Machine Learning30
What if data are still not linearly separable?
©2017 Emily Fox
• If data are not linearly separable, some points don’t satisfy margin constraint:
• How bad is the violation?
• Tradeoff margin violation with ||w||:
2/14/2017
16
CSE 446: Machine Learning31
SVMs for non-linearly separable data,meet my friend the Perceptron…
• Perceptron was minimizing the hinge loss:
• SVMs minimizes the regularized hinge loss!!
©2017 Emily Fox
CSE 446: Machine Learning32
Stochastic gradient descent for SVMs
• Perceptron minimization:
• SGD for Perceptron:
• SVMs minimization:
• SGD for SVMs:
©2017 Emily Fox
2/14/2017
17
CSE 446: Machine Learning33
Mixture model example
©2017 Emily Fox
From Hastie, Tibshirani, Friedman book
CSE 446: Machine Learning34
Mixture model example – kernels
©2017 Emily Fox
From Hastie, Tibshirani, Friedman book
2/14/2017
18
CSE 446: Machine Learning35
What you need to know…
• Maximizing margin
• Derivation of SVM formulation
• Non-linearly separable case- Hinge loss
- a.k.a. adding slack variables
• SVMs = Perceptron + L2 regularization
• Can optimize SVMs with SGD- Many other approaches possible
©2017 Emily Fox