kernelized perceptron support vector machines · 28 cse 446: machine learning support vector...

2/14/2017

1

CSE 446: Machine Learning

Kernelized PerceptronSupport Vector Machines

©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonFebruary 13, 2017


What is the perceptron optimizing?

©2017 Emily Fox

2/14/2017

2

CSE 446: Machine Learning3

The perceptron algorithm [Rosenblatt ‘58, ‘62]

• Classification setting: y in {-1,+1}

• Linear model- Prediction:

• Training: - Initialize weight vector:

- At each time step:• Observe features:

• Make prediction:

• Observe true class:

• Update model:

- If prediction is not equal to truth

©2017 Emily Fox


Perceptron prediction: Margin of confidence

©2017 Emily Fox

2/14/2017

3


Hinge loss

• Perceptron prediction:

• Makes a mistake when:

• Hinge loss (same as maximizing the margin used by SVMs)

©2017 Emily Fox


Minimizing hinge loss in batch setting

• Given a dataset:

• Minimize average hinge loss:

• How do we compute the gradient?

©2017 Emily Fox

2/14/2017

4


Subgradients of convex functions

• Gradients lower bound convex functions:

• Gradients are unique at x if function differentiable at x

• Subgradients: Generalize gradients to non-differentiable points:- Any plane that lower bounds function:

©2017 Emily Fox


Subgradient of hinge

• Hinge loss:

• Subgradient of hinge loss:- If yt(w.xt) > 0:

- If yt (w.xt) < 0:

- If yt (w.xt) = 0:

- In one line:

©2017 Emily Fox

2/14/2017

5


Subgradient descent for hinge minimization

• Given data:

• Want to minimize:

• Subgradient descent works the same as gradient descent:- But if there are multiple subgradients at a point, just pick (any) one:

©2017 Emily Fox


Perceptron revisited

• Perceptron update:

• Batch hinge minimization update:

• Difference?

©2017 Emily Fox

2/14/2017

6


The kernelized perceptron

©2017 Emily Fox


What if the data are not linearly separable?

©2017 Emily Fox

Use features of features of features of features….

Feature space can get really large really quickly!

2/14/2017

7


Higher order polynomials

©2017 Emily Fox

number of input dimensions

nu

mb

er

of

mo

no

mia

l te

rms

p=2

p=4

p=3

d – input dimensionp – degree of polynomial

grows fast!p = 6, d = 100about 1.6 billion terms


Perceptron revisited

• Given weight vector w(t), predict point x by:

• Mistake at time t: w(t+1) w(t) + yt xt

• Thus, write weight vector in terms of mistaken data points only:- Let M(t) be time steps up to t when mistakes were made:

• Prediction rule now:

• When using high dimensional features:

©2017 Emily Fox

2/14/2017

8


Dot-product of polynomials

©2017 Emily Fox

polynomials of degree exactly p


Finally, the kernel trick!Kernelized perceptron

• Every time you make a mistake, remember (xt,yt)

• Kernelized perceptron prediction for x:

©2017 Emily Fox

2/14/2017

9


Polynomial kernels

• All monomials of degree p in O(d) operations:

• How about all monomials of degree up to p?- Solution 0:

- Better solution:

©2017 Emily Fox

polynomials of degree exactly p


Common kernels

• Polynomials of degree exactly p

• Polynomials of degree up to p

• Gaussian (squared exponential) kernel

• Sigmoid

©2017 Emily Fox

2/14/2017

10


What you need to know

• Linear separability in higher-dim feature space

• The kernel trick

• Kernelized perceptron

• Derive polynomial kernel

• Common kernels

©2017 Emily Fox


Support vector machines (SVMs)

©2017 Emily Fox

2/14/2017

11


Linear classifiers—Which line is better?

©2017 Emily Fox


Pick the one with the largest margin!

©2017 Emily Fox

2/14/2017

12


Maximize the margin

©2017 Emily Fox


But there are many planes…

©2017 Emily Fox

2/14/2017

13


Review: Normal to a plane

©2017 Emily Fox


A convention: Normalized marginCanonical hyperplanes

©2017 Emily Fox

x-x+

2/14/2017

14


Margin maximization using canonical hyperplanes

©2017 Emily Fox

Unnormalizedproblem:

Normalized Problem:


Support vector machines (SVMs)

• Solve efficiently by many methods, e.g.,- quadratic programming (QP)

• Well-studied solution algorithms

- Stochastic gradient descent

• Hyperplane defined by support vectors

©2017 Emily Fox

2/14/2017

15


What if data are not linearly separable?

©2017 Emily Fox

Use features of features of features of features….


What if data are still not linearly separable?

©2017 Emily Fox

• If data are not linearly separable, some points don’t satisfy margin constraint:

• How bad is the violation?

• Tradeoff margin violation with ||w||:

2/14/2017

16


SVMs for non-linearly separable data,meet my friend the Perceptron…

• Perceptron was minimizing the hinge loss:

• SVMs minimizes the regularized hinge loss!!

©2017 Emily Fox


Stochastic gradient descent for SVMs

• Perceptron minimization:

• SGD for Perceptron:

• SVMs minimization:

• SGD for SVMs:

©2017 Emily Fox

2/14/2017

17


Mixture model example

©2017 Emily Fox

From Hastie, Tibshirani, Friedman book


Mixture model example – kernels

©2017 Emily Fox

From Hastie, Tibshirani, Friedman book

2/14/2017

18


What you need to know…

• Maximizing margin

• Derivation of SVM formulation

• Non-linearly separable case- Hinge loss

- a.k.a. adding slack variables

• SVMs = Perceptron + L2 regularization

• Can optimize SVMs with SGD- Many other approaches possible

©2017 Emily Fox

kernelized perceptron support vector machines · 28 cse 446: machine learning support vector...

Documents