kernelized perceptron support vector machines · 28 cse 446: machine learning support vector...

18
2/14/2017 1 Kernelized Perceptron Support Vector Machines ©2017 Emily Fox CSE 446: Machine Learning Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? ©2017 Emily Fox

Upload: others

Post on 04-Nov-2019

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

1

CSE 446: Machine Learning

Kernelized PerceptronSupport Vector Machines

©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonFebruary 13, 2017

CSE 446: Machine Learning

What is the perceptron optimizing?

©2017 Emily Fox

Page 2: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

2

CSE 446: Machine Learning3

The perceptron algorithm [Rosenblatt ‘58, ‘62]

• Classification setting: y in {-1,+1}

• Linear model- Prediction:

• Training: - Initialize weight vector:

- At each time step:• Observe features:

• Make prediction:

• Observe true class:

• Update model:

- If prediction is not equal to truth

©2017 Emily Fox

CSE 446: Machine Learning4

Perceptron prediction: Margin of confidence

©2017 Emily Fox

Page 3: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

3

CSE 446: Machine Learning5

Hinge loss

• Perceptron prediction:

• Makes a mistake when:

• Hinge loss (same as maximizing the margin used by SVMs)

©2017 Emily Fox

CSE 446: Machine Learning6

Minimizing hinge loss in batch setting

• Given a dataset:

• Minimize average hinge loss:

• How do we compute the gradient?

©2017 Emily Fox

Page 4: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

4

CSE 446: Machine Learning7

Subgradients of convex functions

• Gradients lower bound convex functions:

• Gradients are unique at x if function differentiable at x

• Subgradients: Generalize gradients to non-differentiable points:- Any plane that lower bounds function:

©2017 Emily Fox

CSE 446: Machine Learning8

Subgradient of hinge

• Hinge loss:

• Subgradient of hinge loss:- If yt(w.xt) > 0:

- If yt (w.xt) < 0:

- If yt (w.xt) = 0:

- In one line:

©2017 Emily Fox

Page 5: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

5

CSE 446: Machine Learning9

Subgradient descent for hinge minimization

• Given data:

• Want to minimize:

• Subgradient descent works the same as gradient descent:- But if there are multiple subgradients at a point, just pick (any) one:

©2017 Emily Fox

CSE 446: Machine Learning10

Perceptron revisited

• Perceptron update:

• Batch hinge minimization update:

• Difference?

©2017 Emily Fox

Page 6: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

6

CSE 446: Machine Learning

The kernelized perceptron

©2017 Emily Fox

CSE 446: Machine Learning12

What if the data are not linearly separable?

©2017 Emily Fox

Use features of features of features of features….

Feature space can get really large really quickly!

Page 7: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

7

CSE 446: Machine Learning13

Higher order polynomials

©2017 Emily Fox

number of input dimensions

nu

mb

er

of

mo

no

mia

l te

rms

p=2

p=4

p=3

d – input dimensionp – degree of polynomial

grows fast!p = 6, d = 100about 1.6 billion terms

CSE 446: Machine Learning14

Perceptron revisited

• Given weight vector w(t), predict point x by:

• Mistake at time t: w(t+1) w(t) + yt xt

• Thus, write weight vector in terms of mistaken data points only:- Let M(t) be time steps up to t when mistakes were made:

• Prediction rule now:

• When using high dimensional features:

©2017 Emily Fox

Page 8: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

8

CSE 446: Machine Learning15

Dot-product of polynomials

©2017 Emily Fox

polynomials of degree exactly p

CSE 446: Machine Learning16

Finally, the kernel trick!Kernelized perceptron

• Every time you make a mistake, remember (xt,yt)

• Kernelized perceptron prediction for x:

©2017 Emily Fox

Page 9: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

9

CSE 446: Machine Learning17

Polynomial kernels

• All monomials of degree p in O(d) operations:

• How about all monomials of degree up to p?- Solution 0:

- Better solution:

©2017 Emily Fox

polynomials of degree exactly p

CSE 446: Machine Learning18

Common kernels

• Polynomials of degree exactly p

• Polynomials of degree up to p

• Gaussian (squared exponential) kernel

• Sigmoid

©2017 Emily Fox

Page 10: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

10

CSE 446: Machine Learning19

What you need to know

• Linear separability in higher-dim feature space

• The kernel trick

• Kernelized perceptron

• Derive polynomial kernel

• Common kernels

©2017 Emily Fox

CSE 446: Machine Learning

Support vector machines (SVMs)

©2017 Emily Fox

Page 11: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

11

CSE 446: Machine Learning21

Linear classifiers—Which line is better?

©2017 Emily Fox

CSE 446: Machine Learning22

Pick the one with the largest margin!

©2017 Emily Fox

Page 12: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

12

CSE 446: Machine Learning23

Maximize the margin

©2017 Emily Fox

CSE 446: Machine Learning24

But there are many planes…

©2017 Emily Fox

Page 13: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

13

CSE 446: Machine Learning25

Review: Normal to a plane

©2017 Emily Fox

CSE 446: Machine Learning26

A convention: Normalized marginCanonical hyperplanes

©2017 Emily Fox

x-x+

Page 14: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

14

CSE 446: Machine Learning27

Margin maximization using canonical hyperplanes

©2017 Emily Fox

Unnormalizedproblem:

Normalized Problem:

CSE 446: Machine Learning28

Support vector machines (SVMs)

• Solve efficiently by many methods, e.g.,- quadratic programming (QP)

• Well-studied solution algorithms

- Stochastic gradient descent

• Hyperplane defined by support vectors

©2017 Emily Fox

Page 15: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

15

CSE 446: Machine Learning29

What if data are not linearly separable?

©2017 Emily Fox

Use features of features of features of features….

CSE 446: Machine Learning30

What if data are still not linearly separable?

©2017 Emily Fox

• If data are not linearly separable, some points don’t satisfy margin constraint:

• How bad is the violation?

• Tradeoff margin violation with ||w||:

Page 16: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

16

CSE 446: Machine Learning31

SVMs for non-linearly separable data,meet my friend the Perceptron…

• Perceptron was minimizing the hinge loss:

• SVMs minimizes the regularized hinge loss!!

©2017 Emily Fox

CSE 446: Machine Learning32

Stochastic gradient descent for SVMs

• Perceptron minimization:

• SGD for Perceptron:

• SVMs minimization:

• SGD for SVMs:

©2017 Emily Fox

Page 17: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

17

CSE 446: Machine Learning33

Mixture model example

©2017 Emily Fox

From Hastie, Tibshirani, Friedman book

CSE 446: Machine Learning34

Mixture model example – kernels

©2017 Emily Fox

From Hastie, Tibshirani, Friedman book

Page 18: Kernelized Perceptron Support Vector Machines · 28 CSE 446: Machine Learning Support vector machines (SVMs) • Solve efficiently by many methods, e.g., - quadratic programming (QP)

2/14/2017

18

CSE 446: Machine Learning35

What you need to know…

• Maximizing margin

• Derivation of SVM formulation

• Non-linearly separable case- Hinge loss

- a.k.a. adding slack variables

• SVMs = Perceptron + L2 regularization

• Can optimize SVMs with SGD- Many other approaches possible

©2017 Emily Fox