machine learning

CSE 575: Sta*s*cal Machine Learning

Jingrui He CIDSE, ASU

Instance-based Learning

3

1-Nearest Neighbor

Four things make a memory based learner: 1. A distance metric

Euclidian (and many more) 2. How many nearby neighbors to look at? One 1. A weigh:ng func:on (op:onal)

Unused

2. How to t with the local points? Just predict the same output as the nearest neighbor.

4

Consistency of 1-NN Consider an es*mator fn trained on n examples

e.g., 1-NN, regression, ... Es*mator is consistent if true error goes to zero as amount of

data increases e.g., for no noise data, consistent if:

Regression is not consistent! Representa*on bias

1-NN is consistent (under some mild neprint)

What about variance???

5

1-NN overts?

6

k-Nearest Neighbor Four things make a memory based learner: 1. A distance metric

Euclidian (and many more) 2. How many nearby neighbors to look at?

k 1. A weigh:ng func:on (op:onal)

Unused

2. How to t with the local points? Just predict the average output among the k nearest neighbors.

7

k-Nearest Neighbor (here k=9)

K-nearest neighbor for funcFon Hng smooth away noise, but there are clear deciencies. What can we do about all the discon*nui*es that k-NN gives us?

8

Curse of dimensionality for instance-based learning

Must store and retrieve all data! Most real work done during tes*ng For every test sample, must search through all dataset very slow! There are fast methods for dealing with large datasets, e.g., tree-based

methods, hashing methods,

Instance-based learning o^en poor with noisy or irrelevant features

Support Vector Machines

w.x = j w(j) x(j) 10

Linear classiers Which line is beber?

Data:

Example i:

11

Pick the one with the largest margin!

w.x = j w(j) x(j)

w.x + b = 0

12

Maximize the margin

w.x + b = 0

13

But there are a many planes

w.x + b = 0

14

w.x + b = 0

Review: Normal to a plane

15

w.x + b = +1

w.x + b = -1

w.x + b = 0

margin 2

x- x+

Normalized margin Canonical hyperplanes

16

Normalized margin Canonical hyperplanes

w.x + b = +1

w.x + b = -1

w.x + b = 0

margin 2

x- x+

17

Margin maximiza*on using canonical hyperplanes

w.x + b = +1

w.x + b = -1

w.x + b = 0

margin 2

18

Support vector machines (SVMs)

w.x + b = +1

w.x + b = -1

w.x + b = 0

margin 2

Solve eciently by quadra*c programming (QP) Well-studied solu*on algorithms

Hyperplane dened by support vectors

19

What if the data is not linearly separable?

Use features of features of features of features.

20

What if the data is s*ll not linearly separable?

Minimize w.w and number of training mistakes Tradeo two criteria?

Tradeo #(mistakes) and w.w 0/1 loss Slack penalty C Not QP anymore Also doesnt dis*nguish near misses

and really bad mistakes

21

Slack variables Hinge loss

If margin 1, dont care If margin < 1, pay linear penalty

22

Side note: Whats the dierence between SVMs and logis*c regression?

SVM: LogisFc regression:

Log loss:

23

Constrained op*miza*on

24

Lagrange mul*pliers Dual variables

Moving the constraint to objecFve funcFon Lagrangian:

Solve:

25

Lagrange mul*pliers Dual variables

Solving:

26

Dual SVM deriva*on (1) the linearly separable case

27

Dual SVM deriva*on (2) the linearly separable case

28

Dual SVM interpreta*on

w.x + b = 0

29

Dual SVM formula*on the linearly separable case

30

Dual SVM deriva*on the non-separable case

31

Dual SVM formula*on the non-separable case

32

Why did we learn about the dual SVM?

There are some quadra*c programming algorithms that can solve the dual faster than the primal

But, more importantly, the kernel trick!!! Another lible detour

33

Reminder from last *me: What if the data is not linearly separable?

Use features of features of features of features.

Feature space can get really large really quickly!

34

Higher order polynomials

number of input dimensions

numbe

r of m

onom

ial terms

d=2

d=4

d=3

m input features d degree of polynomial

grows fast! d = 6, m = 100 about 1.6 billion terms

35

Dual formula*on only depends on dot-products, not on w!

36

Dot-product of polynomials

37

Finally: the kernel trick!

Never represent features explicitly Compute dot products in closed form

Constant-*me high-dimensional dot-products for many classes of features

Very interes*ng theory Reproducing Kernel Hilbert Spaces

38

Polynomial kernels

All monomials of degree d in O(d) opera*ons:

How about all monomials of degree up to d? Solu*on 0:

Beber solu*on:

39

Common kernels

Polynomials of degree d

Polynomials of degree up to d

Gaussian kernels

Sigmoid

40

Overvng?

Huge feature space with kernels, what about overvng??? Maximizing margin leads to sparse set of support vectors

Some interes*ng theory says that SVMs search for simple hypothesis with large margin

O^en robust to overvng

41

What about at classica*on *me For a new input x, if we need to represent (x), we are in trouble!

Recall classier: sign(w.(x)+b) Using kernels we are cool!

42

SVMs with kernels Choose a set of features and kernel func*on Solve dual problem to obtain support vectors i

At classica*on *me, compute:

Classify as

43

Whats the dierence between SVMs and Logis*c Regression?

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! No

44

Kernels in logis*c regression

Dene weights in terms of support vectors:

Derive simple gradient descent rule on i

45

Whats the dierence between SVMs and Logis*c Regression? (Revisited)

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

Margin Real probabilities

machine learning

Documents

margin w

j wj xj w

support vector machines

margin maximiza

largest margin

nearest neighbor

dual svm deriva

features of features