introduction to machine learning - las.inf.ethz.ch · introduction to machine learning non-linear...

28
Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch )

Upload: others

Post on 20-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Introduction to Machine Learning

Non-linear prediction with kernels

Prof. Andreas KrauseLearning and Adaptive Systems (las.ethz.ch)

Page 2: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Recall: Linear classifiers

2

+

–+++

+

+

–– –

––

–+–

w

y = sign(wTx)

Page 3: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Recall: The Perceptron problemSolve

where

Optimize via Stochastic Gradient Descent

3

`P (w; yi,xi) = max(0,�yiwTxi)

w = argminw

1

n

nX

i=1

`P (w;xi, yi)

Page 4: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Solving non-linear classification tasksHow can we find nonlinear classification boundaries?Similar as in regression, can use non-linear transformations of the feature vectors, followed bylinear classification

4

Page 5: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Recall: linear regression for polynomialsWe can fit non-linear functions via linear regression, using nonlinear features of our data (basis functions)

For example: polynomials (in 1-D)

5

f(x) =dX

i=1

wi�i(x)

f(x) =mX

i=0

wixi

Page 6: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Polynomials in higher dimensionsSuppose we wish to use polynomial features, but ourinput is higher-dimensionalCan still use monomial featuresExample: Monomials in 2 variables, degree = 2

6

Page 7: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Avoiding the feature explosionNeed O(dk) dimensions to represent (multivariate) polynomials of degree k on d featuresExample: d=10000, k=2 è Need ~100M dimensions

In the following, we can see how we can efficientlyimplicitly operate in such high-dimensional feature spaces(i.e., without ever explicitly computing the transformation)

7

Page 8: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Revisiting the Perceptron/SVMFundamental insight: Optimal hyperplane lies in thespan of the data

(Handwavy) proof: (Stochastic) gradient descentstarting from 0 constructs such a representation

More abstract proof: Follows from the„representer theorem“ (not discussed here)

8

w =nX

i=1

↵iyixi

wt+1 = wt + ⌘tytxt [ytwTt xt < 0]

wt+1 = wt(1� 2�⌘t) + ⌘tytxt [ytwTt xt < 1]

Perceptron:

SVM:

Page 9: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Reformulating the Perceptron

9

Page 10: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Advantage of reformulation

Key observation: Objective only depends on inner products of pairs of data pointsThus, we can implicitly work in high-dimensional spaces, as long as we can do inner products efficiently

10

x 7! �(x)

↵ = argmin↵1:n

1

n

nX

i=1

max{0,�nX

j=1

↵jyiyjxTi xj}

xTx0 7! �(x)T�(x0) =: k(x,x0)

Page 11: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

„Kernels = efficient inner products“Often, can be computed much moreefficiently than

Simple example: Polynomial kernel in degree 2

11

k(x,x0)�(x)T�(x0)

Page 12: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Polynomial kernels (degree 2)Suppose and

Then

12

x = [x1, . . . , xd]T x0 = [x0

1, . . . , x0d]

T

(xTx0)2 =

dX

i=1

xix0i

!2

Page 13: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Polynomial kernels: Fixed degreeThe kernelimplicitly represents all monomials of degree m

How can we get monomials up to order m?

13

k(x,x0) = (xTx0)m

Page 14: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Polynomial kernelsThe polynomial kernelimplicitly represents all monomials of up to degree m

Representing the monomials (and computing innerproduct explicitly) is exponential in m!!

14

k(x,x0) = (1 + xTx0)m

Page 15: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

The „Kernel Trick“Express problem s.t. it only depends on inner productsReplace inner products by kernels

This „trick“ is very widely applicable!

15

k(xi,xj)xTi xj

Page 16: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

The „Kernel Trick“Express problem s.t. it only depends on inner productsReplace inner products by kernels

Example: Perceptron

16

Page 17: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

The „Kernel Trick“Express problem s.t. it only depends on inner productsReplace inner products by kernels

Example: Perceptron

Will see further examples later 17

↵ = argmin↵1:n

1

n

nX

i=1

max{0,�nX

j=1

↵jyiyjxTi xj}

↵ = argmin↵1:n

1

n

nX

i=1

max{0,�nX

j=1

↵jyiyjk(xi,xj)}

Page 18: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Derivation: Kernelized Perceptron

18

Page 19: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Kernelized PerceptronInitialize For t=1,2,...

Pick data point (xi, ,yi ) uniformly at randomPredict

If set

For new point x, predict

19

y = sign⇣ nX

j=1

↵jyjk(xj ,xi)⌘

y 6= yi ↵i ↵i + ⌘t

↵1 = · · · = ↵n = 0

Training

Predictiony = sign

⇣ nX

j=1

↵jyjk(xj ,x)⌘

Page 20: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Demo: Kernelized Perceptron

20

Page 21: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

QuestionsWhat are valid kernels?How can we select a good kernel for our problem?Can we use kernels beyond the perceptron?Kernels work in very high-dimensional spaces. Doesn‘t this lead to overfitting?

21

Page 22: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Properties of kernel functionsData space XA kernel is a functionCan we use any function?

k must be an inner product in a suitable spaceèk must be symmetric!

èAre there other properties that it must satisfy?22

k : X ⇥X ! R

Page 23: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Positive semi-definite matricesSymmetric matrix is positive semidefinite iff

23

M 2 Rn⇥n

Page 24: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Kernels è semi-definite matricesData space X (possibly infinite)Kernel functionTake any finite subset of dataThen the kernel (gram) matrix

is positive semidefinite

24

k : X ⇥X ! RS = {x1, . . . ,xn} ✓ X

K =

0

B@k(x1,x1) . . . k(x1,xn)

......

k(xn,x1) . . . k(xn,xn)

1

CA =

0

B@�(x1)T�(x1) . . . �(x1)T�(xn)

......

�(xn)T�(x1) . . . �(xn)T�(xn)

1

CA

Page 25: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Semi-definite matrices è kernelsSuppose the data space X={1,...,n} is finite, and we aregiven a positive semidefinite matrixThen we can always construct a feature map

such that

25

K 2 Rn⇥n

Ki,j = �(i)T�(j)� : X ! Rn

Page 26: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Outlook: Mercer‘s TheoremLet X be a compact subset of anda kernel function

Then one can expand k in a uniformly convergentseries of bounded functions s.t.

Can be generalized even further26

Rn k : X ⇥X ! Rn

�i

k(x, x0) =1X

i=1

�i �i(x)�i(x0)

Page 27: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Definition: kernel functionsData space XA kernel is a function satisfying1) Symmetry: For any it must hold that

2) Positive semi-definiteness: For any n, any set, the kernel (Gram) matrix

must be positive semi-definite27

k : X ⇥X ! R

S = {x1, . . . ,xn} ✓ X

K =

0

B@k(x1,x1) . . . k(x1,xn)

......

k(xn,x1) . . . k(xn,xn)

1

CA

x,x0 2 X

k(x,x0) = k(x0,x)

Page 28: Introduction to Machine Learning - las.inf.ethz.ch · Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Examples of kernels on Linear kernel:Polynomial kernel:Gaussian (RBF,squared exp. kernel):

Laplacian kernel:

28

Rd

k(x,x0) = xTx0

k(x,x0) = (xTx0 + 1)d

k(x,x0) = exp(�||x� x0||22/h2)

k(x,x0) = exp(�||x� x0||1/h)