kernel ridge regression - rensselaer polytechnic institute

35
Kernel Ridge Regression Prof. Bennett Based on Chapter 2 of Shawe-Taylor and Cristianini

Upload: others

Post on 16-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Kernel Ridge Regression

Prof. BennettBased on Chapter 2 of

Shawe-Taylor and Cristianini

Page 2: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Outline

OverviewRidge RegressionKernel Ridge RegressionOther KernelsSummary

Page 3: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Recall E&K model

R(t)=at2+bt+cIs linear is in its parametersDefine mapping θ(t) and make linear

function in the θ(t) or feature space2

2 2

( ) [ 1]' [ ]'

( ) ( ) [ 1]'

t t t s a b ca

R t t s t t b at bt cc

θ

θ

= =

⎡ ⎤⎢ ⎥= ⋅ = = + +⎢ ⎥⎢ ⎥⎣ ⎦

Page 4: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Linear versus Nonlinear

f(t)=bt+c f(θ(t))=at2+bt+c

Page 5: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Kernel Method

Two parts (+):Mapping into embedding or feature space defined by kernel.Learning algorithm for discovering linear patterns in that space.

Illustrate using linear ridge regression.

Page 6: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Linear Regression in Feature Space

Key Idea:Map data to higher dimensional space (feature space) and perform linear regression in embedded space.

Embedding Map:: n NR F R N nφ ∈ → ⊆ >>x

Page 7: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Nonlinear Regression in Feature Space

Input

[ ]1 2

2 2

2 21 2 3

,,

( ) , , 2

( ) ( ) ,

2F

r sw r w s

r s r s

g

w r w s w r s

θ

θ

== +

⎡ ⎤= ⎣ ⎦=

= + +

xx w

x

x x w

Feature

Page 8: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Nonlinear Regression in Feature Space

[ ]1 2

3 3 2 2 2 2

3 31 2 1 0

,,

( ) , , , , , , , ,

( ) ( ) ,

. . . . 2F

r sw r w s

r s r s r s s r s r s r

g

w r w s w r s

θ

θ

== +

⎡ ⎤= ⎣ ⎦=

= + + +

xx w

x

x x w

Input

Feature

Page 9: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Let’s try a quadratic on Aquasol

What are all the terms we need to add to our 525 dimensional space to map it into feature space?

Just do squared terms and cross terms.

Page 10: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Kernel and Duality to the rescue

Duality Alternative but equivalent view of the problemsKernel trick – makes mapping to the feature space efficient

Page 11: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Linear Regression

Given training data:

Construct linear function:

( ) ( ) ( ) ( )( )1 1 2 2, , , , , , , , ,

points and labels i

ni

S y y y y

R y R

=

∈ ∈

i

i

x x x x

x

… …

1( ) , '

n

i ii

g w x=

= = = ∑x w x w x

Page 12: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Least Squares Approximation

Want

Define error

Minimize loss

( )g x y≈

( , ) ( )f y y g ξ= − =x x

( )

( )( )

2

1

2

1 1

( , ) ( , ) ( )

, ,

i ii

i i ii i

L g s L w S y g

l y gξ

=

= =

= = −

= =

∑ ∑

x

x

Page 13: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Ridge Regression

Use least norm solution for fixedRegularized problem

Optimality Condition:

2 2min ( , )L Sλ λ= + −w w w y Xw

( , )2 2 ' 2 ' 0

L Sλ λ∂

= − + =∂w

w X y X Xww

( )' 'nλ+ =X X I w X y

0.λ >

Requires 0(n3) operations

Page 14: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Ridge Regression (cont)

Inverse always exists for any

Alternative representation:( ) 1' 'λ −= +w X X I X y

0.λ >

( ) ( )( )

1

1

' ' ' '

' '

λ λ

λ

+ = ⇒ = −

⇒ = − =

X X I w X y w X y X Xw

w X y Xw X α

( )1λ−= −α y Xw

Solving l×l equation is

Page 15: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Ridge Regression (cont)

Inverse always exists for any

Alternative representation:( ) 1' 'λ −= +w X X I X y

0.λ >

( ) ( )( )

1

1

' ' ' '

' '

λ λ

λ

+ = ⇒ = −

⇒ = − =

X X I w X y w X y X Xw

w X y Xw X α( )( ) ( )

( )

1

1

''

where '

λ

λλ

λ

= −

⇒ = − = −

⇒ + =

⇒ = + =

α y Xw

α y Xw y XX αXX α α y

α G I y G XXSolving l×l equation is

Page 16: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Ridge Regression (cont)

Inverse always exists for any

Alternative representation:( ) 1' 'λ −= +w X X I X y

0.λ >

( ) ( )( )

1

1

' ' ' '

' '

λ λ

λ

+ = ⇒ = −

⇒ = − =

X X I w X y w X y X Xw

w X y Xw X α( )( ) ( )

( )

1

1

''

where '

λ

λλ

λ

= −

⇒ = − = −

⇒ + =

⇒ = + =

α y Xw

α y Xw y XX αXX α α y

α G I y G XXSolving l×l equation is

Page 17: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Gram or Kernel Matrix

Gram Matrix

Composed of inner products of data

'K G XX= =

, ,i j i jK x x=

Page 18: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Dual Ridge Regression

To predict new point:

Note need only compute G, the Gram Matrix

( ) 1

1( ) , , '

where ,

i ii

i i

g α λ −

=

= = = +

=

∑x w x x x y G I z

z x x

' ,ij i jG= =G XX x x

Ridge Regression requires only inner products between data points

Page 19: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Efficiency

To compute w in primal ridge regression is 0(n3)

α in dual ridge regression is 0(l3)

To predict new point xprimal 0(n)

dual 0(nl)( ) ( )

1 1 1( ) , ,

n

i i i i j ji i j

g α α= = =

⎛ ⎞= = = ⎜ ⎟

⎝ ⎠∑ ∑ ∑x wx x x x x

Dual is better if n>>l

( )1

( ) ,n

i ii

g w=

= = ∑x w x x

Page 20: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Notes on Ridge Regression

“Regularization” is key to addressstability and regularization.Regularization lets method work when n>>l.Dual more efficient when n>>l.Dual only requires inner products of data.

Page 21: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Nonlinear Regression in Feature Space

1

( ) ( ) ,

( ) , ( )

F

i ii

g φ

α φ φ=

=

= ∑

x x w

x x

In dual representation:

So if we can efficiently compute inner product, our metis efficient

Page 22: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Let try it for our sample problem

( )

2 2 2 21 2 1 2 1 2 1 2

2 2 2 21 1 2 2 1 2 1 2

21 1 2 2

2

( ) , ( )

( , , 2 ) , ( , , 2 )

2

,

u u u u v v v v

u v u v u u v v

u v u v

φ φ

=

= + +

= +

=

u v

u v

2D e f i n e : ( , ) ,K =u v u v

Page 23: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Kernel Function

A kernel is a function K such that

There are many possible kernels.Simplest is linear kernel.

, ( ), ( )

where is a mapping from input space to feature space .

FK

F

φ φ

φ

=x u x u

, ,K =x u x u

Page 24: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Dual Ridge Regression

To predict new point:

Note need only compute G, the Gram Matrix

( ) 1

1( ) , , '

where ,

i ii

i i

g α λ −

=

= = = +

=

∑x w x x x y G I z

z x x

' ,ij i jG= =G XX x x

Ridge Regression requires only inner products between data points

Page 25: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Ridge Regression in Feature Space

To predict new point:

To compute the Gram Matrix

( ) 1

1( ( )) , ( ) ( ), ( ) '

where ( ), ( )

i ii

i i

g φ φ α φ φ λ

φ φ

=

= = = +

=

∑x w x x x y G I z

z x x

( ) ( ) ' ( ), ( ) ( , )ij i j i jG Kφ φ φ φ= = =G X X x x x x

Use kernel to compute inner product

Page 26: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Nonlinear Regression in Feature Space

1

1

( ) ( ) ,

( ) , ( )

( , )

F

i ii

i ii

g

K

φ

α φ φ

α

=

=

=

=

=

x x w

x x

x x

Kernel trick works for any dual representation:

Page 27: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Popular Kernels based on vectors

By Hilbert-Schmidt Kernels (Courant and Hilbert 1953)

( ), ( ) ( , )u v K u vθ θ ≡for certain η and K, e.g.

2

( ) ( , )Degree polynomial ( , 1)

|| ||Radial Basis Function M achine exp

Two Layer Neural Network ( , )

d

u K u vd u v

u v

sigmoid u v c

θ

ση

+

⎛ ⎞−−⎜ ⎟

⎝ ⎠+

Page 28: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Kernels Intuition

Kernels encode the notion of similarity to be used for a specific applications.

Document use cosine of “bags of text”.Gene sequences can used edit distance.

Similarity defines distance:

Trick is to get right encoding for domain.

2|| || ( ) '( ) , 2 , ,− = − − = − +u v u v u v u u u v v v

Page 29: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Important Points

Kernel method = linear method + embedding in feature space.Kernel functions used to do embedding efficiently.Feature space is higher dimensional space so must regularize.Choose kernel appropriate to domain.

Page 30: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Kernel Ridge Regression

Simple to derive kernel methodWorks great in practice with some finessing.Next time:Practical issues.More standard dual derivation.

Page 31: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Optimal Solution

Want:Mathematical Model:

Optimality Conditions:

2 2min ( , , ) ( ) || ||L b S b wλ= − + +w w y Xw e

b is a vector of on≈ +y Xw e e

( , , ) 2 '( ) 2 0L b S b λ∂= − − + =

∂w X y Xw e ww

( , , ) 2 '( ) 0L b S e bb

∂= − − =

∂w y Xw e

es

Page 32: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Let Try it: In class lab

Basic Ridge Regression

Optimality Condition:

Dual equivalent

2 2min ( , )L Sλ λ= + −w w w y Xw

( ) 1' 'nλ −= +w X X I X y

0.λ >

( ) 1

,where G ,

'i j i j

yλ −= +

=

=

α G I

x x

w X α

Page 33: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Better model

Basic Ridge Regression

Center X and Y to get Xc, Yc

Dual equivalent

2 2,min ( , ) ( )b L S bλ λ= + − +w w w y Xw e

( ) 1' 'nc c c cλ −= +w X X I X y

0.λ >

( ) 1

,where G ,

'i j i j

y

c c

c

λ −= +

=

=

α G I

x x

w X α

Page 34: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Recenter Data

Shift y by mean

Shift x by mean

1

1 :i i ii

y yc yµ µ=

= = −∑

1

1 :i i ii

c=

= = −∑x x x x x

Page 35: Kernel Ridge Regression - Rensselaer Polytechnic Institute

Ridge Regression with bias

Center data by

Calculate w

CalculateTo predict new point

x and µ

1( ) 'c c c cλ −= +w X 'X I X y

'c c c µ= − = −X X ex y y e

b µ=( ) ( ) 'g x x x w b= − +