introduction: overview of kernel methodsfukumizu/kyushu2008/kernel_intro_1.pdf · basic idea of...

25
Introduction: Overview of Kernel Methods Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies October 6-10, 2008, Kyushu University

Upload: others

Post on 30-Apr-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Introduction: Overview of Kernel MethodsStatistical Data Analysis with Positive Definite Kernels

Kenji Fukumizu

Institute of Statistical Mathematics, ROISDepartment of Statistical Science, Graduate University for Advanced Studies

October 6-10, 2008, Kyushu University

Page 2: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Outline

Basic idea of kernel methodsLinear and nonlinear Data AnalysisEssence of kernel methodology

Some examples of kernel methodsKernel PCA: Nonlinear extension of PCARidge regression and its kernelization

2 / 25

Page 3: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Basic idea of kernel methodsLinear and nonlinear Data AnalysisEssence of kernel methodology

Some examples of kernel methodsKernel PCA: Nonlinear extension of PCARidge regression and its kernelization

3 / 25

Page 4: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Nonlinear Data Analysis I

• Classical linear methods• Data is expressed by a matrix.

X =

X1

1 X21 · · · Xm

1

X12 X2

2 · · · Xm2

...X1

N X2N · · · Xm

N

(m dimensional, N data)

• Linear operations (matrix operations) are used for data analysis.e.g.

- Principal component analysis (PCA)- Canonical correlation analysis (CCA)- Linear regression analysis- Fisher discriminant analysis (FDA)- Logistic regression, etc.

4 / 25

Page 5: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Nonlinear Data Analysis II

Are linear methods sufficient? Nonlinear transform can help.

• Example 1: classification

-6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

0

5

10

15

20 0

5

10

15

20

-15

-10

-5

0

5

10

15

x1

x2

z1

z3

z2

linearly inseparable linearly separable

(x1, x2) 7→ (z1, z2, z3) = (x21, x

22,√

2x1x2)

(Unclear? watch http://jp.youtube.com/watch?v=3liCbRZPrZA)

5 / 25

Page 6: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

• Example 2: dependence of two data

Correlation

ρXY =Cov[X,Y ]√Var[X]Var[Y ]

=E[(X − E[X])(Y − E[Y ])]√E[(X − E[X])2]E[(Y − E[Y ])2]

.

• Transforming data to incorporate high-order moments seemsattractive.

6 / 25

Page 7: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Basic idea of kernel methodsLinear and nonlinear Data AnalysisEssence of kernel methodology

Some examples of kernel methodsKernel PCA: Nonlinear extension of PCARidge regression and its kernelization

7 / 25

Page 8: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Feature space for transforming data• Kernel methodology = a systematic way of analyzing data by

transforming them into a high-dimensional feature space.

Apply linear methods on the feature space.

• Which type of space serves as a feature space?• The space should incorporate various nonlinear information of the

original data.• The inner product of the feature space is essential for data analysis

(seen in the next subsection).

8 / 25

Page 9: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Computational problem of inner product

• For example, how about this?

(X,Y, Z) 7→ (X,Y, Z,X2, Y 2, Z2, XY, Y Z,ZX, . . .).

• But, for high-dimensional data, the above expansion makes thefeature space very huge!

e.g. If X is 100 dimensional and the moments up to the thirdorder are used, the dimensionality of feature space is

100C1 + 100C2 + 100C3 = 166750.

• This causes a serious computational problem in working on theinner product of the feature space.We need a cleverer way of computing it. ⇒ Kernel method.

9 / 25

Page 10: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Inner product by positive definite kernel

• A positive definite kernel gives efficient computation of the innerproduct:

With special choice of the feature space, we have a functionk(x, y) such that

〈Φ(Xi),Φ(Xj)〉 = k(Xi, Xj), positive definite kernel

whereX 3 x 7→ Φ(x) ∈ H (feature space).

• Many linear methods use only the inner product withoutnecessity of the explicit form of the vector Φ(X).

10 / 25

Page 11: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Basic idea of kernel methodsLinear and nonlinear Data AnalysisEssence of kernel methodology

Some examples of kernel methodsKernel PCA: Nonlinear extension of PCARidge regression and its kernelization

11 / 25

Page 12: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Review of PCA I

X1, . . . , XN : m-dimensional data.

Principal Component Analysis (PCA)• Find d-directions to maximize the variance.• Purpose: represent the structure of the data in a low dimensional

space.

12 / 25

Page 13: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Review of PCA IIThe first principal direction:

u1 = arg max‖u‖=1

1N

{∑Ni=1u

T (Xi − 1N

∑Nj=1Xj)

}2 = arg max‖u‖=1

uTV u,

where V is the variance-covariance matrix:

V = 1N

∑Ni=1(Xi − 1

N

∑Nj=1Xj)(Xi − 1

N

∑Nj=1Xj)T .

- Eigenvectors u1, . . . , um of V (in descending order).- The p-th principal axis = up.- The p-th principal component of Xi = uT

pXi

Observation: PCA can be done if we can compute the inner product• covariance matrix V ,• inner product between the unit eigenvector and the data.

13 / 25

Page 14: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Kernel PCA IX1, . . . , XN : m-dimensional data.Transform the data by a feature map Φ into a feature space H:

X1, . . . , XN 7→ Φ(X1), . . . ,Φ(XN )

Assume that the feature space has the inner product 〈 , 〉.

Apply PCA to the transformed data:• Maximize the variance of the projection onto the unit vector f .

max‖f‖=1

Var[〈f,Φ(X)〉] = max‖f‖=1

1N

∑Ni=1

(〈f,Φ(Xi)− 1

N

∑Nj=1Φ(Xj)〉

)2• Note: it suffices to use f =

∑ni=1 aiΦ(Xi), where

Φ(Xi) = Φ(Xi)− 1N

∑Nj=1Φ(Xj).

The direction orthogonal to Span{Φ(X1), . . . , Φ(XN )} does notcontribute.

14 / 25

Page 15: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Kernel PCA II• The PCA solution:

max aT K2a subject to aT Ka = 1,

where K is N ×N matrix with Kij = 〈Φ(Xi), Φ(Xj)〉.Note:

1N

∑Ni=1〈f, Φ(Xi)〉2 = 1

N

∑Ni=1〈

∑Nj=1ajΦ(Xj), Φ(Xi)〉2 = 1

N aT K2a,

‖f‖2 = 〈∑n

i=1aiΦ(Xi),∑n

i=1aiΦ(Xi)〉 = aT Ka.

• The first principal component of the data Xi is

〈Φ(Xi), f〉 =∑N

i=1

√λ1u

1i ,

where K =∑N

i=1 λiuiuiT is the eigen decomposition.

15 / 25

Page 16: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Kernel PCA IIIObservation:• PCA in the feature space can be done if we can compute〈Φ(Xi), Φ(Xj)〉 or

〈Φ(Xi),Φ(Xj)〉 = k(Xi, Xj).

• The principal direction is obtained in the form f =∑

i aiΦ(Xi),i.e., in the linear hull of the data.

Note:

Kij = 〈Φ(Xi), Φ(Xj)〉

= 〈Φ(Xi), Φ(Xj)〉 − 1N

∑Nb=1〈Φ(Xi), Φ(Xb)〉

− 1N

∑Na=1〈Φ(Xa), Φ(Xj)〉+ 1

N2

∑Na=1〈Φ(Xa), Φ(Xb)〉

= k(Xi, Xj)− 1N

∑Nb=1k(Xi, Xb)− 1

N

∑Na=1k(Xa, Xj) + 1

N2

∑Na=1k(Xa, Xb)

16 / 25

Page 17: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Basic idea of kernel methodsLinear and nonlinear Data AnalysisEssence of kernel methodology

Some examples of kernel methodsKernel PCA: Nonlinear extension of PCARidge regression and its kernelization

17 / 25

Page 18: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Review: Linear Regression ILinear regression• Data: (X1, Y1), . . . , (XN , YN ): data

• Xi: explanatory variable, covariate (m-dimensional)• Yi: response variable, (1 dimensional)

• Regression model: find the best linear relation

Yi = aTXi + εi

18 / 25

Page 19: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Review: Linear Regression II• Least square method:

mina

∑Ni=1‖Yi − aTXi‖2.

• Matrix expression

X =

X1

1 X21 · · · Xm

1

X12 X2

2 · · · Xm2

...X1

N X2N · · · Xm

N

, Y =

Y1

Y2

...YN

.

• Solution:a = (XTX)−1XTY

y = aTx = Y TX(XTX)−1x.

Observation: Linear regressio can be done if we can compute theinner product XTX, aTx and so on.

19 / 25

Page 20: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Ridge Regression

Ridge regression:• Find a linear relation by

mina

∑Ni=1‖Yi − aTXi‖2 + λ‖a‖2.

λ: regularization coefficient.• Solution

a = (XTX + λIN )−1XTY

For a general x,

y(x) = aTx = Y TX(XTX + λIN )−1x.

• Ridge regression is useful when (XTX)−1 does not exist, orinversion is numerically unstable.

20 / 25

Page 21: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Kernelization of Ridge Regression I(X1, Y1) . . . , (XN , YN ) (Yi: 1-dimensional)

Transform Xi by a feature map Φ into a feature space H:

X1, . . . , XN 7→ Φ(X1), . . . ,Φ(XN )

Assume that the feature space has the inner product 〈 , 〉.

Apply ridge regression to the transformed data:• Find the vector f such that

minf∈H

∑Ni=1|Yi − 〈f,Φ(Xi)〉H|2 + λ‖f‖2H.

• Similarly to kernel PCA, we can assume f =∑n

j=1 cjΦ(Xj).

minc

∑Ni=1|Yi − 〈

∑Nj=1cjΦ(Xj),Φ(Xi)〉H|2 + λ‖

∑Nj=1cjΦ(Xj)‖2H

21 / 25

Page 22: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Kernelization of Ridge Regression II

• Solution:

c = (K+λIN )−1Y, where Kij = 〈Φ(Xi),Φ(Xj)〉H = k(Xi, Xj).

For a general x,

y(x) = 〈f ,Φ(x)〉H = 〈∑

j cjΦ(Xj),Φ(x)〉H= Y T (K + λIN )−1k,

where

k =

〈Φ(X1),Φ(x)〉...

〈Φ(XN ),Φ(x)〉

=

k(X1, x)...

k(XN , x)

.

22 / 25

Page 23: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Kernelization of Ridge Regression III

Proof.Matrix expression gives∑N

i=1 |Yi − 〈∑N

j=1cjΦ(Xj),Φ(Xi)〉H|2 + λ‖∑N

j=1cjΦ(Xj)‖2H= (Y −Kc)T (Y −Kc) + λcTKc

= cT (K2 + λK)c− 2Y TKc+ Y TY.

It follows that the optimal c is given by

c = (K + λIN )−1Y.

Inserting this to y(x) = 〈∑

j cjΦ(Xj),Φ(x)〉H, we have the claim.

23 / 25

Page 24: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Kernelization of Ridge Regression IV

Observation:• Ridge regression in the feature space can be done if we can

compute the inner product

〈Φ(Xi),Φ(Xj)〉 = k(Xi, Xj).

• The resulting coefficient is of the form f =∑

i ciΦ(Xi), i.e., in thelinear hull of the data.The orthogonal directions do not contribute to the objectivefunction.

24 / 25

Page 25: Introduction: Overview of Kernel Methodsfukumizu/Kyushu2008/Kernel_intro_1.pdf · Basic idea of kernel methods Some examples of kernel methods Outline Basic idea of kernel methods

Basic idea of kernel methods Some examples of kernel methods

Kernel methodology

• A feature space H with inner product 〈 , 〉.

• Mapping of the data into a feature space:

X1, . . . , XN 7→ Φ(X1), . . . ,Φ(XN ) ∈ H.

• If the computation of the inner product 〈Φ(Xi),Φ(Xi)〉 istractable, various linear methods can be extended to the featurespace.

• Give Methods of nonlinear data analysis.

How can we prepare such a feature space?⇒ Positive definite kernel!

25 / 25