manifold learning: from linear to nonlineardisp.ee.ntu.edu.tw/~pujols/gsp2012_1.pdfmanifold...

120
Manifold Learning: From Linear to nonlinear Presenter: Wei-Lun (Harry) Chao Date: April 26 and May 3, 2012 At: AMMAI 2012 1

Upload: phungnguyet

Post on 29-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Manifold Learning:From Linear to nonlinear

Presenter: Wei-Lun (Harry) Chao

Date: April 26 and May 3, 2012

At: AMMAI 2012

1

Preview

• Goal:

Dimensionality reduction

Classification and clustering

• Main idea:

What information and properties to preserve or enhance?

2

Outline

• Notation and fundamental of linear algebra

• PCA and LDA

• Topology, manifold, and embedding

• MDS

• ISOMAP

• LLE

• Laplacian eigenmap

• Graph embedding and supervised, semi-supervised extensions

• Other manifold learning algorithms

• Manifold ranking

• Other cases3

Reference

• [1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007

• [2] R. O. Duda et al., Pattern Classification, 2001

• [3] P. N. Belhumeur et al., Eigenfaces vs. Fisherfaces, 1997

• [4] J. B. Tenenbaum et al., A global geometric framework for

nonlinear dimensionality reduction, 2000

• [5] S. T. Roweis et al., Nonlinear dimensionality reduction by

locally linear embedding, 2000

• [6] L. K. Saul et al., Think globally, fit locally, 2003

• [7] M. Belkin et al., Laplacian eigenmaps for dimensionality

reduction and data representation, 2003

• [8] T. F. Cootes et al., Active appearance models, 1998

4

Notation

• Data set:

• Matrix:

• Vector:

• Matrix form of data set:

( )

1high-D: { }n Nd

nX R x

( )

1low-D: { }pn N

nY R y

(1) (2) ( )

1 ,1[ , ,......, ] [ ]m

n m ij i n j mA a a a a

( ) ( ) ( ) ( )

1 1 2[ , ,......, ]i i i i T

d da a a a

(1) (2) ( ) (1) (2) ( )

| | |

, ,......, ......

| | |

N N

d N

N

X

x x x x x x

5

Fundamental of Linear Algebra

• SVD (singular value decomposition):

, where

(1)11

(2)22(1) (2) ( )

( )

0 ... 0 0| | |

0 0 0 0......

0 0 0| | |

0 0 0 0

T

d

T

T

d

d N

N Td ddd d N

d d N N

N

N

N

U VX

v

vu u u

v

(1)

(2)

(1) (2) ( )

( )

1 1

| | |

...... ,

| | |

, ,

T

T

T d T

d d N N

d T d d

d d

T T T T T T

U U I V V I

U U I UU V V I VV U U V V

u

uu u u

u

6

Fundamental of Linear Algebra

• SVD (singular value decomposition):

7

=d

N

* *

d

N

=*

0

0

*

Fundamental of Linear Algebra

• EVD (Eigenvector decomposition)

• Caution: Eigenvalues are not always orthogonal!

• Caution: Not all the matrices have EVDs.

1

2(1) (2) ( ) (1) (2) ( )

1 1

0 0| | | | | |

0 0 0...... ......

0| | | | | |

0 0

N N

N N

N

A

AU U A

A U U A U U

u u

u u u u u u

8

Fundamental of Linear Algebra

• Determinant:

• Trace:

• Rank:

1

N

N N n

n

A

1 1

( ) ( )N N

N N ii n

n n

tr A diag A a

( ) ( )d N N d N d d Ntr A B tr B A

( ) ( ) # nonzero diagonal elements of

# independent columns of

# nonzero eigenvalues (square )

Trank A rank U V

A

A

9

( ) min( ( ), ( ))

( ) ( ) ( )

rank AB rank A rank B

rank A B rank A rank B

Fundamental of Linear Algebra

• SVD vs. EVD (symmetric positive semi-definite)

• Hermitian matrix:

• Hermitian matrices have orthonormal eigenvectors.

10

( ) ( ) ( )

( ) ( )

T T T T T T T

T T T

T T

T

A U V U V V U U U

AU U U U U U U

XX V

U

U

A

is real( )

( )

H T T

T T T T T

T T

A A conj A A A

A U V U V V U A

A U

A

U V

U AU U U AU

Dimensionality reduction

• Operation:

• Reason:

Compression

Knowledge discovery or feature extraction

Irrelevant and noise feature removal

Visualization

Curse of dimensionality

( )

1high-D: { }n Nd

nX R x ( )

1low-D: { }n

pn NY y R ( )p d

11

Dimensionality reduction

• Methods:

Feature transform:

Feature selection:

• Criterion:

Preserving some properties or structures of the high-D feature

space into the low-D feature space.

These properties are measured from data.

: , d pR p df R x y

linear form: ( )T

p dQ y x

(1) (2) ( )[ , ,......, ]

deotes selected indices

T

s s s p

s

y x x x

12

Dimensionality reduction

• Model:

Linear projection:

Direct re-embedding:

Learning a mapping function:

13

: , d pR p df R x y

( )T

p dQ y x

( ) ( )

1 1{ } { }n Nd N

n

pn

nX R Y R x y

Principal Component Analysis

(PCA)

14

[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007

[2] R. O. Duda et al., Pattern Classification, 2001

Principal component analysis (PCA)

• PCA:

15

(1)

(2)

( )

(1) (2) ( )

( )

( )

| | |

......

| | |

T

d p d p p p

T

T

T

d p

p T

p d

p

d p

p

Q Q I

Q

Q

^

q

qy x = x

q

x x y = q q q y

Principal component analysis (PCA)

• Surprising usage: face recognition and encoding

= -2181 +627 +389 +… =

16

Principal component analysis (PCA)

• PCA is basic yet important and useful:

Easy to train and use

Lots of additional functionalities:

noise reduction, ellipse fitting, ……

• Also named Karhunen-Loeve transform (KL transform)

• Criteria:

Maximum variance (with decorrelation)

Minimum reconstruction error

17

Principal component analysis (PCA)

• Maximum variance (with decorrelation)

• Minimum reconstruction error

18

Principal component analysis (PCA)

• (Training) data set:

• Preprocessing: centering (mean can be added back)

• Model:

( )

1high-D: { }n Nd

nX R x

( )

1

( ) ( )

1

1( ), or say

Nn

N

n

NT T n n

Nn

X

X X X IN

x x e

xe ee x x x

, where and is

(orthonormal)

T p

T

p p

T

R Q d p

Q Q I

Q

Q QQ

^

y x y

x y = x 19

Maximum variance (with decorrelation)

• The low-D feature vectors should be decorrelated

• Covariance variance:

• Covariance matrix:(1)

1 1 1 (2)

(1) (2) ( )

1 ( )

( ) (

| | |cov( , ) cov( , )1

......

cov( , ) cov( , ) | | |

( )(

T

d T

N

xx

d d d N T

n

x x x x

CN

x x x x

x

xx x x

x

x

x x)

1

(1) (1) ( ) ( )

1)

| |1 1

......

| |

( )( )N

n T T

n

T T

T N N T

X XN

N N

I I

x x x

x ee

x

ee

( ) ( )

1 2 1 1 2 2 1 1 2 2

1

1cov( , ) ( )( ) ( )( )

Nn n

n

x x E x x x x x x x xN

20

Maximum variance (with decorrelation)

• Covariance matrix

( ) ( )

1 2 1 1 2 2 1 1 2 2

1

1cov( , ) ( )( ) ( )( )

Nn n

n

x x E x x x x x x x xN

21

Maximum variance (with decorrelation)

• Decorrelation:

( ) ( )

1 1

( ) ( ) ( ) ( ) ( ) ( )

1 1 1

1 10

1

diagonal

1 1( )( ) ( )( ) ( )(

ma

)

ri t x

N Nn T n T

n n

N N Nn n T T n T n T T n n T

yy

n n n

T

xx

Q QN N

C Q Q Q QN N N

Q C Q

y y x x

xyy x xyy x

22

Maximum variance (with decorrelation)

• Maximum variance

( ) ( )

2 2* ( ) ( )

1 1

( ) ( ) ( ) ( )

1 1

( ) ( )

1 1

1 1arg max arg max

1 1 arg max ( ) ( ) arg max

1 1 arg max arg} max{ { }

T T

T T

T T

N Nn T n

Q Q I Q Q In n

N NT n T T n n T T n

Q Q I Q Q In n

N NT n n T

Q Q I Q

n T T

Q In n

n

Q QN N

Q Q QQN

tr QQ

N

tr Q QN N

y y x

x

x

x

xx

x x

x

( ) ( )

1

1 arg max { [ ] } arg max { }

T T

NT n n T T

xxQ Q I Q Q I

n

tr Q Q tr Q C QN

x x

(1 x 1) (p x p)

23

Maximum variance (with decorrelation)

• Optimization problem:

• Solution:

* arg max { } subject to diagonal and

T

yy xx

T T

d p xx yyQ

C Q C Q

Q tr Q C Q C QQ I

* (1) (2) ( )

( )

* *

[ , ,......, ],

is the th largest eigenvector of

1 1 1( )

xx xx xx

xx

p

C C C

i d d

C xx

T T T T T T T

xx

T

p p

Q

i C R

C XX U V V U U U U UN N N

Q Q I

u u u

u

24

Maximum variance (with decorrelation)

• Proof:

*

1

, ,

*

Assume is 1, { }

arg max

arg max ( , ) arg max

Take partial derivative 0

( ) 2 0

( 1)

eigenvec

1 0

tor

is the

T

T T

xx xx

T

xx

T

xx

T

xx xxxx

T

T

d tr C C

C

E C

EC C

E

C

q q

q q

q q q q q

q q q

q q q q q

q qq qq

q q

q

* * * *

1

largest eigenvector of

xx

T T T

xx

C

C U U q q q q = 25

Lagrange

multiplier

Maximum variance (with decorrelation)

(1)

(1)

(1) (1) (1) (1)

*

1

*

* * *

2

| || |

Assume is 2, { } { }

| | | |

arg max the second largest eigenvector of

T

TT

T T T

xx xx xx xxT

T

xx xx

T T T

xx

Q d tr Q C Q tr C C C

C C

C U U

q q

q q

qq q q q q q q q

q

q q q

q q q q =

26

(1) ( )

( ) ( ) ( )

1 1

*

* * * *

1

1

...

| |

Assume is ( 1), { }

||

arg max the ( 1) largest eigenvector of

T

r

r rr T i T i T T

d r xx xx xx i xx

i i

T th

xx xx

T T T

xx r

Q Q d r tr Q C Q C C C

C r C

C U U

q q

q q q

q q q q q q q

q q q

q q q q =

Minimum reconstruction error

• Mean square error is preferred:

2 2* ( ) ( ) ( ) ( )

1 1

( ) ( )

1

( ) ( ) ( ) ( ) ( ) ( )

1 1 1

1 1arg min min

1 arg min (( ) ) (( ) )

1 2 1 arg min ( )

T T

T

T

N Nn n n T n

Q Q I Q Q In n

NT n T T n

d d d dQ Q I

n

N N Nn T n n T T n n T T T n

Q Q In n n

Q Q QQN N

I QQ I QQN

QQ Q Q Q QN N N

x y x x

x x

x x x x x x

( ) ( ) ( ) ( )

1 1

( ) ( )

1

1arg

1 1 arg min

...max ... arg max { }

T

TT

Nn T T n

Q Q In

N Nn T n n T T n

Q Q In n

T

xxQ Q I

QQN

QQN N

tr Q C Q

x x x

x

x

x

27

Algorithm

• (Training) data set:

• Preprocessing: centering (mean can be added back)

• Model:

( )

1high-D: { }n Nd

nX R x

, where and is

(orthonormal)

T p

T

p p

T

R Q d p

Q Q I

Q

Q QQ

^

y x y

x y = x28

( )

1

( ) ( )

1

1( ), or say

Nn T

N

n

NT T n n

Nn

Xe

X X e X I eeN

x x

x x x x

Algorithm

• Algorithm 1: (EVD)

• Algorithm 2: (SVD)

1

(1) (2) ( ) (1) (2) ( )

( )

1. , where in are in descending order

| | | | | |

2. ...... ......

| | | | | |

d

xx i i

p pd p

d p

d p p

U C U

IQ UI

O

u u u u u u

1

(1) (2) ( ) (1) (2) ( )

( )

1. , where in are in descending order

| | | | | |

2. ...... ......

| | | | | |

dT

i i

p pd p

d p

d p p

X U V

IQ UI

O

u u u u u u

29

Illustration

• What is PCA doing:

30

1 1

22

Summary

• PCA exploits 2nd –order statistical properties measured

in data (simple and not vulnerable to over-fitting)

• Usually used as a “preprocessing step” in applications

• Rank:

( ) ( ) (1) (1) ( ) ( )

1

| |1 1 1

......

| |

, 1 in ge( ) l1 nera

xx

Nn n T T N N T

xx

n

xxrank C N

U C U

CN N N

p N

x x x x x x

31

Optimization problem

• Convex or not?

• Convex or not?

32

*

1arg max arg max , s.t. 1

1 1(1) semi-postive definite

(2) 1 quadratic equality constraint

)

1

(

T

T T T

xx xx

T TT

xx

T T

C C

C XX U UN N

qq q

q q q = q q q q

q q q q

*

1arg min arg min , s.t. 1

1 1(1) semi-postive definite

(2) 1 quadratic equality constraint

)

1

(

T

T T T

xx xx

T TT

xx

T T

C C

C XX U UN N

qq qq q q = q q q q

q q q q

T

xxCq q

Examples

• Active appearance model: TQ QQ^

x y = x

33

[8]

[8]

Linear Discriminant Analysis

(LDA)

34

[2] R. O. Duda et al., Pattern Classification, 2001

[3] P. N. Belhumeur et al., Eigenfaces vs. Fisherfaces, 1997

Linear discriminant analysis(LDA)

• PCA is unsupervised

• LDA takes the label information into consideration

• Achieved low-D features are efficient for discrimination.

35

Linear discriminant analysis(LDA)

• (Training) data set:

• Model:

• Notation:

( )

1high-D: { }n Nd

nX R x

, where and is T pRQ Q d p y x y

( )

1 2( ) { , ,......, }n

clabel L l l l x

( )

( ) ( )

1

( )

( )

1

{ | ( ) }

# samples in

1class mean:

1total mean:

i

ni

Nn n

i i n

i i

n

i

Xi

Nn

n

X label l

N X

N

N

x

x x

x

x

( )

1

( ) ( )

1

between-class scatter:

within-class scatter:

( )( )

( )( )n

i

cT

B i i i

i

cn n T

W i i

i X

S N

S

x

x x

36

Linear discriminant analysis(LDA)

• Properties of scatter matrix:

• Scatter matrix in low-D:

( )

1

( ) ( )

1

inter-class separation( )( )

( )( ) intra-class tightn s e sn

i

cT

B i i i

i

cn n T

W i i

i X

S N

S

x

x x

( )

1

( ) ( )

1

between-class: ( )( )

within-class: ( ) )n

i

cT T T T T

i i i

i

cT

T

B

T

W

n T T n T T

i i

i X

Q S Q

Q

N Q Q Q Q

Q Q Q Q S Q

x

x x

37

Linear discriminant analysis(LDA)

38

Criterion and algorithm

• Criterion of LDA:

• Determinant and trace are suitable scalar measures:

• With Rayleigh quotient:

Maximize the ration of to Q "in some se nse"T T

B WQ S Q Q S

* ( )arg max or arg max

( )

T TB B

TTQ QWW

Q S Q tr Q S QQ

tr Q S QQ S Q

*

* (1) (2)

( ) ( )

( )

is n, are both symmetric positive semi-definite and

arg max , is in descending order

, ,.....

onsigu

.

larB W

T

B

iT

W

i i

BQ

p

i

T

W

W

S S

Q S QQ

Q

S

SS Q

Q

S

u

u u

u

u39

Note and Problem

• Note:

• Problem:

( ) ( ) 1 ( ) ( )

( ) 1, at most 1 nonzero

so 1

i i i i

B i W W B i

B i

S S S S

rank S c c

p c

u u u u

( ) , and is

if ( ) , is singular, Rayleight quotient is useless

W W

W W

rank S N c S d d

rank S d S

40

Solution

• Problem:

• Solution:

PCA+LDA:

Null-space:

( )

( ) ( )

1

( )( ) is singularn

i

cn n T

W i i

i X

S

x

x x

( ) ( ) ( )

( ( ))

(( ) ( ))

1. Perform on ,

2. Compute , if nonsingular, the problem is solved

3. For new samples,

n n T n N c

PCA d N c PCA

W N c N c

T T

LDA PCA

Q Q R

S

Q Q

x x x

y x

* *

*

1. arg max find to make 0

2. Extract columns of from the null space of

T

B T

WTQW

W

Q S QQ Q Q S Q

Q S Q

Q S

41

Example

42

[3]

Topology, Manifold, and

Embedding

43

[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007

Topology

• Geometrical point of view

If two or more features are latently dependent, their joint

distribution does not span the whole feature space.

The dependence induces some structures (object) in the

feature space.

1 2( , ) ( ), x x g s a s b

( )g a

( )g b

44

Topology

• Topology:

Allowed: Deformation, twisting, and stretching

Not allowed: Tearing

Topology object means properties and structures

A topology object (space) is represented (embedding) as a

spatial object in the feature space.

Topology abstracts the intrinsic structure, but ignores the

details of spatial object.

Ex: circle and ellipse are topologically homeomorphic

45

Manifold

• Feature space: dimensionality + structure

• Neighborhood:

• Topology space can be characterized by neighborhoods

• Manifold is a locally Euclidean topological space

• Euclidean space:

• In general, any spatial object that is nearly flat in small

scale is a manifold

2

( )( ) ball ( ) { }d i

LNei R B x x x x

2

(1) (2) (1) (2)( , ) is meaningfulL

dis x x x x

46

Manifold

2D3D+non-Euclidean

47

[5]

Embedding

• Embedding:

Embedding is a representation of a topological object (ex. a

manifold, graph) in a certain feature space, in such a way the

topological properties are preserved.

A smooth manifold is differentiable and has “functional

structure” to link the features with latent variables.

The dimensionality of a manifold is the # latent variables

A k-manifold can be embedded to any d-dimensional space

with d is equal to or larger than (2k+1)

48

Manifold learning

• Manifold learning:

Recover the original embedding function from data.

• Dimensionality reduction with the manifold property:

Re-embed a k-manifold in d-dimensional space into a p-

dimensional space with d >p

49

Latent variables s

d-dimensional

space

p-dimensional

space

1( )g s2 ( )g s

( )h s

( )f s

Example

1 2 3 1( , , ) ( ), x x x g s a s b

1( )g a

1( )g b

a s b a b

2 ( )g a

2 ( )g b

1 2 2( , ) ( ), x x g s a s b

Re-embedding

Latent variable:

1 2: ( ) ( )f g s g s

50

Manifold learning

• Properties to preserve:

Isometric embedding: distance preserving

Conformal embedding: angle preserving

Topological embedding: neighbor / local preserving

• Input space: locally Euclidean

• Output space: user defined 51

(1) (2) (1) (2)( , ) ( , )dis disx x y y

(1) (3) (2) (3) (1) (3) (2) (3)( , ) ( , )angle angle x x x x y y y y

Multidimensional Scaling

(MDS)

52

[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007

Multidimensional Scaling (MDS)

• Distance preserving:

• Scaling refers to construct a configuration of samples

in a target metric space from information of inter-

point distances

( ) ( ) ( ) ( )( , ) ( , )i j i jdis disx x y y

53

10

4

9

8.5 6.5

?

Multidimensional Scaling (MDS)

• MDS: a scaling where the target space is Euclidean

• Here we mentioned about classical metric MDS

• Metric MDS indeed preserves pairwise inner product

rather than pairwise distance

• Metric MDS is unsupervised

54

Multidimensional Scaling(MDS)

• (Training) data set:

• Preprocessing: centering (mean can be added back)

• Model:

( )

1high-D: { }n Nd

nX R x

( )

1

( ) ( )

1

1= ( ) or say

Nn

n

NT T n n

Nn

X X X IN

1

x x

x ee x x x

There is no to train

: , d pf R

Q

R p d x y

55

Criterion

• Inner product (scalar product):

• Gram matrix: recording pairwise inner product

• Usually, we only know Z, but not X

( ) ( ) ( ) ( ) ( ) ( )( , ) ( , )i j i j i T j

Xs i j s x x x x x x

(1)

(2)

(1) (2) ( )

1 ,

( )

| | |

[ ( , )] ...... .

| | |

1Gram matrix: , Covariance matrix: ( )( )

T

T

N T

X i j

T

N

T

T TT

N

S s i j X X

S X X C X XIN

I

x

xx x

ee

x

x

ee

56

Criterion

• Criterion 1:

• Criterion 2:

2* ( ) ( ) 2

1 1

2

(1)

(2)

2 1/2 (1

,

( )

arg min ( ) arg min arg min

, where is the matrix norm, also called the Frobenius norm

|

( ) ( ) (

N Ni j T T

p N ij F FY Y Yi j

F

T

T

T

ijFi j

N T

Y s S Y Y S Y Y

A L

A a tr A A tr

y y

a

aa

a

) (2) ( ) 1/2

| |

...... )

| | |

N

a a

(1)

(2)

(1) (2) ( )

( )

| | |

......

| | |

T

T

T T N

N T

X X S Y Y

y

yy y y

y 57

Algorithm

• Rank: (assume N>d)

• Low-rank approximation:

58

( ) min( , ), ( ) min( , )T Trank X X N rank Y Y pNd

* *

(1)

(2)

(1) (2) ( )

( )

( )

with ( ) 0,

arg min

| | |

......

| | |

d N T

T

F

T

k k

ran B kk r

T

d

N Td d

k k

A R rank A r A U V

I O

O O

O

O O

B A B U V

v

vu u u

vN N

Algorithm

• EVD: (Hermitian matrix)

• Solution:

59

1/2 1/2

( ) ( ) ( )

( )( )( )

T T T T T T T

p pT T T

N p N p

S X X U V U V V V V V

OY Y V V V I I V

O O

1/2 (1)1/2 (1)11

1/2 (2)1/2 (2)22

( )

1/2 ( )1

1/

2

2

/ ( )

|

, where is a arbitrary orthonormal (unitary) matrix for ro

TT

TT

T

p p p p N p

N TN T

N

N

p

I O

p p

Y I U

uu

uu

uu

tation

PCA vs. MDS

• (Training) data set:

• PCA: EVD on covariance matrix

• MDS: EVD on Gram matrix

( )

1high-D: { } SVD: n N T

n

dX R X U V x

1( )

)

1

(( )

TT T T T T T

xx PCA

T T

PCA d p

T

p d

C XX U V V U U U U UN N N

Y Q X UI X I U X

1/2 T

p N MD

T T T T T T T

MDS

MDS S

S X X V U U V V V V V

Y I V

60

PCA vs. MDS

• Discard the rotation term and with some derivations:

• Comparison

1/2 1/2( )

( ) ( )

T T T T

MDS p N MDS p N p d

T T T T

PCA p d p d p d

Y I V I V I V

Y I U X I U U V I V

PCA: EVD on matrix

MDS: EVD on matrix

SVD: SVD on matrix

T

xx

T

d d C XX

N N S X X

d N X

61

For test data

• Model:

• For a new coming test x:

• Finally:

(generatuve view)

Use from PCA for convenience

T

d p

Q Q

Q UI

y x x y

1/2( )

( )

(with )

T T T

d p

T

T T T T

T T T

d p N p

T T T

X U V V U V U

V U V I V I

X X V V V

Q

UI

V

s x = y

y y y

x x

1/2 T

p NI V

y s

62

MDS with pairwise distance

• How about a training set with pairwise distance?

63

( ) ( )

1 ,[ ( , )] , no and i j

ij i j ND d dis X S x x

10

4

9

8.5 6.5

?

Distance metric

• Distance “metric”:

Nonnegative:

Symmetric:

Triangular:

• Minkowski distance: (order p)

( ) ( ) ( ) ( ) ( ) ( )( , ) 0, ( , ) 0 iff i j i j i jdis dis x x x x x x( ) ( ) ( ) ( )( , ) ( , )i j j idis disx x x x

( ) ( ) ( ) ( ) ( ) ( )( , ) ( , ) ( , )i j i k k jdis dis dis x x x x x x

( ) ( ) ( ) ( ) ( ) ( ) 1/

1

1/

( ) ( )

1

( ) [ ( ) ]

di j i j i j p p

k k k kpk

ppd

i j

k k

k

dis x x x x

x x

x x

64

Distance metric

• (Training) data set:

• Euclidean distance and inner product:

2

( ) ( ) ( ) ( ) ( ) ( ) 2 1/2

1

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

2

( , ) [ ( ) ]

( , ) ( ) ( )

2

( , ) 2 ( , ) ( , )

1( , )

di j i j i j

k kLk

i j i j T i j

i T i i T j j T j

X X X

X

dis x x

dis

s i i s i j s j j

s i j

x x x x

x x x x x x

x x x x x x

( ) ( )2{ ( , ) }2

( , ) ( , )X X

i j s i id s jis j x x

( ) ( )

1 ,[ ( , )] , no and i j

ij i j ND d dis X S x x

65

Distance to inner product

• Define square distance matrix:

• Double centering:

2 2 2 22

2 2 2 2

21 1 1 1

1 1 1 1( )

2

1 1 1 1( , ) ( )

2

T T T T

X N N N N N N N N

N N N N

X ij ik mj mk

k m k m

S D D D DN N N

s i j d d d dN N N

e e e e e e e e

2 ( ) ( )

2 2 1

2

,[ ( , )]i j

ij ij i j ND d d dis x x

66

Proof

• Proof:

2 2 ( ) ( )

1 1 1

( ) ( ) ( ) ( ) ( ) ( )

1

( ) ( ) ( ) ( ) (( )

1

)

1

1 1 1( , ) ( , ) 2 ( , ) ( , )

1 , 2 , ,

1 , , 2 ,

N N Nm j

mj X X X

m m m

Nm m m j j j

m

Nj j m m j

m

Nm

m

d dis s m m s m j s j jN N N

N

N

x x

x x x x x x

x x x x xx

( ) ( )

1

2 ( ) ( )

1

( ) ( )

( )

1

( )

1 ,,

1,,

1

Nm m

m

N Nk k

ik

k

j j

i

k

i

N

dN N

x x

x x

x x

x x

67

Proof

• Proof:

• Finally

2 2 ( ) ( )

2 2 21 1 1 1 1 1

( ) ( ) ( ) ( ) ( ) ( )

21 1

( ) ( ) ( ) (

2 21 1

1 1 1( , ) ( , ) 2 ( , ) ( , )

1 , 2 , ,

1 1 , ,

N N N N N Nm k

mk X X X

m k m k m k

N Nm m m k k k

m k

N Nm m k k

m k

d dis s m m s m k s k kN N N

N

N N

x x

x x x x x x

x x x x) ( ) ( )

1 1 1 1

( ) ( ) ( ) ( )

1 1

1 12 ,

1 1 , ,

N N N Nm k

m k m k

N Nm m k k

m k

N N

N N

x x

x x x x

2 2 2

21 1

( ) ( ) ( )

1 1

( ) 1 1,

1,

N N N N

mj ik m

j j j j

k

m k m k

d d dN N N

x x x x

68

Algorithm

• Given X:

Get S, perform MDS

• Given S:

Perform MDS

• Given D:

Double each entry in D

Perform double centering

Perform MDS

69

Summary

• Metric MDS preserves pairwise inner product instead

of pairwise distance

• It preserves linear properties

• Extension:

Sammom’s nonlinear mapping

Curvilinear component analysis (CCA)

2

1 1

( ( , ) ( , ))

( , )

N NX Y

NLM

i j X

dis i j dis i jE

dis i j

2

1 1

1( ( , ) ( , )) ( ( , ))

2

N N

CCA X Y Y

i j

E dis i j dis i j h dis i j

70

From Linear To Nonlinear

71

Linear

• PCA, LDA, MDS are linear:

Matrix operation

Linear properties (sum, scaling, commutative,…)

• Inner product, covariance:

• Assumption on the original feature space:

Euclidean or Euclidean with “rotation and scaling”

( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( ) ( )

, , ,

( )

i j k i k j k

k i j T k i T k j T

x x x x x x x

x x x x x x x

72

Problem

• If there exists structure in the feature space:

1 2 3 1( , , ) ( ), x x x g s a s b

1( )g a

1( )g b

crashed

73

Manifold way

• Assumption:

The latent space is nonlinearly embedding in the feature space

The latent space is a manifold, so does the feature space

The feature space is locally smooth and Euclidean

• Local geometry or property:

Distance preserving: ISOMAP

Neighborhood (topology) preserving (LLE)

Locality (topology) preserving (LE)

• Caution:

There properties and structures are measured in the feature

space.74

Isometric Feature Mapping

(ISOMAP)

75

[4] J. B. Tenenbaum et al., A global geometric framework for

nonlinear dimensionality reduction, 2000

ISOMAP

• Distance metric in feature space: Geodesic distance

• How to measure:

Small scale: Euclidean distance in

Large scale: shortest path in connected Graph

• The space to re-embed:

p-dimensional Euclidean space

After we get the pairwise distance, we can embed it in many

kinds of space.

dR

76

Graph

• (Training) data set: ( )

1high-D: { }n Nd

nX R x

Vertices

Assume placed in order

77

(1)x

( )Nx

Small scale

• Small scale: Euclidean, Large scale: graph distance

Vertices + edges

(1)x

( )Nx

2

1(1) ( ) ( ) ( 1)

1

( , ) ,N

N i i

Li

dis

x x x x

Assume placed in order

78

Distance metric

• MDS: Distance preserving

Vertices + edges

Assume placed in order

2 2

1(1) ( ) ( ) ( ) (1) ( ) ( ) ( 1)

1

( , ) , ( , ) ,N

N i N N i i

L Li

dis dis

y y y y x x x x

79

Algorithm

• Presetting:

• (1) Geodesic distance in neighborhood

1 ,

( )

Define distance matrxi [ ]

Set ( ) as the neighbor set of (undified)

ij i j N

i

N N D d

Nei i

x

2

( )

( ) ( )

for 1:

for 1:

if ( ( ) and )

end

end

end

j

i j

ij L

i N

j N

Nei i i j

d

x

x x

80

Algorithm

• (1) Geodesic distance in neighborhood:

• (2) Geodesic distance in large scale: (shortest path)

2

( ) ( ) ( )

( ) ( ) ( )

Neighbor:

-neighbor: ( ) iff

NN : ( ) iff ( ) or ( )

j i j

L

j j i

Nei i

K Nei i KNN i KNN j

x x x

x x x

for each pair ( , )

for 1:

min{ , }

end

end

ij ij ik kj

i j

k N

d d d d

Floyd’s algorithm:

Run several round until converge

81

Algorithm

• (3) MDS:

Transfer pairwise distance into inner product:

EVD:

Proof

2( ) / 2, where ( , ) 1/ (for centering)ijD H H ND h i j

1/2 1/2

1/

1/2

2

1/2( ) ( )( ) ( ) ( )

( , 1)

T T T T T

T

p NY I

D U U U U U U

p d pU N

2 2

2 2 2 22

1 1( ) / 2 ( ) ( ) / 2

1 1 1 1 ( ) / 2

T T

N N N N N N N N

T T T T

N N N N N N N N

D HD H I D IN N

D D D DN N N N

S

e e e e

e e e e e e e e

82

Example

• Swiss roll:

83

[4]

Example

• Swiss roll: 350 points

MD

S

ISOMAP

84

[1]

Example

85

[4]

Summary

• Compared to MDS: ISOMAP has the ability to discover

the underlying structure (latent variables) which is

nonlinear embedded in the feature space

• It is a “global method”, which preserves all pairs of

distances.

• The Euclidean space assumption in low-D space

implies the convex property, which sometimes fails.

86

1 1 1 1

2?

Locally Linear Embedding

(LLE)

87

[5] S. T. Roweis et al., Nonlinear dimensionality reduction by locally

linear embedding, 2000

[6] L. K. Saul et al., Think globally, fit locally, 2003

LLE

• Neighborhood preserving:

Based on the fundamental manifold properties.

Preserve the local geometry of each sample and its neighbors.

Ignore the global geometry in large scale

• Assumption:

Well-sampled with sufficient data.

Each sample and its neighbors lie on or closed to a local

linear patch (sub-plane) of the manifold.

88

LLE

• Properties:

Local geometry is characterized by linear coefficients that

reconstruct each sample from its neighbors

These coefficients are robust to RST: rotation, scaling, and

translation.

• Re-embedding:

Assume the target space is locally smooth (manifold)

Locally Euclidean, but not necessary in large scale

Reconstruction coefficients are still meaningful

Stick local patches on the low-D global coordinate 89

LLE

• (Training) data set: ( )

1high-D: { }n Nd

nX R x

90

Neighborhood properties

• Linear reconstruction coefficients:

91

Re-embedding

• Local patches into global coordinate:

92

Illustration

93[5]

Algorithm

• Presetting:

• (1) Find neighbors of each sample

(1)( )

(2)( )

1 ,

( )( )

Define weight matrxi [ 0]

T

T

ij i j N

N T

N N W w

w

w

w

94

2

(

( )

)

( ) ( ) ( )

( ) ( )

Set ( ) as the neighbor set of (undified)

Neighbor:

-neighbor: ( ) iff

NN : ( ) (or (if )f ) , )(

i

j j

j i

i

L

j

Nei i

Nei i

K Nei i KNN i pKNN j K

x

x

x x x

x x

Algorithm

• (2) Linear reconstruction coefficients:

Objective function:

Constraints: (for RST invariant)

95

2

( ) ( )2

2

2

( ) ( )

1 1

2

2( ) ( ) ( ) ( )

1

min ( ) min

min mini i

N Ni j

ijW W

i j L

Ni j i i

ij Lj L

E W w

w X

w w

x x

x x x w

( )

1

for all : 0, if ( ), 1N

j

ij ij

j

i w x Nei i w

( ) (0) ( ) (0) ( ) (0)

1 1 1 1?

if ( ) ( )N N N N

j j j

ij ij ij ij

j j j j

w w w w

x x x x x x x x

Algorithm

• (2) Linear reconstruction coefficients: (for each sample)

96

( )1 1( )( ) ( )

2( )

2( ) ( ) ( )

1

2( )

( )( )

1

| | |

Define neighbor index of , , is ( ) 1

| | |

( , ) ( 1) ( 1)

( ) ( 1)

Nei i

m

hh h

i

Nei i

i i T i T

i m i

m

Nei i

hi T

m

m

i X Nei i

E X X

1 1

1

h x x x

x x

x x

2

( )

( ) ( )

( ) ( 1)

( ) ( ) ( 1) ( 1)

i T T

i

T i T T i T T T T

i i

X

X X C

1 1

1 1 1 1

x

x x

Algorithm

• (2) Linear reconstruction coefficients:

Algorithm: Run for each sample

97

11

12 0 2

2

1 0T

T

C

C

EC C C

E

1 11

1 11

1

1( ) ( )

1

Define , , and

( ) ( ),

for 1: ( )

end

m

i

i T T i T

i i T

i mh

X

CC X X

C

m Nei i

w

11 1

1 1

h

x x

Algorithm

• (3) Re-embedding: (minimize reconstruction error again)

98

2

( ) ( )

1 1

(1) ( ) (1) ( ) (

2

1) ( )

min ( )

| | | | | |

| | | | | |

{( ) ( )}

{( ) ( )}

N Ni j

ijY

i j

N N

F

T

N

T T

N N N

T

N

T

Y w

t

tr

r I W

Y YW Y YW

Y Y I W

y y

y y y y w w

{ ( )( ) }

{ ( ) }

T T

N N N N

T T T

N N

tr Y I W I W Y

tr Y I W W W W Y

Algorithm

• (3) Re-embedding:

Definition:

Constraints: (avoid degradation)

Optimization:

99

1 ,

1

( )

[ ]

T T

N N

NN

ij ij ij ji ki kj i j N

k

M I W W W W

m w w w w

( ) ( ) ( )

1 1

10, ( )( )

N Nn n n T T

n n

YY IN

y yy y y

* ( )

1

min { }, subject to

Rayleitz-Ritz theorem

0,

Apply

NT n T

Yn

Y tr YMY YY I

y

Algorithm

• (3) Re-embedding:

Additional property (row sum of M is 0)

Solution: (EVD)

100

1 1 1 1 1 1

1 1 1

1

1 1

[ ] 1

0

N N N N N N

ij ij ji ki kj ij ji ki kj

j k j j j k

N

ij

j

N N N N N

ji ki kj ji ki

j k j j k

w w w w w w w w

w w w

m

w w

(

1

1 )

* min { } ( )T

T

N p p

T T

p pY

p

Y I

M

O

U U

O

Y tr YMY U I

is a eigenvector of with =0N M 1

Algorithm

*

, ,

* ( 1) * * (

1

)

1

Assume is 1 ,

arg min { } arg min { }

arg min ( , ) min

with , because with 0

( 1)

T

T

T T

Y

T

N T N

N N

T

Y N

Y tr YMY tr M

E M

tr M

1

q q

q

q q

q

q q

q q q

q u q q

q q

u101

(1)

(1) ( )

( )

| |

| |

T

N

p N

p T

Y

q

y y

q

each sample each dimension

Algorithm

102

(1) ( )1

..

( ) ( )

1 1

*

* * * *

1

. ...

Assume is ( 1) , { }

arg min the ( 1) eigenvector of

T

r

r r rr N T i T i T T

iTi i

T th

T T T

N r

YY r N tr YMY M M M

M N r M

M U U

1

q q

q q q

q q q q q qq

q q q

q q q q =

Example

• Swiss roll:

103350 points

[5]

[1]

Example

• S shape:

104

[6]

Example

105[5]

Summary

• Although the global geometry isn’t explicitly preserved

during LLE, it can still be reconstructed from the

overlapping local neighborhoods.

• The matrix M to perform EVD is indeed sparse.

• K is a key factor in LLE, so does in ISOMAP.

• Cannot handle holes very well

106

Laplacian eigenmap

107

[7] M. Belkin et al., Laplacian eigenmaps for dimensionality

reduction and data representation, 2003

Review and Comparison

• Data set:

• ISOMAP: (isometric embedding)

• LLE: (neighborhood preserving)

( ) ( )

1 1high-D: { } low-D: { }n N n N

n n

d pX R Y y R x

108

2

( ) ( ) (1) ( ) ( ) ( )geodesic ( , ) ( , ) ,i j N i j

Ldis dis x x y y y y

2

2 2

( ) ( ) ( ) ( )

1 1 1 1

min ( ) min min ( )N N N N

i j i j

W W Yi j i

j

j

i

L

ijwE W Y w

x x y y

Laplacian eigenmap (LE)

• LE:

109

2 2( ) ( )2

( ) ( )

1

2( ) ( )

1 11 1

model: ( ) : on

criterion: ( ) ( ) ( ), ( )

arg min ( ) arg min ( ) ( )

l lLL M L M

p n d n

l l ll

l l l l

N Ni j

l l l ijMf f

i j

f f R M R

f f f f

f f f w

x y

x x x x x x x

x x x

( ) ( )

1 1

arg minN N

i j

l l

i

ij

j

y y w

(O) (O) (X)

Sample

similarity

General setting

• (Training) data set:

• Preprocessing: centering (mean can be added back)

• Want to achieve:

( )

1high-D: { }n Nd

nX R x

( )

1

( ) ( )

1 or say

Nn

n

NT n n

Nn

X X

x x

xe x x x

110

( )

1low-D: { }n

pn NY y R

Algorithm

• Fundamental:

Laplacian-Beltrami operator (for smoothness)

• Presetting:

• (1) Neighborhood definition:

111

2

( )

( ) ( ) ( )

( ) ( ) ( )

Set ( ) as the neighbor set of (undified)

Neighbor:

-neighbor: ( ) iff

NN : ( ) iff ( ) or ( )

i

j i j

L

j j i

Nei i

Nei i

K Nei i KNN i KNN j

x

x x x

x x x

1 ,Define weight matrxi [ ]0ij i j NN N W w

Algorithm

• (2) Weight computation: (heat kernel)

• (3) Re-embedding:

112

2

2( ) ( )

( )exp( ), if ( )

i j

L j

ij ij jiw Nei i w wt

x xx

2 2

2( ) ( )

2( ) ( )

1 1 1

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

1 1 1 1

( ) ( ) ( (

1

1

)

1

1

( )

2

p N Ni j

l i ijLl i j

N N N NT

i j i j i T i i T j j T j

ij ij

i j i j

N Ni T i j T j

ij

i j

N Ni j

ijLi j

E Y w

w w

w

w

y y

y y y y y y y y y y

y

y y

y y y) ( ) ( )

1 1 1 1

2 ......N N N N

i T j

ij ij

j i i j

w w

y y

Algorithm

• (3) Re-embedding:

113

( ) ( ) ( ) ( ) ( ) ( )

1 1 1 1

( ) ( ) ( ) ( )

1 1 1

( ) ( ) ( ) ( )

1

( ) 2

............... ignore the sc 2 a2

lar 2

N N N Ni T i j T j i T j

ii jj ij

i j i j

N N Ni T i i T j

ii ij

i i j

Ni i T j i T

ii

i

E Y d d w

tr d tr w

tr d tr w

y y y y y y

y y y y

y y y y1 1

(1) (1)

11 11 1

(1) ( ) (1) ( )

( ) ( )

1

| | | |0

| | 0 | |

N N

ij

i j

T T

N

N N

N T N T

NN N NN

d w w

tr tr

d w w

y y

y y y y

y y

( ) T Ttr Y D W Y tr YLY

11

is an diagonal matrix with N

ji

j

N

ii ij

j

D N N d w w

Optimization

• Optimization:

• Constraint:

114

*

( ) ( )

*

1

1

( 1 )

arg min ( )

( )

T

T

i i

i

N p

Y

p

T

p N N N p p

DY I

p

Y tr YLY

L D D LU U

U I

O

O

Y

u u

is a eigenvector of with =0N M 1

( ) ( ) (1) (1) ( ) ( )

11

1

......N

T i i T T i i T

ii NN

i

YDY d d d I

y y y y y y

large iid

small iid

Optimization

, ,

* *

* ( 1)

1

1 (1)

* *

( )

Assume is 1 ,

min { } min { } min ( , ) min

2 2 0

1 0

wit

( 1)

generalized eigenvector

h

T

T

D

N

T

T T T

Y

T

T

N

T

Y N

tr YLY tr L E L

E

D

D LL D

ED

t

tr D

U

r L

q q q

q q

q

q q q q q

q qq

q q

q qq u

q

q

q q

q

q

u u

* * * *

1 1

( ), because with 0

T T

N N

N

N

L D

1

q q q q

u 115

(1)

(1) ( )

( )

| |

| |

T

N

p N

p T

Y

q

y y

q

Optimization

116

( )1

( ) ( )

1 1

*

*

0, 1

** *

1* *

...

Assume is ( 1) , { }

arg min the ( 1) eigenvector of

T

T i

r r rr N T i T i T T

iTi i

T t

D

D i

h

TT

N rT

r

YY r N tr YLY L L L

L N r M

LL

D

q q

q q

q q q q q qq

q q q

q q= q q =

q q

1/2 1/2

1/2 1/2 1/2 1/2

1/2 1/2 1/2 1/2

1/2 1/2 1/2 1/2

1/2

Proof:

...... set

is Hermitian, then

I

T T T

LU DU D D U

L D D U D D U

LD A D A D LD A A

D LD A A I U D D U U DU

D U A

1/2n Spectral clustering: T

DY U

Example

• Swiss roll: 2000 points

117

[7]

Example

• Example: From 3D to 3D

118[1]

• Constraints used in LLE and LE:

• I can be replaced by positive-element diagonal matrices:

Is the constraint meaningful?

119

or T TYY I YDY I

11

1/2( ) ( )

0

or :

0

T T n n

i ii i

pp

b

YY YDY y b y

b

Thank you for listening

120