Manifold Learning:From Linear to nonlinear
Presenter: Wei-Lun (Harry) Chao
Date: April 26 and May 3, 2012
At: AMMAI 2012
1
Preview
• Goal:
Dimensionality reduction
Classification and clustering
• Main idea:
What information and properties to preserve or enhance?
2
Outline
• Notation and fundamental of linear algebra
• PCA and LDA
• Topology, manifold, and embedding
• MDS
• ISOMAP
• LLE
• Laplacian eigenmap
• Graph embedding and supervised, semi-supervised extensions
• Other manifold learning algorithms
• Manifold ranking
• Other cases3
Reference
• [1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007
• [2] R. O. Duda et al., Pattern Classification, 2001
• [3] P. N. Belhumeur et al., Eigenfaces vs. Fisherfaces, 1997
• [4] J. B. Tenenbaum et al., A global geometric framework for
nonlinear dimensionality reduction, 2000
• [5] S. T. Roweis et al., Nonlinear dimensionality reduction by
locally linear embedding, 2000
• [6] L. K. Saul et al., Think globally, fit locally, 2003
• [7] M. Belkin et al., Laplacian eigenmaps for dimensionality
reduction and data representation, 2003
• [8] T. F. Cootes et al., Active appearance models, 1998
4
Notation
• Data set:
• Matrix:
• Vector:
• Matrix form of data set:
( )
1high-D: { }n Nd
nX R x
( )
1low-D: { }pn N
nY R y
(1) (2) ( )
1 ,1[ , ,......, ] [ ]m
n m ij i n j mA a a a a
( ) ( ) ( ) ( )
1 1 2[ , ,......, ]i i i i T
d da a a a
(1) (2) ( ) (1) (2) ( )
| | |
, ,......, ......
| | |
N N
d N
N
X
x x x x x x
5
Fundamental of Linear Algebra
• SVD (singular value decomposition):
, where
(1)11
(2)22(1) (2) ( )
( )
0 ... 0 0| | |
0 0 0 0......
0 0 0| | |
0 0 0 0
T
d
T
T
d
d N
N Td ddd d N
d d N N
N
N
N
U VX
v
vu u u
v
(1)
(2)
(1) (2) ( )
( )
1 1
| | |
...... ,
| | |
, ,
T
T
T d T
d d N N
d T d d
d d
T T T T T T
U U I V V I
U U I UU V V I VV U U V V
u
uu u u
u
6
Fundamental of Linear Algebra
• EVD (Eigenvector decomposition)
• Caution: Eigenvalues are not always orthogonal!
• Caution: Not all the matrices have EVDs.
1
2(1) (2) ( ) (1) (2) ( )
1 1
0 0| | | | | |
0 0 0...... ......
0| | | | | |
0 0
N N
N N
N
A
AU U A
A U U A U U
u u
u u u u u u
8
Fundamental of Linear Algebra
• Determinant:
• Trace:
• Rank:
1
N
N N n
n
A
1 1
( ) ( )N N
N N ii n
n n
tr A diag A a
( ) ( )d N N d N d d Ntr A B tr B A
( ) ( ) # nonzero diagonal elements of
# independent columns of
# nonzero eigenvalues (square )
Trank A rank U V
A
A
9
( ) min( ( ), ( ))
( ) ( ) ( )
rank AB rank A rank B
rank A B rank A rank B
Fundamental of Linear Algebra
• SVD vs. EVD (symmetric positive semi-definite)
• Hermitian matrix:
• Hermitian matrices have orthonormal eigenvectors.
10
( ) ( ) ( )
( ) ( )
T T T T T T T
T T T
T T
T
A U V U V V U U U
AU U U U U U U
XX V
U
U
A
is real( )
( )
H T T
T T T T T
T T
A A conj A A A
A U V U V V U A
A U
A
U V
U AU U U AU
Dimensionality reduction
• Operation:
• Reason:
Compression
Knowledge discovery or feature extraction
Irrelevant and noise feature removal
Visualization
Curse of dimensionality
( )
1high-D: { }n Nd
nX R x ( )
1low-D: { }n
pn NY y R ( )p d
11
Dimensionality reduction
• Methods:
Feature transform:
Feature selection:
• Criterion:
Preserving some properties or structures of the high-D feature
space into the low-D feature space.
These properties are measured from data.
: , d pR p df R x y
linear form: ( )T
p dQ y x
(1) (2) ( )[ , ,......, ]
deotes selected indices
T
s s s p
s
y x x x
12
Dimensionality reduction
• Model:
Linear projection:
Direct re-embedding:
Learning a mapping function:
13
: , d pR p df R x y
( )T
p dQ y x
( ) ( )
1 1{ } { }n Nd N
n
pn
nX R Y R x y
Principal Component Analysis
(PCA)
14
[1] J. A. Lee et al., Nonlinear Dimensionality reduction, 2007
[2] R. O. Duda et al., Pattern Classification, 2001
Principal component analysis (PCA)
• PCA:
15
(1)
(2)
( )
(1) (2) ( )
( )
( )
| | |
......
| | |
T
d p d p p p
T
T
T
d p
p T
p d
p
d p
p
Q Q I
Q
Q
^
q
qy x = x
q
x x y = q q q y
Principal component analysis (PCA)
• Surprising usage: face recognition and encoding
= -2181 +627 +389 +… =
16
Principal component analysis (PCA)
• PCA is basic yet important and useful:
Easy to train and use
Lots of additional functionalities:
noise reduction, ellipse fitting, ……
• Also named Karhunen-Loeve transform (KL transform)
• Criteria:
Maximum variance (with decorrelation)
Minimum reconstruction error
17
Principal component analysis (PCA)
• Maximum variance (with decorrelation)
• Minimum reconstruction error
18
Principal component analysis (PCA)
• (Training) data set:
• Preprocessing: centering (mean can be added back)
• Model:
( )
1high-D: { }n Nd
nX R x
( )
1
( ) ( )
1
1( ), or say
Nn
N
n
NT T n n
Nn
X
X X X IN
x x e
xe ee x x x
, where and is
(orthonormal)
T p
T
p p
T
R Q d p
Q Q I
Q
Q QQ
^
y x y
x y = x 19
Maximum variance (with decorrelation)
• The low-D feature vectors should be decorrelated
• Covariance variance:
• Covariance matrix:(1)
1 1 1 (2)
(1) (2) ( )
1 ( )
( ) (
| | |cov( , ) cov( , )1
......
cov( , ) cov( , ) | | |
( )(
T
d T
N
xx
d d d N T
n
x x x x
CN
x x x x
x
xx x x
x
x
x x)
1
(1) (1) ( ) ( )
1)
| |1 1
......
| |
( )( )N
n T T
n
T T
T N N T
X XN
N N
I I
x x x
x ee
x
ee
( ) ( )
1 2 1 1 2 2 1 1 2 2
1
1cov( , ) ( )( ) ( )( )
Nn n
n
x x E x x x x x x x xN
20
Maximum variance (with decorrelation)
• Covariance matrix
( ) ( )
1 2 1 1 2 2 1 1 2 2
1
1cov( , ) ( )( ) ( )( )
Nn n
n
x x E x x x x x x x xN
21
Maximum variance (with decorrelation)
• Decorrelation:
( ) ( )
1 1
( ) ( ) ( ) ( ) ( ) ( )
1 1 1
1 10
1
diagonal
1 1( )( ) ( )( ) ( )(
ma
)
ri t x
N Nn T n T
n n
N N Nn n T T n T n T T n n T
yy
n n n
T
xx
Q QN N
C Q Q Q QN N N
Q C Q
y y x x
xyy x xyy x
22
Maximum variance (with decorrelation)
• Maximum variance
( ) ( )
2 2* ( ) ( )
1 1
( ) ( ) ( ) ( )
1 1
( ) ( )
1 1
1 1arg max arg max
1 1 arg max ( ) ( ) arg max
1 1 arg max arg} max{ { }
T T
T T
T T
N Nn T n
Q Q I Q Q In n
N NT n T T n n T T n
Q Q I Q Q In n
N NT n n T
Q Q I Q
n T T
Q In n
n
Q QN N
Q Q QQN
tr QQ
N
tr Q QN N
y y x
x
x
x
xx
x x
x
( ) ( )
1
1 arg max { [ ] } arg max { }
T T
NT n n T T
xxQ Q I Q Q I
n
tr Q Q tr Q C QN
x x
(1 x 1) (p x p)
23
Maximum variance (with decorrelation)
• Optimization problem:
• Solution:
* arg max { } subject to diagonal and
T
yy xx
T T
d p xx yyQ
C Q C Q
Q tr Q C Q C QQ I
* (1) (2) ( )
( )
* *
[ , ,......, ],
is the th largest eigenvector of
1 1 1( )
xx xx xx
xx
p
C C C
i d d
C xx
T T T T T T T
xx
T
p p
Q
i C R
C XX U V V U U U U UN N N
Q Q I
u u u
u
24
Maximum variance (with decorrelation)
• Proof:
*
1
, ,
*
Assume is 1, { }
arg max
arg max ( , ) arg max
Take partial derivative 0
( ) 2 0
( 1)
eigenvec
1 0
tor
is the
T
T T
xx xx
T
xx
T
xx
T
xx xxxx
T
T
d tr C C
C
E C
EC C
E
C
q q
q q
q q q q q
q q q
q q q q q
q qq qq
q q
q
* * * *
1
largest eigenvector of
xx
T T T
xx
C
C U U q q q q = 25
Lagrange
multiplier
Maximum variance (with decorrelation)
(1)
(1)
(1) (1) (1) (1)
*
1
*
* * *
2
| || |
Assume is 2, { } { }
| | | |
arg max the second largest eigenvector of
T
TT
T T T
xx xx xx xxT
T
xx xx
T T T
xx
Q d tr Q C Q tr C C C
C C
C U U
q q
q q
qq q q q q q q q
q
q q q
q q q q =
26
(1) ( )
( ) ( ) ( )
1 1
*
* * * *
1
1
...
| |
Assume is ( 1), { }
||
arg max the ( 1) largest eigenvector of
T
r
r rr T i T i T T
d r xx xx xx i xx
i i
T th
xx xx
T T T
xx r
Q Q d r tr Q C Q C C C
C r C
C U U
q q
q q q
q q q q q q q
q q q
q q q q =
Minimum reconstruction error
• Mean square error is preferred:
2 2* ( ) ( ) ( ) ( )
1 1
( ) ( )
1
( ) ( ) ( ) ( ) ( ) ( )
1 1 1
1 1arg min min
1 arg min (( ) ) (( ) )
1 2 1 arg min ( )
T T
T
T
N Nn n n T n
Q Q I Q Q In n
NT n T T n
d d d dQ Q I
n
N N Nn T n n T T n n T T T n
Q Q In n n
Q Q QQN N
I QQ I QQN
QQ Q Q Q QN N N
x y x x
x x
x x x x x x
( ) ( ) ( ) ( )
1 1
( ) ( )
1
1arg
1 1 arg min
...max ... arg max { }
T
TT
Nn T T n
Q Q In
N Nn T n n T T n
Q Q In n
T
xxQ Q I
QQN
QQN N
tr Q C Q
x x x
x
x
x
27
Algorithm
• (Training) data set:
• Preprocessing: centering (mean can be added back)
• Model:
( )
1high-D: { }n Nd
nX R x
, where and is
(orthonormal)
T p
T
p p
T
R Q d p
Q Q I
Q
Q QQ
^
y x y
x y = x28
( )
1
( ) ( )
1
1( ), or say
Nn T
N
n
NT T n n
Nn
Xe
X X e X I eeN
x x
x x x x
Algorithm
• Algorithm 1: (EVD)
• Algorithm 2: (SVD)
1
(1) (2) ( ) (1) (2) ( )
( )
1. , where in are in descending order
| | | | | |
2. ...... ......
| | | | | |
d
xx i i
p pd p
d p
d p p
U C U
IQ UI
O
u u u u u u
1
(1) (2) ( ) (1) (2) ( )
( )
1. , where in are in descending order
| | | | | |
2. ...... ......
| | | | | |
dT
i i
p pd p
d p
d p p
X U V
IQ UI
O
u u u u u u
29
Summary
• PCA exploits 2nd –order statistical properties measured
in data (simple and not vulnerable to over-fitting)
• Usually used as a “preprocessing step” in applications
• Rank:
( ) ( ) (1) (1) ( ) ( )
1
| |1 1 1
......
| |
, 1 in ge( ) l1 nera
xx
Nn n T T N N T
xx
n
xxrank C N
U C U
CN N N
p N
x x x x x x
31
Optimization problem
• Convex or not?
• Convex or not?
32
*
1arg max arg max , s.t. 1
1 1(1) semi-postive definite
(2) 1 quadratic equality constraint
)
1
(
T
T T T
xx xx
T TT
xx
T T
C C
C XX U UN N
qq q
q q q = q q q q
q q q q
*
1arg min arg min , s.t. 1
1 1(1) semi-postive definite
(2) 1 quadratic equality constraint
)
1
(
T
T T T
xx xx
T TT
xx
T T
C C
C XX U UN N
qq qq q q = q q q q
q q q q
T
xxCq q
Linear Discriminant Analysis
(LDA)
34
[2] R. O. Duda et al., Pattern Classification, 2001
[3] P. N. Belhumeur et al., Eigenfaces vs. Fisherfaces, 1997
Linear discriminant analysis(LDA)
• PCA is unsupervised
• LDA takes the label information into consideration
• Achieved low-D features are efficient for discrimination.
35
Linear discriminant analysis(LDA)
• (Training) data set:
• Model:
• Notation:
( )
1high-D: { }n Nd
nX R x
, where and is T pRQ Q d p y x y
( )
1 2( ) { , ,......, }n
clabel L l l l x
( )
( ) ( )
1
( )
( )
1
{ | ( ) }
# samples in
1class mean:
1total mean:
i
ni
Nn n
i i n
i i
n
i
Xi
Nn
n
X label l
N X
N
N
x
x x
x
x
( )
1
( ) ( )
1
between-class scatter:
within-class scatter:
( )( )
( )( )n
i
cT
B i i i
i
cn n T
W i i
i X
S N
S
x
x x
36
Linear discriminant analysis(LDA)
• Properties of scatter matrix:
• Scatter matrix in low-D:
( )
1
( ) ( )
1
inter-class separation( )( )
( )( ) intra-class tightn s e sn
i
cT
B i i i
i
cn n T
W i i
i X
S N
S
x
x x
( )
1
( ) ( )
1
between-class: ( )( )
within-class: ( ) )n
i
cT T T T T
i i i
i
cT
T
B
T
W
n T T n T T
i i
i X
Q S Q
Q
N Q Q Q Q
Q Q Q Q S Q
x
x x
37
Criterion and algorithm
• Criterion of LDA:
• Determinant and trace are suitable scalar measures:
• With Rayleigh quotient:
Maximize the ration of to Q "in some se nse"T T
B WQ S Q Q S
* ( )arg max or arg max
( )
T TB B
TTQ QWW
Q S Q tr Q S QQ
tr Q S QQ S Q
*
* (1) (2)
( ) ( )
( )
is n, are both symmetric positive semi-definite and
arg max , is in descending order
, ,.....
onsigu
.
larB W
T
B
iT
W
i i
BQ
p
i
T
W
W
S S
Q S QQ
Q
S
SS Q
Q
S
u
u u
u
u39
Note and Problem
• Note:
• Problem:
( ) ( ) 1 ( ) ( )
( ) 1, at most 1 nonzero
so 1
i i i i
B i W W B i
B i
S S S S
rank S c c
p c
u u u u
( ) , and is
if ( ) , is singular, Rayleight quotient is useless
W W
W W
rank S N c S d d
rank S d S
40
Solution
• Problem:
• Solution:
PCA+LDA:
Null-space:
( )
( ) ( )
1
( )( ) is singularn
i
cn n T
W i i
i X
S
x
x x
( ) ( ) ( )
( ( ))
(( ) ( ))
1. Perform on ,
2. Compute , if nonsingular, the problem is solved
3. For new samples,
n n T n N c
PCA d N c PCA
W N c N c
T T
LDA PCA
Q Q R
S
Q Q
x x x
y x
* *
*
1. arg max find to make 0
2. Extract columns of from the null space of
T
B T
WTQW
W
Q S QQ Q Q S Q
Q S Q
Q S
41
Topology
• Geometrical point of view
If two or more features are latently dependent, their joint
distribution does not span the whole feature space.
The dependence induces some structures (object) in the
feature space.
1 2( , ) ( ), x x g s a s b
( )g a
( )g b
44
Topology
• Topology:
Allowed: Deformation, twisting, and stretching
Not allowed: Tearing
Topology object means properties and structures
A topology object (space) is represented (embedding) as a
spatial object in the feature space.
Topology abstracts the intrinsic structure, but ignores the
details of spatial object.
Ex: circle and ellipse are topologically homeomorphic
45
Manifold
• Feature space: dimensionality + structure
• Neighborhood:
• Topology space can be characterized by neighborhoods
• Manifold is a locally Euclidean topological space
• Euclidean space:
• In general, any spatial object that is nearly flat in small
scale is a manifold
2
( )( ) ball ( ) { }d i
LNei R B x x x x
2
(1) (2) (1) (2)( , ) is meaningfulL
dis x x x x
46
Embedding
• Embedding:
Embedding is a representation of a topological object (ex. a
manifold, graph) in a certain feature space, in such a way the
topological properties are preserved.
A smooth manifold is differentiable and has “functional
structure” to link the features with latent variables.
The dimensionality of a manifold is the # latent variables
A k-manifold can be embedded to any d-dimensional space
with d is equal to or larger than (2k+1)
48
Manifold learning
• Manifold learning:
Recover the original embedding function from data.
• Dimensionality reduction with the manifold property:
Re-embed a k-manifold in d-dimensional space into a p-
dimensional space with d >p
49
Latent variables s
d-dimensional
space
p-dimensional
space
1( )g s2 ( )g s
( )h s
( )f s
Example
1 2 3 1( , , ) ( ), x x x g s a s b
1( )g a
1( )g b
a s b a b
2 ( )g a
2 ( )g b
1 2 2( , ) ( ), x x g s a s b
Re-embedding
Latent variable:
1 2: ( ) ( )f g s g s
50
Manifold learning
• Properties to preserve:
Isometric embedding: distance preserving
Conformal embedding: angle preserving
Topological embedding: neighbor / local preserving
• Input space: locally Euclidean
• Output space: user defined 51
(1) (2) (1) (2)( , ) ( , )dis disx x y y
(1) (3) (2) (3) (1) (3) (2) (3)( , ) ( , )angle angle x x x x y y y y
Multidimensional Scaling (MDS)
• Distance preserving:
• Scaling refers to construct a configuration of samples
in a target metric space from information of inter-
point distances
( ) ( ) ( ) ( )( , ) ( , )i j i jdis disx x y y
53
10
4
9
8.5 6.5
?
Multidimensional Scaling (MDS)
• MDS: a scaling where the target space is Euclidean
• Here we mentioned about classical metric MDS
• Metric MDS indeed preserves pairwise inner product
rather than pairwise distance
• Metric MDS is unsupervised
54
Multidimensional Scaling(MDS)
• (Training) data set:
• Preprocessing: centering (mean can be added back)
• Model:
( )
1high-D: { }n Nd
nX R x
( )
1
( ) ( )
1
1= ( ) or say
Nn
n
NT T n n
Nn
X X X IN
1
x x
x ee x x x
There is no to train
: , d pf R
Q
R p d x y
55
Criterion
• Inner product (scalar product):
• Gram matrix: recording pairwise inner product
• Usually, we only know Z, but not X
( ) ( ) ( ) ( ) ( ) ( )( , ) ( , )i j i j i T j
Xs i j s x x x x x x
(1)
(2)
(1) (2) ( )
1 ,
( )
| | |
[ ( , )] ...... .
| | |
1Gram matrix: , Covariance matrix: ( )( )
T
T
N T
X i j
T
N
T
T TT
N
S s i j X X
S X X C X XIN
I
x
xx x
ee
x
x
ee
56
Criterion
• Criterion 1:
• Criterion 2:
2* ( ) ( ) 2
1 1
2
(1)
(2)
2 1/2 (1
,
( )
arg min ( ) arg min arg min
, where is the matrix norm, also called the Frobenius norm
|
( ) ( ) (
N Ni j T T
p N ij F FY Y Yi j
F
T
T
T
ijFi j
N T
Y s S Y Y S Y Y
A L
A a tr A A tr
y y
a
aa
a
) (2) ( ) 1/2
| |
...... )
| | |
N
a a
(1)
(2)
(1) (2) ( )
( )
| | |
......
| | |
T
T
T T N
N T
X X S Y Y
y
yy y y
y 57
Algorithm
• Rank: (assume N>d)
• Low-rank approximation:
58
( ) min( , ), ( ) min( , )T Trank X X N rank Y Y pNd
* *
(1)
(2)
(1) (2) ( )
( )
( )
with ( ) 0,
arg min
| | |
......
| | |
d N T
T
F
T
k k
ran B kk r
T
d
N Td d
k k
A R rank A r A U V
I O
O O
O
O O
B A B U V
v
vu u u
vN N
Algorithm
• EVD: (Hermitian matrix)
• Solution:
59
1/2 1/2
( ) ( ) ( )
( )( )( )
T T T T T T T
p pT T T
N p N p
S X X U V U V V V V V
OY Y V V V I I V
O O
1/2 (1)1/2 (1)11
1/2 (2)1/2 (2)22
( )
1/2 ( )1
1/
2
2
/ ( )
|
, where is a arbitrary orthonormal (unitary) matrix for ro
TT
TT
T
p p p p N p
N TN T
N
N
p
I O
p p
Y I U
uu
uu
uu
tation
PCA vs. MDS
• (Training) data set:
• PCA: EVD on covariance matrix
• MDS: EVD on Gram matrix
( )
1high-D: { } SVD: n N T
n
dX R X U V x
1( )
)
1
(( )
TT T T T T T
xx PCA
T T
PCA d p
T
p d
C XX U V V U U U U UN N N
Y Q X UI X I U X
1/2 T
p N MD
T T T T T T T
MDS
MDS S
S X X V U U V V V V V
Y I V
60
PCA vs. MDS
• Discard the rotation term and with some derivations:
• Comparison
1/2 1/2( )
( ) ( )
T T T T
MDS p N MDS p N p d
T T T T
PCA p d p d p d
Y I V I V I V
Y I U X I U U V I V
PCA: EVD on matrix
MDS: EVD on matrix
SVD: SVD on matrix
T
xx
T
d d C XX
N N S X X
d N X
61
For test data
• Model:
• For a new coming test x:
• Finally:
(generatuve view)
Use from PCA for convenience
T
d p
Q Q
Q UI
y x x y
1/2( )
( )
(with )
T T T
d p
T
T T T T
T T T
d p N p
T T T
X U V V U V U
V U V I V I
X X V V V
Q
UI
V
s x = y
y y y
x x
1/2 T
p NI V
y s
62
MDS with pairwise distance
• How about a training set with pairwise distance?
63
( ) ( )
1 ,[ ( , )] , no and i j
ij i j ND d dis X S x x
10
4
9
8.5 6.5
?
Distance metric
• Distance “metric”:
Nonnegative:
Symmetric:
Triangular:
• Minkowski distance: (order p)
( ) ( ) ( ) ( ) ( ) ( )( , ) 0, ( , ) 0 iff i j i j i jdis dis x x x x x x( ) ( ) ( ) ( )( , ) ( , )i j j idis disx x x x
( ) ( ) ( ) ( ) ( ) ( )( , ) ( , ) ( , )i j i k k jdis dis dis x x x x x x
( ) ( ) ( ) ( ) ( ) ( ) 1/
1
1/
( ) ( )
1
( ) [ ( ) ]
di j i j i j p p
k k k kpk
ppd
i j
k k
k
dis x x x x
x x
x x
64
Distance metric
• (Training) data set:
• Euclidean distance and inner product:
2
( ) ( ) ( ) ( ) ( ) ( ) 2 1/2
1
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
2
( , ) [ ( ) ]
( , ) ( ) ( )
2
( , ) 2 ( , ) ( , )
1( , )
di j i j i j
k kLk
i j i j T i j
i T i i T j j T j
X X X
X
dis x x
dis
s i i s i j s j j
s i j
x x x x
x x x x x x
x x x x x x
( ) ( )2{ ( , ) }2
( , ) ( , )X X
i j s i id s jis j x x
( ) ( )
1 ,[ ( , )] , no and i j
ij i j ND d dis X S x x
65
Distance to inner product
• Define square distance matrix:
• Double centering:
2 2 2 22
2 2 2 2
21 1 1 1
1 1 1 1( )
2
1 1 1 1( , ) ( )
2
T T T T
X N N N N N N N N
N N N N
X ij ik mj mk
k m k m
S D D D DN N N
s i j d d d dN N N
e e e e e e e e
2 ( ) ( )
2 2 1
2
,[ ( , )]i j
ij ij i j ND d d dis x x
66
Proof
• Proof:
2 2 ( ) ( )
1 1 1
( ) ( ) ( ) ( ) ( ) ( )
1
( ) ( ) ( ) ( ) (( )
1
)
1
1 1 1( , ) ( , ) 2 ( , ) ( , )
1 , 2 , ,
1 , , 2 ,
N N Nm j
mj X X X
m m m
Nm m m j j j
m
Nj j m m j
m
Nm
m
d dis s m m s m j s j jN N N
N
N
x x
x x x x x x
x x x x xx
( ) ( )
1
2 ( ) ( )
1
( ) ( )
( )
1
( )
1 ,,
1,,
1
Nm m
m
N Nk k
ik
k
j j
i
k
i
N
dN N
x x
x x
x x
x x
67
Proof
• Proof:
• Finally
2 2 ( ) ( )
2 2 21 1 1 1 1 1
( ) ( ) ( ) ( ) ( ) ( )
21 1
( ) ( ) ( ) (
2 21 1
1 1 1( , ) ( , ) 2 ( , ) ( , )
1 , 2 , ,
1 1 , ,
N N N N N Nm k
mk X X X
m k m k m k
N Nm m m k k k
m k
N Nm m k k
m k
d dis s m m s m k s k kN N N
N
N N
x x
x x x x x x
x x x x) ( ) ( )
1 1 1 1
( ) ( ) ( ) ( )
1 1
1 12 ,
1 1 , ,
N N N Nm k
m k m k
N Nm m k k
m k
N N
N N
x x
x x x x
2 2 2
21 1
( ) ( ) ( )
1 1
( ) 1 1,
1,
N N N N
mj ik m
j j j j
k
m k m k
d d dN N N
x x x x
68
Algorithm
• Given X:
Get S, perform MDS
• Given S:
Perform MDS
• Given D:
Double each entry in D
Perform double centering
Perform MDS
69
Summary
• Metric MDS preserves pairwise inner product instead
of pairwise distance
• It preserves linear properties
• Extension:
Sammom’s nonlinear mapping
Curvilinear component analysis (CCA)
2
1 1
( ( , ) ( , ))
( , )
N NX Y
NLM
i j X
dis i j dis i jE
dis i j
2
1 1
1( ( , ) ( , )) ( ( , ))
2
N N
CCA X Y Y
i j
E dis i j dis i j h dis i j
70
Linear
• PCA, LDA, MDS are linear:
Matrix operation
Linear properties (sum, scaling, commutative,…)
• Inner product, covariance:
• Assumption on the original feature space:
Euclidean or Euclidean with “rotation and scaling”
( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( ) ( )
, , ,
( )
i j k i k j k
k i j T k i T k j T
x x x x x x x
x x x x x x x
72
Problem
• If there exists structure in the feature space:
1 2 3 1( , , ) ( ), x x x g s a s b
1( )g a
1( )g b
crashed
73
Manifold way
• Assumption:
The latent space is nonlinearly embedding in the feature space
The latent space is a manifold, so does the feature space
The feature space is locally smooth and Euclidean
• Local geometry or property:
Distance preserving: ISOMAP
Neighborhood (topology) preserving (LLE)
Locality (topology) preserving (LE)
• Caution:
There properties and structures are measured in the feature
space.74
Isometric Feature Mapping
(ISOMAP)
75
[4] J. B. Tenenbaum et al., A global geometric framework for
nonlinear dimensionality reduction, 2000
ISOMAP
• Distance metric in feature space: Geodesic distance
• How to measure:
Small scale: Euclidean distance in
Large scale: shortest path in connected Graph
• The space to re-embed:
p-dimensional Euclidean space
After we get the pairwise distance, we can embed it in many
kinds of space.
dR
76
Graph
• (Training) data set: ( )
1high-D: { }n Nd
nX R x
Vertices
Assume placed in order
77
(1)x
( )Nx
Small scale
• Small scale: Euclidean, Large scale: graph distance
Vertices + edges
(1)x
( )Nx
2
1(1) ( ) ( ) ( 1)
1
( , ) ,N
N i i
Li
dis
x x x x
Assume placed in order
78
Distance metric
• MDS: Distance preserving
Vertices + edges
Assume placed in order
2 2
1(1) ( ) ( ) ( ) (1) ( ) ( ) ( 1)
1
( , ) , ( , ) ,N
N i N N i i
L Li
dis dis
y y y y x x x x
79
Algorithm
• Presetting:
• (1) Geodesic distance in neighborhood
1 ,
( )
Define distance matrxi [ ]
Set ( ) as the neighbor set of (undified)
ij i j N
i
N N D d
Nei i
x
2
( )
( ) ( )
for 1:
for 1:
if ( ( ) and )
end
end
end
j
i j
ij L
i N
j N
Nei i i j
d
x
x x
80
Algorithm
• (1) Geodesic distance in neighborhood:
• (2) Geodesic distance in large scale: (shortest path)
2
( ) ( ) ( )
( ) ( ) ( )
Neighbor:
-neighbor: ( ) iff
NN : ( ) iff ( ) or ( )
j i j
L
j j i
Nei i
K Nei i KNN i KNN j
x x x
x x x
for each pair ( , )
for 1:
min{ , }
end
end
ij ij ik kj
i j
k N
d d d d
Floyd’s algorithm:
Run several round until converge
81
Algorithm
• (3) MDS:
Transfer pairwise distance into inner product:
EVD:
Proof
2( ) / 2, where ( , ) 1/ (for centering)ijD H H ND h i j
1/2 1/2
1/
1/2
2
1/2( ) ( )( ) ( ) ( )
( , 1)
T T T T T
T
p NY I
D U U U U U U
p d pU N
2 2
2 2 2 22
1 1( ) / 2 ( ) ( ) / 2
1 1 1 1 ( ) / 2
T T
N N N N N N N N
T T T T
N N N N N N N N
D HD H I D IN N
D D D DN N N N
S
e e e e
e e e e e e e e
82
Summary
• Compared to MDS: ISOMAP has the ability to discover
the underlying structure (latent variables) which is
nonlinear embedded in the feature space
• It is a “global method”, which preserves all pairs of
distances.
• The Euclidean space assumption in low-D space
implies the convex property, which sometimes fails.
86
1 1 1 1
2?
Locally Linear Embedding
(LLE)
87
[5] S. T. Roweis et al., Nonlinear dimensionality reduction by locally
linear embedding, 2000
[6] L. K. Saul et al., Think globally, fit locally, 2003
LLE
• Neighborhood preserving:
Based on the fundamental manifold properties.
Preserve the local geometry of each sample and its neighbors.
Ignore the global geometry in large scale
• Assumption:
Well-sampled with sufficient data.
Each sample and its neighbors lie on or closed to a local
linear patch (sub-plane) of the manifold.
88
LLE
• Properties:
Local geometry is characterized by linear coefficients that
reconstruct each sample from its neighbors
These coefficients are robust to RST: rotation, scaling, and
translation.
• Re-embedding:
Assume the target space is locally smooth (manifold)
Locally Euclidean, but not necessary in large scale
Reconstruction coefficients are still meaningful
Stick local patches on the low-D global coordinate 89
Algorithm
• Presetting:
• (1) Find neighbors of each sample
(1)( )
(2)( )
1 ,
( )( )
Define weight matrxi [ 0]
T
T
ij i j N
N T
N N W w
w
w
w
94
2
(
( )
)
( ) ( ) ( )
( ) ( )
Set ( ) as the neighbor set of (undified)
Neighbor:
-neighbor: ( ) iff
NN : ( ) (or (if )f ) , )(
i
j j
j i
i
L
j
Nei i
Nei i
K Nei i KNN i pKNN j K
x
x
x x x
x x
Algorithm
• (2) Linear reconstruction coefficients:
Objective function:
Constraints: (for RST invariant)
95
2
( ) ( )2
2
2
( ) ( )
1 1
2
2( ) ( ) ( ) ( )
1
min ( ) min
min mini i
N Ni j
ijW W
i j L
Ni j i i
ij Lj L
E W w
w X
w w
x x
x x x w
( )
1
for all : 0, if ( ), 1N
j
ij ij
j
i w x Nei i w
( ) (0) ( ) (0) ( ) (0)
1 1 1 1?
if ( ) ( )N N N N
j j j
ij ij ij ij
j j j j
w w w w
x x x x x x x x
Algorithm
• (2) Linear reconstruction coefficients: (for each sample)
96
( )1 1( )( ) ( )
2( )
2( ) ( ) ( )
1
2( )
( )( )
1
| | |
Define neighbor index of , , is ( ) 1
| | |
( , ) ( 1) ( 1)
( ) ( 1)
Nei i
m
hh h
i
Nei i
i i T i T
i m i
m
Nei i
hi T
m
m
i X Nei i
E X X
1 1
1
h x x x
x x
x x
2
( )
( ) ( )
( ) ( 1)
( ) ( ) ( 1) ( 1)
i T T
i
T i T T i T T T T
i i
X
X X C
1 1
1 1 1 1
x
x x
Algorithm
• (2) Linear reconstruction coefficients:
Algorithm: Run for each sample
97
11
12 0 2
2
1 0T
T
C
C
EC C C
E
1 11
1 11
1
1( ) ( )
1
Define , , and
( ) ( ),
for 1: ( )
end
m
i
i T T i T
i i T
i mh
X
CC X X
C
m Nei i
w
11 1
1 1
h
x x
Algorithm
• (3) Re-embedding: (minimize reconstruction error again)
98
2
( ) ( )
1 1
(1) ( ) (1) ( ) (
2
1) ( )
min ( )
| | | | | |
| | | | | |
{( ) ( )}
{( ) ( )}
N Ni j
ijY
i j
N N
F
T
N
T T
N N N
T
N
T
Y w
t
tr
r I W
Y YW Y YW
Y Y I W
y y
y y y y w w
{ ( )( ) }
{ ( ) }
T T
N N N N
T T T
N N
tr Y I W I W Y
tr Y I W W W W Y
Algorithm
• (3) Re-embedding:
Definition:
Constraints: (avoid degradation)
Optimization:
99
1 ,
1
( )
[ ]
T T
N N
NN
ij ij ij ji ki kj i j N
k
M I W W W W
m w w w w
( ) ( ) ( )
1 1
10, ( )( )
N Nn n n T T
n n
YY IN
y yy y y
* ( )
1
min { }, subject to
Rayleitz-Ritz theorem
0,
Apply
NT n T
Yn
Y tr YMY YY I
y
Algorithm
• (3) Re-embedding:
Additional property (row sum of M is 0)
Solution: (EVD)
100
1 1 1 1 1 1
1 1 1
1
1 1
[ ] 1
0
N N N N N N
ij ij ji ki kj ij ji ki kj
j k j j j k
N
ij
j
N N N N N
ji ki kj ji ki
j k j j k
w w w w w w w w
w w w
m
w w
(
1
1 )
* min { } ( )T
T
N p p
T T
p pY
p
Y I
M
O
U U
O
Y tr YMY U I
is a eigenvector of with =0N M 1
Algorithm
*
, ,
* ( 1) * * (
1
)
1
Assume is 1 ,
arg min { } arg min { }
arg min ( , ) min
with , because with 0
( 1)
T
T
T T
Y
T
N T N
N N
T
Y N
Y tr YMY tr M
E M
tr M
1
q q
q
q q
q
q q
q q q
q u q q
q q
u101
(1)
(1) ( )
( )
| |
| |
T
N
p N
p T
Y
q
y y
q
each sample each dimension
Algorithm
102
(1) ( )1
..
( ) ( )
1 1
*
* * * *
1
. ...
Assume is ( 1) , { }
arg min the ( 1) eigenvector of
T
r
r r rr N T i T i T T
iTi i
T th
T T T
N r
YY r N tr YMY M M M
M N r M
M U U
1
q q
q q q
q q q q q qq
q q q
q q q q =
Summary
• Although the global geometry isn’t explicitly preserved
during LLE, it can still be reconstructed from the
overlapping local neighborhoods.
• The matrix M to perform EVD is indeed sparse.
• K is a key factor in LLE, so does in ISOMAP.
• Cannot handle holes very well
106
Laplacian eigenmap
107
[7] M. Belkin et al., Laplacian eigenmaps for dimensionality
reduction and data representation, 2003
Review and Comparison
• Data set:
• ISOMAP: (isometric embedding)
• LLE: (neighborhood preserving)
( ) ( )
1 1high-D: { } low-D: { }n N n N
n n
d pX R Y y R x
108
2
( ) ( ) (1) ( ) ( ) ( )geodesic ( , ) ( , ) ,i j N i j
Ldis dis x x y y y y
2
2 2
( ) ( ) ( ) ( )
1 1 1 1
min ( ) min min ( )N N N N
i j i j
W W Yi j i
j
j
i
L
ijwE W Y w
x x y y
Laplacian eigenmap (LE)
• LE:
109
2 2( ) ( )2
( ) ( )
1
2( ) ( )
1 11 1
model: ( ) : on
criterion: ( ) ( ) ( ), ( )
arg min ( ) arg min ( ) ( )
l lLL M L M
p n d n
l l ll
l l l l
N Ni j
l l l ijMf f
i j
f f R M R
f f f f
f f f w
x y
x x x x x x x
x x x
( ) ( )
1 1
arg minN N
i j
l l
i
ij
j
y y w
(O) (O) (X)
Sample
similarity
General setting
• (Training) data set:
• Preprocessing: centering (mean can be added back)
• Want to achieve:
( )
1high-D: { }n Nd
nX R x
( )
1
( ) ( )
1 or say
Nn
n
NT n n
Nn
X X
x x
xe x x x
110
( )
1low-D: { }n
pn NY y R
Algorithm
• Fundamental:
Laplacian-Beltrami operator (for smoothness)
• Presetting:
• (1) Neighborhood definition:
111
2
( )
( ) ( ) ( )
( ) ( ) ( )
Set ( ) as the neighbor set of (undified)
Neighbor:
-neighbor: ( ) iff
NN : ( ) iff ( ) or ( )
i
j i j
L
j j i
Nei i
Nei i
K Nei i KNN i KNN j
x
x x x
x x x
1 ,Define weight matrxi [ ]0ij i j NN N W w
Algorithm
• (2) Weight computation: (heat kernel)
• (3) Re-embedding:
112
2
2( ) ( )
( )exp( ), if ( )
i j
L j
ij ij jiw Nei i w wt
x xx
2 2
2( ) ( )
2( ) ( )
1 1 1
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
1 1 1 1
( ) ( ) ( (
1
1
)
1
1
( )
2
p N Ni j
l i ijLl i j
N N N NT
i j i j i T i i T j j T j
ij ij
i j i j
N Ni T i j T j
ij
i j
N Ni j
ijLi j
E Y w
w w
w
w
y y
y y y y y y y y y y
y
y y
y y y) ( ) ( )
1 1 1 1
2 ......N N N N
i T j
ij ij
j i i j
w w
y y
Algorithm
• (3) Re-embedding:
113
( ) ( ) ( ) ( ) ( ) ( )
1 1 1 1
( ) ( ) ( ) ( )
1 1 1
( ) ( ) ( ) ( )
1
( ) 2
............... ignore the sc 2 a2
lar 2
N N N Ni T i j T j i T j
ii jj ij
i j i j
N N Ni T i i T j
ii ij
i i j
Ni i T j i T
ii
i
E Y d d w
tr d tr w
tr d tr w
y y y y y y
y y y y
y y y y1 1
(1) (1)
11 11 1
(1) ( ) (1) ( )
( ) ( )
1
| | | |0
| | 0 | |
N N
ij
i j
T T
N
N N
N T N T
NN N NN
d w w
tr tr
d w w
y y
y y y y
y y
( ) T Ttr Y D W Y tr YLY
11
is an diagonal matrix with N
ji
j
N
ii ij
j
D N N d w w
Optimization
• Optimization:
• Constraint:
114
*
( ) ( )
*
1
1
( 1 )
arg min ( )
( )
T
T
i i
i
N p
Y
p
T
p N N N p p
DY I
p
Y tr YLY
L D D LU U
U I
O
O
Y
u u
is a eigenvector of with =0N M 1
( ) ( ) (1) (1) ( ) ( )
11
1
......N
T i i T T i i T
ii NN
i
YDY d d d I
y y y y y y
large iid
small iid
Optimization
, ,
* *
* ( 1)
1
1 (1)
* *
( )
Assume is 1 ,
min { } min { } min ( , ) min
2 2 0
1 0
wit
( 1)
generalized eigenvector
h
T
T
D
N
T
T T T
Y
T
T
N
T
Y N
tr YLY tr L E L
E
D
D LL D
ED
t
tr D
U
r L
q q q
q q
q
q q q q q
q qq
q q
q qq u
q
q
q q
q
q
u u
* * * *
1 1
( ), because with 0
T T
N N
N
N
L D
1
q q q q
u 115
(1)
(1) ( )
( )
| |
| |
T
N
p N
p T
Y
q
y y
q
Optimization
116
( )1
( ) ( )
1 1
*
*
0, 1
** *
1* *
...
Assume is ( 1) , { }
arg min the ( 1) eigenvector of
T
T i
r r rr N T i T i T T
iTi i
T t
D
D i
h
TT
N rT
r
YY r N tr YLY L L L
L N r M
LL
D
q q
q q
q q q q q qq
q q q
q q= q q =
q q
1/2 1/2
1/2 1/2 1/2 1/2
1/2 1/2 1/2 1/2
1/2 1/2 1/2 1/2
1/2
Proof:
...... set
is Hermitian, then
I
T T T
LU DU D D U
L D D U D D U
LD A D A D LD A A
D LD A A I U D D U U DU
D U A
1/2n Spectral clustering: T
DY U
• Constraints used in LLE and LE:
• I can be replaced by positive-element diagonal matrices:
Is the constraint meaningful?
119
or T TYY I YDY I
11
1/2( ) ( )
0
or :
0
T T n n
i ii i
pp
b
YY YDY y b y
b