machine learningepxing/class/10701-08s/lecture/lecture26-fa.pdf · 1 eric xing 1 machine learning...

1

Eric Xing 1

Machine LearningMachine Learning

1010--701/15701/15--781, Spring 2008781, Spring 2008

Factor Analysis and Metric Factor Analysis and Metric LearningLearning

Eric XingEric Xing

Lecture 26, April 23, 2008

Reading: Chap. 12.3-4, C.B book

Eric Xing 2

OutlineProbabilistic PCA (breif)

Factor Analysis (somewhat detail)

ICA (will skip)

Distance metric learning from very little side info (a very coolmethod)

2

Eric Xing 3

Popular dimensionality reduction techniqueProject data onto directions of greatest variation

Consequence:xi are uncorrelated such that the covariance matrix is

Truncation error

Recap of PCA

( )

( )uyCovu

uyym

u

uym

u

T

m

i

Tii

T

m

i

Ti

)(maxarg

maxarg

maxarg

=

⎟⎠

⎞⎜⎝

⎛=

=

∑

∑

=

=

1

1

2

1

1

rr

r

qi

Tq

iTq

iT

iT

i yU

yu

yuyu

x R∈=

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=r

rM

r

r

r 2

1

y1

y 2

( ) ( ) x

q

k

Tkkk

K

k

Tkkky uuuu Σ=≈=Σ ∑∑

== 11γγ

∑=

m

i

Tii xx

m 1

1 rr

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

qγ

γO

1

ii xUy rr=

Eric Xing 4

Popular dimensionality reduction techniqueProject data onto directions of greatest variation

Useful tool for visualising patterns and clusters within the data set, but …

Need centering

Does not explicitly model data noise

Recap of PCA

3

Eric Xing 5

Probabilistic Interpretation?

AY

Xcontinuous

continuous

regression

AY

Xcontinuous

continuous

?

Eric Xing 6

Probabilistic PCAPCA can be cast as a probabilistic model

with q-dimensional latent variables

The resulting data distribution is

Maximum likelihood solution is equivalent to PCA

Diagonal Γq contains the top q sample covariance eigen-values and Uqcontains associated eigenvectors

Tipping and Bishop, J. Royal Stat. Soc. 6, 611 (1999).

nnn xy εµ ++Λ= ),(~ In20 σε N

),(~ Ixn 0N

),(~ Iy Tn

2σµ +ΛΛN

∑=n

nML y

N1µ 212 /)( IU qq

ML σ−Γ=Λ

4

Eric Xing 7

Factor analysisAn unsupervised linear regression model

Geometric interpretation

To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.

AY

X

where Λ is called a factor loading matrix, and Ψ is diagonal.

),;()(),;()(

ΨΛ+=

=

xyxyxx

µN

N

pp I0

Eric Xing 8

Relationship between PCA and FA

Probabilistic PCA is equivalent to factor analysis with equalnoise for every dimension, i.e., εn~ isotropic Gaussian

In factor analysis for a diagonal covariance matrix

An iterative algorithm (eg. EM) is required to find parameters if precisions are not known in advance

),(~ Ψ0NnεΨ

),( I20 σN

5

Eric Xing 9

Factor analysisAn unsupervised linear regression model

Geometric interpretation

To generate data, first generate a point within the manifold then add noise. Coordinates of point are components of latent variable.

AY

X

where Λ is called a factor loading matrix, and Ψ is diagonal.

),;()(),;()(

ΨΛ+=

=

xyxyxx

µN

N

pp I0

Eric Xing 10

Marginal data distribution A marginal Gaussian (e.g., p(x)) times a conditional Gaussian (e.g., p(y|x)) is a joint GaussianAny marginal (e.g., p(y) of a joint Gaussian (e.g., p(x,y)) is also a Gaussian

Since the marginal is Gaussian, we can determine it by just computing its mean and variance. (Assume noise uncorrelated with data.)

[ ] [ ][ ] [ ]

[ ] ( )( )[ ]( )( )[ ]( )( )[ ][ ] [ ]

Ψ+ΛΛ=

+ΛΛ=

+Λ+Λ=

−+Λ+−+Λ+=

−−=

=++=+Λ+=

Ψ+Λ+=

T

TTT

T

T

T

EEE

E

EVar

EEEE

WWXXWXWX

WXWX

YYY

WXWWXY

µµµµ

µµ

µµµµ

00

0 ),(~ here w N

AY

X

6

Eric Xing 11

FA = Constrained-Covariance Gaussian

Marginal density for factor analysis (y is p-dim, x is k-dim):

So the effective covariance is the low-rank outer product of two long skinny matrices plus a diagonal matrix:

In other words, factor analysis is just a constrained Gaussian model. (If were not diagonal then we could model any Gaussian and it would be pointless.)

),;()|( Ψ+ΛΛ= Tµθ yy Np

Eric Xing 12

Review:A primer to multivariate Gaussian

Multivariate Gaussian density:

A joint Gaussian:

How to write down p(x1), p(x1|x2) or p(x2|x1) using the block elements in µ and Σ?

Formulas to remember:

{ })-()-(-exp)(

),|( //µµ

πµ xxx 1

21

21221 −ΣΣ

=Σ Tn

p

) , (),|( ⎥⎦

⎤⎢⎣

⎡ΣΣΣΣ

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=Σ⎥

⎦

⎤⎢⎣

⎡

2221

1211

2

1

2

1

2

1

µµ

µxx

xx

Np

211

22121121

221

2212121

2121121

ΣΣΣ−Σ=

−ΣΣ+=

=

−

−

|

|

||

)(

),|()(

V

xm

Vmxxx

µµ

Np

222

22

2222

Σ=

=

=

m

m

mmp

V

m

Vmxx

µ

),|()( N

AX2

X1

7

Eric Xing 13

Review:Some matrix algebra

Trace and derivativesCyclical permutations

Derivatives

Determinants and derivatives

[ ] ∑=i

iiadef

tr A

[ ] TBBAA

=∂∂ tr

[ ] [ ] TTT xxAxxA

AxxA

=∂∂

=∂∂ trtr

[ ] [ ] [ ]BCACABABC trtrtr ==

-TAAA

=∂∂ log

Eric Xing 14

FA joint distributionModel

Covariance between x and y

Hence the joint distribution of x and y:

Assume noise is uncorrelated with data or latent variables.

),;()(),;()(

ΨΛ+=

=

xyxyxx

µN

N

pp I0

[ ] ( )( )[ ] ( )[ ][ ]T

TTT

TT

EEECov

Λ=

+Λ=

−+Λ+=−−=

XWXXWXXYXYX, µµµ0

) , ()( ⎥⎦

⎤⎢⎣

⎡

Ψ+ΛΛΛΛ

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡T

TIµ0

yx

yx

Np

8

Eric Xing 15

Inference in Factor AnalysisApply the Gaussian conditioning formulas to the joint distribution we derived above, where

we can now derive the posterior of the latent variable x given observation y, , where

Applying the matrix inversion lemma

Here we only need to invert a matrix of size |x|×|x|, instead of |y|×|y|.

( ) )(

)(|

µ

µµ

−Ψ+ΛΛΛ=

−ΣΣ+=−

−

y

ym1

21

2212121

TT

( )Ψ+ΛΛ=Σ

Λ=Σ=Σ

=Σ

T

TT

I

22

1212

11

),|()( || 2121 Vmxyx N=p

⇒⇒( ) ( ) 1-1-1- GEGE-HFEEFH-E

1111 −−−−+= FG

( ) ΛΨ+ΛΛΛ−=

ΣΣΣ−Σ=−

−

1

211

22121121

TTI

|V

( ) 1121

−− ΛΨΛ+= TI|V )(|| µ−ΨΛ= − yVm 12121

T

Eric Xing 16

Geometric interpretation: inference is linear projection

The posterior is:

Posterior covariance does not depend on observed data y!Computing the posterior mean is just a linear operation:

),;()( || 2121 Vmxyx N=p

( ) 1121

−− ΛΨΛ+= TI|V )(|| µ−ΨΛ= − yVm 12121

T

9

Eric Xing 17

EM for Factor AnalysisIncomplete data log likelihood function (marginal density of y)

Estimating m is trivial: Parameters Λ and Ψ are coupled nonlinearly in log-likelihood

Complete log likelihood

( )

( )[ ] Tn

nn

nT

nn

µyµy N

yyND

)()( where , trlog

)()(log),(

−−=Ψ+ΛΛ−Ψ+ΛΛ−=

−Ψ+ΛΛ−−Ψ+ΛΛ−=

∑

∑−

−

SS1

1

21

2

21

2TT

TT µµθl

∑= n nNML y1µ̂

[ ] [ ] Tnn

nnn

n

Tnn

nnT

nnn

nn

Tn

nnnn

nnn

xyxyN

NxxN

xyxyNxxN

xypxpyxpD

)()( where , trtrlog

)()(loglog

)|(log)(log),(log),(

Λ−Λ−=Ψ−−Ψ−=

Λ−ΨΛ−−Ψ−−−=

+==

∑∑

∑∑

∑∑

−

−

122

12

21

221

21

1

SS

I

θcl

Eric Xing 18

E-step for Factor Analysis Compute

Recall that we have derived:

[ ] [ ] trtrlog),( 1

221

2−Ψ−−Ψ−= ∑ SNXXND

n

Tnnθcl

)|(),( yxpDθcl

)( T

n

Tnn

Tn

Tn

TTnn

Tnn XXyXXyyy

N ΛΛ+Λ−Λ−= ∑1S

[ ] [ ] [ ]TnnnnnnTnn yXyXyXXX ||| EEVar +=

[ ]nnn yXX |E=

( ) 1121

−− ΛΨΛ+= TI|V )(|| µ−ΨΛ= − yVm 12121

T

)(|| µ−ΨΛ== −nyxn yX

nn

121

TVm⇒⇒ Tyxyx

Tnn nnnn

XX ||| and mmV += 21

10

Eric Xing 19

M-step for Factor Analysis Take the derivates of the expected complete log likelihood wrt. parameters.

Using the trace and determinant derivative rules:

[ ] [ ]

S

S

22

221

21

11

NN

NXXNn

Tnn

−Ψ=

⎟⎠

⎞⎜⎝

⎛Ψ−−Ψ−

Ψ∂∂

=Ψ∂∂ −

−− ∑ trtrlogcl

[ ] [ ]

∑∑

∑

∑

ΛΨ−Ψ=

⎟⎠

⎞⎜⎝

⎛ΛΛ+Λ−Λ−

Λ∂∂

Ψ−=

Λ∂∂

Ψ−=⎟⎠

⎞⎜⎝

⎛Ψ−−Ψ−

Λ∂∂

=Λ∂∂

−−

−

−−

n

Tnn

n

Tnn

T

n

Tnn

Tn

Tn

TTnn

Tnn

n

Tnn

XXXy

XXyXXyyyN

N

NNXXN

11

1

11

12

2221

2

)(

trtrlog SScl

⇒⇒ S=Ψ +1t

⇒⇒1

1−

+ ⎟⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛=Λ ∑∑

n

Tnn

n

Tnn

t XXXy

Eric Xing 20

Comparison of PCA and FAFA


nn

121

TVm

Tyxyx

Tnn nnnn


11

−+ ⎟

⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛=Λ ∑∑

n

Tnn

n

Tnn

t XXXy

),(~ Ψ0Nnε

nnn xy εµ ++Λ=

( )uyCovuu T )(maxarg=

qi

Tq

iTk

iT

iT

i yU

yu

yuyu

x R∈=

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=r

rM

r

r

r 2

1

nn Uxy =

PCA

y1

y 2

11

Eric Xing 21

Comparison of PCA and FAFA

Invert a q×q matrix

Covariant under rescaling: diag(α)y

Neither of the factors found by a two-factor model is necessarily the same as that found by a single factor model, and …


nn

121

TVm

Tyxyx

Tnn nnnn


11

−+ ⎟

⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛=Λ ∑∑

n

Tnn

n

Tnn

t XXXy

( )uyCovuu T )(maxarg=

qi

Tq

iTk

iT

iT

i yU

yu

yuyu

x R∈=

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=r

rM

r

r

r 2

1

PCA

SVD on a K×K matrix

Covariant under rotation: Ay

Principle axis can be found incrementally

S=Ψ +1t

Eric Xing 22

Original data matrix Correlation matrix Factor matrix

Obs

Observationaldata

Variablesv1 v2 . . . v10

o1o2.

.

.

on

Factor loadings

FactorsF1 F2 F3

Spd Str End

100mLongHigh110m400mDiscusShotJavelinPole1500m

.87

.77

.51

.63

.74

.22

.31

.02

.24

.02

.07

.34

.27

.32

.06

.79

.82

.70

.40

.12

.14

.13

.34

.05

.38

.06

.10

.15

.50

.89

Correlationcoefficients

Variablesv1 v2 . . . v10

v1v2

.

.

.v10

Decisions •Factoring method• # of factors to retain• Factor rotation

Example:

12

Eric Xing 23

100 Meters

400 Meters

Javelin

High Jump

100m Hurdles

Shot Put

Long Jump

Discus

Pole Vault

1500 Meters

“SPEED”

“STRENGTH”

“ENDURANCE”

Decathlon example

Eric Xing 24

Model Invariance and Identifiability

There is degeneracy in the FA model.Since Λ only appears as outer product ΛΛΤ, the model is invariant to rotation and axis flips of the latent space.We can replace Λ with ΛQ for any orthonormal matrix Q and the model remains the same: (ΛQ)(ΛQ)Τ=Λ(QQΤ)ΛΤ=ΛΛΤ.This means that there is no “one best” setting of the parameters. An infinite number of parameters all give the ML score!Such models are called un-identifiable since two people both fitting ML parameters to the identical data will not be guaranteed to identify the same parameters.

13

Eric Xing 25

Why FALatent trajectories

A AA Ax2 x3x1 xN

y2 y3y1 yN...

... A AA Ax2 x3x1 xN

y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...


y2 y3y1 yN...

...

HMM (for discrete sequential data, e.g., text)

HMM(for continuous sequential data, e.g., speech signal)

State space model

AX

Y

AX

Y

AX

Ydiscrete

discrete

discrete

continuous

continuous

continuous

Mixture modele.g., mixture of multinomials

Mixture modele.g., mixture of Gaussians

Factor analysis

Eric Xing 26

Independent Components Analysis (ICA)

ICA is similar to FA, except it assumes the latent source has non-Gaussian density.Hence ICA can extract higher order moments (not just second order).It is commonly used to solve blind source separation (cocktail party problem).

AY

X

AY2

X1 X2

Y1 YK……

⇒⇒

FAFA ICAICA

14

Eric Xing 27

SourcesObservations

s1

s2

x1

x2

Mixing matrix A

x = As

n sources, m=n observations

The simple “Cocktail Party”Problem

We skip more details and next introduce a more interesting new algorithm!

Eric Xing 28

ICA versus PCA (and FA)

SimilarityFeature extractionDimension reduction

DifferencePCA uses up to second order moment of the data to produce uncorrelated componentsICA strives to generate components as independent as possible

15

Eric Xing 29

Semi-supervised Metric Learning

⇒⇒

Xing et al, NIPS 2003

Eric Xing 30

What is a good metric?What is a good metric over the input space for learning and data-mining

How to convey metrics sensible to a human user (e.g., dividing traffic along highway lanes rather than between overpasses, categorizing documents according to writing style rather than topic) to a computer data-miner using a systematic mechanism?

16

Eric Xing 31

Issues in learning a metricData distribution is self-informing (E.g., lies in a sub-manifold)

Learning metric by finding an embedding of data in some space. Con: does not reflect (changing) human subjectiveness.

Explicitly labeled dataset offers clue for critical featuresSupervised learning

Con: needs sizable homogeneous training sets.

What about side information? (E.g., x and y look (or read) similar ...)

Providing small amount of qualitative and less structured side information is often much easier than stating explicitly a metric (what should be the metric for writing style?) or labeling a large set of training data.

Can we learn a distance metric more informative than Euclidean distance using a small amount of side information?

Eric Xing 32

Distance Metric Learning

17

Eric Xing 33

Optimal Distance MetricLearning an optimal distance metric with respect to the side-information leads to the following optimization problem:

This optimization problem is convex. Local-minima-free algorithms exist.Xing et al 2003 provided an efficient gradient descent + iterative constraint-projection method

Eric Xing 34

Examples of learned distance metrics

Distance metrics learned on three-cluster artificial data:

18

Eric Xing 35

Application to ClusteringArtificial Data I: a difficult two-class dataset

Eric Xing 36

Application to ClusteringArtificial Data II: two-class data with strong irrelevant feature

19

Eric Xing 37

Application to Clustering9 datasets from the UC Irvine repository

Eric Xing 38

Accuracy vs. amount of side-information

Two typical examples of how the quality of the clusters found increases with the amount of side-information.

20

Eric Xing 39

Take home messageDistance metric learning is an important problem in machine learning and data mining.A good distance metric can be learned from small amount of side-information in the form of similarity and dissimilarity constraints from data by solving a convex optimization problem.The learned distance metric can identify the most significant direction(s) in feature space that separates data well, effectively doing implicit Feature Selection.The learned distance metric can be used to improve clustering performance.

Eric Xing 40

Additional Details: Independent Components Analysis (ICA)

ICA is similar to FA, except it assumes the latent source has non-Gaussian density.Hence ICA can extract higher order moments (not just second order).It is commonly used to solve blind source separation (cocktail party problem).

AY

X

AY2

X1 X2

Y1 YK……

⇒⇒

FAFA ICAICA

21

Eric Xing 41

SourcesObservations

s1

s2

x1

x2

Mixing matrix A

x = As

n sources, m=n observations

The simple “Cocktail Party”Problem

Eric Xing 42

MotivationMotivation

Two Independent Sources

Mixture at two Mics

aij ... Depend on the distances of the microphones from the speakers

)()()()()()(tsatsatx

tsatsatx

2221212

2121111

+=+=

22

Eric Xing 43

MotivationMotivation

Get the Independent Signals out of the Mixture

Eric Xing 44

Blind Source SeparationSuppose that there are k unknown independent sources

A data vector x(t) is observed at each time point t, such that

where A is a n ´ k full rank scalar matrix

[ ] [ ] 11 == )(th wi)(,),()( tEtstst Tk ss K

)()( tt Asx =

23

Eric Xing 45

Blind source separation

Mixingprocess

A

Independentcomponents

…

BlindSource

De-mixingprocess

W

…

Observedsequences

Recovered independent components

Eric Xing 46

ICA versus PCA (and FA)

SimilarityFeature extractionDimension reduction

DifferencePCA uses up to second order moment of the data to produce uncorrelated componentsICA strives to generate components as independent as possible

24

Eric Xing 47

Problem formulationThe goal of ICA is to find a linear mapping W such that the unmixed sequences u

are maximally statistically independent

Find some

where C is a diagonal matrix and P is a permutation matrix.

Mixingprocess

A


…

BlindSource

De-mixingprocess

W

…

Observedsequences


Mixingprocess

A

Mixingprocess

A


…


…

BlindSourceBlindSource

De-mixingprocess

W

…

Observedsequences

…

Observedsequences


Recovered independent components)()()( ttt WAsWxu ==

PCWAV ==

Eric Xing 48

The fundamental restriction in ICA is that the independent components must be nongaussian for ICA to be possible.

This is because gaussianity is invariant under orthogonal transformation and hence make the matrix A not identifiable for gaussian independent components.

Principle of ICA: Nongaussianity

25

Eric Xing 49

Measures of nongaussianity (1)Kurtosis

Kurtosis can be very sensitive to outliers, when its value has to be estimate from a measured sample.

Mutual information

Negative Entropy

Eric Xing 50

FastICA － PreprocessingCentering:

Make the x-s mean 0 variables

WhiteningTransform the observed vector x linearly so that it has unit variance:

One can show that:

where

26

Eric Xing 51

FastICA algorithmInitialize the weight matrix WIteration:

where

Repeat until convergence

The ICAs are the components of

Eric Xing 52

SummaryThere has been a wide discussion about the application of Independence Component Analysis (ICA) in Signal Processing, Neural Computation and Finance.

First introduced as a novel tool to separate blind sources in a mixed signal.

The Basic idea of ICA is to reconstruct from observation sequences the hypothesized independent original sequences.

machine learningepxing/class/10701-08s/lecture/lecture26-fa.pdf · 1 eric xing 1 machine learning...

Documents