pca, svd and mle a statistical view of principal ...pca, svd and mle a statistical view of principal...

PCA, SVD and MLEA Statistical View of Principal Components

Naomi Altman1,2 & Wei Luo1 & Garvesh Raskutti2

SAMSI 2013February 14, 2013

1. The Pennsylvania State University 2. SAMSI

Altman (Penn State) PCA,SVD,MLE February 14, 2013 1 / 31

Matrices vs Multivariate Statistics

PCA and SVD are often presented as equivalent matrixdecomposition methods.Today’s talk will focus on statistical interpretation.We will need to distinguish between

Population (model) parametersObservable (estimated) parameters

PCA has both a population and observable version.SVD is a matrix decomposition and has only an observable version.


Random variables and samples

We suppose X is a p-dimensional random (row) variable.E(X ) = µ a p-dimensional row vector (non-random).Let Y = X − µ.Var (X ) = Σ = E(X − µ)>(X − µ) a p × p positive semi-definitematrix.

We sample X1 · · ·Xn i.i.d. vectors from the distribution of X .X is the n × p matrix with rows Xi .X̄ is the p-vector of column averages (the sample mean).Yi = Xi− X̄ is a centered “observation”.Y is the n × p matrix with rows Yi .V̂ar (X ) = Y>Y/(n− 1) = Σ̂ is the sample variance.




We sample X1 · · ·Xn i.i.d. vectors from the distribution of X .X is the n × p matrix with rows Xi .X̄ is the p-vector of column averages (the sample mean).Yi = Xi− X̄ is a centered “observation”.Y is the n × p matrix with rows Yi .

V̂ar (X ) = Y>Y/(n− 1) = Σ̂ is the sample variance.




We sample X1 · · ·Xn i.i.d. vectors from the distribution of X .X is the n × p matrix with rows Xi .X̄ is the p-vector of column averages (the sample mean).Yi = Xi− X̄ is a centered “observation”.Y is the n × p matrix with rows Yi .V̂ar (X ) = Y>Y/(n− 1) = Σ̂ is the sample variance.



We might think of X as an observed matrix in which case Y, X̄ andΣ̂ are fixed arrays of appropriate dimension.

We might also think of X as a random matrix, in which case Y and Σ̂are also random matrices and X̄ is a random vector and we have

E(X̄) = µ.E(Σ̂) = Σ.



We might think of X as an observed matrix in which case Y, X̄ andΣ̂ are fixed arrays of appropriate dimension.We might also think of X as a random matrix, in which case Y and Σ̂are also random matrices and X̄ is a random vector and we have

E(X̄) = µ.E(Σ̂) = Σ.


PCA - population version

The variance matrix Σ has eigen-decomposition

Σ = V ∆V> (1)

whereV is a p × p orthonormal matrix with columns the eigenvectors of Σ.∆ is a diagonal matrix with diagonal elements

the eigenvalues δ1 ≥ δ2 ≥ · · · ≥ δp ≥ 0.

V is unique if the eigenvalues are unique.

The i th principal component of X is Vi .


PCA - population version

The variance matrix Σ has eigen-decomposition

Σ = V ∆V> (1)

whereV is a p × p orthonormal matrix with columns the eigenvectors of Σ.∆ is a diagonal matrix with diagonal elements

the eigenvalues δ1 ≥ δ2 ≥ · · · ≥ δp ≥ 0.V is unique if the eigenvalues are unique.

The i th principal component of X is Vi .


PCA - sample version

The estimated variance matrix Σ̂ has eigen-decomposition

Σ̂ = V̂ ∆̂V̂> (2)

whereV̂ is a p × p orthonormal matrix with columns the eigenvectors of Σ̂.∆̂ is a diagonal matrix with diagonal elementsδ̂1 ≥ δ̂2 ≥ · · · ≥ δ̂p ≥ 0.

The i th empirical principal component of X is V̂i .


SVD=PCA - sample version

Y has singular value decomposition

Y = uDv> (3)

whereD is a diagonal matrix with diagonal elementsd1 ≥ d2 ≥ · · · ≥ dp ≥ 0,u is the n × p matrix of left singular vectors andv is the p × p matrix of right singular vectors.

Σ̂ = Y>Y/(n− 1) = vD2v>/(n− 1).V̂ = v and δ̂i = d2i /(n − 1).


SVD - population and sample versions

Note that Y is a random vector.It is not at all clear how one would define SVD for a random quantity.However, Y and X do have principal components.

By contrast Y is a matrix and does have an SVD, which gives theempirical principal components.



Note that Y is a random vector.It is not at all clear how one would define SVD for a random quantity.However, Y and X do have principal components.By contrast Y is a matrix and does have an SVD, which gives theempirical principal components.



Recall that the SVD has an important interpretation in terms of matrixcloseness:

Let M be an n × p matrix with n > p.For any matrix m, let m[•r ] denote the first r columns.Let ‖ M ‖ be the L2 norm of the matrix.Suppose M = ÛD̂V̂> is the SVD of M.

Let W be an orthogonal p × r matrix and A an N × r matrix.If rank(M) ≥ r thenA pair, (W ,A) minimizing ‖ M − AW> ‖ isW = V̂ [•r ] and A = (ÛD̂)[•r ].In other words, the first r singular vectors provide the closest rank ror less approximation to M.

Note that the non-random vector closest to Y in the L2 sense isE(Y ) = 0.





Let W be an orthogonal p × r matrix and A an N × r matrix.If rank(M) ≥ r thenA pair, (W ,A) minimizing ‖ M − AW> ‖ isW = V̂ [•r ] and A = (ÛD̂)[•r ].

In other words, the first r singular vectors provide the closest rank ror less approximation to M.






Let W be an orthogonal p × r matrix and A an N × r matrix.If rank(M) ≥ r thenA pair, (W ,A) minimizing ‖ M − AW> ‖ isW = V̂ [•r ] and A = (ÛD̂)[•r ].In other words, the first r singular vectors provide the closest rank ror less approximation to M.



Directions of Greatest Variance

Let X be any random p-vector with mean µ, variance Σ andprincipal components V1 · · ·Vp.Var (XV1) = V>1 ΣV1 = δ1 so the projection of X onto the firstprincipal component is the 1-dimensional projection with greatestvariance.XV [•r ] is the projection of X onto the r-dimensional space withgreatest variance.

Note that projections onto subspaces may not be very interesting ifthe support of X is a manifold with high curvature.


Directions of Greatest Variance

Let X be any random p-vector with mean µ, variance Σ andprincipal components V1 · · ·Vp.Var (XV1) = V>1 ΣV1 = δ1 so the projection of X onto the firstprincipal component is the 1-dimensional projection with greatestvariance.XV [•r ] is the projection of X onto the r-dimensional space withgreatest variance.Note that projections onto subspaces may not be very interesting ifthe support of X is a manifold with high curvature.


Singular Variance Matrices

Lemma: If x is a 1-dimensional random variable with Var(x)=0, then x isconstant a.s.

Theorem: If Y is a random p-vector with mean 0 and variance Σ andrank(Σ)=r[•r ]) has full rank rLet M2 = [Mr+1 · · ·Mp] be the matrix with columns the last p − rcolumns of M.Then YM2 = 0 a.s.

Proof: Let V be the ordered eigenvectors of Σ.


Projection Theorem

Corollary: If Y is a random p-vector with mean 0 and variance Σ andrank(Σ)=r

Factor Analysis and PCA

The Factor Analysis ModelSuppose there are random factors C1 · · ·Cr

E(Ci ) = 0Var (Ci ) = σ2Cuncorrelated

Let Xi = µi +∑r

j=1 qijCj + Ei for i = 1 · · · p ≥ rfixed constants µi and qij .and Ei i.i.d. with mean 0 and variance σ2E independent of Cj for all j .

i.e. X = µ+ CQ + E where Q is p × r .

Then Var (X ) = σ2CQ>Q + σ2EIp where Ip is the p identity matrix.


Factor Analysis and PCA, cont’d

The Factor Analysis ModelX = µ+ CQ + E.Var (X ) = σ2CQ

>Q + σ2EIp

Let the SVD of Q be Q = UDV>.

Note that the columns of V are the first r principal components of X .Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.So XV is a basis for Span (C).Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.




>Q + σ2EIp

Let the SVD of Q be Q = UDV>.Note that the columns of V are the first r principal components of X .

Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.So XV is a basis for Span (C).Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.




>Q + σ2EIp

Let the SVD of Q be Q = UDV>.Note that the columns of V are the first r principal components of X .Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.

So XV is a basis for Span (C).Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.




>Q + σ2EIp

Let the SVD of Q be Q = UDV>.Note that the columns of V are the first r principal components of X .Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.So XV is a basis for Span (C).

Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.




>Q + σ2EIp

Let the SVD of Q be Q = UDV>.Note that the columns of V are the first r principal components of X .Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.So XV is a basis for Span (C).Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.


Hierarchical Mean model and PCA

The Hierarchical Mean ModelSuppose

C is a q-dimensional random variable with variance Var (C) = σ2C Iq.W is a p × q fixed matrix of rank r .X is a random p-vector withE(X |C) = CW and Var (X |C) = σ2Ip.Then E(X ) = E(C)W and Var (X ) = σ2CW

>W + σ2Ip.

Let the SVD of W be W = UDV>.

Note that the first r columns of V are the first r principalcomponents of X .Note that W is not identifiable, but Span (W ) = Span (V [•r ]) isidentifiable.





>W + σ2Ip.

Let the SVD of W be W = UDV>.Note that the first r columns of V are the first r principalcomponents of X .

Note that W is not identifiable, but Span (W ) = Span (V [•r ]) isidentifiable.





>W + σ2Ip.

Let the SVD of W be W = UDV>.Note that the first r columns of V are the first r principalcomponents of X .Note that W is not identifiable, but Span (W ) = Span (V [•r ]) isidentifiable.


PCA for Normally Distributed RVs

PCA is often used with Multivariate Normal DataIf X is a p-dimensional Normal vector with mean µ and (nonsingular)variance Σ then

L(X |µ,Σ) = −12

(log(det(Σ)) + (X − µ)>Σ−1(X − µ)

)= −1

2

p∑i=1

(log(δi) + Y>ViV>i Y/δi

)We see that

The likelihood depends on the data only through the projection ofthe centered data onto the principal components.The level contours of distribution are the ellipsoids centered at µand with major axes described by the principal components.


PCA for Normally Distributed RVs

Figure : from moserware.com


PCA and MLE

Normal Factor Analysis ModelIf X = µ+ QC + E where

µ is a p-vectorC ∼ MVN(0, σ2C Ir )E ∼ MVN(0, σ2E Ip)Q is a fixed p × r matrix

thenX ∼ MVN(µ, σ2CQ

>Q + σ2E Ip)

Tipping and Bishop (1999) show that Span (V̂ [•r ]) is the maximumlikelihood estimator of Span (Q).


PCA and MLE

Normal Hierarchical Mean ModelIn the hierarchical mean model, if C (q-dimensional) and X |C(p-dimensional) are both multivariate normal and W has rank r ≤ q then

X ∼ MVN(E(C)W , σ2CW>W + σ2Ip)

Note that E(X ) is in Span (W ).

The argument of Tipping and Bishop (1999) shows that Span (V̂ [•r ]) isthe maximum likelihood estimator of Span (W ).


PCA and MLE

Normal Hierarchical Mean Model - Empirical version

We generate data by drawing Ci ∼ Nq(0, σ2C Iq) andXi |Ci ∼ Np(CiW , σ2Ip).

This is equivalent to Xi = CiW + Ei where Ei ∼ Np(0, σ2Ip).

Letting C be the n × q,n > q (unobserved) matrix with rows Ciand E be the n × p matrix with rows EiX = CW + E.Yi = Xi − X̄ or Y = PX where P = In − 11>/n.


PCA and MLE


We generate data by drawing Ci ∼ Nq(0, σ2C Iq) andXi |Ci ∼ Np(CiW , σ2Ip).

This is equivalent to Xi = CiW + Ei where Ei ∼ Np(0, σ2Ip).Letting C be the n × q,n > q (unobserved) matrix with rows Ci

and E be the n × p matrix with rows EiX = CW + E.Yi = Xi − X̄ or Y = PX where P = In − 11>/n.


PCA and MLE

Normal Hierarchical Mean Model - Empirical versionWe have:

X = CW + EY = PX = PCW + PE.Let the SVD of W be W = UDV>. Then Span (W ) = Span (V ).

Let Y = ÛD̂V̂>.We have already seen that Û[•r ]Diag(d1 · · · dr )V̂ [•r ]> is a solution tothe rank r matrix closest to Y.and V̂ [•r ] are the first r empirical principal components of X .

Now suppose that we also consider the SVD X = udv>.Some questions:When is Span (CW) = Span (PCW) = Span (W)?

C has full rank q a.s. so Span (CW) = Span (W) a.s.Is PC full rank a.s.?


PCA and MLE




Now suppose that we also consider the SVD X = udv>.Some questions:

When is Span (CW) = Span (PCW) = Span (W)?C has full rank q a.s. so Span (CW) = Span (W) a.s.Is PC full rank a.s.?


PCA and MLE





C has full rank q a.s. so Span (CW) = Span (W) a.s.

Is PC full rank a.s.?


PCA and MLE





C has full rank q a.s. so Span (CW) = Span (W) a.s.Is PC full rank a.s.?


PCA and MLE


How does Span (V̂ [•r ]) compare with Span (v [•r ])?

I would like to argue that since E is a random Normal matrix, mostof the structure in X and Y = PX comes from CW and PCW.When is this true?


PCA and MLE


How does Span (V̂ [•r ]) compare with Span (v [•r ])?I would like to argue that since E is a random Normal matrix, mostof the structure in X and Y = PX comes from CW and PCW.When is this true?


PCA and MLE

Putting PCA in terms of maximum likelihood suggests at least 2extensions:

PCA can be robustified against contamination via robustM-estimation methods.PCA can be extended to non-Normal multivariate families viamaximum likelihood estimates based on the factor analysis orhierarchical means models.

We tackle the second problem.


PCA and MLE

Putting PCA in terms of maximum likelihood suggests at least 2extensions:

PCA can be robustified against contamination via robustM-estimation methods.PCA can be extended to non-Normal multivariate families viamaximum likelihood estimates based on the factor analysis orhierarchical means models.We tackle the second problem.


Extensions of PCA

Elliptical familiesA p-dimensional random vector X has an elliptical distribution if

There is a fixed p-vector µand a fixed positive semi-definite matrix Ssuch that

the density function depends on X only through (X − µ)>S(X − µ).

The level probability contours of X are ellipsoids with center µ andmajor axes the eigenvectors of S.If Var (X ) is finite then E(X ) = µ and S is a generalized inverse ofVar (X ).


Extensions of PCA


There is a fixed p-vector µand a fixed positive semi-definite matrix Ssuch thatthe density function depends on X only through (X − µ)>S(X − µ).



Extensions of PCA



The level probability contours of X are ellipsoids with center µ andmajor axes the eigenvectors of S.

If Var (X ) is finite then E(X ) = µ and S is a generalized inverse ofVar (X ).


Extensions of PCA





Extensions of PCA

Elliptical families cont’d.Any elliptical distribution can be decomposed as

X = µ+ RU(p)A

whereU(p) is a random vector uniformly distributed on the unit sphere in A = Σ.Letting C = RUp and A = Q we see that this is just a factor analysismodel.


Extensions of PCA

Elliptical families cont’d.Cambanis, Huang and Simons (1981) show that if rank Σ = r < p then

X = µ+ SU(r)V[•r]>

whereS is independent of U(r) andV [•r ] are the first r eigenvectors of Σ.The remaining eigenvectors are associated with the null space ofA>A.


Extensions of PCA

Elliptical families cont’d.Elliptical distributions are a special case of the factor analysismodel.The span of the first r principal components is the Span (A)This should be enough to show that the first r empirical principalcomponents are the MLE of Span (A).


Extensions of PCA

Linear Exponential FamiliesA p-dimensional random vector X is in a linear exponential family incanonical form if

There is a p-vector θ such thatL(x |θ) = h(x) + xθ −G(θ) where

h(x) and G(θ) are functions onto

Extensions of PCA

Linear Exponential Families cont’d

e.g. Suppose that X ∼ Poisson(λ).

f (x |λ) = e−λλx/x!L(x |λ) = −λ+ x log(λ)− log(x!)

Thenθ = log(λ).h(x) = −log(x!).G(θ) = exp(θ) = λ.E(X |θ) = G′(θ) = exp(θ) = λ.Var (X |θ) = G′′(θ) = λ.


Extensions of PCA

Linear Exponential Families cont’dWe can turn this into a hierarchical means model by assuming

E(X |C) = g(CW )

where C is random.Note that we are assuming that θ = CW .

Note however

Var (X |C) = g′(CW )Var (X ) = Var (E(X |C)) + E(Var (X |C))

= Var (g(CW )) + E(g′(CW ))

so ordinary PCA does NOT extract the space spanned by the singularvectors of W .


Extensions of PCA

Linear Exponential Families cont’d

e.g. Suppose that X ∼ Poisson(λ).If θ = log(λ) = CW (for now W is 1× p),Var (X ) = E(exp(CW )) + Var (exp(CW )).


Extensions of PCA

Linear Exponential Families cont’dWe propose to use MLE to find the spaces spanned by the singularvectors of W corresponding to non-zero eigenvalues.


pca, svd and mle a statistical view of principal ...pca, svd and mle a statistical view of principal...

Documents