pca, svd and mle a statistical view of principal ...pca, svd and mle a statistical view of principal...

66
PCA, SVD and MLE A Statistical View of Principal Components Naomi Altman 1,2 & Wei Luo 1 & Garvesh Raskutti 2 SAMSI 2013 February 14, 2013 1. The Pennsylvania State University 2. SAMSI Altman (Penn State) PCA,SVD,MLE February 14, 2013 1 / 31

Upload: others

Post on 23-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

  • PCA, SVD and MLEA Statistical View of Principal Components

    Naomi Altman1,2 & Wei Luo1 & Garvesh Raskutti2

    SAMSI 2013February 14, 2013

    1. The Pennsylvania State University 2. SAMSI

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 1 / 31

  • Matrices vs Multivariate Statistics

    PCA and SVD are often presented as equivalent matrixdecomposition methods.Today’s talk will focus on statistical interpretation.We will need to distinguish between

    Population (model) parametersObservable (estimated) parameters

    PCA has both a population and observable version.SVD is a matrix decomposition and has only an observable version.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 2 / 31

  • Matrices vs Multivariate Statistics

    PCA and SVD are often presented as equivalent matrixdecomposition methods.Today’s talk will focus on statistical interpretation.We will need to distinguish between

    Population (model) parametersObservable (estimated) parameters

    PCA has both a population and observable version.SVD is a matrix decomposition and has only an observable version.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 2 / 31

  • Random variables and samples

    We suppose X is a p-dimensional random (row) variable.E(X ) = µ a p-dimensional row vector (non-random).Let Y = X − µ.Var (X ) = Σ = E(X − µ)>(X − µ) a p × p positive semi-definitematrix.

    We sample X1 · · ·Xn i.i.d. vectors from the distribution of X .X is the n × p matrix with rows Xi .X̄ is the p-vector of column averages (the sample mean).Yi = Xi− X̄ is a centered “observation”.Y is the n × p matrix with rows Yi .V̂ar (X ) = Y>Y/(n− 1) = Σ̂ is the sample variance.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 3 / 31

  • Random variables and samples

    We suppose X is a p-dimensional random (row) variable.E(X ) = µ a p-dimensional row vector (non-random).Let Y = X − µ.Var (X ) = Σ = E(X − µ)>(X − µ) a p × p positive semi-definitematrix.

    We sample X1 · · ·Xn i.i.d. vectors from the distribution of X .X is the n × p matrix with rows Xi .X̄ is the p-vector of column averages (the sample mean).Yi = Xi− X̄ is a centered “observation”.Y is the n × p matrix with rows Yi .

    V̂ar (X ) = Y>Y/(n− 1) = Σ̂ is the sample variance.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 3 / 31

  • Random variables and samples

    We suppose X is a p-dimensional random (row) variable.E(X ) = µ a p-dimensional row vector (non-random).Let Y = X − µ.Var (X ) = Σ = E(X − µ)>(X − µ) a p × p positive semi-definitematrix.

    We sample X1 · · ·Xn i.i.d. vectors from the distribution of X .X is the n × p matrix with rows Xi .X̄ is the p-vector of column averages (the sample mean).Yi = Xi− X̄ is a centered “observation”.Y is the n × p matrix with rows Yi .V̂ar (X ) = Y>Y/(n− 1) = Σ̂ is the sample variance.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 3 / 31

  • Random variables and samples

    We might think of X as an observed matrix in which case Y, X̄ andΣ̂ are fixed arrays of appropriate dimension.

    We might also think of X as a random matrix, in which case Y and Σ̂are also random matrices and X̄ is a random vector and we have

    E(X̄) = µ.E(Σ̂) = Σ.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 4 / 31

  • Random variables and samples

    We might think of X as an observed matrix in which case Y, X̄ andΣ̂ are fixed arrays of appropriate dimension.We might also think of X as a random matrix, in which case Y and Σ̂are also random matrices and X̄ is a random vector and we have

    E(X̄) = µ.E(Σ̂) = Σ.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 4 / 31

  • PCA - population version

    The variance matrix Σ has eigen-decomposition

    Σ = V ∆V> (1)

    whereV is a p × p orthonormal matrix with columns the eigenvectors of Σ.∆ is a diagonal matrix with diagonal elements

    the eigenvalues δ1 ≥ δ2 ≥ · · · ≥ δp ≥ 0.

    V is unique if the eigenvalues are unique.

    The i th principal component of X is Vi .

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 5 / 31

  • PCA - population version

    The variance matrix Σ has eigen-decomposition

    Σ = V ∆V> (1)

    whereV is a p × p orthonormal matrix with columns the eigenvectors of Σ.∆ is a diagonal matrix with diagonal elements

    the eigenvalues δ1 ≥ δ2 ≥ · · · ≥ δp ≥ 0.V is unique if the eigenvalues are unique.

    The i th principal component of X is Vi .

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 5 / 31

  • PCA - population version

    The variance matrix Σ has eigen-decomposition

    Σ = V ∆V> (1)

    whereV is a p × p orthonormal matrix with columns the eigenvectors of Σ.∆ is a diagonal matrix with diagonal elements

    the eigenvalues δ1 ≥ δ2 ≥ · · · ≥ δp ≥ 0.V is unique if the eigenvalues are unique.

    The i th principal component of X is Vi .

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 5 / 31

  • PCA - sample version

    The estimated variance matrix Σ̂ has eigen-decomposition

    Σ̂ = V̂ ∆̂V̂> (2)

    whereV̂ is a p × p orthonormal matrix with columns the eigenvectors of Σ̂.∆̂ is a diagonal matrix with diagonal elementsδ̂1 ≥ δ̂2 ≥ · · · ≥ δ̂p ≥ 0.

    The i th empirical principal component of X is V̂i .

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 6 / 31

  • SVD=PCA - sample version

    Y has singular value decomposition

    Y = uDv> (3)

    whereD is a diagonal matrix with diagonal elementsd1 ≥ d2 ≥ · · · ≥ dp ≥ 0,u is the n × p matrix of left singular vectors andv is the p × p matrix of right singular vectors.

    Σ̂ = Y>Y/(n− 1) = vD2v>/(n− 1).V̂ = v and δ̂i = d2i /(n − 1).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 7 / 31

  • SVD=PCA - sample version

    Y has singular value decomposition

    Y = uDv> (3)

    whereD is a diagonal matrix with diagonal elementsd1 ≥ d2 ≥ · · · ≥ dp ≥ 0,u is the n × p matrix of left singular vectors andv is the p × p matrix of right singular vectors.

    Σ̂ = Y>Y/(n− 1) = vD2v>/(n− 1).V̂ = v and δ̂i = d2i /(n − 1).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 7 / 31

  • SVD - population and sample versions

    Note that Y is a random vector.It is not at all clear how one would define SVD for a random quantity.However, Y and X do have principal components.

    By contrast Y is a matrix and does have an SVD, which gives theempirical principal components.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 8 / 31

  • SVD - population and sample versions

    Note that Y is a random vector.It is not at all clear how one would define SVD for a random quantity.However, Y and X do have principal components.By contrast Y is a matrix and does have an SVD, which gives theempirical principal components.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 8 / 31

  • SVD - population and sample versions

    Recall that the SVD has an important interpretation in terms of matrixcloseness:

    Let M be an n × p matrix with n > p.For any matrix m, let m[•r ] denote the first r columns.Let ‖ M ‖ be the L2 norm of the matrix.Suppose M = ÛD̂V̂> is the SVD of M.

    Let W be an orthogonal p × r matrix and A an N × r matrix.If rank(M) ≥ r thenA pair, (W ,A) minimizing ‖ M − AW> ‖ isW = V̂ [•r ] and A = (ÛD̂)[•r ].In other words, the first r singular vectors provide the closest rank ror less approximation to M.

    Note that the non-random vector closest to Y in the L2 sense isE(Y ) = 0.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 9 / 31

  • SVD - population and sample versions

    Recall that the SVD has an important interpretation in terms of matrixcloseness:

    Let M be an n × p matrix with n > p.For any matrix m, let m[•r ] denote the first r columns.Let ‖ M ‖ be the L2 norm of the matrix.Suppose M = ÛD̂V̂> is the SVD of M.

    Let W be an orthogonal p × r matrix and A an N × r matrix.If rank(M) ≥ r thenA pair, (W ,A) minimizing ‖ M − AW> ‖ isW = V̂ [•r ] and A = (ÛD̂)[•r ].In other words, the first r singular vectors provide the closest rank ror less approximation to M.

    Note that the non-random vector closest to Y in the L2 sense isE(Y ) = 0.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 9 / 31

  • SVD - population and sample versions

    Recall that the SVD has an important interpretation in terms of matrixcloseness:

    Let M be an n × p matrix with n > p.For any matrix m, let m[•r ] denote the first r columns.Let ‖ M ‖ be the L2 norm of the matrix.Suppose M = ÛD̂V̂> is the SVD of M.

    Let W be an orthogonal p × r matrix and A an N × r matrix.If rank(M) ≥ r thenA pair, (W ,A) minimizing ‖ M − AW> ‖ isW = V̂ [•r ] and A = (ÛD̂)[•r ].

    In other words, the first r singular vectors provide the closest rank ror less approximation to M.

    Note that the non-random vector closest to Y in the L2 sense isE(Y ) = 0.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 9 / 31

  • SVD - population and sample versions

    Recall that the SVD has an important interpretation in terms of matrixcloseness:

    Let M be an n × p matrix with n > p.For any matrix m, let m[•r ] denote the first r columns.Let ‖ M ‖ be the L2 norm of the matrix.Suppose M = ÛD̂V̂> is the SVD of M.

    Let W be an orthogonal p × r matrix and A an N × r matrix.If rank(M) ≥ r thenA pair, (W ,A) minimizing ‖ M − AW> ‖ isW = V̂ [•r ] and A = (ÛD̂)[•r ].In other words, the first r singular vectors provide the closest rank ror less approximation to M.

    Note that the non-random vector closest to Y in the L2 sense isE(Y ) = 0.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 9 / 31

  • SVD - population and sample versions

    Recall that the SVD has an important interpretation in terms of matrixcloseness:

    Let M be an n × p matrix with n > p.For any matrix m, let m[•r ] denote the first r columns.Let ‖ M ‖ be the L2 norm of the matrix.Suppose M = ÛD̂V̂> is the SVD of M.

    Let W be an orthogonal p × r matrix and A an N × r matrix.If rank(M) ≥ r thenA pair, (W ,A) minimizing ‖ M − AW> ‖ isW = V̂ [•r ] and A = (ÛD̂)[•r ].In other words, the first r singular vectors provide the closest rank ror less approximation to M.

    Note that the non-random vector closest to Y in the L2 sense isE(Y ) = 0.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 9 / 31

  • Directions of Greatest Variance

    Let X be any random p-vector with mean µ, variance Σ andprincipal components V1 · · ·Vp.Var (XV1) = V>1 ΣV1 = δ1 so the projection of X onto the firstprincipal component is the 1-dimensional projection with greatestvariance.XV [•r ] is the projection of X onto the r-dimensional space withgreatest variance.

    Note that projections onto subspaces may not be very interesting ifthe support of X is a manifold with high curvature.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 10 / 31

  • Directions of Greatest Variance

    Let X be any random p-vector with mean µ, variance Σ andprincipal components V1 · · ·Vp.Var (XV1) = V>1 ΣV1 = δ1 so the projection of X onto the firstprincipal component is the 1-dimensional projection with greatestvariance.XV [•r ] is the projection of X onto the r-dimensional space withgreatest variance.Note that projections onto subspaces may not be very interesting ifthe support of X is a manifold with high curvature.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 10 / 31

  • Singular Variance Matrices

    Lemma: If x is a 1-dimensional random variable with Var(x)=0, then x isconstant a.s.

    Theorem: If Y is a random p-vector with mean 0 and variance Σ andrank(Σ)=r[•r ]) has full rank rLet M2 = [Mr+1 · · ·Mp] be the matrix with columns the last p − rcolumns of M.Then YM2 = 0 a.s.

    Proof: Let V be the ordered eigenvectors of Σ.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 11 / 31

  • Projection Theorem

    Corollary: If Y is a random p-vector with mean 0 and variance Σ andrank(Σ)=r

  • Projection Theorem

    Corollary: If Y is a random p-vector with mean 0 and variance Σ andrank(Σ)=r

  • Factor Analysis and PCA

    The Factor Analysis ModelSuppose there are random factors C1 · · ·Cr

    E(Ci ) = 0Var (Ci ) = σ2Cuncorrelated

    Let Xi = µi +∑r

    j=1 qijCj + Ei for i = 1 · · · p ≥ rfixed constants µi and qij .and Ei i.i.d. with mean 0 and variance σ2E independent of Cj for all j .

    i.e. X = µ+ CQ + E where Q is p × r .

    Then Var (X ) = σ2CQ>Q + σ2EIp where Ip is the p identity matrix.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 13 / 31

  • Factor Analysis and PCA, cont’d

    The Factor Analysis ModelX = µ+ CQ + E.Var (X ) = σ2CQ

    >Q + σ2EIp

    Let the SVD of Q be Q = UDV>.

    Note that the columns of V are the first r principal components of X .Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.So XV is a basis for Span (C).Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 14 / 31

  • Factor Analysis and PCA, cont’d

    The Factor Analysis ModelX = µ+ CQ + E.Var (X ) = σ2CQ

    >Q + σ2EIp

    Let the SVD of Q be Q = UDV>.Note that the columns of V are the first r principal components of X .

    Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.So XV is a basis for Span (C).Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 14 / 31

  • Factor Analysis and PCA, cont’d

    The Factor Analysis ModelX = µ+ CQ + E.Var (X ) = σ2CQ

    >Q + σ2EIp

    Let the SVD of Q be Q = UDV>.Note that the columns of V are the first r principal components of X .Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.

    So XV is a basis for Span (C).Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 14 / 31

  • Factor Analysis and PCA, cont’d

    The Factor Analysis ModelX = µ+ CQ + E.Var (X ) = σ2CQ

    >Q + σ2EIp

    Let the SVD of Q be Q = UDV>.Note that the columns of V are the first r principal components of X .Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.So XV is a basis for Span (C).

    Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 14 / 31

  • Factor Analysis and PCA, cont’d

    The Factor Analysis ModelX = µ+ CQ + E.Var (X ) = σ2CQ

    >Q + σ2EIp

    Let the SVD of Q be Q = UDV>.Note that the columns of V are the first r principal components of X .Note that Q is not identifiable, but Span (Q) = Span (V) is identifiable.So XV is a basis for Span (C).Since only Span (C) is available, often a XV is rotated to give aninterpretable set of factors.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 14 / 31

  • Hierarchical Mean model and PCA

    The Hierarchical Mean ModelSuppose

    C is a q-dimensional random variable with variance Var (C) = σ2C Iq.W is a p × q fixed matrix of rank r .X is a random p-vector withE(X |C) = CW and Var (X |C) = σ2Ip.Then E(X ) = E(C)W and Var (X ) = σ2CW

    >W + σ2Ip.

    Let the SVD of W be W = UDV>.

    Note that the first r columns of V are the first r principalcomponents of X .Note that W is not identifiable, but Span (W ) = Span (V [•r ]) isidentifiable.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 15 / 31

  • Hierarchical Mean model and PCA

    The Hierarchical Mean ModelSuppose

    C is a q-dimensional random variable with variance Var (C) = σ2C Iq.W is a p × q fixed matrix of rank r .X is a random p-vector withE(X |C) = CW and Var (X |C) = σ2Ip.Then E(X ) = E(C)W and Var (X ) = σ2CW

    >W + σ2Ip.

    Let the SVD of W be W = UDV>.Note that the first r columns of V are the first r principalcomponents of X .

    Note that W is not identifiable, but Span (W ) = Span (V [•r ]) isidentifiable.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 15 / 31

  • Hierarchical Mean model and PCA

    The Hierarchical Mean ModelSuppose

    C is a q-dimensional random variable with variance Var (C) = σ2C Iq.W is a p × q fixed matrix of rank r .X is a random p-vector withE(X |C) = CW and Var (X |C) = σ2Ip.Then E(X ) = E(C)W and Var (X ) = σ2CW

    >W + σ2Ip.

    Let the SVD of W be W = UDV>.Note that the first r columns of V are the first r principalcomponents of X .Note that W is not identifiable, but Span (W ) = Span (V [•r ]) isidentifiable.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 15 / 31

  • PCA for Normally Distributed RVs

    PCA is often used with Multivariate Normal DataIf X is a p-dimensional Normal vector with mean µ and (nonsingular)variance Σ then

    L(X |µ,Σ) = −12

    (log(det(Σ)) + (X − µ)>Σ−1(X − µ)

    )= −1

    2

    p∑i=1

    (log(δi) + Y>ViV>i Y/δi

    )We see that

    The likelihood depends on the data only through the projection ofthe centered data onto the principal components.The level contours of distribution are the ellipsoids centered at µand with major axes described by the principal components.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 16 / 31

  • PCA for Normally Distributed RVs

    Figure : from moserware.com

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 17 / 31

  • PCA and MLE

    Normal Factor Analysis ModelIf X = µ+ QC + E where

    µ is a p-vectorC ∼ MVN(0, σ2C Ir )E ∼ MVN(0, σ2E Ip)Q is a fixed p × r matrix

    thenX ∼ MVN(µ, σ2CQ

    >Q + σ2E Ip)

    Tipping and Bishop (1999) show that Span (V̂ [•r ]) is the maximumlikelihood estimator of Span (Q).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 18 / 31

  • PCA and MLE

    Normal Factor Analysis ModelIf X = µ+ QC + E where

    µ is a p-vectorC ∼ MVN(0, σ2C Ir )E ∼ MVN(0, σ2E Ip)Q is a fixed p × r matrix

    thenX ∼ MVN(µ, σ2CQ

    >Q + σ2E Ip)

    Tipping and Bishop (1999) show that Span (V̂ [•r ]) is the maximumlikelihood estimator of Span (Q).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 18 / 31

  • PCA and MLE

    Normal Hierarchical Mean ModelIn the hierarchical mean model, if C (q-dimensional) and X |C(p-dimensional) are both multivariate normal and W has rank r ≤ q then

    X ∼ MVN(E(C)W , σ2CW>W + σ2Ip)

    Note that E(X ) is in Span (W ).

    The argument of Tipping and Bishop (1999) shows that Span (V̂ [•r ]) isthe maximum likelihood estimator of Span (W ).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 19 / 31

  • PCA and MLE

    Normal Hierarchical Mean ModelIn the hierarchical mean model, if C (q-dimensional) and X |C(p-dimensional) are both multivariate normal and W has rank r ≤ q then

    X ∼ MVN(E(C)W , σ2CW>W + σ2Ip)

    Note that E(X ) is in Span (W ).

    The argument of Tipping and Bishop (1999) shows that Span (V̂ [•r ]) isthe maximum likelihood estimator of Span (W ).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 19 / 31

  • PCA and MLE

    Normal Hierarchical Mean ModelIn the hierarchical mean model, if C (q-dimensional) and X |C(p-dimensional) are both multivariate normal and W has rank r ≤ q then

    X ∼ MVN(E(C)W , σ2CW>W + σ2Ip)

    Note that E(X ) is in Span (W ).

    The argument of Tipping and Bishop (1999) shows that Span (V̂ [•r ]) isthe maximum likelihood estimator of Span (W ).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 19 / 31

  • PCA and MLE

    Normal Hierarchical Mean Model - Empirical version

    We generate data by drawing Ci ∼ Nq(0, σ2C Iq) andXi |Ci ∼ Np(CiW , σ2Ip).

    This is equivalent to Xi = CiW + Ei where Ei ∼ Np(0, σ2Ip).

    Letting C be the n × q,n > q (unobserved) matrix with rows Ciand E be the n × p matrix with rows EiX = CW + E.Yi = Xi − X̄ or Y = PX where P = In − 11>/n.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 20 / 31

  • PCA and MLE

    Normal Hierarchical Mean Model - Empirical version

    We generate data by drawing Ci ∼ Nq(0, σ2C Iq) andXi |Ci ∼ Np(CiW , σ2Ip).

    This is equivalent to Xi = CiW + Ei where Ei ∼ Np(0, σ2Ip).Letting C be the n × q,n > q (unobserved) matrix with rows Ci

    and E be the n × p matrix with rows EiX = CW + E.Yi = Xi − X̄ or Y = PX where P = In − 11>/n.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 20 / 31

  • PCA and MLE

    Normal Hierarchical Mean Model - Empirical versionWe have:

    X = CW + EY = PX = PCW + PE.Let the SVD of W be W = UDV>. Then Span (W ) = Span (V ).

    Let Y = ÛD̂V̂>.We have already seen that Û[•r ]Diag(d1 · · · dr )V̂ [•r ]> is a solution tothe rank r matrix closest to Y.and V̂ [•r ] are the first r empirical principal components of X .

    Now suppose that we also consider the SVD X = udv>.Some questions:When is Span (CW) = Span (PCW) = Span (W)?

    C has full rank q a.s. so Span (CW) = Span (W) a.s.Is PC full rank a.s.?

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 21 / 31

  • PCA and MLE

    Normal Hierarchical Mean Model - Empirical versionWe have:

    X = CW + EY = PX = PCW + PE.Let the SVD of W be W = UDV>. Then Span (W ) = Span (V ).

    Let Y = ÛD̂V̂>.We have already seen that Û[•r ]Diag(d1 · · · dr )V̂ [•r ]> is a solution tothe rank r matrix closest to Y.and V̂ [•r ] are the first r empirical principal components of X .

    Now suppose that we also consider the SVD X = udv>.Some questions:When is Span (CW) = Span (PCW) = Span (W)?

    C has full rank q a.s. so Span (CW) = Span (W) a.s.Is PC full rank a.s.?

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 21 / 31

  • PCA and MLE

    Normal Hierarchical Mean Model - Empirical versionWe have:

    X = CW + EY = PX = PCW + PE.Let the SVD of W be W = UDV>. Then Span (W ) = Span (V ).

    Let Y = ÛD̂V̂>.We have already seen that Û[•r ]Diag(d1 · · · dr )V̂ [•r ]> is a solution tothe rank r matrix closest to Y.and V̂ [•r ] are the first r empirical principal components of X .

    Now suppose that we also consider the SVD X = udv>.Some questions:

    When is Span (CW) = Span (PCW) = Span (W)?C has full rank q a.s. so Span (CW) = Span (W) a.s.Is PC full rank a.s.?

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 21 / 31

  • PCA and MLE

    Normal Hierarchical Mean Model - Empirical versionWe have:

    X = CW + EY = PX = PCW + PE.Let the SVD of W be W = UDV>. Then Span (W ) = Span (V ).

    Let Y = ÛD̂V̂>.We have already seen that Û[•r ]Diag(d1 · · · dr )V̂ [•r ]> is a solution tothe rank r matrix closest to Y.and V̂ [•r ] are the first r empirical principal components of X .

    Now suppose that we also consider the SVD X = udv>.Some questions:When is Span (CW) = Span (PCW) = Span (W)?

    C has full rank q a.s. so Span (CW) = Span (W) a.s.

    Is PC full rank a.s.?

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 21 / 31

  • PCA and MLE

    Normal Hierarchical Mean Model - Empirical versionWe have:

    X = CW + EY = PX = PCW + PE.Let the SVD of W be W = UDV>. Then Span (W ) = Span (V ).

    Let Y = ÛD̂V̂>.We have already seen that Û[•r ]Diag(d1 · · · dr )V̂ [•r ]> is a solution tothe rank r matrix closest to Y.and V̂ [•r ] are the first r empirical principal components of X .

    Now suppose that we also consider the SVD X = udv>.Some questions:When is Span (CW) = Span (PCW) = Span (W)?

    C has full rank q a.s. so Span (CW) = Span (W) a.s.Is PC full rank a.s.?

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 21 / 31

  • PCA and MLE

    Normal Hierarchical Mean Model - Empirical version

    How does Span (V̂ [•r ]) compare with Span (v [•r ])?

    I would like to argue that since E is a random Normal matrix, mostof the structure in X and Y = PX comes from CW and PCW.When is this true?

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 22 / 31

  • PCA and MLE

    Normal Hierarchical Mean Model - Empirical version

    How does Span (V̂ [•r ]) compare with Span (v [•r ])?I would like to argue that since E is a random Normal matrix, mostof the structure in X and Y = PX comes from CW and PCW.When is this true?

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 22 / 31

  • PCA and MLE

    Putting PCA in terms of maximum likelihood suggests at least 2extensions:

    PCA can be robustified against contamination via robustM-estimation methods.PCA can be extended to non-Normal multivariate families viamaximum likelihood estimates based on the factor analysis orhierarchical means models.

    We tackle the second problem.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 23 / 31

  • PCA and MLE

    Putting PCA in terms of maximum likelihood suggests at least 2extensions:

    PCA can be robustified against contamination via robustM-estimation methods.PCA can be extended to non-Normal multivariate families viamaximum likelihood estimates based on the factor analysis orhierarchical means models.We tackle the second problem.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 23 / 31

  • Extensions of PCA

    Elliptical familiesA p-dimensional random vector X has an elliptical distribution if

    There is a fixed p-vector µand a fixed positive semi-definite matrix Ssuch that

    the density function depends on X only through (X − µ)>S(X − µ).

    The level probability contours of X are ellipsoids with center µ andmajor axes the eigenvectors of S.If Var (X ) is finite then E(X ) = µ and S is a generalized inverse ofVar (X ).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 24 / 31

  • Extensions of PCA

    Elliptical familiesA p-dimensional random vector X has an elliptical distribution if

    There is a fixed p-vector µand a fixed positive semi-definite matrix Ssuch thatthe density function depends on X only through (X − µ)>S(X − µ).

    The level probability contours of X are ellipsoids with center µ andmajor axes the eigenvectors of S.If Var (X ) is finite then E(X ) = µ and S is a generalized inverse ofVar (X ).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 24 / 31

  • Extensions of PCA

    Elliptical familiesA p-dimensional random vector X has an elliptical distribution if

    There is a fixed p-vector µand a fixed positive semi-definite matrix Ssuch thatthe density function depends on X only through (X − µ)>S(X − µ).

    The level probability contours of X are ellipsoids with center µ andmajor axes the eigenvectors of S.

    If Var (X ) is finite then E(X ) = µ and S is a generalized inverse ofVar (X ).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 24 / 31

  • Extensions of PCA

    Elliptical familiesA p-dimensional random vector X has an elliptical distribution if

    There is a fixed p-vector µand a fixed positive semi-definite matrix Ssuch thatthe density function depends on X only through (X − µ)>S(X − µ).

    The level probability contours of X are ellipsoids with center µ andmajor axes the eigenvectors of S.If Var (X ) is finite then E(X ) = µ and S is a generalized inverse ofVar (X ).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 24 / 31

  • Extensions of PCA

    Elliptical families cont’d.Any elliptical distribution can be decomposed as

    X = µ+ RU(p)A

    whereU(p) is a random vector uniformly distributed on the unit sphere in A = Σ.Letting C = RUp and A = Q we see that this is just a factor analysismodel.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 25 / 31

  • Extensions of PCA

    Elliptical families cont’d.Cambanis, Huang and Simons (1981) show that if rank Σ = r < p then

    X = µ+ SU(r)V[•r]>

    whereS is independent of U(r) andV [•r ] are the first r eigenvectors of Σ.The remaining eigenvectors are associated with the null space ofA>A.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 26 / 31

  • Extensions of PCA

    Elliptical families cont’d.Elliptical distributions are a special case of the factor analysismodel.The span of the first r principal components is the Span (A)This should be enough to show that the first r empirical principalcomponents are the MLE of Span (A).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 27 / 31

  • Extensions of PCA

    Linear Exponential FamiliesA p-dimensional random vector X is in a linear exponential family incanonical form if

    There is a p-vector θ such thatL(x |θ) = h(x) + xθ −G(θ) where

    h(x) and G(θ) are functions onto

  • Extensions of PCA

    Linear Exponential FamiliesA p-dimensional random vector X is in a linear exponential family incanonical form if

    There is a p-vector θ such thatL(x |θ) = h(x) + xθ −G(θ) where

    h(x) and G(θ) are functions onto

  • Extensions of PCA

    Linear Exponential Families cont’d

    e.g. Suppose that X ∼ Poisson(λ).

    f (x |λ) = e−λλx/x!L(x |λ) = −λ+ x log(λ)− log(x!)

    Thenθ = log(λ).h(x) = −log(x!).G(θ) = exp(θ) = λ.E(X |θ) = G′(θ) = exp(θ) = λ.Var (X |θ) = G′′(θ) = λ.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 29 / 31

  • Extensions of PCA

    Linear Exponential Families cont’dWe can turn this into a hierarchical means model by assuming

    E(X |C) = g(CW )

    where C is random.Note that we are assuming that θ = CW .

    Note however

    Var (X |C) = g′(CW )Var (X ) = Var (E(X |C)) + E(Var (X |C))

    = Var (g(CW )) + E(g′(CW ))

    so ordinary PCA does NOT extract the space spanned by the singularvectors of W .

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 30 / 31

  • Extensions of PCA

    Linear Exponential Families cont’d

    e.g. Suppose that X ∼ Poisson(λ).If θ = log(λ) = CW (for now W is 1× p),Var (X ) = E(exp(CW )) + Var (exp(CW )).

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 31 / 31

  • Extensions of PCA

    Linear Exponential Families cont’dWe propose to use MLE to find the spaces spanned by the singularvectors of W corresponding to non-zero eigenvalues.

    Altman (Penn State) PCA,SVD,MLE February 14, 2013 32 / 31