generalized principal component analysis · 2015-11-11 · generalized principal component...

Generalized Principal Component Analysis:Dimensionality Reduction through

the Projection of Natural Parameters

Yoonkyung Lee*Department of Statistics

The Ohio State University*joint work with Andrew Landgraf

November 10, 2015Department of Statistics and Probability

Michigan State University

Dimensionality Reduction

Principal component analysis (PCA)to generalized PCA for non-Gaussian data

Hotelling, H. (1933), Analysis of a complex of statisticalvariables into principal componentsJournal of Educational Psychology 24(6), 417-441

Pearson, K. (1901), On Lines and Planes of Closest Fit toSystems of Points in SpacePhilosophical Magazine 2(11), 559-572

Principal Component Analysis (PCA)

PCA is concerned with explaining the variance-covariancestructure of a set of correlated variables through a few linearcombinations of these variables.

●

● ●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1.2 1.4 1.6 1.8 2.0 2.2 2.4

1.2

1.4

1.6

1.8

2.0

2.2

2.4

Humerus

D.H

umer

us

●

● ●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1.2 1.4 1.6 1.8 2.0 2.2 2.4

1.2

1.4

1.6

1.8

2.0

2.2

2.4

random projection

Humerus

D.H

umer

us

●

● ●

●●

●●

●●●

● ●

●

●

●● ●

●

●

●

●

●

● ●

●

●

● ●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1.2 1.4 1.6 1.8 2.0 2.2 2.4

1.2

1.4

1.6

1.8

2.0

2.2

2.4

PC1

Humerus

D.H

umer

us

●

●●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

Figure: Data on the mineral content measurements (g/cm) of threebones (humerus, radius and ulna) on the dominant and nondominantsides for 25 old women

Variance Maximization

I Given p correlated variables X = (X1, · · · ,Xp)>, consider alinear combination of Xj ’s,

p∑j=1

ajXj = a>X

for a = (a1, . . . ,ap)> ∈ Rp with ‖a‖2 = 1.

I The first principal component direction is defined as thevector a that gives the largest sample variance of a>Xamong all unit vectors a:

maxa∈Rp,‖a‖2=1

a>Sna

where Sn is the sample variance-covariance matrix of X .

Principal Components

I Let Sn =∑p

j=1 λjvjv>j with eigenvaluesλ1 ≥ λ2 ≥ · · · ≥ λp > 0, and the correspondingeigenvectors v1, . . . , vp.Then the first principal component direction is given by v1.

I The derived variable Z1 = v>1 X is called the first principalcomponent.

I Similarly, the second principal component direction isdefined as the vector a with the largest sample variance ofa>X among all normalized a subject to a>X beinguncorrelated with v>1 X . It is given by v2.

I In general, the j th principal component direction is definedsuccessively from j = 1 to p.

Pearson’s Reconstruction Error FormulationPearson, K. (1901), On Lines and Planes of Closest Fit toSystems of Points in Space

I Given x1, · · · , xn ∈ Rp, consider the following dataapproximation:

xi ≈ µ+ vv>(xi − µ)

where µ ∈ Rp and v is a unit vector in Rp so that vv> is arank-one projection.

I What are µ and v ∈ Rp (with ‖v‖2 = 1) that minimize thereconstruction error?

n∑i=1

‖xi − µ− vv>(xi − µ)‖2

I µ̂ = x̄ and v̂ = v1 minimize the error.

Minimization of Reconstruction Error

I More generally, consider a rank-k (< p) approximation:

xi ≈ µ+ VV>(xi − µ)

where µ ∈ Rp and V is a p × k matrix with orthogonalcolumns that results in a rank-k projection of VV>.

I Wish to minimize the reconstruction error:

n∑i=1

‖xi − µ− VV>(xi − µ)‖2

subject to V>V = Ik

I µ̂ = x̄ and V̂ = [v1, · · · , vk ] provide the best k -dimensionalreconstruction of the data.

PCA for Non-Gaussian Data?

I PCA finds a low rank subspace by implicitly minimizing thereconstruction error under squared error loss, which islinked to Gaussian distribution.

I Binary, count, or non-negative data abound in practice.

e.g. images, term frequencies for documents, ratings formovies, click-through rates for online ads

I How to generalize PCA to non-Gaussian data?

Generalization of PCACollins et al. (2001), A generalization of principal componentsanalysis to the exponential family

I Draws on the ideas from the exponential family andgeneralized linear models.

I For Gaussian data, assume that xi ∼ Np(θi , Ip) and θi ∈ Rp

lies in a k dimensional subspace:

for a basis {b`}k`=1, θi =k∑

`=1

ai`b` = B(p×k)ai

I To find Θ = [θij ], maximize the log likelihood or equivalentlyminimize the negative log likelihood (or deviance):

n∑i=1

‖xi − θi‖2 = ‖X −Θ‖2F = ‖X − AB>‖2F

Generalization of PCAI According to Eckart-Young theorem, the best rank-k

approximation of X (= Un×pDp×pV>p×p) is given by therank-k truncated singular value decomposition UkDk︸︷︷︸

A

V>k︸︷︷︸B>

.

I For exponential family data, factorize the matrix of naturalparameter values Θ as AB> with rank-k matrices An×k andBp×k (of orthogonal columns) by maximizing the loglikelihood.

I For binary data X = [xij ] with P = [pij ], “logistic PCA” looks

for a factorization of Θ =[log pij

1−pij

]= AB> that maximizes

`(X ; Θ) =∑i,j

{xij(a>i bj∗)− log(1 + exp(a>i bj∗))

}

subject to B>B = Ik .

Drawbacks of the Matrix Factorization Formulation

I Involves estimation of both case-specific (or row-specific)scores A and variable-specific (or column-specific) factorsB: more of extension of SVD than PCA.

I The number of parameters increases with the number ofobservations.

I The scores of generalized PC for new data involveadditional optimization while PC scores for standard PCAare simple linear combinations of the data.

Alternative Interpretation of Standard PCA

I Assuming that data are centered (µ = 0), minimize

n∑i=1

‖xi − VV>xi‖2 = ‖X − XVV>‖2F

subject to V>V = Ik .

I XVV> can be viewed as a rank-k projection of the matrixof natural parameters (“means” in this case) of thesaturated model Θ̃ (best possible fit) for Gaussian data.

I Standard PCA finds the best rank-k projection of Θ̃ byminimizing the deviance under Gaussian distribution.

Natural Parameters of the Saturated Model

I For an exponential family distribution with naturalparameter θ and pdf

f (x |θ) = exp (θx − b(θ) + c(x)) ,

E(X ) = b′(θ) and the canonical link function is the inverseof b′.

θ b(θ) canonical linkN(µ,1) µ θ2/2 identityBernoulli(p) logit(p) log(1 + exp(θ)) logitPoisson(λ) log(λ) exp(θ) log

I Take Θ̃ = [canonical link(xij)].

New Formulation of Logistic PCA

Landgraf and Lee (2015), Dimensionality Reduction for BinaryData through the Projection of Natural Parameters

I Given xij ∼ Bernoulli(pij), the natural parameter (logit pij )of the saturated model is

θ̃ij = logit(xij) =∞× (2xij − 1)

We will approximate θ̃ij ≈ m × (2xij − 1) for large m > 0.

I Project Θ̃ to a k -dimensional subspace by using thedeviance D(X ; Θ) = −2{`(X ; Θ)− `(X ; Θ̃)} as a loss:

minV∈Rp×k

D(X ; Θ̃VV>︸︷︷︸Θ̂

) = −2∑i,j

{xij θ̂ij − log(1 + exp(θ̂ij))

}

subject to V>V = Ik

Logistic PCA vs Logistic SVDI The previous logistic SVD (matrix factorization) gives an

approximation of logit P:

Θ̂LSVD = AB>

I Alternatively, our logistic PCA gives

Θ̂LSVD = Θ̃V︸︷︷︸A

V>,

which has much fewer parameters.

I Computation of PC scores on new data only requiresmatrix multiplication for logistic PCA while logistic SVDrequires fitting k -dimensional logistic regression for eachnew observation.

I Logistic SVD with additional A is prone to overfit.

Geometry of Logistic PCA

● ●

●

●

●

●

●

−5

0

5

−5 0 5θ1

θ 2

● ●

●●

●

●

●

●

●

●

●

0

1

0 1X1

X2

Probability

●

●●●

0.1

0.2

0.3

0.4

●●

●●

PCA

LPCA

Figure: Logistic PCA projection in the natural parameter space withm = 5 (left) and in the probability space (right) compared to the PCAprojection

New Formulation of Generalized PCA

Landgraf and Lee (2015), Generalized PCA: Projection ofSaturated Model Parameters

I The idea can be applied to any exponential familydistribution (e.g. Poisson, multinomial).

I Find the best rank-k projection of the matrix of naturalparameters from the saturated model Θ̃ by minimizing theappropriate deviance for the data:

minV∈Rp×k

D(X ; Θ̃VV>)

subject to V>V = Ik

I If desired, main effects µ can be added to theapproximation of Θ:

Θ̂ = 1µ> + (Θ̃− 1µ>)VV>

First-Order Optimality Conditions

I For the orthonormality constraint V>V = Ik , consider theLagrangian

L(V , µ,Λ) = D(X ; 1µ>+(Θ̃−1µ>)VV>)+tr(

Λ(V>V − Ik )),

where Λ is a k × k symmetric matrix of Lagrangemultipliers.

I Setting the gradient of L with respect to V , µ, and Λ equalto 0, we obtain the first-order optimality conditions forlogistic PCA:[

(X − P̂)>(Θ̃− 1µ>) + (Θ̃− 1µ>)>(X − P̂)]

V = V Λ

(Ip − VV>)(X − P̂)>1 = 0, and

V>V = Ik

MM Algorithm

I Majorize the objective function with a simpler objective ateach iterate, and minimize the majorizing function.(Hunter and Lange, 2004)

I From the quadratic approximation of the deviance at Θ(t),step t solution, and the fact that p(1− p) ≤ 1/4,

D(X ; 1µ> + (Θ̃− 1µ>)VV>)

≤ 14‖1µ> + (Θ̃− 1µ>)VV> − Z (t+1)‖2F + C,

where Z (t+1) = Θ(t) + 4(X − P̂(t)).

I Update Θ at step t + 1:averaging for µ(t+1) given V (t) andeigen-analysis of a p × p matrix for V (t+1) given µ(t+1).

Medical Diagnosis Data

I Part of electronic health record data on 12,000 adultpatients admitted to the intensive care units (ICU) in OhioState University Medical Center from 2007 to 2010(provided by S. Hyun)

I Patients are classified as having one or more diseases ofover 800 disease categories from the InternationalClassification of Diseases (ICD-9).

I Interested in characterizing the co-morbidity as latentfactors, which can be used to define patient profiles forprediction of other clinical outcomes

I Analysis is based on a sample of 1,000 patients, whichreduced the number of disease categories to 584.

Patient-Diagnosis Matrix

Deviance Explained by Components

●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●● ●

●

●●●

●●●●●●

●●●●●●●

●●●

●●

●●

●●●●●

Cumulative Marginal

0%

25%

50%

75%

0.0%

2.5%

5.0%

7.5%

0 10 20 30 0 10 20 30Number of Principal Components

% o

f Dev

ianc

e E

xpla

ined

●

LPCA

LSVD

PCA

Figure: Cumulative and marginal percent of deviance explained byprincipal components of LPCA, LSVD, and PCA

Deviance Explained by Parameters

0%

25%

50%

75%

0k 10k 20k 30k 40kNumber of free parameters

% o

f dev

ianc

e ex

plai

ned

LPCA

LSVD

PCA

Figure: Cumulative percent of deviance explained by principalcomponents of LPCA, LSVD, and PCA versus the number of freeparameters

Predictive Deviance

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●

●

●●●

●●

●

●●

●

●

●

●●●

●●

●

●

●

●●

●●●

●

Cumulative Marginal

0%

10%

20%

30%

40%

0%

2%

4%

6%

0 10 20 30 0 10 20 30Number of principal components

% o

f pre

dict

ive

devi

ance

●

LPCA

PCA

Figure: Cumulative and marginal percent of predictive deviance overtest data (1,000 patients) by the principal components of LPCA andPCA

Interpretation of Loadings

●

●●

●● ●

●

●

●● ●

●

●

●

●●

●

●

●●

●

●● ● ●● ●

● ●●

● ● ●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

● ●

●

●●

●

●

●

● ● ●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

● ●

●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●●

●

●

●

●

●● ●

●

●

●

● ●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

● ●● ●

●

●

●●

●

●

● ●

●

●

● ●●● ●

●●

●●

●●

●●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●● ●

●

●

●

●

●

●

●

● ●

●

●

●●

●●

●●

●

● ●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●●

●

●

●

●

●

●●

●

● ●●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●●●

●

●

●

●

●

●

● ●●

●● ● ●

●

●

●●

●● ●●

●●

● ●●● ● ● ●● ●●

●●●

●

●●●

● ●

● ●●●●●

●

●●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●●

●●

●

●

●●

●●

●

●

●●●●

●

●

●●

●● ●●

●●●

●●● ●

● ●● ●●●●

●●● ●

●●●●

●

● ● ●●●

●

●●

●

● ●●

● ●●● ●●●●

●

●●

● ●●●

●

●●● ●●

●●

●●

●●●

●●

●

●

●

●

●●●

●

● ●●●

●●

● ●● ●

●

●

●

●

●

●

●

●●

●

●

● ●●

●

●

●

● ●

●

●

●●

●

●●

●●

●

● ● ●

●

●●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●●● ●

●

● ●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

●

●

●

● ●●

●

●●

●

●●

●

●

●

● ●

●

●●●

● ●●

● ●

●

● ●●●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●●

●●

●

●● ●●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●●

●●

●

● ●

●

●

●

●

● ●

●

●

●

● ●

●

●●

●

●

●

●

●

●●

● ●

●

●●

●

●●

●

●● ●

●

●

●

●●● ● ●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●● ●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●● ●

●

●● ●

●

● ●● ●

●

●●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●●●

●●● ●●● ●● ● ●● ●● ● ●

●

●

●●

●

●●●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●

●●●●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

● ● ●

●

●

●

●

●●

●

●

●

● ●

●

●

●

● ●

●

●●

●

●

●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

●

●●

●

● ●● ●

●

●●

●

●● ●●● ●● ●

●●

●

●

●●●● ●

●●

●

●●

●● ●

●●●

●

●

●

●●

●●●

●

●

●

●●●

●

●●● ●

●●●●

●

●

●

●

●

●

● ●●

●

●

● ●

●

●

●●

● ●●

●

●

●

●

●●

●

●

●● ●

●

●

●

●

● ●●●

●

●

●● ●

●

●

●

●

● ●

01:Gram−neg septicemia NEC03:Hyperpotassemia

08:Acute respiratry failure

10:Acute kidney failure NOS16:Speech Disturbance NEC

17:Asphyxiation/strangulat

03:DMII wo cmp nt st uncntr

03:Hyperlipidemia NEC/NOS

04:Leukocytosis NOS

07:Mal hy kid w cr kid I−IV

07:Old myocardial infarct

07:Crnry athrscl natve vssl

07:Systolic hrt failure NOS

09:Oral soft tissue dis NEC

10:Chr kidney dis stage IIIV:Gastrostomy status

−0.2

0.0

0.2

Component 1 Component 2

Figure: The first component is characterized by common seriousconditions that bring patients to ICU, and the second component isdominated by diseases of the circulatory system (07’s).

Concluding Remarks

I We generalize PCA via projections of the naturalparameters of the saturated model using the generalizedlinear model framework.

I We have extended generalized PCA to handle differentialcase weights, missing data, and variable normalization.

I Further extensions are possible with other constraints thanrank for desirable properties (e.g. sparsity) on the loadingsand predictive formulations.

I R package, logisticPCA is available at CRAN andgeneralizedPCA is currently under development.

Acknowledgments

Andrew Landgraf@ Battelle Memorial Institute

Sookyung Hyun and Cheryl Newton@ College of Nursing, OSU

DMS-1513566

References

A. J. Landgraf and Y. Lee.Dimensionality reduction for binary data through the projection ofnatural parameters.Technical Report 890, Department of Statistics, The Ohio StateUniversity, 2015.Also available at arXiv:1510.06112.

A. J. Landgraf and Y. Lee.Generalized principal component analysis: Projection ofsaturated model parameters.Technical Report 892, Department of Statistics, The Ohio StateUniversity, 2015.

generalized principal component analysis · 2015-11-11 · generalized principal component...

Documents