heteroskedastic pca: algorithm, optimality, and applicationstcai/paper/heteropca.pdf ·...

35
Submitted to the Annals of Statistics arXiv: arXiv:0000.0000 HETEROSKEDASTIC PCA: ALGORITHM, OPTIMALITY, AND APPLICATIONS By Anru Zhang T. Tony Cai and Yihong Wu University of Wisconsin-Madison, University of Pennsylvania, and Yale University Principal component analysis (PCA) and singular value decom- position (SVD) are widely used in statistics, machine learning, and applied mathematics. It has been well studied in the case of ho- moskedastic noise, where the noise levels of the contamination are homogeneous. In this paper, we consider PCA and SVD in the presence of het- eroskedastic noise, which arises naturally in a range of applications. We introduce a general framework for heteroskedastic PCA and pro- pose an algorithm called HeteroPCA, which involves iteratively im- puting the diagonal entries to remove the bias due to heteroskedas- ticity. This procedure is computationally ecient and provably opti- mal under the generalized spiked covariance model. A key technical step is a deterministic robust perturbation analysis on the singular subspace, which can be of independent interest. The eectiveness of the proposed algorithm is demonstrated in a suite of applications, including heteroskedastic low-rank matrix denoising, Poisson PCA, and SVD based on heteroskedastic and incomplete data. 1. Introduction. Principal component analysis (PCA) and spectral methods are ubiquitous tools in many fields including statistics, machine learning, and applied mathematics. They have been extensively studied and used in a wide range of applications. Recent examples include matrix denois- ing (Donoho and Gavish, 2014; Shabalin and Nobel, 2013), community de- tection (Donath and Homan, 2003; Newman, 2013), ranking from pairwise comparisons (Negahban et al., 2012; Chen and Suh, 2015), matrix comple- tion (Keshavan et al., 2010; Sun and Luo, 2016), high-dimensional clustering The research of Anru Zhang was supported in part by NSF Grant DMS-1811868 and NIH grant R01-GM131399-01. The research of Tony Cai was supported in part by NSF Grant DMS-1712735 and NIH grants R01-GM129781 and R01-GM123056. The research of Yihong Wu was supported in part by the NSF Grant CCF-1527105, an NSF CAREER award CCF-1651588, and an Alfred Sloan fellowship. MSC 2010 subject classifications: Primary 62H12, 62H25; secondary 62C20 Keywords and phrases: heteroskedasticity, low-rank matrix denoising, principal compo- nent analysis, singular value decomposition, spectral method 1

Upload: others

Post on 31-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

Submitted to the Annals of Statistics

arXiv: arXiv:0000.0000

HETEROSKEDASTIC PCA: ALGORITHM, OPTIMALITY,AND APPLICATIONS

By Anru Zhang⇤ T. Tony Cai† and Yihong Wu‡

University of Wisconsin-Madison, University of Pennsylvania, and YaleUniversity

Principal component analysis (PCA) and singular value decom-position (SVD) are widely used in statistics, machine learning, andapplied mathematics. It has been well studied in the case of ho-moskedastic noise, where the noise levels of the contamination arehomogeneous.

In this paper, we consider PCA and SVD in the presence of het-eroskedastic noise, which arises naturally in a range of applications.We introduce a general framework for heteroskedastic PCA and pro-pose an algorithm called HeteroPCA, which involves iteratively im-puting the diagonal entries to remove the bias due to heteroskedas-ticity. This procedure is computationally e�cient and provably opti-mal under the generalized spiked covariance model. A key technicalstep is a deterministic robust perturbation analysis on the singularsubspace, which can be of independent interest. The e↵ectiveness ofthe proposed algorithm is demonstrated in a suite of applications,including heteroskedastic low-rank matrix denoising, Poisson PCA,and SVD based on heteroskedastic and incomplete data.

1. Introduction. Principal component analysis (PCA) and spectralmethods are ubiquitous tools in many fields including statistics, machinelearning, and applied mathematics. They have been extensively studied andused in a wide range of applications. Recent examples include matrix denois-ing (Donoho and Gavish, 2014; Shabalin and Nobel, 2013), community de-tection (Donath and Ho↵man, 2003; Newman, 2013), ranking from pairwisecomparisons (Negahban et al., 2012; Chen and Suh, 2015), matrix comple-tion (Keshavan et al., 2010; Sun and Luo, 2016), high-dimensional clustering

⇤The research of Anru Zhang was supported in part by NSF Grant DMS-1811868 andNIH grant R01-GM131399-01.

†The research of Tony Cai was supported in part by NSF Grant DMS-1712735 andNIH grants R01-GM129781 and R01-GM123056.

‡The research of Yihong Wu was supported in part by the NSF Grant CCF-1527105,an NSF CAREER award CCF-1651588, and an Alfred Sloan fellowship.

MSC 2010 subject classifications: Primary 62H12, 62H25; secondary 62C20Keywords and phrases: heteroskedasticity, low-rank matrix denoising, principal compo-

nent analysis, singular value decomposition, spectral method

1

Page 2: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

2 A. ZHANG, T. T. CAI, AND Y. WU

(Jin et al., 2016), Markov process and reinforcement learning (Zhang andWang, 2018), multidimensional scaling (Aflalo and Kimmel, 2013), topicmodeling (Ke and Wang, 2017), phase retrieval (Candes et al., 2015; Caiet al., 2016), tensor PCA (Richard and Montanari, 2014; Zhang and Xia,2018; Zhang and Han, 2018).

The central idea of PCA is to extract the hidden low-rank structure fromthe noisy observations. The following spiked covariance model has been wellstudied and used as a baseline for both methodological and theoretical devel-opments (Johnstone, 2001; Baik and Silverstein, 2006; Paul, 2007). Under

such a model, one observes Y

1

, . . . , Y

n

iid⇠ N

µ,⌃0

+ �

2

I

p

where ⌃0

is asymmetric low-rank matrix. The spiked covariance model can be equiva-lently written as

(1.1) Y

k

= X

k

+ "

k

, X

k

iid⇠ N(µ0

,⌃0

), "

k

iid⇠ N(0,�2Ip

), k = 1, . . . , n

with ⌃0

= U⇤U> being a low-rank matrix. The goal is either to recoverthe matrix ⌃

0

or its principal components. Asymptotic properties of theeigenvalues and eigenvectors for the sample covariance matrix have beenwell established. Estimators based on the eigen-decomposition of the sam-ple covariance matrix have been introduced and their theoretical propertieshave been extensively studied in the literature. A key assumption in themodel (1.1) is that the errors are homoskedastic, in the sense that each ✏

k

is assumed to be a spherically symmetric Gaussian.

Besides the spiked covariance model, the additive noise model has alsobeen well studied. In this case, one observes Y = X+Z, whereX is the targetlow-rank matrix and Z is the random perturbation matrix with zero mean.It is natural to use the singular value decomposition (SVD) of Y to recoverthe leading singular subspaces of X. In the case of nearly homoskedasticnoise where the entries of Z have the same or similar level of amplitudes, theregular SVD has been theoretically justified by previous results under varioussettings (Benaych-Georges and Nadakuditi, 2012; Shabalin and Nobel, 2013;Donoho and Gavish, 2014; Cai and Zhang, 2018).

1.1. Heteroskedastic PCA. In many applications, the noise can be highlyheteroskedastic, i.e., the magnitude of the perturbation varies significantlyfrom entry to entry in the data matrix. For example, the heteroskedasticnoise naturally appears in the datasets with measurements of di↵erent typeof values. For various biological sequencing and photon imaging data, theobservations are discrete counts which are commonly modeled by discretedistributions such as Poisson, multinomial, or negative binomial (Salmon

Page 3: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 3

et al., 2014; Cao et al., 2017), which are naturally heteroskedastic. In networkanalysis and recommender systems, the observations are usually binary orordinal, which are hardly homoskedastic. PCA is also used in the analysis ofspectrophotometric data to determine the number of linearly independentspecies in rapid scanning wavelength kinetics experiments (Cochran andHorne, 1977). The spectrophotometric data often contains heteroskedasticnoise as the measurements are based on the averages over the varying lengthof time intervals, which cause the noise level to vary over time.

Motivated by these applications, it is natural to relax the homoskedastic-ity assumption in (1.1) and consider the generalized spiked covariance model(Bai and Yao, 2012; Yao et al., 2015), where one observes an i.i.d. sample{Y

1

, . . . , Y

n

} with(1.2)

Y

k

= X

k

+"k

, X

k

iid⇠ N (µ0

,⌃0

) , "

k

iid⇠ N

0, diag(�21

, . . . ,�

2

p

)�

, k = 1, . . . , n.

Here, �21

, . . . ,�

2

p

are unknown and need not be identical. It turns out in thiscase the usual PCA can easily lead to inconsistent estimates. To see this,note that performing PCA on {Y

1

, . . . , Y

n

} amounts to applying the regularSVD on Y = [Y

1

, . . . , Y

n

], i.e., estimating U by the leading left singularvectors of the centralized sample matrix

Y � Y 1>n

, Y =1

n

n

X

k=1

Y

k

.

Moreover, the left singular vectors of Y�Y 1>n

is identical to the eigenvectorsof the sample covariance matrix

⌃ =1

n� 1

n

X

k=1

(Yk

� Y )(Yk

� Y )>.

Note that E⌃ = ⌃0

+diag(�21

, . . . ,�

2

p

). When �21

, . . . ,�

2

p

are the same, the top

eigenvectors of E⌃ and ⌃0

coincide; however, when �21

, . . . ,�

2

p

are not iden-

tical, the principal components of E⌃ and those of ⌃0

can di↵er significantlydue to the bias of ⌃ on the diagonal entries. This shows the inadequacy ofregular SVD in the case of heteroskedastic noise.

This phenomenon similarly appears in heteroskedastic low-rank matrixdenoising. Suppose one observes

(1.3) Y = X + Z,

where X is the low-rank matrix of interest and the entries of the noise Z areindependent, zero-mean, but need not have a common variance. The goal

Page 4: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

4 A. ZHANG, T. T. CAI, AND Y. WU

is to recover the singular subspace of X based on the noisy observation Y .The problem arises naturally in a range of applications, such as magneticresonance imaging (MRI) and relaxometry (Candes et al., 2013; Shabalinand Nobel, 2013). This model can also be viewed as a prototype of vari-ous problems in high-dimensional statistics and machine learning, includingPoisson PCA (Salmon et al., 2014), bipartite stochastic block model (Flo-rescu and Perkins, 2016), and exponential family PCA (Liu et al., 2016). Letthe sample and population Gram matrices be N = Y Y

> and M = XX

>,respectively. Then,

(EN)ij

=

M

ij

, i 6= j,

M

ij

+P

p2k=1

Var(Zik

), i = j.

Thus, entries of N are unbiased estimators only for the o↵-diagonal part ofM . Under the heteroskedastic setting that Var(Z

ij

) are unequal, there canbe significant di↵erences between EN and M on the diagonal entries, whichmay lead to significant perturbations on the diagonal of M �M . Since theleft singular vectors of Y and X are respectively identical to those of N andM , the regular SVD can result in inconsistent estimates of the left singularsubspace of X, due to the significant bias on the diagonal entries of N .

To better cope with the bias incurred on the diagonal, Florescu andPerkins (2016) introduced a method called the diagonal-deletion SVD inthe context of bipartite stochastic block model. The idea is to set the diago-nal of the Gram matrix to zero, then perform singular value decomposition.However, it is unclear whether zero diagonals are always the best choice.In fact, we can construct explicit examples where diagonal-deletion SVD isinconsistent (c.f., the forthcoming Proposition 2).

In the paper, we introduce a novel method, called HeteroPCA, for het-eroskedastic principal component analysis. Instead of zeroing out the diago-nals, the central idea is to iteratively updating the diagonal entries based onthe o↵-diagonals, so that the bias incurred on the diagonal is significantlyreduced and optimal estimation accuracy is achieved.

1.2. Optimality. The performance of the proposed procedure is studiedboth theoretically and numerically. The traditional technical tools, such asDavis-Kahan and Wedin’s theorems (Davis and Kahan, 1970; Wedin, 1972),bound the singular subspace estimation error in terms of the overall pertur-bation (k⌃�⌃

0

k or kN �Mk) and the spectral gap. Due to the significantbias on the diagonal entries of the Gram matrices (⌃ or N), these boundsare not suitable for analyzing heteroskedastic PCA. To overcome this di�-culty, we develop a deterministic robust perturbation analysis for the sin-

Page 5: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 5

gular subspace. Specifically, we establish a singular subspace perturbationbound for the setting where a significant portion of perturbations has muchlarger amplitudes. Di↵erent from the previous works on robust PCA (e.g.,Candes et al. (2011)) that typically assumes that corruptions have arbitraryamplitude but randomly selected sparse support, our robust perturbationanalysis is fully deterministic. This new technical tool provides the key in-gredient for analyzing the proposed HeteroPCA procedure and can be ofindependent interest.

By proving matching minimax lower bounds, we show that HeteroPCAachieves the optimal rate of convergence among a general class of settings forheteroskedastic PCA. In particular, the procedure outperforms the regularSVD and the diagonal-deletion SVD introduced by Florescu and Perkins(2016). The following informal statement summarizes the main results ofthis paper.

Theorem 1 (Heteroskedastic PCA, Informal). Suppose {X1

, . . . , X

n

}is an i.i.d. sample from the heteroskedastic PCA model (1.2), where ⌃

0

=U⇤U>, U 2 O

p,r

, ⇤ is a r-by-r diagonal matrices with non-negative entries.Denote �2 =

P

n

k=1

2

i

, �2⇤ = maxi

2

i

. Under regularity conditions, the out-

put of HeteroPCA (Algorithm 1), U , satisfies the following optimal rate ofconvergence,

Ek sin⇥(U , U)k ⇣ 1pn

� + r

1/2

�⇤

1/2

r

(⇤)+

��⇤�

r

(⇤)

!

^ 1.

Here, k sin⇥(·, ·)k is the sin⇥ defined in the forthcoming Section 2.1.

In contrast, the regular SVD (SVD) and diagonal-deletion SVD (DD) (seeSection 3 for the formal definition) yield the following suboptimal rates ofconvergence,

Ek sin⇥(USVD

, U)k ⇣

(�/n)1/2 + �

1/2

1/2

r

(⇤)

!

^1, and Ek sin⇥(UDD

, U)k ⇣ 1.

1.3. Applications and Related Literature. In addition to PCA for thegeneralized spiked covariance model, the newly established perturbationbounds are applicable to a collection of problems in high-dimensional statis-tics where the errors are heteroskedastic. In this paper, we discuss in detailthe following applications.

1. Heteroskedastic low-rank matrix denoising: For the additive noise model(1.3), we construct an estimator of the singular subspace by applying

Page 6: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

6 A. ZHANG, T. T. CAI, AND Y. WU

the HeteroPCA algorithm to the noisy matrix and provide both thetheoretical guarantee and numerical results to demonstrate the ad-vantage of the HeteroPCA procedure over the regular and diagonal-deletion SVD.

2. Poisson PCA: Motivated by various applications in engineering andbioinformaics, e.g., photon-limit imaging analysis (Salmon et al., 2014)and genomics sequencing data in microbiome studies (Cao et al., 2017),the Poisson PCA considers the matrix factorization based on the ob-served Poisson counts, which is intrinsically heteroskedastic. We applythe HeteroPCA algorithm to the Poisson data and prove the accuracyof the resulting estimator of the leading singular subspace.

3. SVD based on heteroskedastic and incomplete data: We also apply theproposed framework to SVD based on heteroskedastic and incompletedata. Motivated by the recommender system design and Netflix Prize,the matrix completion attracts much attention recently and has beenextensively studied in literature (Candes and Recht, 2009; Candes andTao, 2010; Keshavan et al., 2010). Despite the remarkable progressin estimating the whole matrix in noiseless or homoskedastic noisesettings, in real applications of recommender systems, (1) the ratingmatrix is usually binary or ordinal and heteroskedasticity naturallyarises; (2) if the focus is on one feature in recommender systems (e.g.,clustering of customers or movies), the target may be shifted fromthe whole rating matrices to the leading singular subspaces. We showthat HeteroPCA achieves accurate estimation for the leading left sin-gular subspace, even if most of the columns of the target matrix arecompletely missing. PCA with missing data is also discussed.

Moreover, our deterministic robust perturbation analysis is also useful toa range of other applications, such as heteroskedastic canonical correlationanalysis, heteroskedastic tensor SVD, exponential family PCA, communitydetection in bipartite stochastic network, and bipartite multidimensionalscaling.

This paper also relates to several recent works on PCA for heteroskedasticdata. Bai and Yao (2012); Yao et al. (2015) extended the theory of regularspiked covariance model to the generalized one and studied the eigenvaluelimiting distribution of the sample covariance matrix. Hong et al. (2016,2018) considered the PCA with heteroskedastic noises in an alternative way.They introduced a model for heteroskedastic real or complex valued data,where noises are non-uniform across di↵erent samples, but uniform withineach sample. They further studied the performance of regular SVD esti-

Page 7: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 7

mator and established the asymptotic distributions for both eigenvalue andeigenvector estimators. Di↵erent from the previous results, this paper al-lows non-uniform noise within each sample – the noise variances can evenbe non-uniform across all di↵erent entries of the dataset in the heteroskedas-tic low-rank matrix denoising setting (Section 4.1). Since the regular SVD nolong achieves good performance when noises are non-uniform within sample(c.f., the forthcoming Proposition 2), we instead propose the new procedure,HeteroPCA, and establish its optimality results in this paper.

1.4. Organization of the Paper. The rest of the paper is organized asfollows. After a brief introduction of notation and definitions, we presentin detail the HeteroPCA algorithm and a deterministic robust perturba-tion analysis in Section 2 that serves as a key step to the heteroskedasticPCA. In Section 3, we focus on the generalized spiked covariance modeland develop matching minimax upper and lower bounds for the proposedprocedure. Applications to other problems in high-dimensional statistics, in-cluding heteroskedastic matrix denoising, Poisson PCA, and SVD based onheteroskedastic and incomplete data are discussed in Section 4. Numericalresults are presented in Section 5 and other applications are briefly dis-cussed in Section 6. The proofs of the main results are given in Section 7.The proofs of other results and those of additional technical lemmas areprovided in Section A in the supplementary materials (Zhang et al., 2018).

2. A Deterministic Robust Perturbation Analysis. We begin byreviewing notation and definitions in Section 2.1 that will be used through-out the paper and present a deterministic perturbation analysis for eigen-subspaces in Section 2.2. An optimal bound is then established in Section2.3 to provide a theoretical guarantee for the proposed algorithm.

2.1. Notation and Preliminaries. We use lowercase letters, e.g., x, y, z,to represent scalars or vectors; we use uppercase letters, e.g, U,M,N todenote matrices. For any sequences of positive numbers {a

k

} and {bk

}, wedenote a . b and b & a, if there exists a uniform constant C > 0, suchthat a

k

Cb

k

for all k. For any matrix M 2 Rp1⇥p2 , let �k

(M) be the k-thsingular value. Then, one can write M =

P

p1^p2k=1

k

(M)uk

v

>k

as the SVD.We also let SVD

r

(M) = [u1

· · ·ur

] be the subspace composed of the leadingr left singular vectors and QR(M) be the Q part of the QR orthogonaliza-tion of M . The matrix spectral norm kMk = supkuk2=1

kMuk2

= �

1

(M)

and the Frobenius norm kMkF

= (P

i,j

M

2

ij

)1/2 = (P

k

2

k

(M))1/2 will beextensively used throughout the paper. Let I

r

denote the r-by-r identity

Page 8: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

8 A. ZHANG, T. T. CAI, AND Y. WU

matrix, let 0m⇥n

and 1m⇥n

denote the m ⇥ n zero and all-one matrices,and let 0

m

and 1m

denote the m-dimensional zero and all-one column vec-tor. Let O

p,r

= {U 2 Rp⇥r : U>U = I

r

} be the set of all p-by-r matriceswith orthonormal columns. For any U 2 O

p,r

, we note U? 2 Op,p�r

as theorthogonal complement so that [U U?] is an orthogonal matrix.

Motivated by the incoherence condition that is the widely used in thematrix completion literature (Candes and Recht, 2009), we define the inco-herence constant of U 2 O

p,r

as

(2.1) I(U) =p

r

maxi2[p]

ke>i

Uk22

.

It is noteworthy that 1 I(U) p/r for any U 2 Om,r

. The sin⇥ distanceis used to quantify the distance between singular subspaces. Specifically forany U

1

, U

2

2 Op,r

, define k sin⇥(U1

, U

2

)k , kU>1?U2

k = kU>2?U1

k. SupposeG is a subset of [p

1

] ⇥ [p2

] and A 2 Rp1⇥p2 is any matrix, we let �(A) bethe matrix A with all entries in G set to zero and let G(A) = A � �(A).In particular, for any square matrix A, let �(A) be the matrix A with alldiagonal entries set to zero, and D(A) be the matrix A with all o↵-diagonalentries set to zero, then A = �(A) + D(A). We use C,C

1

, . . . , c, c

1

, · · · torespectively represent generic large and small constants, whose values maydi↵er on di↵erent lines.

2.2. Deterministic Eigen-subspace Perturbation Analysis. Next, we fo-cus on the following deterministic low-rank eigen-subspace perturbationanalysis. The method and theory developed here will play a key role in theheteroskedastic principal component analysis later in Section 3. Suppose oneobserves

(2.2) N = M + Z,

where M 2 Rm⇥m is the true underlying matrix of interest and Z 2 Rm⇥m isthe perturbation. Suppose rank(M) = r and U 2 O

m,r

are the eigenvectors.In order to estimate U , i.e., the leading principal components of M , themost natural estimator is U = SVD

r

(N), i.e., the subspace composed ofthe leading r left singular vectors of N . This idea is also widely referred toas singular value thresholding (SVT) in the literature (Donoho and Gavish,2014; Chatterjee, 2015). By the well-known Eckart-Young-Mirsky Theorem(Golub et al., 1987), the singular value thresholding, or the regular singularvalue decomposition, is equivalent to the following optimization problem,

(2.3) U = SVDr

(M), where M = argmin˜

M :rank(

˜

M)r

M �N

.

Page 9: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 9

In particular, Davis-Kahan’s theorem (Davis and Kahan, 1970) yields

(2.4)�

sin⇥(U , U)�

. kZk�

r

(X)^ 1.

Such a bound is sharp in the worst case.

However, in many scenarios, the perturbation need not be homogeneous,in the sense that a portion of Z, say ZG indexed by some set G, may besignificantly larger than the rest. For example, in image processing, a numberof visible patches in images may be highly corrupted, then one may needto perform patch restoration before downstream analysis (Zoran and Weiss,2011); before performing PCA on datasets with outliers, one need to firstapply more robust procedure to detect and remove those anomalies (Jolli↵e,2002), as the regular SVD is known to be sensitive to outliers. Moreover,as we have discussed in the introduction section, the significant corruptionmay exist in a wide class of problems in high-dimensional heteroskedasticdata analysis.

To achieve a more robust estimation of U with provable guarantees, weconsider a general framework for robust eigenspace perturbation analysisand propose the following simple and computationally feasible procedure forestimating the singular subspace under deterministic mask. The key idea isbased on an iterative updates on the corrupted entries. We emphasize thatour framework is distinct from the recent works on matrix completion orrobust PCA, as the missing entries here are deterministic. To be specific,assume the perturbation Z has higher amplitude in G ✓ [m]⇥ [m], we ignorethose entries of ZG in (2.3) and evaluate

(2.5) U = SVDr

(M), where M = argminˆ

M :rank(

ˆ

M)r

�(M �N)�

.

Since (2.5) is non-convex in general, we instead consider the following pro-cedure.

Step 1 Initialize by setting the entries of G in N to zero, i.e., N (0) = �(N).Step 2 For t = 0, . . ., evaluate the SVD of N (t) and let N (t) be its best rank-r

approximation:

N

(t) = U

(t)⌃(t)(V (t))> =X

i

(t)

i

u

(t)

i

(v(t)i

)>, �

(t)

1

� . . . � �

(t)

m

� 0,

N

(t) =r

X

i=1

(t)

i

u

(t)

i

(v(t)i

)>.

Page 10: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

10 A. ZHANG, T. T. CAI, AND Y. WU

Step 3 Update N

(t+1) = G(N (t)) +�(N), i.e., replace the entries in G of N (t)

by those in N

(t). In other words,

(2.6) N

(t+1)

ij

=

(

N

(t)

ij

= N

ij

, (i, j) /2 G;N

(t)

ij

, (i, j) 2 G.

Step 4 Repeat Steps 2-3 until convergence or the maximum number of itera-tions is reached.

The pseudo-code of the proposed procedure is summarized as Algorithm 1.

Algorithm 1 E�cient Robust Eigen-subspace Estimation1: Input: noisy matrix N , rank r, number of iterations T .2: Set N (0) = �(N).3: for t = 1, . . . , T do

4: Calculate SVD: N (t) =P

i �(t)i u(t)

i (v(t)i )>, where �(t)1 � �(t)

2 · · · � 0.

5: Let N (t) =Pr

i=1 �(t)i u(t)

i (v(t)i )>.

6: Update the corrupted entries: N (t+1) = G(N (t)) + �(N).7: end for

8: Output: M = N (t), U = U (t).

2.3. Optimal Bounds for Robust Eigen-Subspace Estimation. We havethe following deterministic guarantee for the proposed Algorithm 1. Whenthe significant corruption of Z is on the diagonal part, the following resultholds, which will be a key technical tool in the later heteroskedastic randomperturbation analysis.

Theorem 2 (Robust sin⇥ Theorem). Suppose N = M + Z 2 Rm⇥m,where M is symmetric and rank-r. Suppose that the eigenvectors of M areU 2 O

m,r

. Then there exists a universal constant cI

> 0 such that if

(2.7)I(U)kMk�

r

(M) c

I

m

r

,

(where I(U) = maxi

m

r

ke>i

Uk22

is the incoherence constant defined in (2.1)),

then the output U of Algorithm 1 with T = ⌦⇣

log �r(M)

k�(Z)k _ 1⌘

iterations

satisfies

(2.8)�

sin⇥(U , U)�

. k�(Z)k�

r

(M)^ 1.

Remark 1. We introduce the incoherence condition in Theorem 2 mainlyto avoid thoseM that are too “spiky”. For example, considerM

1

= e

1

e

>1

and

Page 11: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 11

M

2

= e

2

e

>2

. Then �(M1

) = �(M2

) and there is no way to distinguish thesetwo spiky matrices if one only has reliable o↵-diagonal observations. Similarconditions, such as the “delocalized condition,” appear in recent work onPCA from noisy and linearly reduced data (Dobriban et al., 2016).

The following lower bound shows that the bounds for both the incoherencecondition (2.7) and the estimation error (2.8) are rate-optimal.

Proposition 1 (Robust sin⇥ Theorem: Lower Bounds). Define the fol-lowing collection of pairs of signal and perturbation matrices:(2.9)

Dm,r

(�, �, t) =

(M,Z) :M = U⇤U>

, U 2 Op,r

, I(U)kMk/�r

(M) t,

k�(Z)k �,�

r

(M) � �

.

Suppose 1 r m/2, t � 4, one observes N = M + Z 2 Rm⇥m. Then

(2.10) infˆ

U

sup(M,Z)2Dm,r(�,�,t)

sin⇥(U , U)�

� c

^ 1

.

If the incoherence constraint is weak in the sense that t � m/r, then

(2.11) infˆ

U

sup(M,Z)2Dm,r(�,�,t)

sin⇥(U , U)�

� 1/2.

If the corrupted entries is in a general sparse set G ✓ [m] ⇥ [m] ratherthan the diagonal, we have the following theoretical results for the proposedprocedure.

Theorem 3 (General Robust sin⇥ Theorem). Assume G 2 [m] ⇥ [m]is b-sparse in the sense that

maxi

| {j : (i, j) 2 G} | _maxj

| {i : (i, j) 2 G} | b.

Define

(2.12) ⌘ = maxrank(X)2r

kG(X)kkXk ,

where G(X) is the matrix X with all entries but those in G set to zero.Suppose one observes the symmetric matrix N = M+Z, where rank(M) = r

and Z is any perturbation, the eigenvectors of M are U 2 Op,r

. If U is the

Page 12: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

12 A. ZHANG, T. T. CAI, AND Y. WU

output of Algorithm 1 with T = ⌦(log �r(M)

⌘k�(Z)k _ 1) iterations, there exists auniversal constant c > 0 such that if the incoherence condition

(2.13)I(U)kMk�

r

(M) cm

⌘br(b ^ r)

is satisfied and ⌘k�(Z)k c�

r

(M), then

(2.14)�

sin⇥(U , U)�

. k�(Z)k�

r

(M)^ 1.

Remark 2. The quantity ⌘ measures the maximum e↵ect of the per-turbations on the entries in G to the singular subspace. Although the exactcharacterization of ⌘ may be di�cult for general G, we can show by Lemma4 in the supplement that ⌘ satisfies ⌘

p

b ^ (2r) for all G.

3. Optimal Heteroskedastic Principal Component Analysis. Nowwe are in the position to investigate the heteroskedastic principal compo-nent analysis. Suppose one observes i.i.d. copies Y

1

, . . . , Y

n

of Y from thefollowing generalized spiked covariance model,

Y = X + " 2 Rp

, EX = µ,Cov(X) = U⇤U>,

E" = 0,Cov(") = D = diag(�21

, . . . ,�

2

p

),

" = ((")1

, . . . , (")p

)>; X, (")1

, . . . , (")p

are independent.

(3.1)

Here, U 2 Op,r

is the orthogonal loading matrix, ⇤ is the diagonal eigenvaluematrix, and D 2 Rr⇥r is the non-negative and diagonal matrix representingthe heteroskedastic noise. Then, Y

1

, . . . , Y

n

satisfy

EYk

= µ, Cov(Yk

) = ⌃, where ⌃ = U⇤U> +D.

The proposed procedure can better tackle the heteroskedastic noise and pro-vide a more accurate estimation for U than the regular SVD. In particular,we apply Algorithm 1 on the sample covariance matrix ⌃,

⌃ =1

n� 1

n

X

k=1

(Yk

� Y )(Yk

� Y )>, Y =1

n

Y

k

.

Denote

(3.2) �

2

⇤ , maxi

2

i

, �

2 ,X

i

2

i

.

We have the following theoretical guarantees.

Page 13: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 13

Theorem 4 (Heteroskedastic PCA: upper bound). Let Y = X + " bedrawn from the generalized spiked covariance model (3.1), where X and "are sub-Gaussian, in the sense that

maxq�1,kvk2=1

q

�1/2

E|v>⇤�1/2

U

>X|q

1/q

C,

maxq�1,kvk2=1

q

�1/2 (E|"i

/�

i

|q)1/q C.

Let Y1

, . . . , Y

n

be i.i.d. samples from (3.1). Assume that n � Cr, �⇤/�r(⇤) �exp(�Cn), and k⇤k/�

r

(⇤) C for some constants C, c > 0. There ex-ists some constant c

I

> 0 such that if the incoherence constant I(U) =max

i

p

r

ke>i

Uk22

satisfies I(U) c

I

p/r, we have the following theoretical guar-antee for the output of Algorithm 1.

E�

sin⇥(U , U)�

. 1pn

� + r

1/2

�⇤

1/2

r

(⇤)+

��⇤�

r

(⇤)

!

^ 1.(3.3)

Remark 3 (Interpretation of (3.3)). Let p = �

2

/�

2

⇤. The upper bound(3.3) can be rewritten as

(3.4) E�

sin⇥(U , U)�

.

r

p _ r

n

�⇤

1/2

r

(⇤)+

r

p

n

2

⇤�

r

(⇤)

!

^ 1.

Consider the regular PCA setting where D = �

2

⇤I. A special case of Theorem4 yields:

(3.5) E�

sin⇥(U , U)�

.r

p

n

�⇤

1/2

r

(⇤)+

2

⇤�

r

(⇤)

!

^ 1.

Comparing (3.4) with (3.5), we see that a weighted average between p _ r

and p can be viewed as the “e↵ective dimension” for heteroskedastic PCA.

Next, to establish the optimality of Theorem 4, we consider the followingclass of generalized spiked covariance matrices:

Fp,n,r

(�,�⇤,�)

=

⌃ = U⇤U> +D :D is non-negative diagonal,

P

i

D

ii

2

,maxi

D

ii

2

⇤,U 2 O

p,r

, I(U) c

I

p/r, k⇤k/�r

(⇤) C,�

r

(⇤) � �

.

(3.6)

We establish the following minimax lower bound of heteroskedastic PCA forthe covariance matrices in F

p,n,r

(�,�⇤,�).

Page 14: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

14 A. ZHANG, T. T. CAI, AND Y. WU

Theorem 5 (Heteroskedastic PCA: lower Bound). Supposepp�⇤ � � �

�⇤ > 0. There exists constant C > 0, such that if p � Cr, the following lowerbound holds,

(3.7) infˆ

U

sup⌃2Fp,n,r(�,�⇤,�)

E�

sin⇥(U , U)�

& 1pn

� + r

1/2

�⇤�

1/2

+��⇤�

!

^ 1.

Remark 4. By combining Theorems 4 and 5, the proposed Algorithm 1achieves the following optimal rate for heteroskedastic PCA:

infˆ

U

sup⌃2Fp,n,r(�,�⇤,�)

E�

sin⇥(U , U)�

⇣ 1pn

� + r

1/2

�⇤�

1/2

+��⇤�

!

^ 1.

Remark 5 (Regular and Diagonal-deletion SVDs in Heteroskedastic PCA).We note that E⌃ = U⇤U>+D and E�(⌃) = �(U⇤U>). Then, both E⌃ andE�(⌃) possess di↵erent singular subspaces than U . Therefore, the regularSVD or the diagonal-deletion SVD

U

SVD = SVDr

(⌃), U

DD = SVDr

(�(⌃))

may be inconsistent, even in the “fixed p, growing n” scenario. More specifi-cally, one can show the following lower bounds for the regular and diagonal-deletion SVDs in the class of covariance matrices (3.6).

Proposition 2. There exists some constant C > 0 such that if n � Cr,then

sup⌃2Fp,n,r(�,�⇤,�)

Ek sin⇥(USVD

, U)k &

�/n

1/2 + �⇤�

1/2

!

^ 1,(3.8)

sup⌃2Fp,n,r(�,�⇤,�)

Ek sin⇥(UDD

, U)k & 1.(3.9)

4. More Applications in High-dimensional Statistics. In this sec-tion, we apply the proposed framework to a number of additional applica-tions in high-dimensional statistics, including the heteroskedastic low-rankmatrix denoising, Poisson PCA, and SVD based on heteroskedastic and in-complete data.

4.1. Heteroskedastic Low-rank Matrix Denoising. Suppose one observes

(4.1) Y = X + Z,

Page 15: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 15

where X 2 Rp1⇥p2 is a fixed rank-r matrix and the noise matrix Z consistsof independent entries Z

ij

ind⇠ N(0,�2ij

). Let X = U⇤V > be the singularvalue decomposition, where U 2 O

p1,r, V 2 Op2,r. We aim to estimate the

leading singular vectors of X, i.e., U 2 Op1,r or V 2 O

p2,r. Compared to theregular or diagonal-deletion SVD on Y , the proposed HeteroPCA providesa better estimator. We have the following theoretical guarantees.

Theorem 6 (Upper bound for Heteroskedastic Matrix Denoising). Con-sider the model (4.1). Suppose the left singular subspace of X is U 2 O

p1,r.Assume that the condition number of X is at most some absolute constantC, i.e., kXk C�

r

(X). Denote

(4.2) �

2

R

= maxi

p2X

j=1

2

ij

, �

2

C

= maxj

p1X

i=1

2

ij

, �

2

⇤ = maxij

2

ij

as the rowwise, columnwise, and entrywise noise variances. Then there existsconstant c

I

> 0 such that if U satisfies incoherence condition I(U) c

I

p

1

/r,where I(U) = max

1ip1p1r

ke>i

Uk22

, Algorithm 1 applied to Y Y

> outputs an

estimator U that satisfies

E�

sin⇥(U , U)�

.

C

+pr�⇤

r

(X)+�

R

C

+ �

R

�⇤p

log(p1

^ p

2

) + �

2

⇤ log(p1 ^ p

2

)

2

r

(X)

!

^ 1.

(4.3)

If �⇤ . �

C

/max{pr,

p

log(p1

^ p

2

)} additionally holds, we further have

(4.4) E�

sin⇥(U , U)�

.✓

C

r

(X)+�

R

C

2

r

(X)

^ 1.

Remark 6. Instead of HeteroPCA, one can directly apply the regularSVD

(4.5) U

SVD = SVD(Y ), and V

SVD = SVD(Y >),

or the diagonal-deletion SVD recently proposed by Florescu and Perkins(2016):

(4.6) U

DD = SVD⇣

�(Y >Y )⌘

and V

DD = SVD⇣

�(Y Y

>)⌘

.

Page 16: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

16 A. ZHANG, T. T. CAI, AND Y. WU

Following the proof of Proposition 2, one can establish the lower bound toshow that the proposed HeteroPCA outperforms the regular and diagonal-deletion SVDs. In particular, if �

r

(X) & �

R

_�⇤p

log(p1

^ p

2

) and �⇤/�C &pr, one can show that

supX2Fp,�

Z2Sp(�⇤,�C ,�R)

E�

sin⇥⇣

U , U

⇣ �

C

,

supX2Fp,�

Z2Sp(�⇤,�C ,�R)

E�

sin⇥⇣

U

SVD

, U

& �

C

+ �

R

,

supX2Fp,�

Z2Sp(�⇤,�C ,�R)

E�

sin⇥⇣

U

DD

, U

& 1,

which clearly illustrates the advantage of the HeteroPCA.

Remark 7. When �

ij

= �⇤ for all 1 i p

1

, 1 j p

2

, the upperbound of (4.3) reduces to

E�

sin⇥(U , U)�

.✓p

p

1

�⇤�

r

(X)+

pp

1

p

2

�⇤�

2

r

(X)

,

which matches the optimal rate for homoskedastic matrix denoising in lit-erature (Cai and Zhang, 2016, Theorems 3 and 4).

4.2. Poisson PCA. As mentioned in the introduction, Poisson PCA (Salmonet al., 2014) is an important problem with a range of applications, includ-ing photon-limited imaging and biological sequencing data analysis. Sup-

pose we observe Y 2 Rp1⇥p2 , where Y

ij

ind⇠ Poisson(Xij

) and X 2 Rp1⇥p2

is rank-r. Let X = U⇤V > be the singular value decomposition, whereU 2 O

p1,r, V 2 Op2,r. Due to the heteroskedasticity of Poisson distribution,

HeteroPCA fits in Poisson PCA, which aims to estimate the leading singularvectors of X, i.e., U or V . Although the aforementioned heteroskedastic low-rank matrix denoising can be seen as a prototype problem of Poisson PCA,Theorem 6 is not directly applicable and more careful analysis is neededsince the Poisson distribution has heavier tail than sub-Gaussian.

Theorem 7 (Poisson PCA). Suppose X 2 Rp1⇥p2+

, rank(X) = r,�

1

(X)/�r

(X) C, Xij

� c for constant c > 0, U 2 Op1,r is the left singular

subspace of X. Denote

(4.7) �

2

R

= maxi

p2X

j=1

X

ij

, �

2

C

= maxj

p1X

i=1

X

ij

, �

2

⇤ = maxi,j

X

ij

.

Page 17: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 17

Suppose one observes Y 2 Rp1⇥p2, Y

ij

ind⇠ Poisson(Xij

). Then there existsconstant c

I

> 0 such that if U satisfies I(U) = maxi

p1r

ke>i

Uk22

c

I

p

1

/r,the proposed HeteroPCA procedure (Algorithm 1) yields

Ek sin⇥(U , U)k .

0

B

@

C

+ r�⇤�

r

(X)+

n

R

+ �

C

+ �⇤p

log(p2

) log(p1

)o

2

� �

2

R

2

r

(X)

1

C

A

^ 1.

(4.8)

In addition, if �⇤ �

C

/max{r,p

log(p1

) log(p2

)}, then

E�

sin⇥(U , U)�

.✓

C

r

(X)+�

R

C

2

r

(X)

^ 1.

Remark 8. Similar results to Propositions 2 and 6 can be developed toshow the advantage of HeteroPCA over the regular and diagonal-deletionSVD.

4.3. SVD based on Heteroskedastic and Incomplete Data. Missing dataproblems arise frequently in high-dimensional statistics. The HeteroPCAalgorithm can naturally be applied to SVD with heteroskedastic and incom-plete data. This problem can be seen as a variation of noisy matrix comple-tion, which has attracted much attention from the fields of computer science,applied mathematics, and statistics since the last decade. Let X 2 Rp1⇥p2

be a rank-r unknown matrix. Suppose only a small fraction of entries of X,denoted by ⌦ ✓ [p

1

]⇥ [p2

], are observable with random noises,

Y

ij

= X

ij

+ Z

ij

, (i, j) 2 ⌦.

Here, each entry Y

ij

is observed or missing with probability ✓ or 1 � ✓

for some 0 < ✓ < 1 and Z

ij

’s are independent, zero-mean, and possiblyheteroskedastic. Let M 2 Rp1⇥p2 be the indicator of the observed entries:

M

ij

=

1, (i, j) 2 ⌦;0, (i, j) /2 ⌦,

and M and Y are independent. Assume X = U⇤V > is the singular valuedecomposition, where U 2 O

p1,r and V 2 Op2,r. We specifically aim to

estimate U based on {Yij

, (i, j) 2 ⌦}. Denote Y as the entry-wise productof Y and M , i.e., Y

ij

= Y

ij

M

ij

, 8(i, j) 2 [p1

] ⇥ [p2

]. Since EYij

= ✓X

ij

andVar(Y

ij

) are not necessarily identical, we can apply HeteroPCA on Y Y

> toestimate U . The following theoretical guarantee holds.

Page 18: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

18 A. ZHANG, T. T. CAI, AND Y. WU

Theorem 8. Let X be a p

1

-by-p2

rank-r matrix, whose left singularsubspace is denoted by U 2 O

p1,r. Assume that EY = X. Let Y con-sist of sub-Gaussian entries in the sense that max

ij

kYij

k 2 C. Here,

kY k 2 , sup

q�1

q

�1/2(|Y |q)1/q is the Orlicz- 2

norm of the random vari-

able Y . Suppose 0 < ✓ 1� c and maxj

P

p2i=1

EY 4

ij

� cp

2

p

�C

1

for constantsc, C > 0. There exists constant c

I

> 0 such that if U 2 Op1⇥r

satisfiesI(U)kXk/�

r

(X) c

I

p

1

/r for constant cI

> 0, the HeteroPCA estimator U

satisfies

(4.9)�

sin⇥(U , U)�

.max

n

p

p

2

(✓ + ✓

3

p

2

1

) log(p1

), ✓p1

log2(p1

)o

2

2

r

(X)^ 1.

with probability at least 1� p

�C

1

.

Remark 9. In the special case of �1

(X) C�

r

(X) and kXk2F

⇣ p

1

p

2

,the upper bound in Theorem 8 implies that as long as the expected samplesize satisfies

(4.10) E|⌦| � maxn

p

1/3

1

p

2/3

2

r

2/3 log1/3(p1

), p1

r

2 log(p1

), p1

r log2(p1

)o

,

the HeteroPCA estimator is consistent. When p

2

� p

1

, the rate in Equa-tion (4.10) implies that HeteroPCA estimator can still yield a consistentestimation for U , even if most of the columns are completely missing. Thisrequirement is weaker than the one in classic literature of matrix completion

|⌦| & (p1

+ p

2

)r · polylog(p),

where the goal is to estimate the whole matrix.

Remark 10. PCA based on heteroskedastic and incomplete data is aclosely related problem. Although most existing literature on PCA with in-complete data focused on the regular SVD methods under the homoskedasticnoisy setting (see, e.g., Lounici et al. (2014); Cai and Zhang (2016)), we areable to achieve better performance by applying the proposed HeteroPCAalgorithm if the noises are heteroskedastic. To be more specific, supposeone observes incomplete i.i.d. samples Y

1

, . . . , Y

n

2 Rp from the generalizedspiked covariance model,

Y = X + " 2 Rp

, EX = µ,Cov(X) = U⇤U>,

E" = 0,Cov(") = D = diag(�21

, . . . ,�

2

p

),

" = ((")1

, . . . , (")p

)>; X, (")1

, . . . , (")p

are independent;

Page 19: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 19

8k = 1, . . . , n, Y

k

= (Y1k

, . . . , Y

pk

)>, Y

1

, . . . , Y

n

are i.i.d. copies of Y ;

81 i p, 1 k n, M

ik

=

1, Y

ik

is observable;0, Y

ik

is missing,

and {Mik

}1ip,1kn

are independent of Y1

, . . . , Y

n

. To estimate the leadingprincipal components, i.e., U 2 O

p,r

, we can first evaluate the generalizedsample covariance matrix as in Cai and Zhang (2016),

⌃⇤ = (�⇤ij

)1i,jp

, with �

⇤ij

=

P

n

k=1

(Yik

� Y

⇤i

)(Yik

� Y

⇤j

)Mik

M

jk

P

n

k=1

M

ik

M

jk

and Y

⇤i

=

P

n

k=1

Y

ik

M

ik

P

n

k=1

M

ik

.

Then estimate U by applying Algorithm 1 on ⌃⇤. A similar consistent upperbound result to Theorem 8 can be developed for this procedure.

5. Numerical Results. In this section, we perform simulation studiesto further illustrate the merit of the proposed procedure in singular subspaceestimation when heteroskedastic noises are in presence. All simulation resultsbelow are based on the average of 1000 repeated independent experiments.

We first consider the heteroskedastic PCA. For various values of p, n,and r, we generate a p-by-r random matrix U

0

with i.i.d. standard Gaus-

sian entries, w1

, . . . , w

p

iid⇠ Unif[0, 1], and �

1

, . . . ,�

p

iid⇠ Unif[0, 1]. The pur-pose of generating uniform random variables w,� is to introduce the het-eroskedasticity into the observations. Then, we let U = QR(U

0

diag(w)) 2O

p,r

and ⌃0

= UU

> 2 Rp⇥p. We aim to recover U based on i.i.d. obser-

vations {Yk

= X

k

+ "

k

}nk=1

, where X

1

, . . . , X

n

iid⇠ N(0,⌃0

), "1

, . . . , "

n

iid⇠N(0, diag(�2

1

, . . . ,�

2

n

)). We implement the proposed HeteroPCA, diagonal-deletion, and regular SVD approaches and plot the average estimation errorsin sin⇥ distance in Figure 1. It can be seen that the proposed HeteroPCA es-timator significantly outperforms the other methods; the regular SVD yieldslarger estimation error; and the diagonal-deletion estimator performs unsta-bly across di↵erent settings. This matches the theoretical findings in Section3.

Next we study how the degree of heteroskedasticity a↵ects the estimationerrors of PCA in another setting. Let

v

1

, . . . , v

p

iid⇠ Unif[0, 1], �

2

k

=0.1 · p · v↵

k

P

p

i=1

v

i

, k = 1, . . . , p.

Page 20: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

20 A. ZHANG, T. T. CAI, AND Y. WU

0 1000 2000 3000 4000

0.1

0.2

0.3

0.4

0.5

0.6

n

Sin−

Thet

a D

ista

nce

HeteroPCA Regular SVD Diagonal−deletion

(a) p = 50, r = 3

0 1000 2000 3000 4000

0.2

0.4

0.6

0.8

1.0

n

Sin−

Thet

a D

ista

nce HeteroPCA

Regular SVD Diagonal−deletion

(b) p = 50, r = 10

0 1000 2000 3000 4000 5000 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

n

Sin−

Thet

a D

ista

nce

HeteroPCA Regular SVD Diagonal−deletion

(c) p = 100, r = 3

0 1000 2000 3000 4000 5000 6000

0.2

0.4

0.6

0.8

1.0

n

Sin−

Thet

a D

ista

nce HeteroPCA

Regular SVD Diagonal−deletion

(d) p = 100, r = 10

Fig 1. The average sin⇥ loss of heteroskedastic PCA versus the sample size n.

Page 21: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 21

0 2 4 6 8 10

0.5

0.6

0.7

0.8

0.9

1.0

alpha

Sin−

Thet

a D

ista

nce HeteroPCA

Regular SVD Diagonal−deletion

(a) p = 50, n = 30, r = 5

0 2 4 6 8 10

0.4

0.5

0.6

0.7

0.8

alpha

Sin−

Thet

a D

ista

nce

HeteroPCA Regular SVD Diagonal−deletion

(b) p = 200, n = 150, r = 5

Fig 2. The average loss in sin⇥ distance for heteroskedastic PCA.

In such case, �2 = �

2

1

+ · · · + �

2

p

always equals 0.1p and ↵ characterizesthe degree of heteroskedasticity: the larger ↵ results more imbalanced dis-tribution of (�

1

, . . . ,�

p

), and if ↵ = 0, we have �1

= · · · = �

p

. Now wegenerate U,⌃

0

and {Yk

, X

k

, "

k

}nk=1

in the same way as the previous setting.The average estimation errors for U are plotted in Figure 2. The resultsagain suggest that the performance of diagonal-deletion estimator is unsta-ble across di↵erent settings. When ↵ = 0, i.e., the noises are homoskedastic,the performance of HeteroPCA and regular SVD are comparable; but as ↵increases, the estimation error grows significantly slower than that of theregular SVD, which is consistent with the theoretical results in Theorem 4.

Next, we consider the problem of denoising a low-rank matrix with het-eroskedastic noise discussed in Section 4.1. Let U

0

2 Rp1⇥r and V

0

2 Rp2⇥r

be i.i.d. Gaussian ensemble for (p1

, p

2

) = (50, 200), (200, 1000) and r =3. To introduce heteroskedasticity, we also randomly draw i.i.d. Unif[0, 1]distributed vectors w, v

1

2 Rp1 , and v

2

2 Rp2 . Then we evaluate U =QR

U

0

· diag(w)4�

, V = QR(V0

), and construct the signal matrix X =

(p1

p

2

)1/4 · UV

>. The noise matrix is drawn as Z

ij

ind⇠ N(0,�20

· �2ij

), where

ij

= (v1

)4i

· (v2

)4j

, �0

varies from 0 to 2, 1 i p

1

, and 1 j p

2

.Based on the p

1

-by-p2

observation Y = X + Z, we implement HeteroPCA,regular-SVD, and diagonal-deletion methods to estimate U, V and plots theaverage sin⇥ distance error in Figures 3 (a) - (d). For each of the estimatorsU and V , we also estimate X by X = U U

>Y V V

> and plot the Frobeniusnorm error in Figure 3 (e) and (f). As one can clearly see from Figure 3,the proposed HeteroPCA outperforms the other methods in all estimationsfor U, V , and X, and the advantage of HeteroPCA over the others is moresignificant when the noise level increases.

Page 22: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

22 A. ZHANG, T. T. CAI, AND Y. WU

0.5 1.0 1.5 2.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

sigma_0

Sin−

Thet

a D

ista

nce

HeteroPCA Regular SVD Diagonal−deletion

(a) k sin⇥(U , U)k when (p1, p2) =(50, 200).

0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

sigma_0

Sin−

Thet

a D

ista

nce

HeteroPCA Regular SVD Diagonal−deletion

(b) k sin⇥(U , U)k when (p1, p2) =(200, 1000).

0.5 1.0 1.5 2.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

sigma_0

Sin−

Thet

a D

ista

nce

HeteroPCA Regular SVD Diagonal−deletion

(c) k sin⇥(V , V )k when (p1, p2) =(50, 200).

0.5 1.0 1.5 2.0

0.2

0.4

0.6

0.8

sigma_0

Sin−

Thet

a D

ista

nce

HeteroPCA Regular SVD Diagonal−deletion

(d) k sin⇥(V , V )k when (p1, p2) =(200, 1000).

0.5 1.0 1.5 2.0

24

68

1012

sigma_0

Aver

age

Frob

enio

us N

orm

Los

s HeteroPCA Regular SVD Diagonal−deletion

(e) kX �XkF when (p1, p2) = (50, 200).

0.5 1.0 1.5 2.0

010

2030

40

sigma_0

Aver

age

Frob

enio

us N

orm

Los

s HeteroPCA Regular SVD Diagonal−deletion

(f) kX�XkF when (p1, p2) = (200, 1000).

Fig 3. Estimation errors of U , V , and X in heteroskedastic SVD.

Page 23: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 23

1000 1500 2000 2500 3000

0.2

0.4

0.6

0.8

1.0

p2

Sin−

Thet

a D

ista

nce

HeteroPCA Regular SVD Diagonal−deletion SoftImpute

1000 1500 2000 2500 3000

0.2

0.4

0.6

p2

Sin−

Thet

a D

ista

nce

HeteroPCARegular SVD Diagonal−deletion SoftImpute

Fig 4. Average sin⇥ distance error for SVD based on heteroskedastic and incomplete data.Here, p1 = 50, r = 5, ✓ = .2 (Left Panel) and p1 = 100, r = 3, ✓ = .2 (Right Panel); p2varies from 800 to 3200.

Finally, we study the problem of SVD based on heteroskedastic and incom-plete data in Section 4.3 by the following experiments. Generate Y,X,Z 2Rp1⇥p2 in the same way as the previous heteroskedastic SVD setting withp

1

= 50, 100, r = 3, 5, �0

= .2, and p

2

ranging from 800 to 3200. Each entryof Y is observed independently with probability ✓ = 0.1. We aim to estimateU based on {Y

ij

: (i, j) 2 ⌦}. In addition to the HeteroPCA, regular SVD,and diagonal-deletion SVD, we also consider the nuclear norm minimizationestimator (Mazumder et al., 2010, Soft-Impute package),

X⇤ = argminˆ

X2Rp1⇥p2

X

(i,j)2⌦

(Yij

� X

ij

)2 + �kXk⇤, U = SVDr

(X),

To avoid the cumbersome issue of choosing the parameter �, we evaluate theabove nuclear norm minimization estimator for a grid of values of �, thenrecord the outcome with the minimum sin⇥ distance error k sin⇥(U , U)k.From the results plotted in Figure 4, we can see that HeteroPCA significantlyoutperforms the other methods.

6. Discussions. We consider PCA in the presence of heteroskedasticnoise in this paper. To alleviate the significant bias incurred on diagonalentries of the Gram matrix due to heteroskedastic noises, we introduceda new procedure named HeteroPCA that adaptively imputes the diagonalentries to remove the bias. The proposed procedure achieves the optimalrate of convergence in a range of settings. In addition, we discuss severalapplications of the proposed algorithm, including heteroskedastic low-rankmatrix denoising, Poisson PCA, and SVD based on heteroskedastic andincomplete data.

Page 24: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

24 A. ZHANG, T. T. CAI, AND Y. WU

The proposed HeteroPCA procedure can also be applied to many otherproblems where the noise is heteroskedastic. First, exponential family PCAis a commonly used technique for dimension reduction on non-real-valueddatasets (Collins et al., 2002; Mohamed et al., 2009). As discussed in the in-troduction, the exponential family distributions, e.g., exponential, binomial,and negative binomial, may be highly heteroskedastic. As in the case ofPoisson PCA considered in Section 4.2, the proposed HeteroPCA algorithmcan be applied to exponential family PCA.

In addition, community detection in social network has attracted signif-icant attention in the recent literature (Fortunato, 2010; Newman, 2013).Although most of the existing results focused on the unipartite graph, bi-partite graphs, i.e., all edges are between two groups of nodes, often appearin practice (Melamed, 2014; Florescu and Perkins, 2016; Alzahrani and Ho-radam, 2016; Zhou and Amini, 2018). The proposed HeteroPCA can also beapplied to community detection for bipartite stochastic block model. Similarlyto the analysis for heteroskedastic low-rank matrix denoising in Section 4.1,HeteroPCA can be shown to have advantages over other baseline methods.

The proposed framework is also applicable to solve the heteroskedastictensor SVD problem, which aims to recover the low-rank structure from thetensorial observation corrupted by heteroskedastic noises. Suppose one ob-serves Y = X+ Z 2 Rp1⇥p2⇥p3 , where X is a Tucker low-rank signal tensorand Z is the noise tensor with independent and zero-mean entries. If Z is ho-moskedastic, the higher-order orthogonal iteration (HOOI) (De Lathauweret al., 2000) was shown to achieve the optimal performance for recoveringX (Zhang and Xia, 2018). If Z is heteroskedastic, we can apply HeteroPCAinstead of the regular SVD to obtain a better initialization for HOOI. Simi-larly to the argument in this article, we are able to show that this modifiedHOOI yields more stable and accurate estimates than the regular HOOI.

Canonical correlation analysis (CCA) is one of the most important toolsin multivariate analysis for exploring the relationship between two sets ofvector samples (Hotelling, 1936). In the standard procedure of CCA, the corestep is a regular SVD on the adjusted cross-covariance matrix between sam-ples. When the observations contain heteroskedastic noise, one can replacethe regular SVD procedure by HeteroPCA to achieve better performance.

7. Proofs. In this section, we prove the main results, namely, Theorems2, 3, 4, and Proposition 2. For reasons of space, the remaining proofs aregiven in Section A of the supplementary materials (Zhang et al., 2018).

7.1. Proofs of Deterministic Robust Perturbation Analysis.

Page 25: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 25

Proofs of Theorems 2 and 3. We first prove the more general state-ment of Theorem 3. To characterize how the proposed procedure refines theestimation by initialization and iterations, we define T

0

= k�(N � M)k =k�(Z)k and K

t

= kN (t) �Mk for t = 0, 1, . . .. The initial error satisfies

K

0

=kN (0) �Mk = k�(N)�Mk k�(N �M)k+ kG(M)kk�(Z)k+ kG(M)k = k�(Z)k+ kG(P

U

MP

U

)kLemma 6

k�(Z)k+ I(U)rb

m

kMk = T

0

+I(U)rb

m

kMk.

Provided that I(U)rb

m

kMk �

r

(M)/(16⌘) in the assumption, we have

(7.1) K

0

T

0

+ �

r

(M)/(16⌘).

By definitions,

N

(t�1) =P

U

(t�1)N(t�1)

, �(N (t)) = �(N (t�1)) = · · · = �(N),

G(N (t)) = G(N (t�1)).(7.2)

Then for all t � 0,

(7.3)�

�⇣

N

(t) �M

= k�(N �M)k = k�(Z)k = T

0

.

The analysis for kG(N (t) � M)k is more complicated. Recall U (t�1) is theleading r principal components of N (t�1). Then,

G(N (t) �M)�

(7.2)

=�

G(N (t�1) �M)�

=�

G(PU

(t�1)N(t�1) �M)

=�

G(PU

(t�1)N(t�1) � P

U

(t�1)M � P

U

(t�1)?

M)�

G(PU

(t�1)(N (t�1) �M))�

+ kG(PU

(t�1)?

M)k

G(PU

(N (t�1) �M))�

+�

G

(PU

(t�1) � P

U

) · (N (t�1) �M)⌘

+�

G

P

U

(t�1)?

M

.

(7.4)

Next we bound these three terms separately:

• By Lemma 6,

G

P

U

(N (t�1) �M)⌘

r

I(U)rb(b ^ r)

m

N

(t�1) �M

=

r

I(U)rb(b ^ r)

m

K

t�1

.

(7.5)

Page 26: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

26 A. ZHANG, T. T. CAI, AND Y. WU

• Note that U

(t�1)(U (t�1))> and UU

> are both positive semi-definite andkU (t�1)(U (t�1))>k _ kUU

>k 1, we have kU (t�1)(U (t�1))> � UU

>k 1.By Lemma 1 in Cai and Zhang (2018),�

U

(t�1)(U (t�1))> � UU

>�

2k sin⇥(U (t�1)

, U)k ^ 1 = 2k(U (t�1)

? )>Uk ^ 1

2�

(U (t�1)

? )>UU

>M

· ��1

min

(U>M)⌘

^ 1

2�

(U (t�1)

? )>M�

· ��1

r

(M)⌘

^ 1

4kN (t�1) �Mk�

r

(M)

!

^ 1 =4K

t�1

r

(M)^ 1,

where the penultimate step follows from Lemma 7. Note that

rank((PU

(t�1) � P

U

)(N (t�1) �M)) rank(PU

(t�1) � P

U

)

rank(PU

(t�1)) + rank(PU

) 2r,

we have�

G

(PU

(t�1) � P

U

) · (N (t�1) �M)⌘

⌘ · kPU

(t�1) � P

U

k · kN (t�1) �Mk ⌘K

t�1

·✓

4Kt�1

r

(M)^ 1

.

(7.6)

• By Lemmas 6 and 7,

G(PU

(t�1)?

M)�

=�

G(PU

(t�1)?

MP

U

)�

r

I(U)rb(b ^ r)

m

P

U

(t�1)?

M

2

r

I(U)rb(b ^ r)

m

N

(t�1) �M

= 2

r

I(U)rb(b ^ r)

m

K

t�1

.

(7.7)

Combining (7.3)–(7.7), we have

K

t

k�(N (t) �M)k+ kG(N (t) �M)k

T

0

+ 3

r

I(U)rb(b ^ r)

m

K

t�1

+4⌘

r

(M)K

2

t�1

,(7.8)

for all t � 1.

Finally, we use induction to show that for all t � 0,

(7.9) K

t

2T0

+�

r

(M)

2�(t+4)

.

Page 27: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 27

The base case of t = 0 is proved by (7.1). Next, suppose the statement (7.9)holds for t� 1. Then

K

t

(a)

T

0

+ 3

r

I(U)rb(b ^ r)

m

K

t�1

+4⌘

r

(M)K

2

t�1

(b)

T

0

+K

t�1

4+K

t�1

8⌘T0

r

(M)+

1

4

(c)

T

0

+K

t�1

2

(d)

T

0

+2T

0

+ �

r

(M) · (1/2)(t�1)+4

/⌘

2=2T

0

+ �

r

(M) · (1/2)t+4

/⌘,

where (a) is (7.8); (b) is due to the assumption 144I(U)rb(b^r) m and theinduction hypothesis; (c) follows from the assumption T

0

r

(M)/(64⌘);

(d) is again by the induction hypothesis. Therefore, for all t � ⌦(log �r(M)

T0⌘_

1) = ⌦(log �r(M)

⌘k�(Z)k _1), we have Kt

3T0

. Finally, the desired (2.14) followsfrom Davis-Kahan’s sin⇥ theorem, completing the proof of Theorem 3.

In the special case where the corruption set G is the diagonal, i.e., G ={(i, i) : 1 i m}, we have

b = maxi

{j : (i, j) 2 G} _maxj

{i : (i, j) 2 G} = 1,

⌘ = maxM

kD(M)k/kMk = maxM

maxi

|Mii

|kMk = 1.

Then Theorem 2 follows from Theorem 3.

Proof of Proposition 1. We first develop the lower bound with theincoherence constraint. We first assume �/� 1/

p2. Let d = 2bm/(2r)c,

↵,� 2 Rd be unit vectors such that

↵ =1pd

(1, . . . , 1) , � =1

p

d(1 + ✓

2)(1 + ✓, . . . , 1 + ✓, 1� ✓, . . . , 1� ✓) .

Clearly, f(✓) , k↵↵> � ��

>k is a continuous function of ✓. One can verifythat f(0) = 0; f(1) = 1/

p2, then there exists 0 ✓ 1 to ensure that

(7.10) k↵↵> � ��

>k = �/�.

Based on (7.10), we additionally construct

(7.11) U

(1) =

2

6

6

6

4

1

I

r

...↵

d

I

r

0(p�rd),r

3

7

7

7

5

, U

(2) =

2

6

6

6

4

1

I

r

...�

d

I

r

0(p�rd),r

3

7

7

7

5

.

Page 28: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

28 A. ZHANG, T. T. CAI, AND Y. WU

Here, 1pd

I

r

is repeated for d times in U

(1); both 1+✓pd(1+✓

2)

I

r

and 1�✓pd(1+✓

2)

I

r

are repeated for d/2 times in U

(2). LetM (1) = �U

(1)(U (1))>,M (2) = �U

(2)(U (2))>,Z

(1) = 1

2

(M (2) � M

(1)), Z(2) = 1

2

(M (1) � M

(2)). By such the construction,

r

(M (1)) = �

r

(M (2)) = �, kM (1)k/�r

(M (1)) = kM (2)k/�r

(M (2)) = 1,

I(U (1)) =m

r

maxi

ke>i

U

(1)k22

m

rd

=m

r · 2bm/(2r)c <

m

r · 2(m/(2r)� 1) 2,

I(U (2)) =m

r

maxi

ke>i

U

(2)k22

m(1 + ✓)2

r · (d (1 + ✓

2)) m

r

· 2d

=m

r

· 1

bm/(2r)c

4 · 1 4, if 2r m 4r;m

r

· 1

m/(2r)�1

= 2m

m�2r

4, if 4r + 1 m.

.

k�(Z(1))k =k�(Z(2))kLemma 4

2kZ(2)k 2

1

2

M

(2) �M

(1)

=��

↵↵

> � ��

>�

= �,

which means (M (1)

, Z

(1)), (M (2)

, Z

(2)) 2 Dp,r

(�, �, t) for t � 4. On the otherhand, by (Cai and Zhang, 2018, Lemma 1),

sin⇥(U (1)

, U

(2))�

� 1

2kU (1)(U (1))> � U

(2)(U (2))>k

(7.11)

=1

2

2

6

4

(↵2

1

� �

2

1

)Ir

· · · (↵1

d

� �

1

d

)Ir

......

(↵d

1

� �

d

1

)Ir

· · · (↵2

d

� �

2

d

)Ir

3

7

5

=1

2k↵↵> � ��

>k = �/(2�).

Given M

(1) + Z

(1) = M

(2) + Z

(2), we have

infˆ

U

sup(M,Z)2Dp,r(�,�,t)

sin⇥(U , U)�

� infˆ

U

sup(M,Z)2{(M(1)

,Z

(1)),(M

(2),Z

(2))}

sin⇥(U , U)�

� infˆ

U

1

2

sin⇥(U , U

(1))�

+�

sin⇥(U , U

(2))�

� 1

2

sin⇥(U (1)

, U

(2))�

=�

4�.

Next, if �/� �p2/2, let �

0

= � ·p2/2. By the previous argument, one can

show

infˆ

U

sup(M,Z)2Dp,r(�,�,t)

sin⇥(U , U)�

� �

0

4�=

p2

8�

p2

8

^ 1

.

Page 29: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 29

In summary, we must have

infˆ

U

sup(M,Z)2Dp,r(�,�,t)

sin⇥(U , U)�

�p2

8

^ 1

in the first scenario that t � 4.

Then we consider the second part that t � m/r. Let

U

(1) =

I

r

0(m�r)⇥r

, U

(2) =

2

4

0r⇥r

I

r

0(m�2r)⇥r

3

5

be two orthogonal matrices, M (1) = �U

(1)(U (1))>,M (2) = �U

(2)(U (2))>,Z

(1) = �M

(1)

, Z

(2) = �M

(2). Then clearly, M (1) + Z

(1) = M

(2) + Z

(2),�

r

(M (1)) = �

r

(M (2)) � �, k�(Z(1))k = k�(Z(2))k = 0,

k sin⇥(U (1)

, U

(2))k =⇣

1� �

r

((U (1))>U (2))⌘

1/2

= (1� 0)1/2 = 1.

Moreover, for any t � m/r,

I(U (1)) =m

r

ke>i

U

(1)k22

=m

r

t, I(U (2)) =m

r

ke>i

U

(2)k22

=m

r

t.

We thus have⇣

M

(1)

, Z

(1)

,

M

(2)

, Z

(2)

2 Dp,r

(�, �, t)

if t � m/r. Given M

(1) + Z

(1) = M

(2) + Z

(2), we have

infˆ

U

sup(M,Z)2Dp,r(�,�,t)

sin⇥(U , U)�

� infˆ

U

sup(M,Z)2{(M(1)

,Z

(1)),(M

(2),Z

(2))}

sin⇥(U , U)�

� infˆ

U

1

2

sin⇥(U , U

(1))�

+�

sin⇥(U , U

(2))�

� 1

2

sin⇥(U (1)

, U

(2))�

=1

2,

which has finished the proof of this theorem.

7.2. Proofs in Heteroskedastic PCA.

Proof of Theorem 4. Based on the generalized spiked covariance model,we introduce

Z = ["1

, . . . , "

n

] 2 Rp⇥n

, �

k

= ⇤1/2

U

>(Xk

�µ) 2 Rr

, � = [�1

, . . . , �

n

] 2 Rr⇥n

.

Page 30: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

30 A. ZHANG, T. T. CAI, AND Y. WU

Then the observations can be written as

Y

k

= X

k

+ "

k

= µ+ U⇤1/2

k

+ "

k

, or Y = µ1>n

+ U⇤1/2�+ Z,

where µ 2 Rp is a fixed vector, E�k

= 0,Cov(�k

) = I, Z has independententries, and � has independent columns. Since ⌃ is invariant after any trans-lation on Y , we can assume µ = 0 without loss of generality. The rest of theproof is divided into three steps for the sake of presentation.

Step 1 We define ⌃X

= XX

>/n� XX

> and consider the following decomposi-tion of n(⌃� ⌃

X

),

n(⌃� ⌃X

) = n⌃� (XX

> � nXX

>)

=n

X

k=1

(Yk

� Y )(Yk

� Y )> = Y Y

> � nY Y

> � (XX

> � nXX)

=(X + Z)(X + Z)> � (XX

> � nXX

>)� n

XX

> + XZ

> + ZX

> + ZZ

>⌘

=XZ

> + ZX

> + ZZ

> � n

XZ

> + ZX

> + ZZ

>⌘

.

(7.12)

We analyze each term of (7.12) separately as follows. Since Z has indepen-dent entries and Var(Z

ij

) = �

2

i

, the rowwise structured heteroskedasticconcentration inequality (c.f., Cai et al. (2018)) implies

E�

ZZ

> � EZZ

>�

.pn��⇤ + �

2

.(7.13)

By Lemma 2,

EZ

ZX

> � EZX

>�

X

= EZ

ZX

>�

X

.kXk⇣

C

+ r

1/4

�⇤�C +pr�⇤⌘

. kXk�

� +pr�⇤�

.

(7.14)

Since EkZk22

=P

p

i=1

EZ2

i

=P

p

i=1

2

i

/n = �

2

/n, we have

EZ

n

XZ

> + ZX

> + ZZ

>�

EZ

nkXZ

>k+ EZ

nkZX

>k+ EZ

nkZZ

>kEnkXk

2

· kZk2

+ EnkZk22

2nkXk2

· (EZ

kZk22

)1/2 + EnkZk22

2n1/2

�kXk2

+ �

2

.

(7.15)

Page 31: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 31

Combining (7.13), (7.14), and (7.15), we have

EZ

n⌃� n⌃X

� EZZ

>�

.pn��⇤ + �

2 + kXk(� +pr�⇤) + n

1/2kXk2

�.

Noting that EZZ

> = nD and �(·) is the operator that sets all diagonalentries to zero, we further have

EZ

�(n⌃� n⌃X

)�

EZ

�(n⌃� n⌃X

� EZZ

>)�

Lemma 4

2EZ

n⌃� n⌃X

� EZZ

>�

.pn��⇤ + �

2 + kXk(� +pr�⇤) + n

1/2kXk2

�.

Since rank(⌃X

) r, the eigenvectors of ⌃X

are U , and U satisfies theincoherence condition: I(U) c

I

p/r, the robust sin⇥ Theorem (Theorem2) yields

EZ

sin⇥(U , U)�

. EZ

k�(n⌃� n⌃X

)k�

r

(n⌃X

)^ 1

.pn��⇤ + �

2 + kXk(� +pr�⇤) + n

1/2kXk2

r

(n⌃X

)^ 1.

(7.16)

Step 2 Next, we study the expectation of the target function with respect to X.We specifically need to study �

r

(n⌃X

), kXk, and kXk2

. Since � 2 Rr⇥n

has independent columns and each column is isotropic sub-Gaussian dis-tributed, based on the random matrix theory (Vershynin, 2010, Corollary5.35),

P�p

n+ C

pr + t � k�k � �

r

(�) �pn� C

pr � t

exp(�Ct

2

/2).

In addition,pn� 2 Rr is a sub-Gaussian vector satisfying

maxq�1

maxkvk21

q

�1/2

E|v> ·pn�|q

1/q

C

for any v 2 Rr. By the Bernstein-type concentration inequality (Ver-shynin, 2010, Proposition 5.16),

P�

kpn�k2

2

� r + C

prx+ Cx

C exp(�cx).

If n � Cr for some large constant C > 0, by setting t = c

pn and x = cn

in the previous two inequalities, we have

(7.17) 2pn � k�k � �

r

(�) �pn/2, and k

pn�k

2

pn/3

Page 32: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

32 A. ZHANG, T. T. CAI, AND Y. WU

with probability at least 1� C exp(�cn). When (7.17) holds,

r

(n⌃X

) =�r

n(XX

> � nXX

>)⌘

= �

r

nU⇤1/2(��> � n��>)⇤1/2

U

>⌘

��r

(⇤) · �r

��> � n��>⌘

� �

r

(⇤)�

2

r

(�)� kpn�k2

2

(7.15)

� �

r

(⇤) (n/4� n/9) & n�

r

(⇤);

(7.18) kXk kU⇤1/2�k k⇤1/2k · k�k(7.15)

.pn�

1/2

r

(⇤);

kXk2

= kU⇤1/2�k2

k⇤1/2k · k�k2

. �

1/2

r

(⇤).

By combining the previous three inequalities and (7.16), we know if (7.17)holds,

EZ

sin⇥(U , U)�

.pn��⇤ + �

2 + (n�r

(⇤))1/2(� +pr�⇤) + (n�

r

(⇤))1/2(⇤)�

n�

r

(⇤)^ 1

.✓

� +pr�⇤

(n�r

(⇤))1/2+

pn��⇤ + �

2

n�

r

(⇤)

^ 1

.✓

� +pr�⇤

(n�r

(⇤))1/2+

��⇤n

1/2

r

(⇤)

^ 1.

(7.19)

Here, the last inequality is due to �2/(n�r

(⇤)) ^ 1 �/(n�r

(⇤))1/2 ^ 1.Step 3 Finally,

Ek sin⇥(U , U)k =Ek sin⇥(U , U)k1{(7.17) holds} + Ek sin⇥(U , U)k1{(7.17) does not hold}(7.17)

.✓

� +pr�⇤

(n�r

(⇤))1/2+

��⇤n

1/2

r

(⇤)

^ 1 + P ((7.17) does not hold)

.✓

� +pr�⇤

(n�r

(⇤))1/2+

��⇤n

1/2

r

(⇤)

^ 1 + C exp(�cn)

.✓

� +pr�⇤

(n�r

(⇤))1/2+

��⇤n

1/2

r

(⇤)

^ 1.

The last inequality is due to the assumption that �r

(⇤) � c exp(�cn). There-fore, we have finished the proof of this theorem.

Page 33: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 33

References.

Aflalo, Y. and Kimmel, R. (2013). Spectral multidimensional scaling. Proceedings of theNational Academy of Sciences, page 201308708.

Alzahrani, T. and Horadam, K. (2016). Community detection in bipartite networks: Al-gorithms and case studies. In Complex Systems and Networks, pages 25–50. Springer.

Bai, Z. and Yao, J. (2012). On sample eigenvalues in a generalized spiked populationmodel. Journal of Multivariate Analysis, 106:167–177.

Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices ofspiked population models. Journal of multivariate analysis, 97(6):1382–1408.

Benaych-Georges, F. and Nadakuditi, R. R. (2012). The singular values and vectors oflow rank perturbations of large rectangular random matrices. Journal of MultivariateAnalysis, 111:120–135.

Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: Anonasymptotic theory of independence. Oxford university press.

Cai, T. T., Li, X., and Ma, Z. (2016). Optimal rates of convergence for noisy sparse phaseretrieval via thresholded wirtinger flow. The Annals of Statistics, 44(5):2221–2251.

Cai, T. T., Wu, Y., and Zhang, A. (2018). Heteroskedastic wishart-type concentrationinequalities. preprint.

Cai, T. T. and Zhang, A. (2016). Minimax rate-optimal estimation of high-dimensionalcovariance matrices with incomplete data. Journal of multivariate analysis, 150:55–74.

Cai, T. T. and Zhang, A. (2018). Rate-optimal perturbation bounds for singular subspaceswith applications to high-dimensional statistics. The Annals of Statistics, 46(1):60–89.

Candes, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust principal component analysis?Journal of the ACM (JACM), 58(3):11.

Candes, E. J., Li, X., and Soltanolkotabi, M. (2015). Phase retrieval via wirtinger flow:Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007.

Candes, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization.Foundations of Computational mathematics, 9(6):717.

Candes, E. J., Sing-Long, C. A., and Trzasko, J. D. (2013). Unbiased risk estimatesfor singular value thresholding and spectral estimators. IEEE transactions on signalprocessing, 61(19):4643–4657.

Candes, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrixcompletion. IEEE Transactions on Information Theory, 56(5):2053–2080.

Cao, Y., Zhang, A., and Li, H. (2017). Multi-sample estimation of bacterial compositionmatrix in metagenomics data. arXiv preprint arXiv:1706.02380.

Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding. TheAnnals of Statistics, 43(1):177–214.

Chen, Y. and Suh, C. (2015). Spectral mle: Top-k rank aggregation from pairwise com-parisons. In International Conference on Machine Learning, pages 371–380.

Cochran, R. N. and Horne, F. H. (1977). Statistically weighted principal component analy-sis of rapid scanning wavelength kinetics experiments. Analytical Chemistry, 49(6):846–853.

Collins, M., Dasgupta, S., and Schapire, R. E. (2002). A generalization of principal compo-nents analysis to the exponential family. In Advances in neural information processingsystems, pages 617–624.

Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. iii.SIAM Journal on Numerical Analysis, 7(1):1–46.

De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000). On the best rank-1 and rank-(r1, r 2,..., rn) approximation of higher-order tensors. SIAM journal on Matrix Analysis

Page 34: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

34 A. ZHANG, T. T. CAI, AND Y. WU

and Applications, 21(4):1324–1342.Dobriban, E., Leeb, W., and Singer, A. (2016). Pca from noisy, linearly reduced data: the

diagonal case. arXiv preprint arXiv:1611.10333.Donath, W. E. and Ho↵man, A. J. (2003). Lower bounds for the partitioning of graphs.

In Selected Papers Of Alan J Ho↵man: With Commentary, pages 437–442. World Sci-entific.

Donoho, D. and Gavish, M. (2014). Minimax risk of matrix denoising by singular valuethresholding. The Annals of Statistics, 42(6):2413–2440.

Florescu, L. and Perkins, W. (2016). Spectral thresholds in the bipartite stochastic blockmodel. In Conference on Learning Theory, pages 943–959.

Fortunato, S. (2010). Community detection in graphs. Physics reports, 486(3-5):75–174.Golub, G. H., Ho↵man, A., and Stewart, G. W. (1987). A generalization of the eckart-

young-mirsky matrix approximation theorem. Linear Algebra and its applications,88:317–327.

Hao, B., Zhang, A., and Cheng, G. (2018). Sparse and low-rank tensor estimation viacubic sketchings. arXiv preprint arXiv:1801.09326.

Hong, D., Balzano, L., and Fessler, J. A. (2016). Towards a theoretical analysis of pcafor heteroscedastic data. In Communication, Control, and Computing (Allerton), 201654th Annual Allerton Conference on, pages 496–503. IEEE.

Hong, D., Balzano, L., and Fessler, J. A. (2018). Asymptotic performance of pca forhigh-dimensional heteroscedastic data. Journal of Multivariate Analysis.

Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4):321–377.Jin, J., Wang, W., et al. (2016). Influential features pca for high dimensional clustering.

The Annals of Statistics, 44(6):2323–2359.Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal compo-

nents analysis. Annals of statistics, pages 295–327.Jolli↵e, I. (2002). Principal component analysis. Springer, New York, 2nd ed. edition.Ke, Z. T. and Wang, M. (2017). A new svd approach to optimal topic estimation. arXiv

preprint arXiv:1704.07016.Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from noisy entries.

Journal of Machine Learning Research, 11(Jul):2057–2078.Koltchinskii, V., Lounici, K., and Tsybakov, A. B. (2011). Nuclear-norm penalization

and optimal rates for noisy low-rank matrix completion. The Annals of Statistics,39(5):2302–2329.

Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional bymodel selection. Annals of Statistics, pages 1302–1338.

Liu, L. T., Dobriban, E., and Singer, A. (2016). e pca: High dimensional exponentialfamily pca. arXiv preprint arXiv:1611.05550.

Lounici, K. et al. (2014). High-dimensional covariance matrix estimation with missingobservations. Bernoulli, 20(3):1029–1058.

Massart, P. (2007). Concentration inequalities and model selection.Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization algo-

rithms for learning large incomplete matrices. Journal of machine learning research,11(Aug):2287–2322.

Melamed, D. (2014). Community structures in bipartite networks: A dual-projectionapproach. PloS one, 9(5):e97823.

Mohamed, S., Ghahramani, Z., and Heller, K. A. (2009). Bayesian exponential family pca.In Advances in neural information processing systems, pages 1089–1096.

Negahban, S., Oh, S., and Shah, D. (2012). Rank centrality: Ranking from pair-wisecomparisons. arXiv preprint arXiv:1209.1688.

Page 35: Heteroskedastic PCA: Algorithm, Optimality, and Applicationstcai/paper/HeteroPCA.pdf · HETEROSKEDASTIC PCA 3 et al., 2014; Cao et al., 2017), which are naturally heteroskedastic

HETEROSKEDASTIC PCA 35

Newman, M. E. (2013). Spectral methods for community detection and graph partitioning.Physical Review E, 88(4):042822.

Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spikedcovariance model. Statistica Sinica, pages 1617–1642.

Richard, E. and Montanari, A. (2014). A statistical model for tensor pca. In Advances inNeural Information Processing Systems, pages 2897–2905.

Salmon, J., Harmany, Z., Deledalle, C.-A., and Willett, R. (2014). Poisson noise reductionwith non-local pca. Journal of mathematical imaging and vision, 48(2):279–294.

Shabalin, A. A. and Nobel, A. B. (2013). Reconstruction of a low-rank matrix in thepresence of gaussian noise. Journal of Multivariate Analysis, 118:67–76.

Sun, R. and Luo, Z.-Q. (2016). Guaranteed matrix completion via non-convex factoriza-tion. IEEE Transactions on Information Theory, 62(11):6535–6579.

Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices.arXiv preprint arXiv:1011.3027.

Vershynin, R. (2011). Spectral norm of products of random and deterministic matrices.Probability theory and related fields, 150(3-4):471–509.

Wedin, P.-A. (1972). Perturbation bounds in connection with singular value decomposi-tion. BIT Numerical Mathematics, 12(1):99–111.

Yao, J., Zheng, S., and Bai, Z. (2015). Sample covariance matrices and high-dimensionaldata analysis. Cambridge University Press.

Zhang, A., Cai, T. T., and Wu, Y. (2018). Supplement to “heteroskedastic pca: Algorithm,optimality, and applications”. Technical Report.

Zhang, A. and Han, R. (2018). Optimal sparse singular value decomposition for high-dimensional high-order data. arXiv preprint arXiv:1809.01796.

Zhang, A. and Wang, M. (2018). Spectral state compression of markov processes. arXivpreprint arXiv:1802.02920.

Zhang, A. and Xia, D. (2018). Tensor svd: Statistical and computational limits. IEEETransactions on Information Theory, to appear.

Zhang, A. and Zhou, Y. (2018). A sharp and user-friendly reverse cherno↵-cramer bound.Technical Report.

Zhou, Z. and Amini, A. A. (2018). Optimal bipartite network clustering. arXiv preprintarXiv:1803.06031.

Zoran, D. and Weiss, Y. (2011). From learning models of natural image patches to wholeimage restoration. In Computer Vision (ICCV), 2011 IEEE International Conferenceon, pages 479–486. IEEE.