technical report #2009-5 may 14, 2009 statistical and applied mathematical sciences institute po box...

23
Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC 27709-4006 www.samsi.info Sparse Bayes Inference by Annealing Entropy Ryo Yoshida, Mike West This material was based upon work partially supported by the National Science Foundation under Grant DMS-0635449 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Upload: others

Post on 25-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

Technical Report #2009-5May 14, 2009

Statistical and Applied Mathematical Sciences Institute PO Box 14006

Research Triangle Park, NC 27709-4006www.samsi.info

Sparse Bayes Inference by Annealing Entropy

Ryo Yoshida, Mike West

This material was based upon work partially supported by the National Science Foundation under Grant DMS-0635449 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s)

and do not necessarily reflect the views of the National Science Foundation.

Page 2: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

Sparse Bayes Inference by Annealing Entropy 1

Sparse Bayes Inference by Annealing Entropy

Ryo Yoshida1 Mike West2

[email protected] [email protected]

1 Institute of Statistical Mathematics, Research Organization of Information and Systems,Minato-ku, Tokyo 106-8569, Japan

2 Department of Statistical Science, Duke University, Durham, NC 27708-0251, USA

Abstract

We bring a novel Bayesian computing to find sparse estimates for a range of statistical models. Very often,the problem of sparsity identification requires substantial efforts necessary to solve a hard combinatorialoptimization involved in a large configuration space of sparsity patterns. For numerous existing methods,presence of many improper local optima is the cause of difficulty yet to be developed. The essence of ourapproach is the optimization of the augmented posterior distribution, i.e. the maximum a posteriori (MAP),with regard to sparsity configurations and model parameters. To realize an efficient MAP computing,we impose an artificial regularizer on the posterior entropy of sparsity configurations where the degree ofregularization is dynamically controlled by a meta parameter called temperature. Our algorithm prescribes aschedule for lowering temperature so as to decay slowly and reach to zero, and then proceeds with iterative-optimization over sparsity configurations and parameters while keeping on the cooling schedule. The limitingzero temperature yields an exact MAP estimate which would be the global or near-optimal posterior mode.We detail the procedure of our computing especially for a latent factor model, and some intrinsic natures ofthe annealing.

1 Introduction

Over the past decades, a range of sparsity inference has been studied in statistical science, involved in the use oflatent factor models, [Griffiths and Ghahramani, 2005, Lucas et al., 2006, Carvalho et al., 2008], the principalcomponent analysis (sparse PCA) [Jolliffe et al., 2003, Zou et al., 2006, d’Aspremont et al., 2007], structurallearning for graphical models [Dobra et al., 2004, Jones et al., 2005] and variable selection in regression andclassification [Tibshirani, 1996, Tipping, 2001, Zou and Hastie, 2005]. A fundamental issue on the sparsityidentification arises from a hard combinatorial optimization involving a huge configuration space of possiblesparsity. Naive exploration of most probable sparsification often gets stuck in an improper local optimum, andas yet efficient computing is to be developed.

Of numerous sparse models, the class of latent factor models, including sparse PCA, has received intensivestudies because of its popularity throughout a variety of scientific fields. Although our framework is fullygeneral, we will develop most arguments while quoting a simplified latent factor model. Given n samples,i ∈ {1, . . . , n}, the latent factor model linearly associates a vector of p observable feature variables, xi ∈ R

p,with the k dimensional latent factor λi ∈ R

k through the meaurement equation

xi = Hλi + νi with λi ∼ N (λi|0, I) and νi ∼ N (νi|0,Ψ). (1)

The p×k factor loading matrix H defines the linear mapping from the lower-dimensional latent space Rk to the

domain of data, Rp. The noise terms are independent to the factors and follow the Gaussian distribution with

the diagonal covariance matrix Ψ = diag(ψ1, . . . , ψp). The implied covariance matrix of data, Σ = HHT +Ψ,characterizes a parsimonious structure of multivariate normal distribution. Especially when the noise variancehas the isotropic structure, i.e. Ψ = ψI, (1) is referred to as the probabilistic PCA [Bishop, 1999].

A sparse factor model arises when the loading matrix has some zero elements, so that the bipartite directedgraph mapping elements of λi to those of xi is not complete. Of interest in the sparsificaion of the factor loadingsis to obtain an explicit feature mapping which draws a subset of p feature variables relevant to one or moreblind sources in λi, yielding data covariations. We now write down the factor loading matrix as H → H ◦ Z.The loading matrix is broken into the elementwise product of H and the p×k binary matrix Z which indicatesa sparsity configuration with the entity zgj := (Z)gj = 1 if the gth feature is originated from the jth factorvariable, and zero otherwise. The binary elements in Z are treated as the discrete random variables which follow

Page 3: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

a distribution p(Z). Coupling p(Z) with the likelihood p(X|Θ,Z), where X = {xi}1≤i≤n and Θ stands for theset of unknown model parameters H and Ψ, induces the augmented posterior distribution p(Θ,Z|X) togetherwith a prior distribution p(Θ|Z) if necessary. Our aim is now to find the mode of the posterior distribution withregard to Θ and Z simultaneously. In essence, this MAP estimation amounts to the simultaneous inferenceof the parameters and model structure, i.e. sparsity configuration. As will be seen, our setting encompassessome existing approaches, e.g. sparse PCA based on the semipositive definite programming [d’Aspremont et al.,2007], L1-based shrinkage PCA [Jolliffe et al., 2003, Zou et al., 2006].

The MAP inference induced from the augmented posterior distribution brings a combinatorial optimizationover all configurations of Z. The aim of our work is to present a new Bayesian computing stably to capture thenear-optimal MAP estimates. The key notion is laid on the design of an artificial regularizer which functions tocontrol the complexity of sparsity configuration probabilities through the entropy of the posterior distributionfor the binary variates, p(Z|Θ,X). The regularizer contains a meta-parameter T , called temperature, whichgoverns the realized entropy of the configuration probabilities. We prescribe a cooling schedule for loweringthe temperature which decays up to zero during the run of optimization. Our algorithm iteratively improvesthe estimates of Z and Θ so as to monotonically increase values of the posterior distribution while keepingon the cooling schedule. At the limiting zero temperature, T → 0, the effect of the artificially-incorporatedregularizer is removed, and then we have an exact MAP estimate. The essence of the annealing operation is torealize the mechanisms to avoid trapping to improper local optima, which bear some analogy to the propertyof the simulated annealing (SA) [Geman and Geman, 1984, Kirkpatrick et al., 1983]. Indeed, our annealing isobtained as a natural counterpart to SA. In the light of the bridging to SA, we would be enforced to prescribean exceedingly slow cooling schedule, possibly virtually infeasible to carry, in order to ensure the convergenceto global optimum. In practice, however, as numerical evidences indicate, our approach has the ability more orless to find near-optimal solutions even with linear or exponential decays of temperature.

The outline of this paper is as follows: Section 2 describes a simple sparse latent factor model, called thegraphical factor model (GFM), which will be often quoted throughout this paper for illustration. We also showsome preliminary materials on the subsequent Bayes inference. Section 3 introduces the basic principle of ourcomputing which will be derived by annealing the entropy of sparsity configuration probabilities. We alsoprovoke a bridging to SA. Section 4 details the sparse Bayes computing for the GFM inference, and discussessome computational issues on large-scale problem involving large p. In Section 5, we raise the issue on controllingthe effective degree of sparseness, which will be addressed from the Bayesian perspective. Section 6 highlightsthe inferential powers and mechanisms of our annealing algorithm in the light of artificial data sets. We thenapply our framework to the transcriptome analysis of breast cancer using gene expression profiles. Finally, wegive concluding remarks in Section 7 with a prospect to more general issues, e.g. the use of the state spacemodels, variable selection in regression. Supporting information, including R codes and suppementary materialsfor the transcriptome analysis, is accessible from http://daweb.ism.ac.jp/˜yoshidar/anneals/.

2 Graphical Factor Models

2.1 Model Descriptions

In order to make mathematically identifiable models, some constraints are further needed to impose on (1).Let R ∈ R

k×k be an arbitrary chosen orthogonal matrix. Using such R, we can generate an infinite numberof equivalent models sharing the same covariance structure Σ = HHT + Ψ by the rotation; H → HR andλi → RT λi. This rotational ambiguity sometimes causes a critical issue on sparsity inference. For instance, inthe case where a sparsity configuration of the loading matrix is specified in the light of data, we can rotate theestimate by H →HR in various ways, which typically destroys the structure of sparsity. This is always true aslong as we conduct any inferences solely based on the likelihood criterion. Even with the inclusion of a diffuseprior distribution for sparsity configurations, we are subject to the risk that identified sparsify is induced froman almost complete dominance of the likelihood criterion over the prior distribution.

To eliminate the lack of uniqueness in the original factor model, we specify sparsity of H after bringing anatural reparameterization:

xi = Ψ12 Φ(Z)Δ

12 λi + νi.

The factor loading matrix in the original model is broken into the three components: p× p diagonal matrix ofscaling factors, Ψ1/2, p × k sparse matrix of orthogonal columns, Φ(Z), and k × k diagonal matrix of scalingfactors with unequal elements, Δ1/2 = diag(δ1/2

1 , . . . , δ1/2k ). The orthogonal loading matrix Φ(Z) is further

Page 4: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

decomposed into the elementwise product of Φ and Z, denoted by Φ(Z) = Φ ◦ Z. As illustrated in previous,the p×k binary matrix Z indicates presence or absence of association between p features and k factors; zgj = 1if the gth feature xgi is originated from the jth factor λij , and zero otherwise.

The factorization shown here is derived from the singular decomposition of the scaled-factor loading matrix,Ψ− 1

2 H = Φ(Z)Δ12 R, and the removal of the k × k orthogonal matrix R. This parameterization implicitly

imposes the orthogonality on the scaled-factor loading matrix, i.e. HTΨ−1H = Δ. Hence, any subsequentrotation HR is enforced to ensure the diagonality of (HR)TΨ−1HR = RT ΔR with unequal positive elements.At this point, the family of R is restricted to be the diagonal matrices of ±1. Specification of a sign condition toan arbitrary-chosen row of H leads to the identity matrix R = I since the row of HR must have the same signto the original H . As the assignment of sign condition is arbitrary, we could eliminate the lack of uniqueness.

Notion of this parameterization is also originated from another aspect other than the problem of rotationalambiguity. Note that the latent factor model described here is characterized by the covariance matrix Σ andthe precision matrix Σ−1 as follows:

Σ = Ψ12 (Φ(Z)ΔΦ(Z)T + I)Ψ

12 and Σ−1 = Ψ− 1

2 (I −Φ(Z)DΦ(Z)T )Ψ− 12 (2)

where D = diag(τ1, . . . , τk) with positive diagonal elements, τj = δj(1 + δj)−1 ∈ (0, 1). Sparsity of Φ(Z),corresponding to a sparse factor model, can therefore induce sparsity in Σ−1, simultaneously with that inΣ. In general, a sparse factor loading matrix induces some zero elements in the covariance matrix, but theconfiguration of zero elements is no longer preserved to the implied precision matrix, except in very sparseor block diagonal cases. For Gaussian random variables, a pattern of zeros in the precision matrix definesconditional independence graph or graphical model. In our model, the set of feature variables associated withone specific factor forms a clique in the induced graphical model, with sets of variables that have non-zeroloadings on any two factors lying in the separating subgraph between the corresponding cliques. Here, we havea natural and appealing framework in which sparse factor models and graphical models are reconciled andconsistent. The reconciliation then allows induction of graphical models from sparse factor models. We refer tothis model as the graphical factor model.

2.2 Bayes Preliminary

The subject of the spasity identification of GFM consists of the estimation of the latent binary matrix and theset of unknown parameters, Θ = {Φ,Ψ,Δ}, where we are given p× n data matrix X. The distribution of Xconditional on Z and Θ is then

p(X|Z,Θ) ∝ |Ψ|−n2 etr

(−SΨ−1

2

)|I −D|n2 etr

(Ψ− 1

2 SΨ− 12 Φ(Z)DΦ(Z)T

2

)(3)

where etr(A) = exp(trace(A)) for any square matrix A, and S is the sample sum-of-square matrix S = XXT

with elements (S)gh := sgh. This representation of the likelihood function can be derived by noticing the matrixformula of Σ−1 shown in (2) and |Σ−1| = |Ψ|−1|I −D|. Of (3), the loading values appear only on the lastterm, and form the important statistic

trace(Ψ− 1

2 SΨ− 12 Φ(Z)DΦ(Z)T

)=

k∑j=1

τjφj(zj)TΨ− 12 SΨ− 1

2 φj(zj)

where Φ(Z) := (φ1(z1), . . . ,φk(zk)). This quantity measures the retained variance of projections {Φ(Z)T Ψ−1/2xi}1≤i≤n

which define the linear mapping of n data points in Rp to the factor space R

k with the k orthogonal bases{φj(zj)}1≤j≤k. The weighting factors {τj}1≤j≤k are multiplied during the pooling of the k projected variances.

The binary variates zgj forming a configuration space Z are now treated as independent random variableswhich follow the prior distribution

p(Z|ζ) =p∏

g=1

k∏j=1

p(zgj |ζgj) =p∏

g=1

k∏j=1

exp(−zgjζgj/2)1 + exp(−ζgj/2)

(4)

where ζ = {ζgj} stands for the vector of p× k hyperparameters. This prior distribution regulates the degree ofsparseness through the exponential loss functions, exp(−zgjζgj/2), which raise more penalty in proportional toexp(−ζgj) < 1 as more binary variates are presented where ζgj > 0, and vice versa for ζgj < 0. Specification

Page 5: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

of the effective degree of sparseness is achieved by controlling the values of ζ at a proper level. In later, wewill further put on the lower-level prior distribution ζ ∼ p(ζ) to proceed with the automatic determination ofζ in the light of given data. More specifically, we conduct the joint MAP estimation under which the set ofparameters is augmented by Z, Θ and ζ. The inclusion of the additional ‘many parameters’ seems to cause anover-complexity of modelling. However, as shown in later, our modelling for the lowest hyperprior p(ζ) givesrise to a ‘self-organization’ of the p×k hyperparameters such that their posterior distributions are concentratedon a larger or smaller value according to their relevance. The complete specification of p(ζ) is omitted untilSection 6.

Although we can define the prior distribution p(Θ|Z) on the model parameters, reflecting a priori knowledgeif available, design of the prior distribution should be discussed in domain-specific context. In what follows, wewill employ the flat prior p(Θ|Z) ∝ constant in order to reveal natures of ‘our annealing’ more clearly. .

2.3 Maximum A Posteriori and Related Methods

Our objective is to derive an efficient computing to seek out the mode of the augmented posterior distributioncomprising the product of the likelihood p(X|Z,Θ) and the prior distributions, p(Z|ζ), p(Θ|Z), and also theadditional prior p(ζ). In a while after now, p(ζ) is omitted from the posterior, and we will develop the argumentsbased on the reduced posterior p(Z,Θ|X, ζ) ∝ p(X,Z,Θ|ζ):

log p(X,Z,Θ|ζ)12 = log p(Θ|Z)

12 −

p∑g=1

k∑j=1

zgjζgj −p∑

g=1

(n logψg − sggψ

−1g

)

+k∑

j=1

(n log(1− τj) + τjφj(zj)T Ψ− 1

2 SΨ− 12 φj(zj)

). (5)

The constant terms irrelevant to Z and Θ are all omitted. The criterion (5) is maximized with respect to Θand Z subject to a given ζ. We here rewrite the quadratic form in the last term as

φj(zj)TΨ− 12 SΨ− 1

2 φj(zj)→ φTj S(zj ,Ψ)φj with

(S(zj ,Ψ)

)gh

:= zgjzhjsghψ− 1

2g ψ

− 12

h .

Each φj(zj) is broken into the elementwise product of φj and zj , and then zj is shifted to the scaled-samplecovariance matrix (proportional to 1/n) Ψ−1/2SΨ−1/2 so as to induce the sparsified matrix S(zj ,Ψ). Thisrearrangement is utilized effectively for a concise representation of the optimization algorithm.

There is an obvious connection between the traditional sparse PCA and the MAP framework described here.Under which Ψ = I and Δ = I are given, the likelihood (the last term) in (5) becomes the pooled-varianceof projections, i.e.

∑kj=1 φT

j S(zj , I)φj , which are constructed by the k sparse loading vectors. This is thecentral statistic to be optimized in sparse PCA. Of existing sparse PCA algorithm, the differences come fromthe way of regulating the degree of sparseness. The direct sparse PCA provoked by d’Aspremont et al. [2007]imposes a upper-bound d > 0 on the cardinality of zj (the number of non-zero elements), Card(zj) < d. Theyderived a novel optimization algorithm having complexity O(p4

√log(p)/ε) with an absolute accuracy ε under the

relaxation of the original problem to the semidefinite programming. Due to the high computational complexity,however, its applicability would be limited only on the problems involving small number of features. We canmake an analogy between the MAP estimation and their approach. The cardinality constraints, Card(zj) < d,j ∈ {1, . . . , k}, are regarded as a upper-bounding for the prior distribution p(Z|ζ) < c with a positive value c.

Jolliffe et al. [2003] provided SCoTLASS algorithm which allows to use the L1 regularization for the sparsi-fication of the loading vectors, and in later, as an alternative of SCoTLASS, Zou et al. [2006] presented a moreefficient and generalized algorithm, called SPCA, in which the shrinkage mechanism of elastic-net is incorpo-rated. Setting the Laplace-like prior distribution to the loading vectors, e.g. p(φj(zj)|zj) ∝ exp(−βj ||φj(zj)||1)with a scaling parameter βj > 0, one can obtain a counterpart of the L1-based penalization.

While all these methods were originated from non-probabilistic arguments, the methodological foundationof our approach is laid on a Bayes principle. We would take considerable advantages by this. For example, onefundamental issue on sparse PCA is how to determine an appropriate dimension of the latent factor space in thelight of given data. Though some preceding studies provided heuristics for the choice of the factor dimension,our method has a built-in mechanism that prunes redundant factors automatically where we start with an extraamount of factor variables. Moreover, we have much more flexibility in the determination of the effective degreeof sparseness, which is simply translated to the inference of the additional ‘model’ parameter ζ. We can utilizewell-established Bayesian statistics to handle these issues because our approach is fully model-based.

Page 6: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

3 Annealing on Posterior Entropy

3.1 Basic Principle

The MAP estimation based on (5) involves many discrete variables zgj which are the cause of difficulty in theoptimization. As a simplified method, we may apply a greedy hill-climbing as follows: One starts with an initialset for Θ, and the conditional maximization of (5) is applied with respect to a single binary element zgj whileall the others are fixed at the current values. After running this partial optimization over g ∈ {1, . . . , p} andj ∈ {1, . . . , k}, we then proceed to the improvement of Θ under the new configuration Z. Repeating theseprocedures until no further improvements can be found, the search would get a local optimum of the augmentedposterior distribution. Though the use of such a naive algorithm can save computational burden at some extent,we are often trapped to improper local solutions in which many local optima exist.

In order to ease the stuck in local optima, we newly generate a regularizer based on the annealing ofposterior entropy, which is dynamic and artificial. Before proceeding to the body of methodology, we beginwith the prototype inference. Returning the posterior distribution (5), let us take the marginal of p(X,Z,Θ|ζ)with respect to all the sparsity configurations Z ∈ Z. It then follows

log p(X,Θ|ζ) = log( ∑

Z∈Zp(X,Z,Θ|ζ)

)

=∑Z∈Z

ω(Z) log p(X,Z,Θ|ζ)−∑Z∈Z

ω(Z) log ω(Z) (6)

where the posterior probabilities of the binary variates are denoted by

ω(Z) =k∏

j=1

ω(zj) :=k∏

j=1

p(zj |X,Θ, ζ). (7)

In (6), the sums are expanded over all possible configurations of zgk ∈ {0, 1}. The first term in (6) involvesthe expectation of the logarithmic joint distribution corresponding to the augmented data, X, Θ and Z, withrespect to the posterior configuration probabilities in ω(Z). The second term collects the k entropies of theposterior distributions p(zj |X,Θ, ζ), j ∈ {1, . . . , k}. Here, we have shown the alternative representation ofp(X,Θ|ζ) in the second line of (6) as a preliminary to our annealing algorithm. Note that in (7), the posteriordistribution of the binary random matrix is decomposed into the product of the distributions corresponding toz1, . . . , zk. The independence of the k posterior distributions is proved by noticing the decomposability of thejoint distribution, p(X ,Z,Θ|ζ) =

∏kj=1 p(X,Θ|zj , ζ)p(zj |ζ), which is obvious from its explict form (5).

The concept of our annealing is insipired from (6). The objective function that we consider takes the form:

G(Θ,ω;T ) =∑Z∈Z

ω(Z) log p(X,Z,Θ|ζ)− T∑Z∈Z

ω(Z) logω(Z) (8)

subject to∑Z∈Z

ω(Z) = 1 and ∀Z ∈ Z, ω(Z) ≥ 0.

This criterion generalizes (6) only by the inclusion of the meta parameter T ≥ 0, termed temperature, whichpenalizes the entropy of ω(Z). The magnitude of the temperature controls a dominance of the entropy of ω(Z)over the others. The optimization is applied simultaneously with regard to the parameter Θ and the ‘unknowndistribution’ ω(Z), rather than Z itself (ω(Z) is no longer consistent to the exact posterior distribution, butthe ‘tempered posterior distribution’ as will shown in below).

The temperature T varies during the process of optimization. We prescribe a cooling schedule of T soas to decay monotonically up to zero from a certain value. At the limit, T → 0, the entropies vanish fromG(Θ,ω;T ). Thus, the main body of the criterion (8) is only the first term. The entropy as a function ofω(Z) achieves its maximum value with equal probability ω(Z) = 1/|Z| for all possible configurations, andbecomes correspondingly smaller as ω(Z) turns to be more unequal. At the beginning of optimization, a hightemperature forces all configuration probabilities rather close to 1/|Z|, and puts more unequal probabilities asT approaches to zero. As shown in below, it can be proved that the estimated configuration probability whichachieves the optimum of G(Θ,ω;T ) approaches to the probability mass concentrated on the global optimum ofthe original MAP criterion (5) at the limiting zero temperature T → 0.

In order to clarify the solution pass for the sparsity configuration probabilities during the annealing, we statethe following proposition:

Page 7: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

Proposition For any given parameters and temperature, the maximum of (8) with respect to ω(Z) is consistentto the tempered posterior distribution up to the normalizing constant:

ω(Z;T ) ∝ p(Z|X,Θ, ζ)1T . (9)

Proof 1 See the Appendix.

As raising the temperature, the configuration probabilities in (9) become more equal. On the other hand,reducing temperature yields spikier probabilities. Especially when T = 1, the solution (9) turns to the exactposterior distribution of sparsity configurations. At the limiting zero temperature T → 0, all the configu-ration probabilities are concentrated at Z = Z achieving the global maximum of the posterior distributionp(Z|X,Θ, ζ) ∝ p(X ,Z,Θ|ζ) for any given Θ.

The outline of the annealing procedure is summarized as follows: Starting from a high temperature, wecarry out iterative-improvement of (8) alternatively over Θ and ω(Z) while decreasing the temperature at asufficiently slow rate. Ideally, the induced probability mass captures the global optimum:

Z := limT↓0

ω(Z;T ) =

{1 Z = arg max

Z′p(X,Z ′,Θ|ζ)

0 otherwise

We thus obtain the alternative representation of the MAP estimator for the sparsity configurations, i.e. thedegenerated configuration probabilities a posteriori, induced via the annealing.

The notion of the annealing operation is to realize a gradual move of successively-generated solutions forΘ and ω(Z;T ), and to avoid trapping to local modes while vanishing trivial stationary points of the objectivefunction through the effect of tempering. Intuitively, we shall prescribe a cooling schedule as much as slow todecay so as to avoid sequence of temperatures reaches to zero while the solution pass stays around the regionsfar from the global optimum. This aspect will be made more clear when we notice the analogy to the SA:

π(Θ,ω;T ) ∝ exp( 1TG(Θ,ω;T )

)= exp

( 1T

∑Zω(Z) log p(X,Z,Θ|ζ)

)exp

(∑Z−ω(Z) logω(Z)

)

exp( 1T

∑Zω(Z) log p(X,Z,Θ|ζ)

)(10)

The target distribution π(Θ,ω;T ) is defined by the Boltzman factor in which the energy of the configura-tions for Θ and {ω|ω(Z) ∈ [0, 1],Z ∈ Z} is formed by the negative of the expected joint distribution,−∑

Z ω(Z) log p(X,Z,Θ|ζ), together with the entropy of ω(Z). At which the temperature is lowered suf-ficiently, the entropy of ω(Z) can be ignored, and then we obtain the main body of the Boltzman distributionshown in the last line of (10). Note here that rather than Z itself, the tempered distribution is expanded overthe infinite sequence of the configuration probabilities ω(Z) ∈ [0, 1] for each ‘atom’ Z ∈ Z.

The SA aims to capture the global optimum of the negative energy by drawing particles sequentially whilegradually reducing the temperature. At the limiting temperature where the target distribution concentrates tothe global optimum, the resulting draw becomes the probability mass concentrated on the global optimun. Thisoperation is essentially equivalent to the maximization of our criterion (5). By the bridging between the SAand our annealing, we may have an insight to understand the convergence property of our annealing althoughas yet, we do not know how to bring the theory. However, the obvious connection of these two approachesindicates that for the most part, some convergence properties of SA, e.g. [Geman and Geman, 1984], would bepreserved in our annealing algorithm.

3.2 Factorization of Configuration Probability

Note that as yet, we do not have any practical procedures since the tempered posterior distribution (9) definedover a large configuration space Z is still infeasible to compute. In order to realize the sparsification so as topreserve the prescribed annealing mechanism, we enforce the independency to the configuration probabilities:

ω(Z) =p∏

g=1

k∏j=1

ω(zgj) :=p∏

g=1

k∏j=1

ωzgj

gj (1− ωgj)1−zgj

Page 8: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

The configuration probability is broken into the product of the p × k Bernoulli distributions {ω(zgj)}gj . Ac-cording to this relaxation, the objective function (8) is rather simplified as

G(Θ,ω;T ) =∑Z∈Z

p∏g=1

k∏j=1

ω(zgj) log p(X,Z,Θ|ζ)− T1∑

zgj=0

p∑g=1

k∑j=1

ω(zgj) logω(zgj) (11)

In the second term, the entropy is expanded over the marginal distribution ω(zgj) individually. At the limitingtemperature T → 0, the terms involving the entropy disappear. Thus, the maximum of this modified criterionis still achieved by specifying the p × k probability masses to ω(zgj) concentrated at the configurations of Zyielding maxZ p(X,Z,Θ|ζ) for any given Θ. For most statistical models including the GFM, the optimizationof (11) is rather easy to handle as the independence relaxation gives rise to a great deal of utility in computingthe conditional expectation in the first term of (11).

The solution achieving the maximum of (11) can be obtained by the sequential manner that generates theoptimizer successively with regard to a single component ω(zgj) while conditioning the others. To be specific,we have the solution for each probability of entity, ωgj :

ωgj := ωgj(T ) ∝ exp

(1T

∑Z(h,l)∈Q

∏h�=g

∏l �=j

ω(zhl) log p(zgj = 1|X, {zhl}(h,l)∈Q,Θ, ζ)

)(12)

where Q is the set of the paired indices (h, l) such that (h, l) = (g, j), and {zhl}(h,l)∈Q stands for the set of binaryvariates zhl other than zgj . In (12), the full conditional distribution of zgj is summed after taking the logarithmover the possible configurations of the other binary variates, Z(h,l)∈Q, with regard to

∏h�=g

∏l �=j ω(zhl) that

joins the other configuration probabilities correponding to (h, l) ∈ Q. This negative energy forms the Gibbsian(12) together with the temperature.

Again, starting with ωgj(T ) 1/2 at which a large value is given to T , (12) gradually concentrates to thepoint mass as T decays to zero slowly:

zgj := limT↓0

ωgj(T ) =

{1 if

∑Z(h,l)∈Q

∏h�=g

∏l �=j

ω(zhl) logp(zgj = 1,X, {zhl}(h,l)∈Q,Θ, ζ)p(zgj = 0,X, {zhl}(h,l)∈Q,Θ, ζ)

> 0

0 otherwise.

The estimator makes the decision on presence or absence of zgj according to the sign of the expected log-ratiobetween p(zgj = m|X, {zhl}(h,l)∈Q,Θ, ζ), m = 0, 1. Applying this point of view successively to the otherconfiguration probabilities, one obtains the estimates of the binaries.

4 Sparse Learning of GFM

4.1 Configuration Probability of Sparsity

We now return to the GFM, and illustrate the computation to realize the configuration probabilities of sparsity.Once we are given a value for Θ and factorized configuration probabilities, the task to be addressed is thento find the maximizer of (11) over each ωgj . Let us write down the first term of (5), which is the conditionalexpectation of the logarithmic joint distribution for X, Z and Θ:

∑Z∈Z

p∏g=1

k∏j=1

ω(zgj) log p(X,Z,Θ|ζ)12 = Eω

[log p(Θ|Z)

12]−

p∑g=1

k∑j=1

ωgjζgj

−p∑

g=1

(n logψg + sggψ

−1g

)+

k∑j=1

(n log(1− τj) + τjφ

Tj Eω

[S(zj ,Ψ)

]φj

). (13)

Given the non-informative prior p(Θ|Z) ∝ constant, the conditional expectation only involves the prior distri-bution p(Z|ζ) and quadratic form respectively shown in the second and last term of (13). The latter correspondsto the expectation of the sparsified matrix S(zj ,Ψ):

[S(zj ,Ψ)

]= Ωj ⊗

(Ψ− 1

2 SΨ− 12)

with (Ωj)gh =

{ωg = Eω [z2

gj ] if g = hωgjωhj = Eω [zgjzhj] otherwise.

Page 9: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

The Ωj stands for the conditional expectation of zjzTj .

Here we introduce the notation Ψ−1/2SΨ−1/2 = (s1(Ψ), . . . , sp(Ψ)) to represent the p columns of the scaled-sample sum-of-square matrix. Furthermore, by letting Ωj = (ω1, . . . ,ωp), we define the vector of the partialderivative of length p, ∂ωg/∂ωgj, with elements: (∂ωg/∂ωgj)g = 1 and (∂ωg/∂ωgj)h = ωhj for h = g. Then, thepartial derivative of (11) with respect to ωgj conditional on Θ and the other configuration probabilities inducesthe gradient equation:

Hgj(ζgj) := τjφgj

(φj ◦

∂ωg

∂ωgj

)T

sg(Ψ)− ζgj = T log( ωgj

1− ωgj

).

From this equation, the conditional maximum with regard to ωgj is derived as

ωgj(T, ζgj) =exp

(Hgj(ζgj)T

)1 + exp

(Hgj(ζgj)T

) and 1− ωgj(T, ζgj) =1

1 + exp(Hgj(ζgj)

T

) (14)

for each g ∈ {1, . . . , p} and j ∈ {1, . . . , k}. The optimizer forms the sigmoid of the tempered negative energyHgj(ζgj)/T , and is varied in response to the changes in T . As the temperature is raised, the solution puts moreuniform probabilities. Conversely, the solution tends to lie in the region close to one or zero as T decreases. Tobe specific, the limiting temperature leads to

limT↓0

ωgj(T, ζgj) =

{1 Hgj(ζgj) > 00 otherwise and lim

T↑∞ωgj(T, ζgj) =

12.

T → 0 induces the probability mass taking one for each (g, j) if the quantity Hgj(μ) is greater than zero. Itthen produces non-zero factor loading for the (g, j)th element whereas those satisfying Hgj(ζgj) ≤ 0 becomezero.

Sign of Hgj(ζgj), particularly as nearer limit T 0, plays the essential role in making the decision aboutpresence or absence of the factor loadings, in conjunction with the hyperparameter ζgj . To obtain a physicalmeaning of (14), return to the primary definition of the negative energy Hgj(ζgj):

Hgj(ζgj) =∂

∂ωgjτjφ

Tj Eω

[S(zj,Ψ)

]φj − ζgj

= τj∂

∂ωgj

n∑i=1

[‖Ψ− 1

2 xi − φj(zj)φj(zj)TΨ− 12 xi‖2

]− ζgj .

In the last line which is derived from the norm constrain on φ(zj), the quantity in the bracket means thereconstruction error of the data compressor φT

j (zj)Ψ−1/2xj . This indicates that Hgj(ζgj) measures the ‘sensi-tivity’ of the averaged reconstruction error to the variation of the inclusion probability ωgj for the (g, j)th factorloading. The factor loading enhancing a better data compression brings the negative energy to a higher level. Avalue of the hyperparameter ζgj imposes a threshold level on the gradient of the expected reconstruction error.

4.2 MAP Algorithm

Once a set of the configuration probabilities and temperature is given, we in turn proceed with the refinement ofthe model parameter Θ so as to improve (13). This operation is equivalent to the improvement of (11) becauseonly the first term of (11) depends on Θ. We alternate the two procedures which consist of the maximizationsteps respectively over ω(Z) and Θ while reducing the temperature. The criterion (11) is kept to refine untilconvergence where the temperature reaches to zero. Our alternating optimization is summarized as follows:

1: Set a cooling schedule T = {T1, . . . , Td} of length d where Td = 0;2: Set ζ;3: Initialize Θ;4: Initialize ω(Z);5: i← 0;6: while ({convergence is attained}∧{i = d})

Page 10: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

7: i← i+ 1;8: Compute configuration probabilities ωgj(Ti, ζgj) according to (14);9: Refine (13) with respect to Φ under full-conditioning;10: Refine (13) with respect to Δ under full-conditioning;11: Refine (13) with respect to Ψ under full-conditioning;12: end while

To complete description of the optimization algorithm, we need to show the procedures for the refinements ofΘ, which consist of the line 9 to 11 in the quasi-code.

Conditional Optimization over Φ

Of (13), only the quadratic forms in the last term involve the k loading vectors in Φ. The problem would bethen to find the maximum of the pooled-variance of the projections while reflecting the required orthogonalitycondition Φ(Z)TΦ(Z) = I. Definitely, the original condition is still unavailable because values of the binarymatrix Z are just in the process of exploration, and we must rearranged it in an alternative way. Here, weprovide the simplified alternative:

maximizeΦ

k∑j=1

τjφTj Eω

[S(zj ,Ψ)

]φj

subject to ‖φj‖2 = 1 for j ∈ {1, · · · , k} and 〈φm,φj〉 = 0 for m = j (15)

The rearranged constraints in the problem statement arise from the removal of the binary matrix from theoriginal condition: (Φ ◦Z)T (Φ ◦Z) = I → ΦT Φ = I. Despite of the compromise, solutions for the rearrangedproblem (15) can recover the original condition automatically in which the annealing algorithm converges. Asshown in the statement in below and detailed in the Appendix, our claim can be proved formally.

Before we proceed to the statement, it should be remarked that at the limit T → 0, the configurationprobabilities ωgj(T, ζgj) reach to the probability masses zgj = ωgj(0, ζgj), and correspondingly induce sparsityin the conditional expectation of the scaled-covariance matrix: Eω [S(zj ,Ψ)] = S(zj ,Ψ). We now have thefollowing:

Proposition Let Aj = {g|g ∈ {1, . . . , p}, zgj = 1} be the subset of p indices that collects all the variablescorresponding to zgj = 1. The set Aj also defines S{Aj}(zj ,Ψ), a |Aj |× |Aj | submatrix of S(zj,Ψ), in which allthe pair of rows and columns sharing Aj are aggregated. Furthermore, for given Aj and its complementary setAc

j ({1, . . . , p} = Aj +Acj), define the corresponding partition of the loading vector φj into φj,{Aj} and φj,{Ac

j}.

Then, any solutions for (15) satisfy φj,{Acj} = 0.

Proof 2 See the Appendix.

This statement ensures that even with the relaxation of the orthogonality conditions as (15), the resultingsolutions for the loading vectors recover the original condition (Φ◦ Z)T (Φ◦ Z) = I because the optimal loadingvalues for (15) corresponding to zgj = ωgj(0, ζgj) = 0 always become zero, and hence Φ = Φ ◦ Z holds at thelimiting zero temperature.

The problem (15) almost looks like what the ordinal eigenvalue decomposition is applicable. However,we have some issues to be overcome. Intractability in solving (15) arises from the fact that the k symmetricmatrices in the objective function, Eω[S(zj ,Ψ)], are different across j ∈ {1, . . . , k}. Thus, the ordinal eigenvaluedecomposition is no longer available to find the optimal orthogonal bases.

To address the difficulty, after breaking (15) into the sub-problems, we carry out the optimization in thesequential way;

maximizeφj

φTj Eω

[S(zj ,Ψ)

]φj

subject to ‖φj‖2 = 1 and 〈φm,φj〉 = 0 for m = j. (16)

The optimization is taken only on the jth coordinate while the others are fixed. Alternating the conditionaloptimization over j ∈ {1, . . . , k} until convergence, we have the solution for (15). In the light of this relaxation,we have the following procedure:

Page 11: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

1: Set Φ(−j) = {φm}m�=j ∈ Rp×(k−1) which is the matrix of k − 1 loading vectors other than φj ;

2: Compute the projection matrix N j = (I −Φ(−j)ΦT(−j));

3: Find the eigenvector ϕj corresponding to the largest eigenvalue ρ;

N jEω

[S(zj ,Ψ)

]N jϕj − ρϕj = 0 (17)

4: Set φj ←N jϕj/‖N jϕj‖;This procedure solves (16) indirectly after replacing it to the following alternative:

maximizeϕj

ϕTj N jEω

[S(zj ,Ψ)

]N jϕj subject to ‖ϕj‖2 = constant. (18)

The eigenvalue equation in the step 3 arises from this converted problem. Note that the p× p matrix N j spansthe null space of the currently-obtained k − 1 loading vectors {φm}m�=j, so we have N jφm = 0 for any m = j.The symmetric matrix in (18) defines the projection of Eω

[S(zj ,Ψ)

]onto the orthogonal space of {φm}m�=j .

Therefore, non-trivial eigenvectors of (18) also lies in the null space of {φm}m�=j. As shown in the step 4, thefinal estimate is given by φj ∝ N jϕj which always lies in the orthogonal space of {φm}m�=j, thereby yieldingthe orthogonality of the resulting k loading vectors. In the Appendix, it is verified that this procedure actuallycan capture the solution of (16).

In order to complete the line 9 of the MAP algorithm, it is not necessary to repeat this procedure until‘convergence’. Possibly, one-time run across j ∈ {1, . . . , k} is more natural and enough. Bringing our argumentback to the total operations of the MAP algorithm, it is remarked that the method entails the need to performthe eigenvalue decomposition so many times. In the case where p is fairly large, e.g. of order 103 or more, whichwould be often in transcriptome analyses of gene expression, this algorithm is possibly no longer effective dueto increase in the computational cost. In Section 4.3, we will discuss how to deal with high-dimensional data.

Conditional Optimization over Δ

Of (13), the k singular values of the loading components, δ1/2j , are involved in

k∑j=1

n log( 1

1 + δj

)+

k∑j=1

δj1 + δj

φTj Eω

[S(zj ,Ψ)

]φj . (19)

The update formula for the line 10 of the MAP algorithm is obtained by taking the derivative of (19) withrespect to each δj:

− n

1 + δj+

1(1 + δj)2

φTj Eω

[S(zj,Ψ)

]φj = 0 ⇒ δj =

1n

φTj Eω

[S(zj ,Ψ)

]φj − 1.

In words, the estimated value for δj is the projection of the expected sample covariance matrix onto the jthfactor coordinate with minus one. The process in the line 10 consists of the repeated computing of this formulaover j ∈ {1, . . . , k}.

When the value of δj lies in the negative region, the estimate is shifted to the boundary δ1/2j = 0. This

threshold operation can be verified by the convexity of the objective function (19). Whenever this occurs, thejth factor is removed from a given model together with the associated parameters. This intrinsic mechanism ofthe factor pruning works for removing redundant factor dimensions cooperatively with the exploration for thezero columns in the binary matrix Z, whose elements are all zero.

Conditional Optimization over Ψ

In (13), the diagonal noise covariance matrix Ψ appears in the following three terms:

−n log |Ψ| − trace(SΨ−1) +k∑

j=1

τjtrace(φjφ

Tj Ψ− 1

2(Ωj ◦ S

)Ψ− 1

2).

Differentiating this with respect to Ψ− 12 yields the gradient equation:

n diag−1(Ψ12 )− diag−1(SΨ− 1

2 ) +k∑

j=1

τjdiag−1(φjφ

Tj Ψ− 1

2(Ωj ◦ S

))= 0, (20)

Page 12: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

where diag−1(A) denotes the vector of the diagonal elements in A. This equation bears nonlinear in Ψ,and solving this directly would be somewhat difficult. While some numerical techniques are available to findsolutions, e.g. Newton-Raphson method, we rather exploit a simplified approach. Multiplying Ψ1/2 to (20) anddoing a trivial translation, we have

diag−1(Ψ) =1n

diag−1({

I −k∑

j=1

τj(φjφTj ) ◦ (Ψ− 1

2 ΩΨ12 )}S). (21)

Whenever convergence is attained by the successive computation of (21), we have a valid estimator in sensethat the resulting value is ensured being a stationary point.

4.3 GFM for Large Size Problems

Throughout a variety of scientific fields, the use of latent factor models, including classical PCA, has receivedenormous merits on ‘massive data’ processing. In the foregoing procedure, the operations in the iterative-eigenvalue decomposition for solving (16) would raise computational cost as the number of features, p, increases.Our empirical evidences indicated that its usability is limited around p 500. The implementation writtenin R (our code anneals gfm() is accessible from the website for supporting information) wastes at least morethan ten hours in real time to process p = 1500 while few minutes are taken to process a data set of dimensionp = 500.

We here seek out a solution based on stochastic search operation which places the Monte Carlo samplingof the binary matrix Z before proceeding to the eigenvalue decomposition. For each j (j ∈ {1, . . . , k}), theoptimization (16) over φj is then replaced as follows:

1. Draw a set of binary values zgj , g = 1, . . . , p, according to the current configuration probabilityωgj(Ti, ζgj);

2. Make the set of active features, Aj = {g|g ∈ {1, . . . , p}, zgj = 1};3. Solve the optimization conditional on the Aj ;

maximizeφj,{Aj}

φTj,{Aj}S{Aj}(zj ,Ψ)φj,{Aj}

subject to ‖φj,{Aj}‖2 = 1 and 〈φm,{Aj},φj,{Aj}〉 = 0 for m = j.

4. Replace the loading vector corresponding to Aj by the obtained φj,{Aj} while the other elementsare set to zero;

All the indices corresponding to zgj = 1, g = 1, . . . , p, are aggregated into the active feature set Aj . Thesymmetric matrix S{Aj}(zj ,Ψ) stands for a |Aj | × |Aj | submatrix of S(zj,Ψ) whose elements share the indicesinvolved in Aj . Also, the orthogonal conditions are defined only on the sub-elements of the loading matrixcorresponding to Aj . In a while after starting the run of annealing, the configuration probabilities ωgj(Ti, ζgj)would stay around near to 1/2 due to the regulations by high temperatures. Then, approximately half ofthe binary variates forms the active feature set. For instance, if p = 1000, the optimization is processed bythe eigenvalue decomposition of size p 500. As the annealing process evolves, a part of the binary variatesgradually settles down to zero. As the loading matrix is led to a higher degree of sparseness as a result of reducingtemperature, ideally, we can play out the eigenvalue computations only on a smaller subset of p features.

Following this replacement, we further need to alter the update operation for the configuration probabilities.The foregoing procedure updates them across overall indices in each step. On the other hand, when using thestochastic search algorithm, we should update ωgj(T, ζgj) only for the current indices g ∈ Aj while keeping therest corresponding to the complementary set of Aj at the values previously obtained.

While the stochastic search algorithm could reduce the computational burden at some extent, the use of itwould be still impractical when processing exceedingly large p and given data possess a low sparseness structure.Our numerical experiences indicate that the stochastic search algorithm is effective at most when the numberof feature variables ranges up to several thousands. Of course, some heuristics would be applicable to overcomesuch a limitation. When knowing beforehand the fact that a degree of sparseness is high and its sturctureat some extent, possibly we can process the annealing by retaining such knowledge in the sparsity learning.However, we shall conclude here that the computational issues on massive datasets remain unresolved.

Page 13: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

Configuration Probability (ωgj)

Hyp

erpa

ram

eter

(ζgj

)

μ=15, γ=10

μ=15, γ=20

μ=15, γ=30

μ=1, γ=100

μ=10, γ=100μ=20, γ=100

Figure 1: Change in the estimation equation (22) for the hyperparameter ζgj against to a range of μ and γ.

5 Effective Degree of Sparseness

One of the most important issues on sparsity estimation might be how to specify an effective degree of sparseness.The objective is now to set proper values to the hyperparameters ζgj . In the case of GFM, as already seen inSection 4.1, the role of ζgj is to impose a threshold in the energy function Hgj(ζgj), and thus directly affectsthe sign of the energy function, which is a final decision maker about presence or absence of the loading valueat zero temperature. We address this issue from the Bayesian perspective.

The concept of our strategy partly follows the notion of the hierarchical Bayes inference, but not standard.Our aim is to estimate the hyperparameter ζ together with Θ and Z by maximizing the value of the augmeneteddistribution p(X,Z,Θ|ζ)p(ζ), coupled with a lower-level Gaussian prior ζgj ∼ p(ζgj) = N (ζgj |μ, γ) with mean μand variance γ, which are common to all ζgj . To realize an adaptive sparsity control, we include the optimizationstep for ζ into the process of annealing. Then, the criterion to be maximized is as follows:

Eω[log p(ζ)p(Z|ζ)] = −12

p∑g=1

k∑j=1

(ωgjζgj + log(1 + exp(−ζgj/2))

)− 1

p∑g=1

k∑j=1

(ζgj − μ)2.

Differentiating this with respect to ζgj , we have the sigmoid form of equation

ωgj =exp(−ζgj)

1 + exp(−ζgj)− ζgj − μ

γ. (22)

The configuration probability now obtained is associated with the sigmoid transformation of −ζgj with the biasarising from the Gaussian prior, −γ−1(ζgj − μ). Solution of this equation is uniquely determined, and foundby a trivial computing. We complete our annealing algorithm by inserting the optimization over ζgj into thewhile loop of the MAP algorithm. Figure 1 shows the change in the estimation equation (22) against to somepairs of μ and γ. The estimate of ζgj is defined by an intersection point of ζgj with ωgj on the curve. Note thatthe estimates of ζgj is bounded above and below such that value of the right-side equation of (22) lies in theregion [0, 1]. When the value for ωgj achieves zero or one, the estimate of ζgj is shifted to the correspondingboundary.

6 Experimental Results

6.1 Snapshot of Algorithm

We begin with some artificial data sets to provide an overview of our annealing algorithm. The first data sethas 100 data points which were drawn from the GFM with p = 30 and ktrue = 4. The variance of the noise

Page 14: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

Identified precision matrix Identified loading matrix (k=8)

Precision matrix of artificial data

True loading matrix (ktrue=4)

Figure 2: Result of the annealing algorithm using the log-inverse rate cooling. (Left) Sparsified loading matrixused for generating the artificial data (p = 30, ktrue = 4) and the induced precision matrix. The non-zeroelements are color-coded by black. (Right) Images of the estimated factor loading matrix (k = 8) and thesparsified precision matrix. The last four loading vectors were automatically adjusted to zero.

components and the ktrue singular values of the factor loading matrix were set to respectively Ψ = 0.05I andΔ = diag(1.5, 1.2, 1.0, 0.8). For the construction of sparse loading matix, the binary matrix Z was generatedfrom the Bernoulli distribution with success probability Pr(zgj = 1) = 0.3, thereby yielding the degree ofsparseness around 70%. We then specified the loading components in Φ so as to preserve the orthogonalityof the four columns in Φ ◦Z. The monochrome pixel images in the left panel of Figure 2 display the generatedbinary matrix Z with the induced sparse covariance matrix. A sample code written in R for the data generatoris available at the supplementary website (sample code in anneals gfm()).

To test the sensitivity to change in cooling schedule, the following three patterns were prepared:

• (Log-inverse decay) Ti = 3/log2(i+ 1) for i = 1, . . . , 6999 and T7000 = 0

• (Linear decay) Ti = 3− 6× 103 × (i− 1) for i = 1, . . . , 1999 and T2000 = 0

• (Power decay) Ti = 3× 0.99−(i−1) for i = 1, . . . , 1999 and T2000 = 0.

We then conducted numerical experiments on discriminant power for sparsity structure and computationalefficiency under the specification of the redundant factor dimension k = 8.

Annealing with Fixed Hyperparameters

We demonstrate a simplified usage of our annealing in which the p × k hyperparameters ζgi are fixed to acommon value ζ. Figure 3 summarizes the evaluation of the receiver operating characteristics (ROC) for thethree cooling schedules, which were drawn from a range of arbitrary-chosen ζ allocated on a number of gridpoints in the region ζ ∈ [0, 5]. The true positive (TPR) and false positive rates (FPR) were computed basedon the identification error of sparsity configuration in the covariance matrix (for more details, see the captionof Figure 3). As a benchmark, we evaluated the ROC on the standard PCA by extracting the most dominantfour eigenvectors (k = 4). In the ROC evaluation on the PCA, we successively selected a loading value fromthe p× k elements according to the decreasing order of their absolute values, and move the value to zero. Theresulting ROC curve, shown in the left panel, is very near to the 45◦ line corresponding to the randomized rule ofdiscrimination. For the annealing of GFM, every curves induced from the three cooling rates achieved the muchlarger area of under the curves while the within-difference of them is quite small. However, as will be illustratedin later, the choice of cooling schedule sometimes has a great influence on estimation results especially whenusing lower initial values for temperature scheduling.

Inference of Effective Degree of Sparseness

At the lowest level hierarchy, we specified the Gaussian prior distribution p(ζgj) = N (ζgj |μ, γ) with μ = 3and γ = 6, and then carried out the MAP algorithm with the log-inverse cooling schedule. As shown in theright panel of Figure 2, the estimated hyperparameters realized a reasonable control of FNR (false negativerate; 15.4%) and FPR (false positive rate 0%), which induced a slightly non-sparser solution than the true

Page 15: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

False Positive Rate

Tru

e P

ositi

ve R

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

False Positive RateT

rue

Pos

itive

Rat

e

Linear decayLog−inverse decayExponential decay

Figure 3: ROC for the standard PCA depicted by a simple thresholding of loading values (left) and theannealing estimation under the three cooling schedules (right). TPR (vertical) and FPR (horizontal) werecalculated according to TP/P and FP/N where P and N denote the numbers of non-zero and zero elements intrue loadings, TP and FP are the numbers of true posives and false positives, respectively.

Hyperparameters

0 500 1000 1500 2000

0.0

1.0

2.0

3.0

Factor 1

0 500 1000 1500 2000

0.0

1.0

2.0

3.0

Factor 20 500 1000 1500 2000

0.0

1.0

2.0

3.0

Factor 3

0 500 1000 1500 2000

0.0

1.0

2.0

3.0

Factor 40 500 1000 1500 2000

0.0

1.0

2.0

3.0

Factor 5

0 500 1000 1500 2000

0.0

1.0

2.0

3.0

Factor 60 500 1000 1500 2000

0.0

1.0

2.0

3.0

Factor 7

0 500 1000 1500 2000

0.0

1.0

2.0

3.0

Factor 8

Configuration Probability

0 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Factor 1

0 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Factor 20 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Factor 3

0 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Factor 40 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Factor 5

0 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Factor 60 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Factor 7

0 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

Factor 8

Figure 4: Snapshot for the process of annealing (right: hyperparameters, left: configuration probabilities).

Annealing from High Temperature

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Replication number of dataset

TPR

and

FP

R

Annealing from Low Temperature

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Replication number of dataset

TPR

and

FP

R

Sparse PCA

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Replication number of dataset

TPR

and

FP

R

Figure 5: Sensitivity test to change in the cooling schedules, and comparison with sparse PCA [Zou and Hastie,2005]. For each panel, TPR (black) and FPR (blue) are ploted along the vertical axis across the 20 timesreplications of artificial data. The results of annealing with the higher and lower initial temperatures are shownin the left and center panels respectively. The right panel shows the results of sparse PCA.

Page 16: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

structure. Also, the annealing estimate could automatically prune the redundant four factor components asshown in the right panel of Figure 2. Figure 4 displays a snapshot for evolving configuration probabilities andhyperparameters during the annealing operation. The right panel shows change of configuration probabilitieswhich starts from fairly equal probabilities (1/2), and move to zero or one gradually as the temperature islowered. At around Ti 0.45, all the configuration probabilities corresponding to the redundant four factorsreached to zero. The synchronization of the solution passes between the hyperparameters and configurationprobabilities are confirmed from the figure.

We also evaluated sensitivity against to the choise of cooling schedules under the joint estimation of thehyperparameters. In addition to the previous three cooling schedules, we newly prepared the followings:

• (Log-inverse decay) Ti = 0.7/log2(i+ 1) for i = 1, . . . , 6999 and T7000 = 0

• (Linear decay) Ti = 0.7− 6× 103 × (i− 1) for i = 1, . . . , 1999 and T2000 = 0

• (Power decay) Ti = 0.7× 0.99−(i−1) for i = 1, . . . , 1999 and T2000 = 0.

The initial temperatures are reduced from 3 to 0.7. Figure 5 shows the variations of TPR and FPR in the useof the six cooling schedules, which were evaluated under 20 times replacements of data from different GFM.The 20 test models were randomly created according to the procedure shown in previous, where the probablityof success for the Bernoulli trial was set to 0.3. The left and center panels indicate significant dominance ofthe annealing starting from the higher initial temperatures. Particularly, TPRs induced from the lower initialtemperatures were very worse than the use of the higher values. For the results corresponding to the higherstarting values, the within-difference was quite small as well as the results on the forgoing experiments usingthe fixed hyperparameters.

The right panel in Figure 5 shows TPR and FPR for the sparse PCA proposed by Zou and Hastie [2005],which were evaluated with the 20 data sets in the above and R code spca() available at CRAN (http://cran.r-project.org/). With spca(), we can specify the number of nonzero elements (cardinality) beforehand in eachcolumn of the factor loading matrix. We executed spca() after the assignment of the true cardinality as wellas the known factor dimension ktrue = 4. The figure indicates a better performance of our annealing withthe higher initial temperature than the sparse PCA. It is remarked that we performed the annealing with theredundant dimension of factor k = 8 and also using no a priori knowledge on the degree of sparseness.

Computing Time for Big Data Problem

Figure 6 shows the CPU times required for the execution of the annealing with self-tuning of the sparsenessdegree. The CPU times are ploted across a range of data dimensions, p ∈ {100, 200, 300, 500, 700, 1000}. Thedata sets were generated from the GFMs sharing the almost same degree of sparseness 70%. Using the lineardecay cooling of length 2000, we run the stochastic and non-stochastic annealing under the specificatoin ofk = 8. We resigned to apply the non-stochastic algorithm to the higher dimensional data sets, such as largerthan p = 700, which raised the CPU times drastically up to an inexecutable level. The computational difficultywas eased at some extent by the aid of the stochastic search algorithm, which could keep computation timesfor p = 1000 within a few hour. It should be remarked that this benchmark is invalid for the other data setshaving lower degree of sparseness because the use of the stochastic search algorithm is ineffective unless datahave a centern degree of high sparseness.

6.2 Transcriptome Profiling of Breast Cancer

Transcriptome profiling of cancer cells is one of the most fundamental tools in genomic biomarker discoveryand studies on transcription regulations of tumor growth and metastasis. As the use of DNA microarray assayswas spread over recent molecular biology and medical science, there has been a growing need for statisticaltechnologies having the ability to process the massive data set containing many thousands features (genes).Especially, the latent factor models have been indispensable tools in the primary processing of data for modulediscovery of genes involved in the underlying pathways of transcription regulations [Carvalho et al., 2008, Hiroseet al., 2008, Lucas et al., 2009, Yoshida et al., 2004, 2006].

Microarray Experiments and Preprocessing

We have applied the annealing estimation of GFM to the transcription profiling of breast cancer. Our data set isa collection of gene expression signatures of human breast tumors which were isolated from 138 different patients.In addition to expression signatures, the tumor samples were examined in immunohistochemistry (IHC) testing

Page 17: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

200 400 600 800 1000

020

040

060

080

010

0012

00

Dimension of data

CP

U ti

me

(sec

)

Sparsity of True Loadings

Feature variables

Fact

ors

12

34

200 400 600 800 1000

Identified Sparsity

Feature variablesFa

ctor

s1

23

45

67

8

200 400 600 800 1000

Figure 6: (Left) Plot of CPU times (in second; Intel(R) Core(TM)2 Duo processor, 2.60Ghz) versus problemsize p for the stochastic search algorithm. The results for non-stochastic algorithm is omitted. (Right) Identifiedsparse covariance matrix of size p = 1000 which achieved FPR = 12.0% and FNR = 18.4%.

20 60 100

5020

0

Factor 1

20 60 100

5020

0

Factor 2

20 60 100

2060

Factor 3

20 60 100

1040

70

Factor 4

20 60 100

1030

50

Factor 5

20 60 100

26

10

Factor 6

20 60 100

1030

Factor 7

20 60 100

515

Factor 8

20 60 100

515

Factor 9

20 60 100

0.5

1.5

2.5 Factor 10

20 60 100

26

10

Factor 11

20 60 100

515

Factor 12

20 60 100

515

Factor 13

20 60 100

26

10

Factor 14

20 60 100

28

14

Factor 15

20 60 100

13

57

Factor 16

20 60 100

13

5

Factor 17

20 60 100

28

14

Factor 18

20 60 100

515

Factor 19

200400

600800

1000

5 10 15 20 25

Gen

es

FactorsIdentified Sparsity

Figure 7: Identified factor probes (left) and sparsity configuration (right; binary matrix). In each image ofthe left panel, expression signatures of the probes associated with each factor are depicted accross 138 samples(ordered along horizontal axis).

Page 18: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

which measures concentrations of the two receptor proteins (signaling molecules); erbB-2 (Her2) and estrogenreceptor (ER). Expression status of these two proteins is known as the main discriminative factors for breastcancer phenotype, and perdition of clinical outcomes in practical therapic strategy. The immunohistochemicalassessment revealed the status of each receptor with the grade: ER negative (ER=0), ER positive with low/high-level expression (ER=1 and ER=2), Her2 negative (Her2=0), and Her2 positive with low/high-level (Her2=1and Her=2). The transcript levels of genes were measured using the GeneChip Human Genome U95 arrays(Affymetrix),

Experimental Setting

A practical interest in the application of latent factor models to transcriptome profiling lies in the identificationand decomposition of blind sources yielding covariations in massive data set, and association of many features totumor heterogeneity and transcription modules involved in biological pathways. We have carried on the sparseestimation of the GFM with k = 25 under the adaptive control of degree of sparseness with the lowest-level prior;μ = 7 and γ = 10. The cooling schedule was prescribed by linearly-decreasing sequence of 2000 temperaturesunder which the decay rate and initial temperature were set to 6× 10−3 and 3, respectively.

Identified Factors and Enrichment Analysis of Functional Annotations

During the annealing, redundancy in the prescribed factor dimension has been pruned as k = 25 → k = 19.The expression signatures of the probes associated to the survived nineteen factors are respectively displayedin Figure 7 with the identified sparsity of the loading matrix. The probes aggregated into the same factor werehighly co-expressed with each other as shown in the figure. To obtain a better understanding of underlyingbiology, we evaluated enrichment of the functional annotations of selected genes in each factor through theGene Ontology (GO). To be specific, we assessed significance (quasi-p value) of each GO term registered at thecategories of the GO biological process in the following way: Under which r probes out of indentified l probes ineach factor possess a GO term annotation S, and the p = 996 probes contain in totalm probes belonging to S, wecalculated the upper-tail probability of the hypergeometric distribution that r or more probes annotated by S aredrawn by chance under the l times sampling from 996 popultations without replacement. This secondary analysishas revealed associations between some identified factors and biological processes of GO, where the completetables of the GO enrichment analysis are available from the supporting information website. For instance, thefirst factor collected a large number of probes in which some of the most significant annotations were comprisedof positive regulation of lymphocyte proliferation, BMP signaling pathway involved in remodeling of bone.The 4th factor clearly captured the genes involved in apoptosis and relevant pathways. As the genes codingextracellular signaling molecules of apoptosis, the two tumor necrosis factors, TNFSF10 (tumor necrosis factor(ligand) superfamily) and LITAF (lipopolysaccharide-induced TNF factor), and interleukin (IL8) were listed inthe identified probe set. Besides, many negative regulators of caspase activity, i.e. repressors of apoptosis, whichenhance cancer metastasis, were captured, particularly involved in suppression of tumor necrosis factor-inducedapoptosis by I-κB kinase/NF-κB cascade. The genes relevant to apoptosis were also aggregated into the 14thfactor which showed significant over-representations of the terms with positive regulation of caspase (activatorprotein of apoptosis) and tyrosine phosphorylation of STAT proteins on JAK-STAT cascade. The 8th factorincluded many genes relevant to hormone metabolic process, especially humoral immune response mediatedby circulating immunoglobulin. The hormonal response-related probes were also captured by the 12th factorwhose members are mainly involved in C21-steroid hormone metabolic process. For cell-cycle related genes,genes relevant to G1/S transition of mitotic cell cycle were included in the 15th and 16th factor, whereas 18thfactor showed enrichment of M phase of meiotic cell cycle.

Association Study of Identified Factors to ER pathways

Figure 8 shows the averages of the estimated factor scores separately computed for the three grades of ERand Her2 status, respectively. Many factors captured the discriminators for ER status, particularly, we couldobserve strong association of ER status to the 8th (GO: hormone metabolic process), 9th (GO: glucose metabolicprocess, negative regulation of MAPK activity), 12th (GO: C21-steroid hormone metabolic process), 14th (GO:apoptotic program, positive regulation of caspase activity), 18th (GO: M phase of meiotic cell cycle), 19th factor(GO: regulation of Rab protein signal transduction). These clear discrepancies of ER status indicate strongeffect of estrogen recepter-induced signaling on transcriptions of the downstream target genes, and diversity ofexisting pathways in breast cancer progression.

Page 19: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

ER:0 ER:1 ER:2−30

020

Factor 1

ER:0 ER:1 ER:2−20

020

Factor 2

ER:0 ER:1 ER:2

−10

515

Factor 3

ER:0 ER:1 ER:2−20

−5

10

Factor 4

ER:0 ER:1 ER:2

−10

5

Factor 5

ER:0 ER:1 ER:2

−5

5

Factor 6

ER:0 ER:1 ER:2

−5

1025

Factor 7

ER:0 ER:1 ER:2

−5

5

Factor 8

ER:0 ER:1 ER:2−10

010

Factor 9

ER:0 ER:1 ER:2

−5

5

Factor 10

ER:0 ER:1 ER:2

−15

0

Factor 11

ER:0 ER:1 ER:2−15

0

Factor 12

ER:0 ER:1 ER:2−10

0

Factor 13

ER:0 ER:1 ER:2

−5

5

Factor 14

ER:0 ER:1 ER:2

−5

5

Factor 15

ER:0 ER:1 ER:2

−5

5

Factor 16

ER:0 ER:1 ER:2

−5

05

Factor 17

ER:0 ER:1 ER:2−10

010

Factor 18

ER:0 ER:1 ER:2−10

010

Factor 19

Her2:0 Her2:1 Her2:2−30

0

Factor 1

Her2:0 Her2:1 Her2:2−20

020

Factor 2

Her2:0 Her2:1 Her2:2

−10

515

Factor 3

Her2:0 Her2:1 Her2:2−20

−5

10

Factor 4

Her2:0 Her2:1 Her2:2

−10

5

Factor 5

Her2:0 Her2:1 Her2:2

−5

5

Factor 6

Her2:0 Her2:1 Her2:2

−5

1025

Factor 7

Her2:0 Her2:1 Her2:2

−5

5

Factor 8

Her2:0 Her2:1 Her2:2−10

010

Factor 9

Her2:0 Her2:1 Her2:2

−5

5

Factor 10

Her2:0 Her2:1 Her2:2

−15

0

Factor 11

Her2:0 Her2:1 Her2:2−15

0

Factor 12

Her2:0 Her2:1 Her2:2−10

0

Factor 13

Her2:0 Her2:1 Her2:2

−5

5

Factor 14

Her2:0 Her2:1 Her2:2

−5

5

Factor 15

Her2:0 Her2:1 Her2:2

−5

5

Factor 16

Her2:0 Her2:1 Her2:2

−5

05

Factor 17

Her2:0 Her2:1 Her2:2−10

010

Factor 18

Her2:0 Her2:1 Her2:2−10

010

Factor 19

Figure 8: Boxplots for the means of factor scores computed for the three levels of ER (left) and Her2 (right)status.

Estrogen Response Elements

To achieve a deeper insight of the relevant factors in relation to ER status, we carried out the enrichmentanalysis of the identified probe set based on the estrogen response elements (EREs). ERs are members ofligand-dependent nuclear receptor family, which recognize and bind to EREs on DNA strands, and modulatetranscription of the downstream target genes coupled with several interactor proteins. It is therefore naturalto consider that some of the identified ER-relevant factors captured a number of the ERE genes. To performassociation study, we accessed the ERE database [Bourdeau et al., 2004] to which 17353 ERE genes screenedbased on in vitro DNA binding experiments are registered. Our selected 996 probes shared 276 genes with theregistered EREs. Each set of factor probes, Cj , sharing the jth factor is further divided into the two subsets Cj+and Cj− according to the sign of the estimated loading values. We then computed the p-values for Cj+ and Cj−separately, using the hypergeometric distribution. Of the foregoing factors judged to be ER discriminators, the9th (C9+), 14th (C14+) and 23th (C19−) showed relatively small p-values, (0.1949, 0.107 and 0.107). Summarytable for the evaluated significances is available at our supplementary website.

Her2 status and Oncogenomic Recombination Hotspot on 17q12

Regarding to Her2 status, Figure 8 indicates a correlation to the 16th factor which captured the 7 probes as therelevance features, including STARD3, GRB7 and two probes on the locus of ERBB2 (which encodes Her2).Interestingly, we have noted that these three genes are all located on the same locus at the human chromosome17q12, which is known as PPP1R1B-STARD3-TCAP-PNMT-PERLD1-ERBB2-MGC14832-GRB7 locus. Thislocus has been reported in many studies, e.g. Katoh and Katoh [2004], as an oncogenomic recombinationhotspot which is amplified frequently in breast tunmor cells .

Comparison to Non-Sparse Analysis

We finally demonstrate comparison of the proposed annealing estimation to the non-sparse PCA. SupplementaryFigure 1 and 2 depict the averages of the estimated factor scores (principal components) corresponding to themost dominant 19 eigenvalues, for each grade of ER and Her2 status. Obviously, the PCA failed to revealHer2-relevant phenotype of the tumor samples. Also the association to the ER status became unclear comparedwith our result. The foregoing sparse inference identified the Her2-relevant factor only through the 7 non-zeroloadings. Indeed, our post-analysis has found that our data set contains very few probes exhibiting significantfold-change between the Her2 phenotypes. Hence, the non-sparse model would capture many irrelevant featuresthrough redundant non-zero loadings. The failure of PCA highlights clearly that sparsification of the loadingvalues is an essential part of statistical learning for the discovery of small number of relevant features wheremany feature variables are attributed.

7 Discussion

We have brought a new computational strategy to solve hard combinatorial optimization for the sparse modelconstruction. The methodological foundation of our works was laid on the use of the posterior entropy as a

Page 20: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

measure to evaluate and regulate complexity of the posterior configuration probabilities of sparsity patterns.We realized the MAP estimation of sparsity configuration by incorporating an annealing mechanism built uponthe tempered posterior entropy into the optimization algorithm.

This work was largely inspired by the sparsity inference for the GFM. So, the most arguments in thispaper have been developed by referring the GFM. However, our methodology can be applied for more generalproblems of model selection that bring hard combinatorial optimization. One of the most common issues instatistical science might be the variable selection in regression, which aims to discover an explicit associationbetween the output and input vector variables, xout and xin, typically through a linear regression function,E[xout|xin] = (Θ ◦ Z)xin. Whenever the coefficient matrix (Θ ◦ Z) is sparsified by the inclusion of the binarymatrix Z, our approach would be mostly applicable to compute the MAP estimates for Θ and Z. Other thanthe regression analysis, the method would have a wider field of application, involving the structural inference ofconditional independence graph, the state space model which is looked as a time domain extension of the GFM.

We also remark a more generalization of our annealing method through the replacement of the entropyterms to the alternatives. The central idea of the annealing optimization is the design of the tempered criterionG(Θ, ω;T ) so as to converge to the joint posterior distribution for the parameters and sparsity configurationsat the limiting zero temperature. Our criterion was made based on the additive penalization of the posteriorentropy. However, we have much more flexibility in the choice of the penalization term as long as the criterionconverges to the joint posterior distribution as the temperature reaches to zero. Possibly, this point of view canbe applied in order to realize more efficient solution pass during the process of annealing. Unfortunately, asyet, we do not know how to bring the theory of convergence and how the optimization algorithm approches tomost probable solution. Our annealing has arisen from only a natural intuition, and its effectiveness has beenjust confirmed empirically. Further works are needed to formalize our annealing.

Appendix

Proof of Proposition 1

Replace the objective function of (8) by multiplying inverse temperature 1/T :

1TG(Θ,ω;T ) =

∑Z∈Z

ω(Z) log p(X ,Z,Θ|ζ)1/T −∑Z∈Z

ω(Z) log ω(Z).

An upper-bound of this modified criterion is derived as follows:

1TG(Θ,ω;T ) =

∑Z∈Z

ω(Z) logp(Z|X,Θ, ζ)1/T p(X ,Θ|ζ)1/T

ω(Z)

=∑Z∈Z

ω(Z) logp(Z|X,Θ, ζ)1/T

ω(Z)∑

Z′∈Zp(Z ′|X,Θ, ζ)1/T

+K0

≤ K0

In the second equality, the terms irrelevant to ω(Z) are included inK0 = log p(X,Θ|ζ)1/T +∑

Z′ p(Z ′|X,Θ, ζ)1/T .The first term in the second line is the negative of the Kullback Leibrier divergence between ω(Z) and the nor-malized tempered posterior distribution. The lower-bound of the Kullback librier divergence is attained if andonly if

ω(Z) =p(Z|X ,Θ, ζ)1/T∑

Z′∈Z p(Z′|X ,Θ, ζ)1/T

.

This proves the Proposition.

Proof of Proposition 2

Consider the optimization of (15) with respect to the jth loading vector conditional on the others under the kconstraints: ‖φj‖2 = 1 and 〈φj ,φm〉, m = j. Using ρj , j ∈ {1, . . . , k}, to indicate the Lagrange multipliers for

Page 21: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

the k conditions, we set the Lagrangean:

φTj S(zj ,Ψ)φj − ρj

(‖φj‖2 − 1

)−∑m�=j

ρm〈φm,φj〉 (23)

=⇒ φTj,{Aj}S{Aj}(zj ,Ψ)φj,{Aj} − ρj

(‖φj‖2 − 1

)−∑m�=j

ρm〈φm,φj〉. (24)

Note that any estimates Z sparsify S(zj ,Ψ), and thus, translate the quadratic form in (23) to that shown in(24).

First, we solve (23) with using no a prior knowledge about sparsity of S(zj ,Ψ), i.e. Aj . The derivative of(23) with respect to φj is

S(zj ,Ψ)φj − ρjφj −∑m�=j

ρmφm = 0. (25)

By the multiplication by φTj from the left-hand side of (25), the jth multiplier is expressed as

ρj =φT

j S(zj ,Ψ)φj

‖φj‖2

= φTj,{Aj}S{Aj}(zj ,Ψ)φj,{Aj} (26)

where the terms relevant to zero elements in S(zj,Ψ) are removed from the first to the second equation. Bydefinition, the denominator of the first equation is always chosen as equal to one.

Aside from this, consider the optimization problem to find maxφj φTj S(zj ,Ψ)φj under which the factor

loadings corresponding to Acj are set to zero, φj,{Ac

j} = 0, while retaining the norm constraints ‖φj‖2 =‖φj,{Aj}‖2 = 1 and the orthogonality. Then, the Lagrangean turns to

φTj,{Aj}S{Aj}(zj ,Ψ)φj,{Aj} − ρj(‖φj,{Aj}‖

2 − 1)−∑m�=j

ρm〈φm,{Aj}φj,{Aj}〉. (27)

Obviously, at any stationary points, denoted by φ∗j,{Aj}, we have

ρ∗j =φ∗T

j,{Aj}S{Aj}(zj ,Ψ)φ∗j,{Aj}

‖φ∗j,{Aj}‖2

= φ∗Tj,{Aj}S{Aj}(zj,Ψ)φ∗

j,{Aj}. (28)

For both of (26) and (28), the quadratic forms of the same sparsified matrix are irrelevant to Acj . Thus, without

loss of generality, we can find a pair of solutions such that ρj − ρ∗j = 0.Define the eigenvalue decomposition S{Aj}(zj ,Ψ) = UT V U where U ∈ R

p×q is a matrix of the q eigenvec-tors with regard to the non-zero eigenvalues V = diag(v1, . . . , vq). For non-zero ρ∗ and ρ, the both φ∗

j,{A} andφj,{A} must lie in the span of the q eigenvectors because

ρ∗ = φ∗Tj,{A}UV UT φ∗

j,{A} = 0 if φ∗j,{A} /∈ span{U},

ρ = φTj,{A}UV UT φj,{A} = 0 if φj,{A} /∈ span{U}.

This implies that for any non-trivial solutions ρ∗j = ρj = 0, the correponding solutions are obtained by lineartransformation of the eigenvectors φj,{A} = Uγ and φ∗

j,{A} = Uγ∗ with the coefficients γ = {γi}1≤i≤q ∈ Rq

and γ∗ = {γ∗i }1≤i≤q ∈ Rq. So we have

ρj − ρ∗j = (Uγ)T S{A}(zj ,Ψ)(Uγ)− (Uγ∗)T S{A}(zj,Ψ)(Uγ∗)

=q∑

i=1

(γ2i − γ∗2i )vi = 0. (29)

Bisides, without loss of generality, there exists an expression φj,{A} − φ∗j,{A} = Uγ∗∗ with γ∗∗ ∈ R

q, whichsatisfies ρj − ρ∗j =

∑qi=1 γ

∗∗2i vi and γ2

i − γ∗2i = γ∗∗2i ≥ 0, thereby leading to γ2j = γ∗2j . Noticing that the choice

of sign for γj and γ∗j is arbitrary, the Proposition now follows with γj = γ∗j .

Page 22: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

Derivation: Optimization over Φ

Let ρj , j ∈ {1, . . . , k} be the Lagrange multipliers to ensure the restrictions in (16). We now write down theLagrange function:

φTj Eω

[S(zj ,Ψ)

]φj − ρj(‖φj‖2 − 1)−

∑m�=j

ρm〈φm, φj〉. (30)

Differentiation of (30) with respect to φj yields

[S(zj ,Ψ)

]φj − ρjφj −

∑m�=j

ρmφm = 0. (31)

In order to solve this equation, the first step to be addressed is to find the closed form solution for the vectorof the Lagrange multipliers, ρ(−j) = {ρm}m�=j ∈ R

k−1. Multiplying (31) by each φTm, m = j, from the left, we

have the k − 1 equations as follows:

φTmEω

[S(zj ,Ψ)

]φj −

∑m�=j

ρmφTmφj = 0 for m s.t. m = j.

Using Φ(−j) = {φm}m�=j ∈ Rp×(k−1) to indicate the p × (k − 1) matrix whose columns consist ot the k − 1

loading vectors except for φj , the above equations are converted to the matrix representation,

ΦT(−j)Eω

[S(zj ,Ψ)

]φj −ΦT

(−j)Φ(−j)ρ(−j) = 0,

which in turn leads to the solution for ρ(−j):

ρ(−j) =(ΦT

(−j)Φ(−j)

)−1ΦT

(−j)Eω

[S(zj ,Ψ)

]φj .

After substituting this solution into the original equation (31), we get the eigenvalue equation as follows:

N jEω

[S(zj ,Ψ)

]φj − ρjφj = 0 with Nj = I −Φ(−j)Φ

T(−j). (32)

This eigenvalue equation still causes difficulty in processing the computation because it involves the non-symmetric matrix N jEω

[S(zj,Ψ)

]which often may be huge.

To avoid the non-symmetric eigenvalue problem, we translate it to the alternative form:

N jEω

[S(zj ,Ψ)

]N jϕj − ρjϕj = 0 (33)

This symmetrized eigenvalue problem would be no longer intractable to solve. Note that the N j is the idem-potent matrix, i.e. N j = N b

j for any positive integer b, (33) can be replaced by

N jEω

[S(zj,Ψ)

]N jϕj − ρjϕj = 0⇒N jEω

[S(zj ,Ψ)

]N jϕj − ρjN jϕj = 0. (34)

Set φj ← N jϕj in the right equation of (34). Then, we see the equivalence between the primal equation (32)and (34) which is derived through the alternative formulation (33).

References

C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611–622, 1999.

V. Bourdeau, J. Deschenes, R. Metivier, Y. Nagai, D. Nguyen, T. Hudson, F. Gannon, J.H. White, and S. Mader. Genome-wide identification

of high-affinity estrogen response elements in human and mouse. Molecular Endocrinology, 18:1411–1427, 2004.

C. M. Carvalho, J. E. Lucas, Q. Wang, J. T. Chang, J. R. Nevins, and M. West. High-dimensional sparse factor modelling: Applications

in gene expression genomics. Journal of American Statistical Association, 103:1438–1456, 2008.

A. d’Aspremont, L. El Ghaoui, M.I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming.

SIAM Review, 49(3):434–448, 2007.

A. Dobra, B. Jones, C. Hans, J. R. Nevins, and M. West. Sparse graphical models for exploring gene expression data. Journal of Multivariate

Analysis, 90:196–212, 2004.

Page 23: Technical Report #2009-5 May 14, 2009 Statistical and Applied Mathematical Sciences Institute PO Box 14006 Research Triangle Park, NC …

S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 6(6):721–741, 1984.

T. L. Griffiths and Z Ghahramani. Infinite latent feature models and the indian buffet process. In Advances in Neural Information

Processing Systems 20, pages 475–482, Cambridge, MA, 2005. MIT Press.

O. Hirose, R. Yoshida, S. Imoto, R. Yamaguchi, T. Higuchi, D. S. Charnock-Jones, C. Print, and S. Miyano. Statistical inference of

transcriptional module-based gene networks from time course gene expression profiles by using state space models. Bioinformatics, 24

(7):932–942, 2008.

I. T. Jolliffe, N. T. Trendafilov, and M. Uddin. A modified principal component technique based on the lasso. Journal of Computational

and Graphical Statistics, 112(3):531–547, 2003.

B. Jones, A. Dobra, C. M. Carvalho, C. Hans, C. Carter, and M. West. Experiments in stochastic computation for high-dimensional

graphical models. Statistical Science, 20:388–400, 2005.

M. Katoh and M. Katoh. Evolutionary recombination hotspot around gsdml-gsdm locus is closely linked to the oncogenomic recombination

hotspot around the ppp1r1b-erbb2-grb7 amplicon. International Journal of Oncology, 24:757–63, 2004.

S. Kirkpatrick, C. D. Jr Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671–680, 1983.

J. E. Lucas, C. M. Carvalho, Q. Wang, A. H. Bild, J. R. Nevins, and M. West. Sparse statistical modelling in gene expression genomics.

In P. Muller, K.A. Do, and M. Vannucci, editors, Bayesian Inference for Gene Expression and Proteomics, pages 155–176. Cambridge

University Press, 2006.

J.E. Lucas, C.M. Carvalho, J-T.A. Chi, and M. West. Cross-study projections of genomic biomarkers: An evaluation in cancer genomics.

PLoS One, 4.2:e4523, 2009.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996.

M. E. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211–244, 2001. ISSN

1533-7928.

R. Yoshida, T. Higuchi, and Imoto S. A mixed factors model for dimension reduction and extraction of a group structure in gene expression

data. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, pages 161–172, 2004.

R. Yoshida, T. Higuchi, S. Imoto, and S. Miyano. Arraycluster: an analytic tool for clustering, data visualization and module finder on

gene expression profiles. Bioinformatics, 22(12):1538–1539, 2006.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:

301–320, 2005.

H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):

265–286, 2006.