bayesian infinite mixture model based clustering of gene

13
BIOINFORMATICS Vol. 18 no. 9 2002 Pages 1194–1206 Bayesian infinite mixture model based clustering of gene expression profiles Mario Medvedovic 1, and Siva Sivaganesan 2 1 Center for Genome Information, Department of Environmental Health, University of Cincinnati Medical Center, 3223 Eden Av. ML 56, Cincinnati, OH 45267-0056, USA and 2 Mathematical Sciences Department, University of Cincinnati, 4221 Fox Hollow Dr, Cincinnati, OH 45241, USA Received on November 9, 2001; revised on January 22, 2002; March 4, 2002; accepted on March 12, 2002 ABSTRACT Motivation: The biologic significance of results obtained through cluster analyses of gene expression data gener- ated in microarray experiments have been demonstrated in many studies. In this article we focus on the develop- ment of a clustering procedure based on the concept of Bayesian model-averaging and a precise statistical model of expression data. Results: We developed a clustering procedure based on the Bayesian infinite mixture model and applied it to clustering gene expression profiles. Clusters of genes with similar expression patterns are identified from the posterior distribution of clusterings defined implicitly by the stochas- tic data-generation model. The posterior distribution of clusterings is estimated by a Gibbs sampler. We summa- rized the posterior distribution of clusterings by calculating posterior pairwise probabilities of co-expression and used the complete linkage principle to create clusters. This approach has several advantages over usual clustering procedures. The analysis allows for incorporation of a reasonable probabilistic model for generating data. The method does not require specifying the number of clusters and resulting optimal clustering is obtained by averaging over models with all possible numbers of clusters. Expres- sion profiles that are not similar to any other profile are automatically detected, the method incorporates experi- mental replicates, and it can be extended to accommo- date missing data. This approach represents a qualitative shift in the model-based cluster analysis of expression data because it allows for incorporation of uncertainties involved in the model selection in the final assessment of confidence in similarities of expression profiles. We also demonstrated the importance of incorporating the information on experimental variability into the clustering model. Availability: The MS Windows TM based program imple- menting the Gibbs sampler and supplemental material To whom correspondence should be addressed. is available at http://homepages.uc.edu/medvedm/ BioinformaticsSupplement.htm Contact: [email protected] INTRODUCTION The ability of the DNA microarray technology to produce expression data on a large number of genes in a parallel fashion has resulted in new approaches to identifying in- dividual genes as well as whole pathways involved in per- forming different biologic functions. One commonly used approach in making conclusions from microarray data is to identify groups of genes with similar expression pat- terns across different experimental conditions through a cluster analysis (D’haeseleer et al., 2000). The biologic significance of results of such analyses has been demon- strated in numerous studies. Various traditional cluster- ing procedures, ranging from simple agglomerative hierar- chical methods (Eisen et al., 1998) to optimization-based global procedures (Tavazoie et al., 1999) and self organiz- ing maps (Tamayo et al., 1999), to mention a few, have been applied to clustering gene expression profiles (in this paper we use the term profile to represent the set of ob- served expression values for a gene across several experi- ments). In identifying patterns of expression, such proce- dures depend on either a visual identification of patterns (hierarchical clustering) in a color-coded display or on the correct specification of the number of patterns present in data prior to the analysis (k -means and self organizing maps). Expression data generated by DNA arrays incorporates different sources of variability present in the process of obtaining fluorescence intensity measurements. Choosing a statistically optimal experimental design and the appro- priate statistical analysis of produced data is increasingly important aspect of such experiments (Kerr and Churchill, 2001; Baldi and Long, 2001). In the situation when sta- tistically significant differential expression needs to be detected, various modifications of traditional statistical 1194 c Oxford University Press 2002

Upload: others

Post on 12-Nov-2021

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian infinite mixture model based clustering of gene

BIOINFORMATICS Vol. 18 no. 9 2002Pages 1194–1206

Bayesian infinite mixture model based clusteringof gene expression profiles

Mario Medvedovic 1,∗ and Siva Sivaganesan 2

1Center for Genome Information, Department of Environmental Health, University ofCincinnati Medical Center, 3223 Eden Av. ML 56, Cincinnati, OH 45267-0056, USAand 2Mathematical Sciences Department, University of Cincinnati, 4221 Fox HollowDr, Cincinnati, OH 45241, USA

Received on November 9, 2001; revised on January 22, 2002; March 4, 2002; accepted on March 12, 2002

ABSTRACTMotivation: The biologic significance of results obtainedthrough cluster analyses of gene expression data gener-ated in microarray experiments have been demonstratedin many studies. In this article we focus on the develop-ment of a clustering procedure based on the concept ofBayesian model-averaging and a precise statistical modelof expression data.Results: We developed a clustering procedure basedon the Bayesian infinite mixture model and applied it toclustering gene expression profiles. Clusters of genes withsimilar expression patterns are identified from the posteriordistribution of clusterings defined implicitly by the stochas-tic data-generation model. The posterior distribution ofclusterings is estimated by a Gibbs sampler. We summa-rized the posterior distribution of clusterings by calculatingposterior pairwise probabilities of co-expression and usedthe complete linkage principle to create clusters. Thisapproach has several advantages over usual clusteringprocedures. The analysis allows for incorporation of areasonable probabilistic model for generating data. Themethod does not require specifying the number of clustersand resulting optimal clustering is obtained by averagingover models with all possible numbers of clusters. Expres-sion profiles that are not similar to any other profile areautomatically detected, the method incorporates experi-mental replicates, and it can be extended to accommo-date missing data. This approach represents a qualitativeshift in the model-based cluster analysis of expressiondata because it allows for incorporation of uncertaintiesinvolved in the model selection in the final assessmentof confidence in similarities of expression profiles. Wealso demonstrated the importance of incorporating theinformation on experimental variability into the clusteringmodel.Availability: The MS WindowsTM based program imple-menting the Gibbs sampler and supplemental material

∗To whom correspondence should be addressed.

is available at http://homepages.uc.edu/∼medvedm/BioinformaticsSupplement.htmContact: [email protected]

INTRODUCTIONThe ability of the DNA microarray technology to produceexpression data on a large number of genes in a parallelfashion has resulted in new approaches to identifying in-dividual genes as well as whole pathways involved in per-forming different biologic functions. One commonly usedapproach in making conclusions from microarray data isto identify groups of genes with similar expression pat-terns across different experimental conditions through acluster analysis (D’haeseleer et al., 2000). The biologicsignificance of results of such analyses has been demon-strated in numerous studies. Various traditional cluster-ing procedures, ranging from simple agglomerative hierar-chical methods (Eisen et al., 1998) to optimization-basedglobal procedures (Tavazoie et al., 1999) and self organiz-ing maps (Tamayo et al., 1999), to mention a few, havebeen applied to clustering gene expression profiles (in thispaper we use the term profile to represent the set of ob-served expression values for a gene across several experi-ments). In identifying patterns of expression, such proce-dures depend on either a visual identification of patterns(hierarchical clustering) in a color-coded display or on thecorrect specification of the number of patterns present indata prior to the analysis (k-means and self organizingmaps).

Expression data generated by DNA arrays incorporatesdifferent sources of variability present in the process ofobtaining fluorescence intensity measurements. Choosinga statistically optimal experimental design and the appro-priate statistical analysis of produced data is increasinglyimportant aspect of such experiments (Kerr and Churchill,2001; Baldi and Long, 2001). In the situation when sta-tistically significant differential expression needs to bedetected, various modifications of traditional statistical

1194 c© Oxford University Press 2002

Page 2: Bayesian infinite mixture model based clustering of gene

Bayesian infinite mixture model based clustering

approaches have been proposed (Kerr et al., 2000; Newtonet al., 2001; Ideker et al., 2000; Wolfinger et al., 2001;Rocke and Dubin, 2001; Lonnstedt and Speed, 2001).A common goal of all these approaches is the precisetreatment of different sources of variability resulting inprecise estimates of differential expression and their sta-tistical significance. In this article we suggest Bayesianhierarchical infinite mixture model as a general frameworkfor incorporating the same level of precision in the clusteranalysis of gene expression profiles.

Generally, in a model-based approach to clustering, theprobability distribution of observed data is approximatedby a statistical model. Parameters in such a model de-fine clusters of similar observations. The cluster analysisis performed by estimating these parameters from the data.In a Gaussian mixture model approach, similar individ-ual profiles are assumed to have been generated by thecommon underlying ‘pattern’ represented by a multivari-ate Gaussian random variable. The confidence in obtainedpatterns and the confidence in individual assignments toparticular clusters are assessed by estimating the confi-dence in corresponding parameter estimates. Recently, aGaussian finite mixture model (McLachlan and Basford,1987) based approach to clustering has been used to clus-ter expression profiles (Yeung et al., 2001a). Assumingthat the number of mixture components is correctly spec-ified, this approach offers reliable estimates of confidencein assigning individual profiles to particular clusters. Inthe situation where the number of clusters is not known,this approach relies on ones ability to identify the correctnumber of mixture components generating the data. Allconclusions are then made assuming that the correct num-ber of cluster is known. Consequently, estimates of themodel parameters do not take into account the uncertaintyin choosing the correct number of clusters.

We developed a statistical procedure based on theBayesian infinite mixture (also called the Dirichlet processmixture) model (Ferguson, 1973; Neal, 2000; Rasmussen,2000) in which conclusions about the probability that a setof gene expression profiles is generated by the same pat-tern are based on the posterior probability distribution ofclusterings given the data. In addition to generating groupsof similar expression profiles at a specified confidence, itidentifies outlying expression profiles that are not similarto any other profile in the data set. In contrast to the finitemixture approach, this model does not require specifyingthe number of mixture components, and the obtained clus-tering is a result of averaging over all possible numbersof mixture components. Consequently, the uncertaintyabout the true number of components is incorporated inthe uncertainties about the obtained clusters. Advantagesof such an approach over the finite mixture approach thatrelies on the specification of the ‘correct’ number of mix-ture components are demonstrated by a direct comparison

of the two approaches. We also extended the model toincorporate experimental replicates. In a simulation study,we demonstrated that such an extension is beneficial whenthe expression levels of different genes are measured withdifferent precision.

STATISTICAL MODELSuppose that T gene expression profiles were observedacross M experimental conditions. If xim is the expressionlevel of the i th gene for the mth experimental condition,then xi = (xi1, xi2, . . . , xi M ) denotes the complete ex-pression profile for the i th gene. Each gene expression pro-file can be viewed as being generated by one out of Q dif-ferent underlying expression patterns. Expression profilesgenerated by the same pattern form a cluster of similarexpression profiles. If ci is the classification variable indi-cating the pattern that generates the i th expression profile(ci = j means that the i th expression profile was gener-ated by the j th pattern), then a ‘clustering’ is defined bya set of classification variables for all expression profilesc = (c1, c2, . . . , cT ). Values of classification variables aremeaningful only to the extend that all observed expressionprofiles having the same value for their classification vari-able were generated by the same pattern and form a cluster.In our probabilistic model, underlying patterns generatingclusters of expression profiles are represented by multi-variate Gaussian random variables. That is, profiles clus-tering together are assumed to be a random sample fromthe same multivariate Gaussian random variable.

The following hierarchical model defines the probabilitydistribution generating observed expression profiles. Thismodel implicitly defines the posterior distribution ofthe classification set c and consequently the number ofclusters (patterns) in the data, Q.

Level 1: The distribution of the data given the parametersof underlying patterns [(µ1, σ

21 ), . . . , (µQ, σ 2

Q)], and theclassification variables c is given by:

p(xi |ci = j, µ1, . . . ,µQ, σi , . . . , σQ) = fN (x j |µ j , σ2j I)(I)

where p(x |θ ) denotes the marginal probability distributionfunction of x given parameters θ ; and fN (.|µ, σ 2I) is theprobability density function of a multivariate Gaussianrandom variable with mean µ and variance–covariancematrix σ 2I and I denotes the identity matrix.

Level 2: Prior distributions for [(µ1, σ21 ), . . . , (µQ, σ 2

Q)]and (c1, . . . , cT ), given hyperparameters λ, r, β and w aregiven by:

p(ci |π1, . . . , πQ) =Q∏

j=1

πI(ci = j)j

p(µ j |λ, r) = fN (µ j |λ, r−1I)

1195

Page 3: Bayesian infinite mixture model based clustering of gene

M.Medvedovic and S.Sivaganesan

p(σ−2j |β, w) = fG

(σ−2

j

∣∣∣∣β2 ,βw

2

)(II)

where fG(.|θ, τ ) is the probability density function of aGamma random variable with the shape parameter θ andthe scale parameter τ and I(ci = j) = 1 whenever ci = jand it is zero otherwise.

Level 3: Prior distributions for (π1, . . . , πQ) and hyperpa-rameters λ, r, β and w are given by:

p(π1, . . . , πQ |α, Q) = fD

(π1, . . . , πQ

∣∣∣∣ α

Q, . . . ,

α

Q

)

p(w|σ 2x ) = fG

(w

∣∣∣∣1

2,σ−2

x

2

)

p(β) = fG

∣∣∣∣1

2,

1

2

)(III)

p(r |σ 2x ) = fG

(r

∣∣∣∣1

2,σ 2

x

2

)

p(λ|µx , σ2x ) = fN (λ|µx , σ

2x I) (IV)

where fD(·|θ1, . . . , θQ) represents the probability densityfunction of a Dirichlet random variable with parameters(θ1, . . . , θQ). µx is the average of all profiles analyzed andσ 2

x is the average of gene-specific sample variances basedon all profiles analyzed (see the appendix), and α was setto be equal to 1.

Dependences in this model can be represented using theusual directed acyclic graph (DAG) representation and thedirected Markov assumption (Cowell et al., 1999)

X

C M

α

λ λ

Σ

Π r w

� = (π1, . . . , π Q)

C = (c1, . . . , cT )

X = (x1, . . . , xT )

M = (µ1, . . . , µQ)

� = (σ 21 , . . . , σ 2

Q)

As Q approaches infinity, the model above is equiva-lent to the corresponding Dirichlet process prior mixturemodel (Neal, 2000). This model is similar to the modelsdescribed by Rasmussen (2000) and Escobar and West(1995). Previously, we used a similar approach to modelmultinomial data (Medvedovic, 2000). This hierarchical

model can be easily generalized to include more generalcovariance structures, to accommodate experimental repli-cates, more flexible or more specific priors, missing data,etc.

The cluster analysis based on this model proceeds byapproximating the joint posterior distribution of classifi-cation vectors given data, p(c|x1, . . . , xT ). This distribu-tion is implicitly specified by this hierarchical model butcannot be written in the closed form. However, it can beshown (Neal, 2000) that the posterior marginal distributionof classification variables given all model parameters andthe data, when Q approaches infinity, is fully specified byfollowing two equations:

p(ci = j |c−i , xi ,µ j , σ2j )

= bn−i, j

T − 1 + αfN (xi |µ j , σ

2j I) (V)

p(ci �= c j , j �= i |c−i , xi , µx , σ2x )

= bα

T − 1 + α

∫fN (xi |µ j , σ

2j I)

×p(µ j , σ2j |λ, r−1)dµ j dσ 2

j (VI)

where, n−i,c is the number of expression profiles classifiedin c, not counting the ith profile, c−i is the classificationvector for all except the i th profile and b is just a normal-izing constant ensuring that all classification probabilitiesadd up to one. Posterior conditional distributions for othermodel parameters are given in the Appendix. Having allcomplete conditional distributions specified, we can em-ploy a Gibbs sampler to estimate joint posterior distribu-tion of all model parameters. It is important to note thatalthough the number of mixture components is technicallyassumed to be infinite, the number of nonempty compo-nents is finite and ranges between 1 and T .

While the above expression for the marginal probabilityof classification variables could be somewhat simplified, aclosed form expression for the (unconditional) probabilityfor the classification variables is not possible in light ofthe (non-conjugate) form of the priors we have chosen forµ j s and σ j s. In the current set-up, we did not find anyproblems with mixing aspects of the Gibbs sampler, or theconvergence. However, with the use of a conjugate priorand the resulting simplification, one may obtain a moreefficient Gibbs sampler.

Prior knowledge was only used in the selection of theDirichlet process precision parameter α. In the cell cycledata analysis, this parameter was set to 1 representingthe prior belief that six to seven clusters are expectedto transpire (Escobar, 1994). The prior belief about thenumber of clusters is based on the fact that we expectedto see 6 different clusters related to different cell cyclephases (Cho et al., 1998) and one cluster to allow for genesnot related to cell cycle. In the sensitivity analysis, we

1196

Page 4: Bayesian infinite mixture model based clustering of gene

Bayesian infinite mixture model based clustering

demonstrated that the posterior distribution of clusteringswas not sensitive to doubling and halving of α whichis approximately equivalent to doubling and halving theprior expected number of mixture components (Escobar,1994). An alternative approach to specifying the priorbelief on the number of clusters could be to specify a priordistribution for α and treat it like all other parameters inthe model (Escobar and West, 1995; Rasmussen, 2000).All other hyper-parameters in the model are given priordistributions that are centered around the overall profilesmean and standard deviation. This, we believe, is areasonable ‘default’ choice in the absence of any specificprior information about the cluster locations.

EXPERIMENTAL REPLICATES MODELWhen the signal in the data is drowned in the experimentalvariability, the most obvious way to improve the signalto noise ratio is to perform experimental replicates. Thecluster analysis of such data can be performed by firstaveraging gene expression measurements observed at dif-ferent experimental replicates and applying an usual clus-tering procedure. Since the standard deviation of such av-erage expression levels will decrease with the square rootof the number of replicates, it is likely that weaker asso-ciations will be detected than in the case when a singlereplicate is performed. However, such an approach is notusing all information present in such data. The informa-tion on the variability between experimental replicates isdiscarded and only the information about the mean ex-pression level is utilized. In the situation when experimen-tal variability varies from gene to gene, such informationcan be crucial for the proper assessment of confidence inour final clustering results. On the other hand, it is a welldocumented phenomenon that log-transformed expressionmeasurements of low-expressed genes vary more than ex-pression measurements of highly expressed genes. Bag-gerly et al. (2001) presented a probabilistic model of hy-bridization dynamics that explains this fact. Furthermore,it has been indicated that expression levels of some genesmight be more variable than others (Hughes et al., 2000).In the context of the mixture model based approach, exper-imental replicates and different between—replicates vari-ability can be accommodated by modifying Level 1 of ourhierarchical model (I–IV):

p(xik | yi , ψi ) = fN (xik | yi , ψ2i I),

p(yi | ci = j, µ j , σ j ) = fN (yi | µ j , σ2j I). (VII)

k = 1, . . . , G, i = 1, . . . , T, j = 1, . . . , Q

In this new formulation, xik represents the expressionprofile in the kth replicate for the i th gene and yi is themean expression profile for the i th gene. ψ2

i representsthe between replicates variance for the i th gene which

is assumed to be homogeneous across all experimentalconditions. After integrating out the mean expressionprofiles for individual genes (yi s), the joint probabilitydistribution of all observations for a single gene is givenby

p(xi1, . . . , xiG | ci = j, µ j , σ2j , ψi ) =∏

k

fN

(xi• | µ j ,

(σ 2

j + ψ2i

G

)I)

i = 1, . . . , T,

j = 1, . . . , Q; xi• =∑

k xik

G. (VIII)

Prior distribution for the within profile variance ψ2i and

conditional posterior distributions for parameters in thismodel are also given in the Appendix.

COMPUTING POSTERIOR PROBABILITIESAND CLUSTER FORMATIONGibbs samplerThe Gibbs sampler (Gelfand and Smith, 1990) is a generalprocedure for sampling observations from a multivariatedistribution. It proceeds by iteratively drawing observa-tions from complete conditional distributions of all com-ponents. Under mild conditions, the distribution of gener-ated multivariate observations converges to the target mul-tivariate distribution. The Gibbs sampler for generatinga sequence of clusterings c1, c2, c3, . . . , cS proceeds asfollows

Initialization phase: The algorithm is started by assum-ing that all profiles are clustered together. That is c0

is initialized as:

c01 = c0

2 = · · · = c0T = 1.

Consequently, Q0 is set to one. Correspondingpattern parameters µ1 and σ 2

1 are generated asrandom samples from their prior distribution.

Iterations: Given parameters after the kth step(ck , Qk, µ1, . . . , µQk, σ1, . . . , σk), the (k + 1)th setof parameters is generated by first updating classi-fication variables, that is drawing ck+1 according totheir posterior conditional distributions given by (V)and (VI). This also determines the value of Qk+1.New µs and σ 2s, as well as hyperparameters λ, r, w,and β are then generated according to their posteriorconditional distributions (Appendix). Whenever thenumber of profiles in a cluster falls to zero, thecluster is removed from the list. A new cluster iscreated whenever a ci �= c j for all i �= j is selected.

1197

Page 5: Bayesian infinite mixture model based clustering of gene

M.Medvedovic and S.Sivaganesan

It can be shown (Neal, 2000) that this algorithm, inthe limit, generates clusterings from the desired posteriordistribution of clusterings. Therefore, it can be assumedthat the empirical distribution of generated clusterings cB ,cB+1, . . ., after B ‘burn-in’ samples, approximates thetrue posterior distribution of clusterings. Groups of genesthat had common assignments in a large proportion ofgenerated clusterings are likely to have been generatedby the same underlying pattern. That is, the proportionof clusterings in which a group of genes had commonassignments approximates the probability that they aregenerated by the same underlying pattern. In addition tothe posterior distribution of clusterings averaged over allpossible number of clusters, the results of this samplingprocedure enable us to estimate the posterior distributionof the number of clusters present in the data.

Cluster FormationGiven the sequence of clusterings (cB, cB+1, . . . , cS)

generated by the Gibbs sampler after B ‘burn-in’ cycles,pair-wise probabilities for two genes to be generated bythe same pattern are estimated as:

Pi j = # of samples after ‘burn-in’ for which ci = c j

S − B

Using these probabilities as a similarity measures, pair-wise ‘distances’ are created as

Di j = 1 − Pi j .

Based on these distances, clusters of similar expressionprofiles are created by the complete linkage approach(Everitt, 1993) utilizing the Proc Cluster and Proc Treeof the SAS c© statistical software package (1999). It isimportant to notice that using the posterior pairwise prob-abilities and the complete linkage approach are just oneof many possible approaches to processing the posteriordistribution of clusterings generated by the Gibbs sampler.An alternative approach to identifying the optimal cluster-ing could be to identify the most probable clustering usingthe traditional maximum a posteriori probability (MAP)approach. That is, identifying the clustering that occurredmore times in the Gibbs sampler generated sequencethan any other clustering. The problem with that typeof approach is that many different clusterings have a verysimilar yet very small posterior probability. Choosing the‘best’ out of a slew of very similar clusterings withoutsummarizing the other almost equally probable alterna-tives would likely offer an incomplete picture about theobserved distribution of clusterings. Other approaches tosummarizing the posterior distribution of clusterings arecomplicated by the complex nature of the clustering space.Finding an optimal approach to summarizing the posteriordistribution of clusterings is a challenging task that needsfurther improvements.

DATA ANALYSISYeast cell cycle dataThe utility of this procedure is illustrated by the anal-ysis of the cell-cycle data described and analyzed byCho et al. (1998). This data set has been routinely usedto demonstrate effectiveness of various clustering ap-proaches (Tamayo et al., 1999; Yeung et al., 2001a,b;Tavazoie et al., 1999; Lukashin and Fuchs, 2001) Thecomplete data set was downloaded from http://genomics.stanford.edu/. Data was processed in a similar fashion asdescribed by (Tamayo et al., 1999). 893 genes that showedsignificant variation in both cell cycles were identified.Data was log-transformed and normalized separately foreach cell cycle. Clusters of similar expression profileswere identified based on 100 000 Gibbs sampler generatedclusterings after 100 000 ‘burn-in’ cycles. The distribu-tion of the posterior pairwise distances is shown in theFigure 1a. Approximately 75% of all posterior pairwisedistances were equal to 1. In other words, the majority ofposterior probabilities of two profiles being generated bythe same underlying Gaussian distribution were equal tozero.

Before we created clusters of similar profiles, weremoved ‘outlying’ profiles that were defined to be thoseprofiles for which all posterior pairwise probabilities wereless than 0.5. Thirty ‘outlying’ profiles were identified andremoved from the analysis (the list and the plot are givenat the supporting web page). Clusters were generated byseparating profiles in groups with the maximum possiblecomplete linkage distance. That is, for any pair of clustersformed, there was at least one profile in the first clusterthat had a zero posterior pairwise probability with atleast one profile in the second cluster. Twelve clusters inthe Figure 2 with total of 302 expression profiles wereselected, from the total of 43 clusters generated, basedon their periodicity. Additional five outlying profiles wereidentified separately as correlating with the cell-cyclestages (see additional material). These profiles representfive groups of cell cycle related genes identified by bothCho et al. (1998) and Tamayo et al. (1999). Furtherexamination of the profiles identified in our analysisreveals that out of 389 genes identified by Cho et al., 1998as peaking at specific portion of the cell cycle, 251 alsosatisfied our selection criteria. Out of these 251 genes,201 were identified by our analysis to be associated withcell-cycle events. An additional nine genes out of 50 thatwere not identified by our analysis showed a significantassociation with genes in Cluster 43 (see supplementalmaterial), meaning that the average pairwise posteriorprobability with genes in this cluster were greater than0.1. Two other genes from this group had a significantassociation with one of the 12 clusters in Figure 1.Although Cluster 43 does exhibit a cell-cycle associated

1198

Page 6: Bayesian infinite mixture model based clustering of gene

Bayesian infinite mixture model based clustering

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.01 - Posterior Probability

0.01

0.02

0.03

0.04

0.80

a)

0.000 0.005 0.010 0.015 0.020 0.025 0.0301 - Posterior Probability

0.0

0.1

0.2

0.3

b)

0.2 0.9 1.7 2.4 3.1 3.8 4.5 5.3 6.0 6.7 7.4Euclidian Distance

0.00

0.05

0.10

0.15

c)

1.8 2.5 3.3 4.0 4.8 5.5 6.3 7.0 7.8 8.5 9.3Eucledian Distance

0.00

0.05

0.10

0.15

d)

Fig. 1. (a) The distribution of posterior pairwise distances. The vertical line represents the maximum posterior distance observed by analyzingthe bootstrapped data set. (b) The distribution of posterior pairwise distances observed by analyzing the bootstrapped data set. (c) Thedistribution of pairwise Eucledian distances. The vertical line denotes the minimum pairwise Euclidian distance in the bootstrapped data set.(d) The distribution of pairwise Euclidian distances in the bootstrapped data set.

Time

Exp

ress

ion

Leve

l

0 50 100 150

–2–1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 1; N 14

Time

Exp

ress

ion

Leve

l

0 50 100 150

–2–1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 2; N 3

Time

Exp

ress

ion

Leve

l

0 50 100 150

–2–1

01

2

G1 S G2 M G1 S G2 M

CLUSTER 3; N 86

Time

Exp

ress

ion

Leve

l

Exp

ress

ion

Leve

l

Exp

ress

ion

Leve

l

Exp

ress

ion

Leve

l

Exp

ress

ion

Leve

l

Exp

ress

ion

Leve

l

Exp

ress

ion

Leve

l

Exp

ress

ion

Leve

l

Exp

ress

ion

Leve

l

0 50 100 150

Time0 50 100 150

Time0 50 100 150

Time0 50 100 150

Time0 50 100 150

Time0 50 100 150

Time0 50 100 150

Time0 50 100 150

Time0 50 100 150

–2–1

01

2

–2–1

01

2

–2–1

01

2

–2–1

01

2

–2–1

01

2

–2–1

01

2

–2–1

01

2

–2–1

01

2

–2–1

01

2

G1 S G2 M G1 S G2 M

G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M

G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M G1 S G2 M

CLUSTER 4; N 37

CLUSTER 5; N 12 CLUSTER 6; N 23 CLUSTER 7; N 19 CLUSTER 8; N 22

CLUSTER 9; N 31 CLUSTER 10; N 9 CLUSTER 11; N 17 CLUSTER 12; N 29

Fig. 2. Twelve Cell-cycle associated clusters out of total 43 clusters identified.

1199

Page 7: Bayesian infinite mixture model based clustering of gene

M.Medvedovic and S.Sivaganesan

a) b) c) R2

= 0.99

100

80

60

40

20

00 100000

Iteration200000 0 500

Iteration1000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

893 Initial Clusters0.8 0.9 1.0N

umbe

r of

Clu

ster

s D

iffer

ence

s 100 1.00.90.80.70.60.50.40.30.20.10.0

80

60

40

20

0Num

ber

of C

lust

ers

Diff

eren

ces

Sin

gle

Initi

al C

lust

er

Fig. 3. Mixing properties of the Gibbs sampler. (a) The path of the difference in the number of clusters in 200 000 generated samples for twoindependent runs of the Gibbs sampler. The first Gibbs sampler was initialized by assuming that each profile was generated by a separatemixture component and the second Gibbs sampler was initialized with the number of mixture components set to 1. (b) Same as (a) exceptonly first 1000 are shown. (c) Scatter plot of posterior pairwise probabilities for these two independent runs of the Gibbs sampler.

behavior, due to the large variability, it was not selectedas being cell-cycle related. By adding this cluster, thenumber of genes being implied as cell-cycle related wouldincrease to a total of 355. While visual inspection of othergenerated clusters reveal obvious similarities betweenexpression profiles, they do not seem to be correlated withcell-cycle events. Plots of all 43 clusters can be viewedat the supporting web page. By a closer examination of106 genes that were identified by our analysis but not bythe initial examination performed by Cho et al. (1998),we established that majority of them are functionallyclassified, according to MIPS database (Mewes et al.,1999), in categories previously shown to be associatedwith cell-cycle (Table 1).

This shows that our analysis is capable of providingadditional insight not accessible by a visual inspectionof hierarchically ordered data as performed by Cho etal. (1998). The complete list of these 106 genes alongwith their functional classification is included in thesupplemental material.

Convergence of the Gibbs samplerWe assessed the appropriateness of our ‘burn-in’ periodby comparing posterior distributions obtained from twoindependent runs of the Gibbs sampler (Figure 3). FirstGibbs sampler was initialized with the number of mixturecomponents set to one while the second Gibbs samplerwas initialized by assuming that each profile was gener-ated by a separate mixture component. As it can be seenfrom Figures 3a and 3b, the number-of-clusters parameter(Q) quickly converged to the same stationary distribution.Its posterior distribution, as well as the posterior distri-butions of other model parameters generated by the tworuns were indistinguishable after the ‘burn-in’ period weused (100 000 samples) (data not shown). Furthermore,posterior pairwise probabilities, our key parameters used

for creating clusters, correlated extremely well for two in-dependent runs (Figure 3c). These results led us to believethat the ‘burn-in’ period we used was appropriate. In thispaper we based our clustering on the data from a singlerun. An alternative approach would be to perform severalindependent runs and average the results. Potential bene-fits of running several shorter runs instead of a single longGibbs sampler, as well as advantages of using potentiallymore efficient Gibbs sampling scheme (e.g. Jain and Neal,2000; MacEachern, 1994) need to be further investigated.

Importance of Model AveragingTo demonstrate the importance of model-averaging inthe process of identifying groups of similar profiles, weperformed a finite mixture analysis of the cell cycle data.To do this we used the MCLUST software describedby Fraley and Raftery (1999). In concordance with ourinfinite mixture model we used the unequal volumespherical model. We first identified the optimal numberof clusters using the Bayesian Information criterion (BIC)(Schwarz 1978). Plot of BICs for models having betweenone and 100 clusters is shown in Figure 4a. The modelwith 23 clusters had the highest BIC and was assumedto be ‘optimal’. In the finite mixture approach, modelparameters in the finite mixture model are first calculatedbased on maximum likelihood principle. Based on theseestimates, clusters are then formed based on the posteriorprobabilities of individual profiles of being generated bya particular mixture component (McLachlan and Basford,1987; Fraley and Raftery, 1999; Yeung et al., 2001a).Results of this analysis were compared to the results ofour infinite mixture model. The goal of the comparisonwas to compare the reliability of clusterings obtainedby two different approaches. In this respect, we posedthe following question: How robust are the probabilitiesof concluding that two expression profiles are generated

1200

Page 8: Bayesian infinite mixture model based clustering of gene

Bayesian infinite mixture model based clustering

Table 1. Functional classification of 106 genes implicated by our analysis but not implicated in the original paper (Number of new genes) compared tofunctional categorization of 387 genes implicated in the original paper (number of old genes). Since some of the genes belong to multiple categories, totals inthe table are larger than total number of genes

Functional category Number of new genes Number of old genes

Cell cycle and dna processing 16 158Cell fate 11 70Cell rescue, defense and virulence 7 14Cellular communication/signal transduction mechanism 1 5Cellular transport and transport mechanisms 5 23Classification not yet clear-cut 2 13Control of cellular organization 1 16Energy 3 15Metabolism 20 60Protein fate (folding, modification, destination) 6 18Protein synthesis 0 4Regulation of/interaction with cellular environment 1 16Subcellular localisation 28 196Transcription 10 44Transport facilitation 5 14Unclassified protein 51 99

by the same probability distribution (i.e. that they willcluster together) in the two approaches? To assess therobustness of the finite mixture approach we repeated theanalysis assuming that the number of clusters is one lessor one more than the number indicated by BIC. Thatis, we fitted the finite mixture models with 22 and 24mixture components and calculated the pairwise posteriorprobabilities of two profiles being clustered togetherfor all pairs of profiles and for all three finite mixturemodels. Suppose that pq

ik is the posterior probabilitythat the i th profile is generated by the kth mixturecomponent (McLachlan and Basford, 1987) in the modelwith q mixture components. Than, the pairwise posteriorprobability of two profiles (profile i and profile j)being generated by the same mixture component can beestimated as

Pqi j =

q∑k=1

pqik pq

jk

Scatter plots of pairwise probabilities for mixture mod-els with 22 and 24 mixture components against the modelwith 23 mixture components, along with the correspond-ing coefficients of linear determination (R2) are shown inFigure 4b and 4c. It is obvious from these plots that thereis a strong dependence of these probabilities on the num-ber of mixture components used in the analysis. A smalldeviation in the number of clusters could have a stronginfluence on the quality of clusters as well as the estimatedconfidence in the clustering. That is, the results from thefinite mixture approach can be highly non-robust to theselected ‘optimal’ number of clusters. This means the cor-rect specification of the number of clusters is a paramount

in the finite mixture approach, which would typically bedifficult in practice without substantial amount of priorinformation to justify such specification.

When the number of mixture components is averagedout, as it is in the infinite mixture approach, the posteriordistribution of clusterings, and consequently clusters cre-ated based on this distribution, is robust with respect to thespecification of the prior number of clusters. Doubling orhalving (setting α = 2 or 0.5) of the mean prior number ofclusters (Escobar, 1994) does affect the posterior distribu-tion of the number of clusters (Figure 4d) but has littleeffect on the posterior pairwise probabilities (Figure 4eand 4f).

Although regularity conditions required for BIC to beasymptotically correct are not satisfied in the finite mixturemodel, empirical studies have shown that the criterionworks quite well in identifying the correct number ofmixture components (Biernacki and Govaert, 1999). Ourgoal here was not to propose a better approach toobtaining the number of clusters in the data. However,we believe that there are shortcomings with the finitemixture approach that are remedied by the infinite mixtureapproach. In the cluster analysis based on the finitemixture model, calculated confidence in a particularclustering does not take into account the uncertaintiesrelated to the choice of the right number of clusters basedon the BIC criterion. Consequently, they are valid onlyunder the model where a specific number of clustersis assumed known. On the other hand, when there isuncertainty about the number of clusters, as is often thecase in practice, we demonstrated that ensuing results canbe highly sensitive to the chosen number of clusters. These

1201

Page 9: Bayesian infinite mixture model based clustering of gene

M.Medvedovic and S.Sivaganesan

a) b) R2 = 0.74 c) R2 = 0.59

number of clusters

BIC

0 20 40 60 80 100

–380

00–3

6000

–340

00–3

200

d) e) R2 = 0.97 f) R2 = 0.99

0.20

0.15

0.10

0.05

0.0050 55 60 65 70 75 80

= 1= 2= 0.5

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.00.0 0.1 0.2 0.3 0.4 0.5

Alpha=1

Alp

ha=

0.5

0.6 0.7 0.8 0.9 1.0

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.00.0 0.1 0.2 0.3 0.4 0.5

23 Clusters22

Clu

ster

s0.6 0.7 0.8 0.9 1.0

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.00.0 0.1 0.2 0.3 0.4 0.5

Alpha=1

Alp

ha=

2

0.6 0.7 0.8 0.9 1.0

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.00.0 0.1 0.2 0.3 0.4 0.5

23 Clusters

24 C

lust

ers

0.6 0.7 0.8 0.9 1.0

Fig. 4. Effect of changes in the assumed number of mixture components (clusters) for finite and infinite mixture models. (a) BIC criterion fornumber of clusters ranging from 1 to 100. (b) Scatter plot of pairwise probabilities of two profiles being clustered together for the ‘optimal’with 23 mixture components versus the model with 22 components. (c) Scatter plot of pairwise probabilities of two profiles being clusteredtogether for the ‘optimal’ with 23 mixture components vs the model with 24 components. (d) Posterior distribution of number of mixturecomponents for three different levels of α. (e) Scatter plot of posterior pairwise probabilities of two profiles being clustered together forα = 1 versus α = 0.5. (f) Scatter plot of posterior pairwise probabilities of two profiles being clustered together for α = 1 versus α = 2.

are compelling reasons, in our view, for circumventingthe whole process of identifying the number of clustersaltogether by integrating it out, as it is done in the infinitemixture approach.

Simulated Replicates dataWe performed a simulation study to assess the advantageof using full replicates-model (VII) in situations when theexperimental variability varies from gene to gene. Datawas simulated to represent a two-cluster situation withsixty observations from Gaussian distributions in eachcluster. The number of replicates was two or ten. 1000data sets were simulated for each, two and ten replicatessituation. Following the structure of the model (VII), datawas generated in two stages:Stage 1:

120 mean expression profiles (y1, . . . , y120) were gener-ated. Sixty of them (y1, . . . , y60) were randomly sampledfrom the Gaussian distribution with the mean of 0 andthe variance 0.005, and other 60 (y61, . . . , y120) from theGaussian distribution with the mean of 1.5 and same vari-ance. Between-replicates variances for each mean expres-sion profile (ψ2

1 , . . . , ψ2120) were randomly sampled from

the Inverse Gamma distribution with the shape parame-ter 2 and the scale parameter 0.4 for the two-replicatesscenario, and with the scale parameter two and the loca-tion parameter two for the ten-replicates scenario. Theseparameters reflect the scenario in the which majority ofbetween profiles variability within the same cluster is dueto the experimental between-replicates variability, and thetotal variability in averaged profiles (xi•) was comparablefor two and ten replicates situation.Stage 2:

For each mean expression profile yi , i = 1, . . . , 120,G = 2 or 10 expression profile replicates (xi,1, . . . , xi,G)

were randomly sampled from the Gaussian distributionwith the mean yi and the variance ψ2

i .The clustering procedure based on the full model

showed an improvement in the ability to correctly separateobservations from different clusters. The difference isparticularly evident in the situation when the betweenreplicates variability is high. Scatter plots in Figure 4represent observed differences between average posteriorpairwise probabilities for a profile and the rest of theprofiles in the same cluster and in the opposite cluster(Ri , i = 1, . . . , 120) for all 1000 simulated samples. That

1202

Page 10: Bayesian infinite mixture model based clustering of gene

Bayesian infinite mixture model based clustering

a b c

d e f

1.0

0.5

0.0

–0.5

–1.0–5 –3 –1 1 3

Position5 7 9

1.0

0.5

0.0

–0.5

–1.0–5 –3 –1 1 3

Position5 7 9

1.0

0.5

0.0

–0.5

–1.0–5 –3 –1 1 3

Position5 7 9

1.0

0.5

0.0

–0.5

–1.0–5 –3 –1 1 3

Position5 7 9

1.0

0.5

0.0

–0.5

–1.0–5 –3 –1 1 3

Position5 7 9

1.0

0.5

0.0

–0.5

–1.0–5 –3 –1 1 3

Position5 7 9

Fig. 5. Scatter plot of the differences between average posterior pairwise probabilities for a profile with the rest of the profiles in the samecluster and in the opposite cluster (Ri , i = 1, . . . , 120) for all 1000 simulated data sets. Red dots represent the full model-based analysis andgreen dots represent the analysis based on the averaged profiles. Vertical lines denote two cluster centers at 0 and 1.5. (a) Profiles with thebelow median between replicates standard deviation. (b) Profiles with the between replicates standard deviation being between 50th and 95thpercentile. (c) Profiles with the higher than 95th percentile between replicates standard deviation. (d), (e) and (f) are same as (a), (b) and (c)respectively only for 2 replicates situation.

is, the Rs are defined as:

Ri =60∑j=1

Pi j

60−

120∑j=61

Pi j

60, for i = 1, . . . , 60, and

Ri =120∑j=61

Pi j

60−

60∑j=1

Pi j

60, for i = 61, . . . , 120.

The more positive this difference is, it is more likely thatthe profile will be correctly clustered. The more negativethis difference is, it is more likely that the profile will bemisclassified.

As the between replicates variability increases, thedifference in the performance of the full replicates modelvs. the simple model becomes more obvious. For thebelow median variable profiles in the ten replicatessituation (Figure 5a), both methods perform similarly withthe full model showing somewhat higher confidence in thecorrect clustering. For variabilities between the medianand the 95th percentile of all variabilities (Figure 5band 5e), the full model still behaves as expected whilethe probability of correctly clustering profiles that arerelatively far from cluster centers using the simple model

is seriously diminished (circled observations). This trendis evident in the two-raplicates situation even in thelowest variability group (Figure 5d). In the extremevariability situations (Figure 5c and 5f), in addition tothis trend, there is also an increased chance of incorrectlyclassifying profiles that are lying between or close totwo cluster centers. These observations suggest that, whenexperimental replicates are available, there is a clearadvantage of using the full model instead of using a simplemodel with averaged data.

DISCUSSIONAssuming a few basic principles, the Bayesian approachis the optimal approach to drawing uncertain conclusionsfrom noisy data (Baldi and Brunak, 1998). The utilityof Bayesian approach in computational biology has beendemonstrated in a multitude of situations ranging fromthe analysis of biological sequences (Liu and Lawrence,1999), to identification of differentially expressed genes.(Newton et al., 2001; Baldi and Long, 2001; Long etal., 2001; Lonnstedt and Speed, 2001). In this article wedeveloped a model-based clustering procedure utilizing

1203

Page 11: Bayesian infinite mixture model based clustering of gene

M.Medvedovic and S.Sivaganesan

the Bayesian paradigm for knowledge discovery fromnoisy data. We postulated a reasonable probabilistic modelfor generation of gene expression profiles and performedclustering by estimating the posterior distribution of clus-terings given data. The clustering and the measures ofuncertainty about the clustering are assessed by modelaveraging across models with all possible number ofclusters. In this respect, the model-averaging approachrepresents a qualitative shift in the problem of identifyinggroups of co-expressed genes.

Previously, Schmidler et al. (2000) used the Bayesianmodel-averaging approach to estimate probabilities ofa particular secondary structure after averaging over allpossible segmentation. Furthermore, Zhu et al. (1998)used a similar approach to estimate distances between apair of sequences after averaging over all possible align-ments. In both situations, taking into account uncertaintiesin the model selection proved beneficial for the over-all performance of the computational procedure. Precisequantification of uncertainties related to clustering ofexpression profiles is especially important when obtainedclustering is used as the starting point for identifyingcommon regulatory motifs (e.g. Tavazoie et al., 1999)for genes that cluster together. The inclusion of genes thathave a low probability of actually belonging to a clusterin such an analysis is likely to seriously undermine onesability to detect relevant commonalities in regulatory re-gions of clustered genes. One appealing way of taking intoaccount uncertainties about created clusters in identifyingcommon regulatory sequences is to simultaneously modelboth gene expression and corresponding promoter se-quence data. Holmes and Bruno (2000) recently proposeda ‘sequence-expression’ model that incorporates both, theBayesian finite mixture model for expression data and thesequence model of Lawrence et al. (1993) for promotersequence data. Extending this model to incorporate model-averaging over models with different number of clustersand for incorporating experimental replicates seems to bea logical next step in developing such models.

The procedure we used to create clusters based on theestimated posterior distribution of clusterings is the usualcomplete linkage nearest-neighbor approach with pairwiseposterior distances between two profiles as the distancemeasure (Everitt, 1993). However, in contrast to usual dis-tance measures (e.g. the Euclidian distance) that are basedonly on two profiles at a time, our measure incorporates in-formation from the whole data set in assessing the distancebetween any two profiles. The effect of such ‘informationpulling’ is demonstrated by comparing the distribution ofour posterior distances to the distribution of the commonlyused Euclidian distances (e.g. Lukashin and Fuchs, 2001)calculated for the same data set (Figure 1c). The highconcentration of posterior distances around the maximumdistance (equal to 1) allows for an easy identification of

unrelated profiles. On the other hand, the distributionof corresponding Euclidian distances is rather smooththroughout the range and there is no obvious separation ofclearly unrelated genes. This point is further emphasizedby observing distributions of these two distance measuresfor the structure-less data set obtained by bootstrappingdata (Efron and Tibshirani, 1998) from the original data set(Figure 1b and 1d). The distribution of posterior pairwisedistances is drastically different for the bootstrapped dataset when compared to the original data and it indicatesthat all data in the bootstrapped set form a single cluster.In contrast to this, differences between distributions ofEuclidean distances for two data sets is rather subtle and itis unclear what distance indicates a high-confidence sep-aration between two profiles. Furthermore, posterior pair-wise probabilities are themselves meaningful measuresof uncertainty about similarity of two expression profilesand they can be used to assess confidence and uncertaintyabout clusters and the membership of individual profilesin a cluster. On the other hand, usual distance measuressuch as Euclidian distance and similarity measures such ascorrelation coefficient are meaningful only in comparisonto their distribution under the no-structure assumption(Lukashin and Fuchs, 2001; Herrero et al., 2001).

The issue of performing experimental replicates to modelvarious sources of experimental variability will likely playan increasingly important role in modeling microarraydata. Currently, information about experimental variabilityis commonly used when the goal of the analysis is toidentify differentially expressed genes. Results of oursimulation study show that using this information whenclustering expression data is likely to improve resultsof the analysis. Failing to take into account differencesin precisions with which expression levels of differentgenes are measured can result in both, a failure to identifyexisting and/or inferring non-existing relationships.

A major assumption of mixture model, finite and in-finite, is that, given the classification variables and theparameters of individual mixture components, expressionprofiles for individual genes are independently distributedas multivariate Gaussian random variables. It has beenpreviously shown (Yeung et al., 2001a) that the normalityassumption is rather reasonable for properly transformedmicroarray data sets. In our own experience with bothAffymetrix oligonucleotide arrays and two-color cDNAmicroarrays, after a proper normalization, distribution oflog-transformed data closely resembles Gaussian distribu-tions with different variances for different genes (data notshown). Conditional independence assumption is moredifficult to assess. Typically, in the process of normalizingmicroarray data, an attempt is made to remove system-atic correlations that can occur due to the microarray-to- microarray variability that systematically affects allgenes on the microarray (Yang et al., 2001). More subtle

1204

Page 12: Bayesian infinite mixture model based clustering of gene

Bayesian infinite mixture model based clustering

correlations due to, for example, cross-hybridizations re-sulting from sequence similarities are much more difficultto assess. Even if such correlations can be quantified, thereis no obvious way to incorporate them in the traditionalmixture model.

Finally, the model based approach described here isa flexible platform for dealing with various other issuesin the cluster analysis of expression profiles. In addi-tion to model-averaging and outlier detection that wedemonstrated, Gibbs sampler described here can be easilymodified to handle incomplete profiles (profiles missingsome data points). In such a situation, it is again impor-tant to take into account additional source of variabilityintroduced by imputing missing data points. At this pointwe are unaware of another clustering approach capable ofdealing with all these problems at once

ACKNOWLEDGEMENTSThis work was partially supported by NIEHS grants P30-ES06096 and P42-ES04908. We are also thankful forvery insightful comments and suggestions offered by thereferees.

REFERENCESBaggerly,K.A., Coombes,K.R., Hess,K.R., Stivers,D.N.,

Abruzzo,L.V. and Zhang,W. (2001) Identifying differentiallyexpressed genes in cDNA microarray experiments. J. Comput.Biol., 8, 639–659.

Baldi,P. and Brunak,S. (1998) Bioinformatics: The MachineLearning Approach. The MIT Press, Cambridge, MA.

Baldi,P. and Long,A.D. (2001) A Bayesian framework for theanalysis of microarray expression data: regularized t-test andstatistical inferences of gene changes. Bioinformatics, 17, 509–519.

Biernacki,C. and Govaert,G. (1999) Choosing models in model-based clustering and discriminant analysis. J. Statis. Comput.Simul., 64, 49–71.

Cho,R.J., Campbell,M.J., Winzeler,E.A., Steinmetz,L., Conway,A.,Wodicka,L., Wolfsberg,T.G., Gabrielian,A.E., Landsman,D.,Lockhart,D.J. and Davis,R.W. (1998) A genome-wide transcrip-tional analysis of the mitotic cell cycle. Mol. Cell, 2, 65–73.

Cowell,R.G, Dawid,P.A., Lauritzen,S.L. and Spiegelhalter,D.J.(1999) Probabilistic Networks and Expert Systems. Springer,New York.

D’haeseleer,P., Liang,S. and Somogyi,R. (2000) Genetic networkinference: from co-expression clustering to reverse engineering.Bioinformatics, 16, 707–726.

Efron,B. and Tibshirani,R.J. (1998) An Introduction to theBootstrap. Chapman & Hall/CRC, Boca Raton.

Eisen,M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998)Cluster analysis and display of genome-wide expression patterns.Proc. Natl Acad. Sci. USA, 95, 14863–14868.

Escobar,M.D. (1994) Estimating normal means with a Dirichletprocess prior. J. Amer. Stat. Assoc., 89, 268–277.

Escobar,M.D. and West,M. (1995) Bayesian density estimation andinference using mixtures. J. Amer. Stat. Assoc., 90, 577–588.

Everitt,B.S. (1993) Cluster Analysis. Edward Arnold, London.Ferguson,T.S. (1973) A Bayesian analysis of some nonparametric

problems. Ann. Stat., 1, 209–230.Fraley,C. and Raftery,A.E. (1999) Mclust: Software for Modelbased

Cluster Analysis. Journal of Classification, 16, 297–306.Gelfand,E.A. and Smith,F.M.A. (1990) Sampling-based approaches

to calculating marginal densities. J. Amer. Stat. Assoc., 85, 398–409.

Herrero,J., Valencia,A. and Dopazo,J. (2001) A hierarchical unsu-pervised growing neural network for clustering gene expressionpatterns. Bioinformatics, 17, 126–136.

Holmes,I. and Bruno,W.J. (2000) Finding regulatory elements usingjoint likelihoods for sequence and expression profile data. Proc.Int. Conf. Intell. Syst. Mol. Biol., 8, 202–210.

Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J.,Stoughton,R., Armour,C.D., Bennett,H.A., Coffey,E., Dai,H.,He,Y.D., Kidd,M.J., King,A.M., Meyer,M.R., Slade,D.,Lum,P.Y., Stepaniants,S.B., Shoemaker,D.D., Gachotte,D.,Chakraburtty,K., Simon,J., Bard,M. and Friend,S.H. (2000)Functional discovery via a compendium of expression profiles.Cell, 102, 109–126.

Ideker,T., Thorsson,V., Siegel,A.F. and Hood,L.E. (2000) Testingfor differentially-expressed genes by maximum-likelihood anal-ysis of microarray data. J. Comput. Biol., 7, 805–817.

Jain,S. and Neal,R. (2000) A split-merge Markov chain Monte Carloprocedure for the Dirichlet process mixture model. TechnicalReport No. 2003, Department of Statistics, University of Toronto.

Kerr,K.M. and Churchill,G.A. (2001) Experimental design for geneexpression microarrays. Biostatistics, 2, 183–201.

Kerr,K.M., Martin,M. and Churchill,G.A. (2000) Analysis ofvariance for gene expression microarray data. J. Comput. Biol.,7, 819–837.

Lawrence,C.E., Altschul,S.F., Boguski,M.S., Liu,J.S.,Neuwald,A.F. and Wootton,J.C. (1993) Detecting subtlesequence signals: a Gibbs sampling strategy for multiplealignment. Science, 262, 208–214.

Liu,J.S. and Lawrence,C.E. (1999) Bayesian inference on biopoly-mer models. Bioinformatics, 15, 38–52.

Long,A.D., Mangalam,H.J., Chan,B.Y., Tolleri,L., Hatfield,G.W.and Baldi,P. (2001) Improved statistical inference from dnamicroarray data using analysis of variance and a Bayesianstatistical framework. Analysis of global gene expression inEscherichia coli K12. J. Biol. Chem., 276, 19937–19944.

Lonnstedt,I. and Speed,T.P. (2001) Replicated microarray data.Statistical Sinica, 12, 31–46.

Lukashin,A.V. and Fuchs,R. (2001) Analysis of temporal geneexpression profiles: clustering by simulated annealing anddetermining the optimal number of clusters. Bioinformatics, 17,405–414.

MacEachern,S.N. (1994) Estimating normal means with a conjugatestyle Dirichlet process prior. Communications in Statistics:Simulation and Computation, 23, 727–741.

McLachlan,J.G. and Basford,E.K. (1987) Mixture Models: Infer-ence and Applications to Clustering. Marcel Dekker, New York.

Medvedovic,M. (2000) clustering multinomial observations viafinite and infinite mixture models and MCMC Algorithms.Proceedings of the Joint Statistical Meeting 2000: StatisticalComputing Section. pp. 48–51.

1205

Page 13: Bayesian infinite mixture model based clustering of gene

M.Medvedovic and S.Sivaganesan

Mewes,H.W., Heumann,K., Kaps,A., Mayer,K., Pfeiffer,F.,Stocker,S. and Frishman,D. (1999) MIPS: a database for proteinsequences and complete genomes. Nucleic Acids Res., 27,44–48.

Neal,R.M. (2000) Markov chain sampling methods for Dirichletprocess mixture models. Journal of Computational and Graphi-cal Statistics, 9, 249–265.

Newton,M.A., Kendziorski,C.M., Richmond,C.S., Blattner,F.R. andTsui,K.W. (2001) On differential variability of expressionratios: improving statistical inference about gene expressionchanges from microarray data. J. Comput. Biol., 8, 37–52.

Rasmussen,C.A. (2000) The infinite gaussian mixture model.Advances in Neural Information Processing Systems, 12, 554–560.

Rocke,D.M. and Dubin,B. (2001) A Model for Measurement Errorfor Gene Expression Arrays. J. Comput Biol., 8, 557–569.

SAS/STAT User’s Guide, (1999) SAS Institute, Cary, NC.Schmidler,S.C., Liu,J.S. and Brutlag,D.L. (2000) Bayesian segmen-

tation of protein secondary structure. J. Comput. Biol., 7, 233–248.

Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitro-vsky,E., Lander,E.S. and Golub,T.R. (1999) Interpreting patternsof gene expression with self-organizing maps: methods andapplication to hematopoietic differentiation. Proc. Natl Acad.Sci. USA, 96, 2907–2912.

Tavazoie,S., Hughes,J.D., Campbell,M.J., Cho,R.J. andChurch,G.M. (1999) Systematic determination of geneticnetwork architecture. Nature Genet., 22, 281–285.

Wolfinger,R.D., Gibson,G., Wolfinger,E.D., Bennett,L.,Hamadeh,H., Bushel,P., Afshari,C. and Paules,R.S. (2001)Assessing Gene Significance From CDNA Microarry ExpressionData Via Mixed Models. J. Comput. Biol., 8, 625–637.

Yang,Y.H., Dudoit,S., Luu,P. and Speed,T. (2001) Normalization forcDNA microarray data. SPIE BiOS 2001. San Jose, California.

Yeung,K.Y., Fraley,C., Murua,A., Raftery,A.E. and Ruzzo,W.L.(2001a) Model-based clustering and data transformations forgene expression data. Bioinformatics, 17, 977–987.

Yeung,K.Y., Haynor,D.R. and Ruzzo,W.L. (2001b) Validatingclustering for gene expression data. Bioinformatics, 17, 309–318.

Zhu,J., Liu,J.S. and Lawrence,C.E. (1998) Bayesian adaptivesequence alignment algorithms. Bioinformatics, 14, 25–39.

APPENDIXPosterior probability distributions for the singlereplicate case

p(µ j |c, σ 2j , X,λ, r) = fN

µ j

∣∣∣∣∣r−1x j• + σ 2

jn j

λ

r−1 + σ 2j

n j

,r−1 σ 2

jn j

r−1 + σ 2j

n j

I

p(σ−2j |X, M, β, w) = fG

(σ−2

j

∣∣∣∣∣ Mn j + β

2,s2

j + βw

2

)

f (λ|µ1, . . . ,µQ , r) = fN

λ

∣∣∣∣∣σ 2

x

∑i µiQ + r−1

Q µx

σ 2x + r−1

Q

,σ 2

xr−1

Q

σ 2x + r−1

Q

I

f (r |µ1, . . . ,µQ , λ) = fG

(r

∣∣∣∣∣ M Q + 1

2,

∑i (µi −λ)(µi −λ)′ + σ 2

x2

)

where

µx =∑T

i=1 xi

Tσ 2

x =∑T

i=1 (xi − µ)(xi − µ)′T M − 1

x j• =∑T

ci = j xi

n js j =

T∑ci = j

(xi − µ j )(xi − µ j )′.

f (w|σ−21 , . . . , σ−2

Q , β) = fG

w

∣∣∣∣∣ Qβ + 1

2,β

∑j σ−2

j + σ−2x

2

f (β|σ−21 , . . . , σ−2

Q , w) ∝ �

2

) (β

2

) (Qβ−3)2

exp

(−β−1

2

)

×∏

j

(wσ−2

j )β2 exp

σ−2j wβ)

2

.

Posterior conditional probability distributions ofparameters in the multiple replicates modelThe prior distribution for the within profile variance isgiven by

p(ψ−2i ) = fG

(ψ−2

i

∣∣∣∣∣1

2,σ 2

w

2

)

where

σ 2w =

∑Ti=1

∑Gk=1 (xik − xi•)(xik − xi•)′

(G − 1)T Mand xi• =

∑Gk=1 xik

G.

The posterior conditional distribution for parameters inthe model for experimental replicates are:

f (yi |ci = j,µ j , σ2j , ψ

2i , xi1, . . . , xiG)

= fN

yi

∣∣∣∣∣σ 2

j xi• + ψ2i

G µ j

σ 2j + ψ2

iG

,

ψ2i

G σ 2j

σ 2j + ψ2

iG

I

f (ψ−2i |yi , σ

2w, xi1, . . . , xiG)

= fG

(ψ−2

i

∣∣∣∣∣ G M + 1

2,s2iw + σ 2

w

2

)

siw =G∑

k=1

(xik − yi )(xik − yi )′

s j =T∑

ci = j

(yi − µ j )(yi − µ j )′ y j• =

∑Tci = j yi

n j

p(µ j |c, σ 2j , Y,λ, r) = fN

µ j

∣∣∣∣∣r−1y j + σ 2

jn j

λ

r−1 + σ 2j

n j

,r−1 σ 2

jn j

r−1 + σ 2j

n j

I

p(σ−2j |Y, M, β, w) = fG

(σ−2

j

∣∣∣∣∣ Mn j + β

2,s2

j + βw

2

).

Conditional posterior distributions for other parametersremain unchanged.

1206