publication ix - universitylib.tkk.fi/diss/2005/isbn9512279096/article9.pdf · publication ix janne...

Publication IX

Janne Nikkila, Christophe Roos, Eerika Savia, and Samuel Kaski.Explorative Modeling of Yeast Stress Response and its Regulationwith gCCA and Associative Clustering. International Journal ofNeural Systems, Special Issue on Bioinformatics, vol. 15, no. 4,pages 237–246, 2005.c© 2005 World Scientific, http://www.worldscinet.com/ijns/ijns.shtml.Reprinted with permission.

IX

August 17, 2005 9:50 00022

International Journal of Neural Systems, Vol. 15, No. 4 (2005) 237–246c© World Scientific Publishing Company

EXPLORATORY MODELING OF YEAST STRESS RESPONSEAND ITS REGULATION WITH gCCA AND

ASSOCIATIVE CLUSTERING

JANNE NIKKILALaboratory of Computer and Information Science

Helsinki University of TechnologyP.O. Box 5400, FI-02015 HUT, Finland

[email protected]

CHRISTOPHE ROOSMedicel Oy, Huopalahdentie 24, FI-00350 Helsinki, Finland

[email protected]

EERIKA SAVIALaboratory of Computer and Information Science


[email protected]

SAMUEL KASKI∗

Department of Computer Science, University of HelsinkiP.O. Box 68, FI-00014 University of Helsinki, Finland

andLaboratory of Computer and Information Science


[email protected]

We model dependencies between m multivariate continuous-valued information sources by a combina-tion of (i) a generalized canonical correlations analysis (gCCA) to reduce dimensionality while preservingdependencies in m−1 of them, and (ii) summarizing dependencies with the remaining one by associativeclustering. This new combination of methods avoids multiway associative clustering which would requirea multiway contingency table and hence suffer from curse of dimensionality of the table. The method isapplied to summarizing properties of yeast stress by searching for dependencies (commonalities) betweenexpression of genes of baker’s yeast Saccharomyces cerevisiae in various stressful treatments, and sum-marizing stress regulation by finally adding data about transcription factor binding sites.

Keywords: Associative clustering; canonical correlation analysis (CCA); exploratory data analysis; geneexpression; yeast stress.

1. Introduction

Integration of multiple information sources is becom-ing increasingly important in bioinformatics. It hasbecome evident that most of the important questionsin molecular and cell biology cannot be answered by

studying a single data source, like DNA sequence orgene expression. Due to complexity and noise of bio-logical systems, either lots of prior knowledge or dataare required to constrain models, and single sourcesof measurements then suffice only for producing

∗Corresponding author.

237

August 17, 2005 9:50 00022

238 J. Nikkila et al.

general overviews. In contrast, approaches thatcombine several relevant data sources have shownpotential to answer specific biological questions.4,22

However, the works integrating heterogeneous datasets presented so far tend to be tailored to specifictasks. Their application to other problems is usu-ally possible only after tedious tailoring by bothmethodological and application area experts. Typicalexamples of these methods are Bayes nets7 andinductive logic programming.17

We investigate machine learning methods thatcan both integrate multiple information sources andat the same time are generally applicable to a setof problems of a certain kind. We assume that themultiple data sets are multivariate and continuous-valued. The problem is to find what is common in thedata sets, commonality being defined as such prop-erties of the data sets that are consistently (statisti-cally) dependent. The question is then how to utilizethis kind of multiple information in a principled wayto search for dependencies with minimal assumptionsabout their nature. Specifically, we study yeast stressresponse on gene expression level, and its regulationby a set of proteins called transcription factors.21

Yeast stress response has been studied intensivelyduring recent years.5,9 A group of genes appearsto be always affected in various stress treatments,and this set has been called common environmen-tal response (CER) or environmental stress response(ESR) genes. Such stress response is practically theonly way for the simple yeast to respond to variousadverse environmental conditions, and because it isso easy to elicit, it has been used as a paradigm tostudy gene regulatory networks. Even this responseis far from being completely understood, however;different studies do not even agree on the sets ofstress genes.

In practice we have available a set of gene expres-sion profiles, that is, a time series of expression foreach gene, for each stress treatment. The commonenvironmental response should be visible as someproperties that are unknown but common to alltreatments. We assume here that the dependenciesbetween our gene expression data sets are rather sim-ple, perhaps even linear, since the data sets are rela-tively low-dimensional and all the treatments shouldinduce rather similar expression patterns in yeast.Based on this we assume that the interesting infor-mation between expression data sets can be found by

searching for global common variation between thedata sets.

We will maximize the variation that is common tothe stress treatment data sets and try to neglect allthe other variation in the gene expression profiles.This problem can be solved by a form of gener-alized canonical correlation analysis.1 Its interpre-tation as a mutual information-maximizing methodfurther justifies the use of that specific variant ofgeneralized canonical correlation analysis.

To explore regulation of the stress response, wefurther search for commonalities with data abouthow likely different transcription factors (TF), reg-ulators of gene expression, are to bind to the pro-moter regions of the genes. If a set of genes hascommonalities in their expression patterns acrossstress treatments, and furthermore commonalitiesin the binding patters, they are likely to be stressresponse genes regulated in the same way.

This kind of dependency exploration can be doneby maximizing the mutual information, or preferablyits finite-data variant, between the data sets. We willuse a previously introduced method, called associa-tive clustering (AC),14 which clusters two data setsby maximizing dependency between the clusterings,and hence should suit the task perfectly. In practice,a linear method scales better to multiple data sets,and hence we use generalized canonical correlationsas a preprocessing method to reduce the number ofdata sets.

Clustering methods that maximize mutual infor-mation between the data sets have been formalizedin the information bottleneck framework,8,25 directlyapplicable to discrete data, and extended to continu-ous vectorial data.3,13 Associative clustering can beviewed as an extension to the current informationbottleneck algorithms for finite sets of continuous-valued data.

The methods of clustering with constraints arealso close in spirit to AC; there the clustering issupervised by constraining pairs of samples to belong(or not to belong) to the same cluster.2,26 This willlead to similar results as IB-type clustering if sam-ples from the same class are constrained to belong tothe same cluster. Common to AC is that this can beinterpreted as one data set (the constraints) super-vising the other. In AC two data sets supervise eachother, in the sense that the goal is to find dependen-cies between them.

August 17, 2005 9:50 00022

Exploratory Modeling of Yeast Stress Response 239

Graphical models of dependencies between vari-ables are another related popular formalism. Theyhave been applied to modeling regulation of geneexpression.7 The main practical difference fromour clustering approach, which makes the modelscomplementary, is that our clusterings are intendedto be used as general-purpose, data-driven but notdata-specific, exploratory tools. Associative cluster-ing is a multivariate data analysis tool that can beused in the same way as standard clusterings toexplore regularities in data. The findings can thenbe dressed as prior beliefs or first guesses in themore structured graphical models, hopefully helpingto restrict the hopelessly large search space of possi-ble model structures.

The technical difference from standard cluster-ings, as well as from standard graphical models, isthat our objective function is maximization of depen-dency between the data sets. Instead of modelingall variability in the data the models focus on thoseaspects that are common in the different data sets, inthe sense of being consistently dependent. This fitsperfectly the present application.

2. Methods

2.1. Dependencies by mutualinformation

Having observed for one set of objects (genes) severalmultivariate variables V1, . . . , VM , X (stress treat-ments and TF-binding) forming several data sets, ouraim is to find clusters of genes that maximize themutual information I(V1; . . . ; VM ; X), or its finite-data version, that is, the Bayes factor. In principle,AC could be extended to search for a multiway con-tingency table between the multiple data sets, butthis would lead to severe estimation problems with afinite data set.

Noting that

I(V1; . . . ; VM ; X) = I(V1; . . . ; VM )

+ I((V1, . . . , VM ); X)

we propose a sequential approximation: first approx-imate I(V1; . . . ; VM ) by forming the optimal repre-sentation Y (V1, . . . , VM ) with gCCA, then maximizeI(Y ; X) with AC. In this way we can reduce ourproblem to the AC of two variables and, in a sense,preserve dependencies between all the data sets ina computationally tractable way. Additionally, note

that we are here specifically interested in clusters ofgenes, which justifies the use of AC instead of usingonly gCCA which merely produces a projection ofthe data.

2.2. Generalized canonical correlationanalysis

We focus on the variation that is common to two ormore of the M data sets and are willing to lose infor-mation that is due to variables within one data setonly. For this purpose Canonical Correlation Analy-sis (CCA) is a natural choice. While Principal Com-ponent Analysis (PCA) works with a single ran-dom vector and maximizes the variance of projec-tions of the data, CCA works with a pair of ran-dom vectors and maximizes the correlation betweensets of projections. While PCA leads to an eigen-value problem, CCA leads to a generalized eigenvalueproblem.

There are several ways to generalize canoni-cal correlation analysis to more than two sets ofvariables.1,16 Here we use a generalization of CCA(gCCA) that has a simple connection to mutualinformation.1 The gCCA problem can be written asa generalized eigenproblem

Cξ = λDξ (1)

where C is the covariance matrix of the concatenateddata and D is a block diagonal matrix that consistsof the within-set covariance matrices of the individ-ual data sets Ci.

In effect, CCA whitens the covariance matriceswithin each data set and then performs PCA toseek the largest variance in the between-data-setcovariances.

2.3. Information-theoreticinterpretation

The gCCA projection can also be interpreted from aninformation-theoretic point of view. Assuming thatthe variables are normally distributed, there is asimple connection between mutual information andCCA18:

I(V1; V2) = −12

ln(

detC

detC1 det C2

)

= −12

∏i

λi = −12

∏i

(1 − ρ2

i

), (2)

August 17, 2005 9:50 00022


where the λi are the eigenvalues of Eq. (1) and fur-ther, where the ρi are the canonical correlations. Formultiple data sets the equation generalizes to

I(V1; . . . ; VM ) =M∑i=1

H(Vi) − H(V1, . . . , VM )

= −12

lndetC

detC1 · · · detCM. (3)

We now show how the whitening of the originaldata sets preserves the mutual information. Usingthe notation of Eq. (1), such a whitening can be writ-ten as

V ′ = D−1/2V. (4)

The covariance matrix C′ of the transformed variableV ′ is

C′ = E[D−1/2V V T D−1/2] = D−1/2CD−1/2, (5)

so the mutual information is preserved in the trans-formation:

I(V ′1 ; . . . ; V ′

M ) = −12

ln detC′

= −12

lndetC

detC1 · · · detCM

= I(V1; . . . ; VM ). (6)

Moreover, after data-set-wise whitening themutual information depends only on the jointentropy, which can be shown as follows. In general,mutual information can be written as

I(V1; . . . ; VM ) =M∑i=1

H(Vi) − H(V1, . . . , VM ). (7)

The entropy of an individual data set Vi with dimen-sionality di is in general

H(Vi) = −∫

p(vi) ln p(vi) dv

=di

2(ln(2π) + 1) +

12

ln detCi. (8)

After whitening of the within-data covariancesthe covariance-dependent term of the entropyvanishes,† giving H(V ′

i ) = di

2 (ln(2π) + 1).The mutual information I(V ′

1 ; . . . ; V ′M ) is now

I(V ′

1 ; . . . ; V ′M

)= const. − H

(V ′

1 , . . . , V ′M

), (9)

which intuitively means that all the mutual informa-tion is now represented by the joint entropy plus aconstant.

Our goal is to make a dimensionality reduc-tion that maximally preserves the mutual informa-tion I

(V ′

1 ; . . . ; V ′M

). Thus, according to Eq. (9) the

best approximation is the one that maximally pre-serves the entropy, H

(V ′

1 , . . . , V ′M

). The dimension-

ality is reduced by sequentially searching for theone-dimensional projection that best approximatesthe original d-dimensional variable. We thus seek forthe one-dimensional projection V ′′ = aT V ′, withaT a = 1, that maximizes the entropy of the projec-tion, H(V ′′). The entropy of the projected variableV ′′ is

H(V ′′) =12

ln var V ′′ + const., (10)

where

var V ′′ = E[aT V ′(V ′)T a] = aT C′a (11)

and therefore, the entropy of the projection will be

H(V ′′) =12

ln aT C′a + const. (12)

Since aT C′a is the variance of V ′ in the direction ofa, it is maximized by choosing a to be the first princi-pal component of C′. So, actually, the maximizationof the entropy coincides with maximization of thevariance, i.e., PCA for the transformed variable V ′.This is equivalent to performing gCCA for the origi-nal variable V , since, as already noted in the previoussection, gCCA produces the same result as whiten-ing the within-data covariances of the data sets andthen performing PCA.

2.4. Associative clustering

Having observed a set of paired data samples{(xk,yk)}k from two continuous, vector-valued ran-dom variables X and Y , we wish to find subsets ofdata that are informative of dependencies betweenX and Y . We use a previously introduced method,associative clustering (AC),14 that produces two setsof partitions, one for X and the other for Y . Theaim of AC is to make the cross-partition contingencytable represent as much of the dependencies between

†One should note that whitening (i.e., transforming the covariance matrix to the identity matrix) is not the only possibilityto set 1

2 ln detCi = 0, but it is the only one that implies equal contributions of the original variables within each dataset. Because we have no prior information that any of the variables within a data set would be more important than theothers, it seems reasonable to require equal variances.

August 17, 2005 9:50 00022


X and Y as possible. A Bayesian criterion for thedependency is detailed below.

Because the partitions for x and y define the mar-gins of the contingency table, they are called marginpartitions, and they split the data into margin clus-ters. The cells of the contingency table correspondto pairs of margin clusters, and they split data pairs(x,y) into clusters. Denote the count of data withinthe contingency table cell on row i and column j bynij , and sums with dots: ni· =

∑j nij .

The AC optimizes the margin partitions to max-imize dependency between them, measured by theBayes factor between two hypotheses: the marginsare independent (H) vs. dependent (H). The Bayesfactor is (derivation omitted)

P ({nij}|H)P ({nij}|H)

∝∏

ij Γ(nij + αij)∏i Γ(n·i + αi)

∏j Γ(nj· + αj)

,

(13)

where the α come from priors, set equal to 1 in thiswork. The cell frequencies are computed from thetraining samples {(xk,yk)}k by mapping them to theclosest margin cluster in each space. The clusters areparameterized by prototype vectors m.

During optimization the partitions are smoothedto facilitate the application of gradient-based meth-ods. As the final cost function to optimize, we have

log BF’ =∑ij

log Γ

(∑k

g(x)i (xk)g(y)

j (yk) + αij

)

−λ(x)∑

i

log Γ

(∑k

g(x)i (xk) + αi

)

−λ(y)∑

j

log Γ

(∑k

g(y)j (yk) + αj

),

(14)

where

g(x)i (x) ≡ Z(x)(x)−1 exp

(−‖x− m(x)i ‖2/σ2

(x)

),

and similarly for g(y). The g(·) are the smoothedVoronoi regions at the margins. The Z(·) is setto normalize

∑i g

(x)i (x) =

∑j g

(y)j (y) = 1. The

parameters σ control the degree of smoothing of theVoronoi regions and the λ are regularizing param-eters: if set larger than one, they favour solutionswith equal-sized margin clusters. AC is optimizedwith conjugate gradient method. More details canbe found from the previous publications.14

2.5. Uncertainty in results

Our use of Bayes factors in AC is different from theirtraditional use in hypothesis testing.10 We do nottest any hypotheses but the Bayes factor is maxi-mized to explicitly hunt for dependencies. However,for the current implementation of AC, this leavesthe Bayes factor of AC conditioned on the cluster-ing model. Together with the finiteness of the realworld data and local minima in optimization, thisresults in uncertainty in the results.

We tackle the problem of uncertainty by usingbootstrap6 to produce several perturbed cluster-ings. There are analogous approaches in the liter-ature, where a more traditional clustering methodhas been applied to gene expression data, and hasbeen bootstrapped.15 In our case, we wish to findcross-clusters (contingency table cells) that signifydependencies between the data sets, and that arereproducible.

Reproducibility of the clusters is estimatedwith bootstrap, by sampling 100 bootstrap datasets from the original data set and clusteringeach with AC. Dependency of clusters in eachbootstrap AC is estimated by generating several(1000) data sets of the same size as the originalone from the marginals of the contingency table(i.e., under the null hypothesis of independence).Cross-clusters containing more observations thanexpected by chance given the independent margins(p < 0.01 with Bonferroni correction) are defined asdependent.

The two criteria of dependency and reproducibil-ity will finally be combined by evaluating, for everygene pair, how likely they are to occur within thesame significantly dependent cross-cluster in clus-tering models computed in the different bootstrapdata sets. This similarity matrix will finally be sum-marized by hierarchical clustering. Cutting the tree(arbitrarily) then gives the the most dependent,robust subsets of the data.

2.5.1. Contributions of the original variablesin clusters

Finally, the obtained reliable clusters are to be inter-preted in terms of the original variables. In otherwords, we investigate which transcription factorsbind strongly in a specific cluster. Additionally, thebinding tendency should of course be reliable.

August 17, 2005 9:50 00022


We utilize here the localness of our clusters ineach data space and compute the average profile ofthe original data for the final TF-binding clusters.Average profiles are compared to randomized averageprofiles, computed from the data vectors for 10000random sets of the same size as the cluster. Thisoffers a way to identify abnormally high and smallvalues of TF-binding in the cluster.

3. Yeast Gene Expression underStress, and Its Regulators

3.1. Data

We used data from several experiments to ana-lyze the dependencies between yeast stress genesand their regulators. Common stress response wassought from expression data of altogether 16 stresstreatments5,9: heat (2), acid, alkali, peroxide, NaCl,sorbitol(2), H2O2, menadione, dtt(2), diamide,hypo-osmotic, aminoacid starvation, and nitrogendepletion. A short time series had been measuredfrom each, and in total we had 104 dimensions. Forthese genes we picked up the TF-binding profiles of113 transcription factors,20 to search for dependen-cies with expression patterns. In total we ended uphaving 5998 yeast genes. All the values were normal-ized with respect to the zero point of each time series(or other control), and then the natural logarithm ofthese ratios was taken. Missing values were imputedby genewise averages in each data set.

3.2. Dimensionality Reductionby gCCA

We started with the 16 separate stress genes expres-sion data sets, in total 104-dimensional expressiondata. The number of gCCA components was cho-sen such that the same components could be foundfrom left-out data reasonably well (measured withthe angle between the components) in 20-fold cross-validation. This resulted in 12 generalized canonicalcomponents.

We then checked whether gCCA managed to pro-duce meaningful components. This was verified bytesting the association of the 12 first gCCA com-ponents to genes known to be affected by stress,namely the putative environmental stress responsegenes (ESR) found earlier.9 Of the 12 generalizedcanonical components 9 showed statistically signif-icant association to ESR genes known to be either

Fig. 1. Projection of the genes on the first two gCCAcomponents revealing how the known ESR genes are sep-arated from the rest of the genes by gCCA. Triangle up:upregulated ESR genes, triangle down: downregulatedESR genes, dots: the rest of the genes.

up-regulated or down-regulated (Wilcoxon rank-sumtest; p < 0.05). gCCA thus managed to capture thevariation relevant to ESR genes.

Figure 1 demonstrates that the putative ESRgenes are separated reasonably well from the genemass even in a two-dimensional projection.

3.3. Associative clustering for stressexpression and TF binding

We next analyzed with associative clustering thedependencies between the 12-dimensional expressiondata, resulting from preprocessing by gCCA, and the113-dimensional TF-binding data. The goal was tofind subsets of genes having maximal dependencybetween their TF-binding and expression understress. The number of AC clusters was chosen toproduce about 10 data points in the cross-partitiontable (equivalently, contingency table) on the aver-age, resulting in a table with 30 × 20 cells.

To verify that there are dependencies the AC isable to detect, the contingency table produced byAC was compared with the contingency table pro-duced by independent K-means clusterings in boththe data spaces, in a 10-fold crossvalidation run. ACfound statistically significantly higher dependencybetween the data sets than K-means (p < 0.05;paired t-test), and the actual values of the log Bayesfactors were −23.45 for AC and −48.96 for K-means.This confirmed that at least a subset of the geneshas a non-random dependency between TF-binding

August 17, 2005 9:50 00022


100

020406080

Fig. 2. Dendrogram of the hierarchical clustering visualizing the similarities between all the genes clustered with 100 boot-strap AC models. Vertical axis represents the average dissimilarity of the genes: 100 means that a pair of genes neveroccur in the same significantly dependent cross-cluster in 100 bootstrap runs, and 0 that they always co-occur. Note thatthere is a mass of genes whose dissimilarity, or co-occurances in the different cross-cluster, is over 80, which was the cutoffthreshold to produce the final clusters. Several very reliable clusters can also be seen, as downward protruding peaks.

and expression (discernible from these data sets),although the data globally does not show depen-dency (log Bayes factor is negative).

After these preliminary checks we used the ACto search for salient dependencies between the stressexpression data and the TF binding data. A similar-ity matrix was produced by bootstrap analysis with100 samples as described in Sec. 2.5, and summa-rized by hierarchical clustering. Figure 2 shows a fewclear clusters interspersed within a background thatshows no apparent dependencies. We cut the dendro-gram at the height of 80. This defines a threshold onreliability: if genes occur together, within significantclusters, more than in 20 of the 100 bootstrap AC:son the average, their association is defined reliable.

We validated the clusters extracted from Fig. 2 byinvestigating the distribution of earlier-found puta-tive ESR genes within them. Since we had designedthe models to hunt for regulation patterns of ESRgenes, we expected some of our clusters to con-sist of ESR genes. Indeed, upregulated ESR geneswere enriched statistically significantly in 14 out ofthe 51 clusters (hypergeometric distribution; p-value< 0.001), and downregulated ESR genes in 12 ofthem. This confirms that our method has succeededin capturing stress-related genes in clusters.

3.4. Biological interpretation

For more detailed interpretation of the clusters theywere analyzed with EASE11 to find significant enrich-ments of gene ontology classes. In total we found14 statistically significant enriched (Bonferroni cor-rected p-value from Fisher’s exact test < 0.05) GOslim classes‡ in our 51 clusters. Additionally the

exp

ress

ion

(lo

g-r

atio

)

HY

PO

OS

MO

TIC

NIT

RO

GE

ND

EP

L

SO

RB

ITO

L

NaC

l

PE

RO

XID

E

AL

KA

LI

AC

ID

SO

RB

ITO

L

DT

T

dtt

DIA

MID

E

AA

ST

AR

V

ME

NA

DIO

NE

H2O

2

HE

AT

HE

AT

TF

bin

din

g (

log

-rat

io)

TRANSCRIPTION FACTORS

Fig. 3. A gene cluster related to cell cycle, revealingboth how cell-cycle machinery is driven down understress, and the putative regulators for that set of genes.Upper figure represents the mean expression profile(bars) of the genes with their confidence intervals (lines,computed by random sampling) revealing how the genesare downregulated practically in every treatment, andthus conveying information about the shut down ofthe cell cycle machinery. The lower figure represents themean TF binding profile (with confidence intervals) of thegenes revealing several significant, strong TF bindings.The most interesting of these are analyzed in the text.

enrichments of ESR genes as well as interesting, non-random TF bindings were used as indicators to selectclusters for the analysis. We present here representa-tive examples of the biological interpretations of theclusters.

The cluster in Fig. 3 is an example of a set ofgenes that are not specifically associated to stress,but obviously behave very homogeneously understress. The cluster actually contains only two genes

‡go slim mapping.tab at ftp://genome-ftp.stanford.edu/pub/yeast/data download/literature curation/.

August 17, 2005 9:50 00022


known to be ESR genes. Nevertheless, this rela-tively large cluster can be used to demonstrate twocharacteristic predictions obtained using AC andconfirmed by biological observations. First, this clus-ter is highly enriched in genes involved in the pro-cess of cell cycle (12 out of 57, Bonferroni correctedp-value 3.5e-5). This reflects the coordinated expres-sion of also other genes than the ESR genes, in otherwords coordination can be highlighted in most exper-imental conditions. Second, AC will propose a set oftranscription factors involved in the regulation of themember genes of the cluster. This is of special valuein this case, because although co-ordinated interac-tions between different signal transduction pathwaysare essential in biological systems, interpathway con-nections are difficult to identify. The two most promi-nent transcription factors of this cluster are codedby SWI4 (YER111C) and FKH2 (YNL068C), whichboth are known to be involved in cell cycle control.However, they operate on different parts of the pro-cess, as shown by Shapira and coworkers for the Fork-head factor.23

The significant TF bindings in the same clusteralso include ASH1, which is not directly related tothe cell cycle process but rather to mating type selec-tion. However, mating-type switching in the yeastis a multi-step programme, which enables Ash1p toasymmetrically localize to the daughter cell nucleusat the end of cell division in order to prevent thedaughter cell from switching mating type. Thereby,it is interesting to see that AC has grouped ASH1together with SWI4 and FKH2.

By far the most consistent cluster is the onein which all genes are down-regulated in everytreatment and also classified as downregulated ESRgenes.9 More than 90 of its 100 genes have a GOannotation referring to protein biosynthesis (P-value1.2e-86). Now, these genes are well known to bestrictly co-regulated and therefore this finding is notin itself very special, although it confirms the effi-ciency of the clustering method. However, a closerlook at the associated transcription factors is inter-esting, especially if one looks at factors such asSFP1 (YLR403W). This factor inhibits nuclear pro-tein localization when present in multiple copies andis thereby a regulator of transcription factor activ-ity. It has been associated to the process of cell size.Yeast establishes this balance by enforcing growthto a critical cell size prior to cell cycle commitment

(‘Start’) in late G1 phase. Interestingly, Jorgensenand collaborators have show that SFP1 is one oftwo potent negative regulators of ‘Start’.12 SFP1is shown to be an activator of the ribosomal pro-tein (RP) and ribosome biogenesis (Ribi) regulons,the transcriptional programs that dictate ribosomesynthesis rate in accord with environmental andintracellular conditions. This clearly shows that theprediction of associated transcription factors by theAC algorithm has a potential to produce meaningfulregulator hypotheses.

Finally, we demonstrate some novel hypothesesobtained by associative clustering. The cluster inFig. 4 contains 11 genes, of which 10 belong toESR as defined by Gasch et al.9 These genes con-tain mainly (7) hypothetical open reading frames(as classified in Stanford Genome Database). Two ofthem are annotated as responding to stress. Of thefour better-known genes two are involved in gluta-mate and glutatione catabolism and their expressionis known to be expressed mainly as a response tonitrogen starvation or oxidative stress, both typical

exp

ress

ion

(lo

g-r

atio

)

HY

PO

OS

MO

TIC

NIT

RO

GE

ND

EP

L

SO

RB

ITO

L

NaC

l

PE

RO

XID

E

AL

KA

LI

AC

ID

SO

RB

ITO

L

DT

T

dtt

DIA

MID

E

AA

ST

AR

V

ME

NA

DIO

NE

H2O

2

HE

AT

HE

AT

TF

bin

din

g (

log

-rat

io)

TRANSCRIPTION FACTORS

Fig. 4. A gene cluster consisting almost totally of ESRgenes, which still are largely unknown. Upper figure rep-resent the mean expression profile of the genes withtheir confidence intervals (computed by random sam-pling) revealing how the genes are upregulated practicallyin every treatment, confirming the earlier definition asupregulating ESR genes. The lower figure represents themean TF binding profile (with confidence intervals) ofthe genes revealing two significant, strong TF bindings,which are now potential regulators for these genes.

August 17, 2005 9:50 00022


stress inducers. The associated transcription factorDAL82 (YNL314W) is a positive regulator of allo-phanate inducible genes. The other highly associatedTF STB1 (YNL309W) has to be associated to SWI6to be activated whereafter it is involved in G1/S tran-sition during the cell cycle.

4. Discussion

We have demonstrated the effectiveness ofexploratory dependency modeling for characterizingyeast gene expression under stress, and its regulation.

We applied generalized canonical correlations(gCCA) in a novel way to multiple stress expressionsets to produce one representation for the sets, whichpreserves mutual information between the sets. Thispreprocessed data was then clustered with AC tomaximize its dependency with binding profiles of aset of regulators.

Biological relevance of the clusters was confirmedwith several tests and database searches. We can con-clude that our approach succeeded notably well bothin confirming some known facts, and in generatingnew hypotheses.

From the technical point of view, the main ben-efit from gCCA in this work was the reduction ofthe multiple data sets into one representation, ina mutual information-preserving way. However, theresulting two stage approach is naturally subopti-mal, since combining the projection and clusteringinto one method should improve the results. This isone possible future research direction.

Another interesting future research direction isto use a kernel version of gCCA24,1 in combiningthe data sets. In principle, the nonlinear kernel CCAis superior in finding the correlating componentswhen the data is not normally distributed. However,further work will be required to study the regulariza-tion of the kernel CCA, the choice of kernel function,its application in the integration of the data sets,and its interpretation. There exist additionally otherpromising kernel-based data integration methods inbioinformatics.19

From a biological perspective, the TF-bindingdata is problematic since it has been measured inoptimal environmental conditions, while the geneexpression has been measured under environmentalstress. This implies that, for example, the stress reg-ulators such as MSN2p, that are known to bind togenes only under stress, cannot be found in this type

of analysis. The results are expected to improve whennew binding data measured under stress becomeavailable. In spite of the slight deficiency in thedata, we managed to extract many biologically mean-ingful clusters and hypotheses for their regulators.Moreover, this kind of combination of data setscan be seen as a complementary study to previousones,5,9 having potential to reveal both totally newstress regulators as well as cell mechanisms that areregulated in concert both under stress and in normalgrowth conditions.

Acknowledgments

This work has been supported by the Academy ofFinland, decisions #79017 and #207467.

References

1. F. R. Bach and M. I. Jordan, Kernel independentcomponent analysis, J. Machine Learning Research3 (2002) 1–48.

2. S. Basu, M. Bilenko and R. J. Mooney, A proba-bilistic framework for semi-supervised clustering, inProc. Tenth ACM SIGKDD Int. Conf. KnowledgeDiscovery and Data Mining (KDD-2004) (2004),pp. 59–68.

3. S. Becker, Mutual information maximization: Mod-els of cortical self-organization, Network: Computa-tion in Neural Systems 7 (1996) 7–31.

4. M. A. Beer and S. Tavazoie, Predicting gene expres-sion from sequence, Cell 117 (2004) 185–198.

5. H. C. Causton, B. Ren, S. S. Koh, C. T. Harbison,E. Kanin, E. G. Jennings, T. I. Lee, H. L. True,E. S. Lander and R. A. Young, Remodeling ofyeast genome expression in response to environmen-tal changes, Molecular Biology of the Cell 12 (2001)323–337.

6. B. Efron and R. Tibshirani, An Introduction to theBootstrap (Chapman & Hall, New York, 1993).

7. N. Friedman, Inferring cellular networks usingprobabilistic graphical models, Science 303 (2004)799–805.

8. N. Friedman, O. Mosenzon, N. Slonim andN. Tishby, Multivariate information bottleneck, inProc. UAI’01, The Seventeenth Conf. Uncertainty inArtificial Intelligence (Morgan Kaufmann Publish-ers, San Francisco, CA, 2001), pp. 152–161.

9. A. P. Gasch, P. T. Spellman, C. M. Kao, O. Carmel-Harel, M. B. Eisen, G. Storz, D. Botstein andP. O. Brown, Genomic expression programs in theresponse of yeast cells to environmental changes,Molecular Biology of the Cell 11 (2000) 4241–4257.

10. I. J. Good, On the application of symmetric Dirich-let distributions and their mixtures to contingencytables, Annals of Statistics 4(6) (1976) 1159–1189.

August 17, 2005 9:50 00022


11. D. A. Hosack, G. Dennis Jr., B. T. Sherman,H. C. Lane and R. A. Lempicki, Identifying biolog-ical themes within lists of genes with ease, GenomeBiology 4(R70) (2003).

12. P. Jorgensen, I. Rupes, J. R. Sharom, J. R. BroachL. Schneper and M. Tyers, A dynamic transcrip-tional network communicates growth potential toribosome synthesis and critical cell size, GenesDevelopment (2004). Electronic publication ahead ofprint, PubMed ID PMID: 15466158.

13. S. Kaski, J. Sinkkonen and A. Klami, Discriminativeclustering, Neurocomputing (2005), in press.

14. S. Kaski, J. Nikkila, J. Sinkkonen, L. Lahti, J. Knu-uttila and C. Roos, Associative clustering for explor-ing dependencies between functional genomics datasets. IEEE/ACM Transactions on ComputationalBiology and Bioinformatics, in press.

15. M. K. Kerr and G. A. Churchill, Bootstrappingcluster analysis: Assessing the reliability of conclu-sions from microarray experiments, in Proc. NationalAcademy of Sciences 98 (2001) 8961–8965.

16. J. R. Kettenring, Canonical analysis of several setsof variables, Biometrika 58(3) (1971) 433–451.

17. R. D. King, Applying inductive logic programmingto predicting gene function, AI Magazine 25 (2004)57–68.

18. S. Kullback, Information Theory and Statistics(Wiley, New York, 1959).

19. G. R. G. Lanckriet, T. D. Bie, N. Cristianini,M. I. Jordan and W. S. Noble, A statistical frame-work for genomic data fusion, Bioinformatics 20(16)(2004) 2626–2635.

20. T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom,Z. Bar-Joseph, G. K. Gerber, N. M. Hannett,

C. T. Harbison, C. M. Tomphson, I. Simon,J. Zeitlinger, E. G. Jennings, H. L. Murray,D. B. Gordon, B. Ren, J. J. Wyrick, J.-B. Tagne,T. L. Volkert, E. Fraenkel, D. K. Gifford and R. A.Young, Transcriptional regulatory networks in Sac-charomyces cerevisiae, Science 298 (2002) 799–804.

21. J. Nikkila, C. Roos and S. Kaski, Exploringdependencies between yeast stress genes and theirregulators, in Proc. Int. Conf. Intelligent Data Engi-neering and Automated Learning (IDEAL 2004),eds. Z. R. Yang, R. Everson and H. Yin (Springer,2004), pp. 92–98.

22. E. Segal, R. Yelensky and D. Koller, Genome-wide discovery of trancriptional modules fromdna sequence and gene expression, Bioinformatics19(Suppl 1) (2003) 273–282.

23. M. Shapira, E. Segal, M. Shapira and D. Botstein,Disruption of yeast forkhead-associated cell cycletranscription by oidative stress, Molecular Biologyof the Cell (2004). Electronic publication ahead ofprint.

24. J. Shawe-Taylor and N. Cristianini, Kernel meth-ods for pattern analysis (Cambridge University press,2004).

25. N. Tishby, F. C. Pereira and W. Bialek, The informa-tion bottleneck method, in Proc. 37th Annual Aller-ton Conf .Communication, Control, and Computing,eds. B. Hajek and R. S. Sreenivas (Univ. of Illinois,Urbana, 1999), pp. 368–377.

26. K. Wagstaff, C. Cardie, S. Rogers and S. Schroedl,Constrained k-means clustering with backgroundknowledge, in Proc. of Int. conf. Machine Learning(2001) 577–584.

publication ix - universitylib.tkk.fi/diss/2005/isbn9512279096/article9.pdf · publication ix janne...

Documents