an adaptive strategy for single- and multi-cluster gene assignment

7
An Adaptive Strategy for Single- and Multi-Cluster Gene Assignment Sanjeev Garg, ²,| Marc F. Hansen, David W. Rowe, § and Luke E. K. Achenie* Department of Chemical Engineering, University of Connecticut, Storrs, Connecticut 06269, Center for Molecular Medicine, University of Connecticut Health Center, Farmington, Connecticut 06030, and Department of Genetics and Developmental Biology, University of Connecticut Health Center, Farmington, Connecticut 06030 Strict assignment of genes to one class, dimensionality reduction, a priori specification of the number of classes, the need for a training set, nonunique solution, and complex learning mechanisms are some of the inadequacies of current clustering algorithms. Existing algorithms cluster genes on the basis of high positive correlations between their expression patterns. However, genes with strong negative correlations can also have similar functions and are most likely to have a role in the same pathways. To address some of these issues, we propose the adaptive centroid algorithm (ACA), which employs an analysis of variance (ANOVA)-based performance criterion. The ACA also uses Euclidian distances, the center-of-mass principle for heterogeneously distributed mass elements, and the given data set to give unique solutions. The proposed approach involves three stages. In the first stage a two-way ANOVA of the gene expression matrix is performed. The two factors in the ANOVA are gene expression and experimental condition. The residual mean squared error (MSE) from the ANOVA is used as a performance criterion in the ACA. Finally, correlated clusters are found based on the Pearson correlation coefficients. To validate the proposed approach, a two-way ANOVA is again performed on the discovered clusters. The results from this last step indicate that MSEs of the clusters are significantly lower compared to that of the fibroblast-serum gene expression matrix. The ACA is employed in this study for single- as well as multi-cluster gene assignments. Introduction Large amounts of gene expression data sets under varying experimental conditions of interest have been generated using cDNA microarray experiments (1, 2). The resulting gene expression matrix is an N×M matrix, where N (on the order of thousands) is the number of genes and M (on the order of tens) is the number of attributes or time points. A row of this matrix represents a gene expression vector that describes the expression of the gene under different conditions. The functional role of individual genes as well as the interaction of these genes in the underlying genetic regulatory networks or cellular pathways can greatly be understood using this expression data in conjunction with the appropriate data analysis tools. Data reduction techniques such as prin- cipal component analysis (PCA) (3), multidimensional scaling plots (4), hierarchical clustering (5-8), self- organizing maps (9), knowledge-based support vector machines (10), and “gene-shaving” (11) are a few of these techniques. The above techniques have many desirable features as well as a few undesirable ones. For example, it is difficult to assign meaning to the linear or nonlinear principal components generated using PCA. Hierarchical cluster- ing is best-suited for data that follow a hierarchical pattern; however, it is not well-suited for gene expression data analysis in general (9). Self-organizing maps require the a priori specification of the number of cluster centers, which might not be known in many instances. Moreover, these result in nonunique clusters or functional classes. Support vector machines need a training data set to learn the class information. For this to succeed, the training data has to be a true representation of the whole data set. All these techniques use some measure of correlation among gene expression vectors and cluster genes based on strong positive correlation. In addition, most of the techniques assume that each gene belongs to at most one class of genes. Shatkay and co-workers (12) caution that the underlying assumptions may be flawed and report that “Genes that are functionally related may demonstrate strong anti-correlation in their expression levels... thus clustered into separate groups, blurring the relationship between them.” “... simultaneously expressed genes do not always share a function. Moreover, genes that are expressed at differ- ent times may serve complementing roles of one unifying function.” “Due to the interrelated nature of biological processes, genes may have more than a single function. ...potentially * To whom correspondence should be addressed. Ph: (860) 486- 2756. Fax: (860) 486-2959. Email: [email protected]. ² Department of Chemical Engineering, University of Con- necticut. Center for Molecular Medicine, University of Connecticut Health Center § Department of Genetics and Developmental Biology, Univer- sity of Connecticut Health Center. | Current address: Department of Chemical Engineering, IIT Kanpur, India 208 016. Email: [email protected]. 1142 Biotechnol. Prog. 2003, 19, 1142-1148 10.1021/bp025648p CCC: $25.00 © 2003 American Chemical Society and American Institute of Chemical Engineers Published on Web 06/19/2003

Upload: sanjeev-garg

Post on 21-Jul-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: An Adaptive Strategy for Single- and Multi-Cluster Gene Assignment

An Adaptive Strategy for Single- and Multi-Cluster GeneAssignment

Sanjeev Garg,†,| Marc F. Hansen,‡ David W. Rowe,§ and Luke E. K. Achenie*,†

Department of Chemical Engineering, University of Connecticut, Storrs, Connecticut 06269, Center for MolecularMedicine, University of Connecticut Health Center, Farmington, Connecticut 06030, and Department of Geneticsand Developmental Biology, University of Connecticut Health Center, Farmington, Connecticut 06030

Strict assignment of genes to one class, dimensionality reduction, a priori specificationof the number of classes, the need for a training set, nonunique solution, and complexlearning mechanisms are some of the inadequacies of current clustering algorithms.Existing algorithms cluster genes on the basis of high positive correlations betweentheir expression patterns. However, genes with strong negative correlations can alsohave similar functions and are most likely to have a role in the same pathways. Toaddress some of these issues, we propose the adaptive centroid algorithm (ACA), whichemploys an analysis of variance (ANOVA)-based performance criterion. The ACA alsouses Euclidian distances, the center-of-mass principle for heterogeneously distributedmass elements, and the given data set to give unique solutions. The proposed approachinvolves three stages. In the first stage a two-way ANOVA of the gene expressionmatrix is performed. The two factors in the ANOVA are gene expression andexperimental condition. The residual mean squared error (MSE) from the ANOVA isused as a performance criterion in the ACA. Finally, correlated clusters are foundbased on the Pearson correlation coefficients. To validate the proposed approach, atwo-way ANOVA is again performed on the discovered clusters. The results from thislast step indicate that MSEs of the clusters are significantly lower compared to thatof the fibroblast-serum gene expression matrix. The ACA is employed in this studyfor single- as well as multi-cluster gene assignments.

Introduction

Large amounts of gene expression data sets undervarying experimental conditions of interest have beengenerated using cDNA microarray experiments (1, 2).The resulting gene expression matrix is an N×M matrix,where N (on the order of thousands) is the number ofgenes and M (on the order of tens) is the number ofattributes or time points. A row of this matrix representsa gene expression vector that describes the expressionof the gene under different conditions. The functional roleof individual genes as well as the interaction of thesegenes in the underlying genetic regulatory networks orcellular pathways can greatly be understood using thisexpression data in conjunction with the appropriate dataanalysis tools. Data reduction techniques such as prin-cipal component analysis (PCA) (3), multidimensionalscaling plots (4), hierarchical clustering (5-8), self-organizing maps (9), knowledge-based support vectormachines (10), and “gene-shaving” (11) are a few of thesetechniques.

The above techniques have many desirable features aswell as a few undesirable ones. For example, it is difficultto assign meaning to the linear or nonlinear principalcomponents generated using PCA. Hierarchical cluster-ing is best-suited for data that follow a hierarchicalpattern; however, it is not well-suited for gene expressiondata analysis in general (9). Self-organizing maps requirethe a priori specification of the number of cluster centers,which might not be known in many instances. Moreover,these result in nonunique clusters or functional classes.Support vector machines need a training data set to learnthe class information. For this to succeed, the trainingdata has to be a true representation of the whole dataset. All these techniques use some measure of correlationamong gene expression vectors and cluster genes basedon strong positive correlation. In addition, most of thetechniques assume that each gene belongs to at most oneclass of genes. Shatkay and co-workers (12) caution thatthe underlying assumptions may be flawed and reportthat

“Genes that are functionally related may demonstratestrong anti-correlation in their expression levels... thusclustered into separate groups, blurring the relationshipbetween them.”

“... simultaneously expressed genes do not always sharea function. Moreover, genes that are expressed at differ-ent times may serve complementing roles of one unifyingfunction.”

“Due to the interrelated nature of biological processes,genes may have more than a single function. ...potentially

* To whom correspondence should be addressed. Ph: (860) 486-2756. Fax: (860) 486-2959. Email: [email protected].

† Department of Chemical Engineering, University of Con-necticut.

‡ Center for Molecular Medicine, University of ConnecticutHealth Center

§ Department of Genetics and Developmental Biology, Univer-sity of Connecticut Health Center.

| Current address: Department of Chemical Engineering, IITKanpur, India 208 016. Email: [email protected].

1142 Biotechnol. Prog. 2003, 19, 1142−1148

10.1021/bp025648p CCC: $25.00 © 2003 American Chemical Society and American Institute of Chemical EngineersPublished on Web 06/19/2003

Page 2: An Adaptive Strategy for Single- and Multi-Cluster Gene Assignment

preventing the exposure of complex interrelationshipsbetween genes.”

With the cautionary statements of Shatkay et al. inmind, in this paper we attempt to avoid some of the aboveflaws by proposing a new clustering tool based on thestatistical analysis of variance (ANOVA). We refer to thenew clustering tool as the adaptive centroid algorithm(ACA). Various steps involved in the ACA are describedin detail in Algorithm and Implementation.

Systems and Methods

The proposed approach involves three stages. In thefirst stage a two-way ANOVA of the gene expressionmatrix is performed. Many public domain data sets reportonly the expression ratios for genes at different experi-mental conditions. Thus, we base our approach on a two-way ANOVA, the two factors being gene expression leveland experimental conditions. If available, other factors,such as dye, assay, or slide, can also be considered in theANOVA. In this work we propose the use of the residualmean squared error (MSE from the ANOVA) as theperformance criterion in the ACA during the secondstage. Unaccounted error in the ANOVA is assumed toaffect the quality of the clustering. Therefore, by mini-mizing this error we will obtain good gene clusters basedon the gene expression levels in the case of microarrays.We employ the unaccounted error in the ANOVA as theupper limit (a performance criterion) on the Euclideanradius of each cluster in the ACA. The F-test (in ANOVA)requires populations that are normal. If this assumptionis not valid, then the conclusion drawn may not be validunless one is working with large samples (greater than30). In other words, better estimates are obtained forlarge sample sizes. In cDNA microarray data, the numberof genes is usually in the hundreds or thousands. Thus,better estimates are obtained for array data profiling andthe conclusions made are valid. The ACA also usesEuclidian distances, an analogy to the center-of-masscalculation for heterogeneously distributed mass ele-ments. Note that some of the existing algorithms (forexample SVMs) do not give unique solutions. This is notdesired, as different researchers would have differentresults from a given set of gene expression experiments.In contrast the proposed approach results in uniquesolutions.

The ACA is employed in this study for single- as wellas multi-cluster gene assignments. Finally, correlatedclusters are found in the third stage based on the Pearsoncorrelation coefficients. To validate the proposed ap-proach, a two-way ANOVA is again performed on thediscovered clusters. The results from this last stepindicate that MSEs of the clusters (for the gene effectand for the residual) are significantly lower compared tothose of the fibroblast-serum gene expression matrix. Thethird stage also helps in identifying anti-correlated genes,which may be part of the same cellular pathway orgenetic network.

Algorithm and Implementation. The clusteringalgorithm is based on center-of-mass calculation forheterogeneously distributed mass elements in space. Thecalculated center-of-mass lies nearest to the location withmaximum density of distributed mass elements (13). Ananalogy is made between this and the calculation of thecluster centers from a given gene expression matrix. Thealgorithm is implemented in C++ and is available onrequest from the authors.

In ACA, the centroid of unclassified genes (Cunclassified)is calculated iteratively and is adapted for genes being

classified. At each iteration, the nearest gene (Gnearest) tothe calculated centroid is located by calculating theEuclidian distances of all genes from the centroid. Thesquared minimum distance from the centroid to thenearest gene is denoted as min_dis. A local search in theneighborhood of the located nearest gene is performedto find the associated genes in the cluster. Note that inmulti-cluster assignment, the nearest gene and the genesthat belong to the new cluster may include genes thathave already been assigned to other clusters. This is notthe case for single-cluster assignments (see steps 2 and3 below for details). To help in the local search, twobinary indicator variables, wi and yi,j, are defined. Thevariable wi is set to one if ith gene is in the neighborhoodof the gene nearest to the Centroid; in other words theith gene is in the new cluster). Otherwise wi is set to zero.The indicator variable yi,j is set to one if the ith genebelongs to the jth cluster and is zero otherwise (seeFigure 1).

A new centroid for the genes located in the neighbor-hood (Cclassified) is calculated iteratively. The neighborhoodfor the nearest gene is expanded (by deltasq as explainedbelow) until one of the two conditions, (a) centroidcondition or (b) delta condition, is violated. The centroidcondition uses the dynamic min_dis (adapted for eachcluster) as an upper bound. The delta condition is basedon the residual MSE, epssq from the ANOVA. Note thatthe delta condition is an upper bound on deltasq. A newiteration is started to locate a new cluster center, if anyof these two conditions is violated. The whole process isrepeated with updated values of the parameters until allthe genes are classified (see Figure 1).

In greater detail the different steps of the algorithmused at stage 2 of the proposed approach are

Step 0. Input gene expression matrix (N × M matrix,N genes, M experiments), and epssq. Initialize theparameters, wi, yi,j, cluster_count (all to zero); deltasq andstep (to low values).

Step 1. Calculate the centroid of unclassified (wi ) 0)genes, Cunclassified

j as

where GEVi is the gene expression vector for the ith gene(expression values of a gene at all time points orexperimental attributes) and N is the total number ofgenes. Note that j ) cluster_count.

Step 2. Find the nearest gene, Gnearestj , and the

squared minimum distance, min_dis, asCase I: single-cluster assignment

Case II: multi-cluster assignment

Cunclassifiedj ) [∑i)1

N

GEVi(1 - wi)

N - ∑i)1

N

wi ]Gnearest

j ) arg min||Cunclassifiedj - GEVi||

min_dis ) [min||Cunclassifiedj -

GEVi||]2, for i such that wi ) 0

Gnearestj ) arg min||Cunclassified

j - GEVi||min_dis ) [min||Cunclassified

j - GEVi||]2, for all i

Biotechnol. Prog., 2003, Vol. 19, No. 4 1143

Page 3: An Adaptive Strategy for Single- and Multi-Cluster Gene Assignment

Step 3. Calculate the distance of genes and update theindicator variables as

Case I: single-cluster assignment

Case II: multi-cluster assignment

Case a: If distance is less than deltasq, then

Case b: If distance is equal to or greater than deltasq,then continue.

Step 4. Calculate the centroid of classified (wi ) 1)genes, Cclassified

j as

Step 5. Check if all genes are classified.

Case a: If yes, store results and STOP.Case b: If no, continue.Step 6. Centroid condition: check if [||GEVnearest

j -Cclassified

j ||]2< min_disCase a: If yes, then update deltasq as deltasq ) deltasq

+ step and continue.Case b: If no, then set deltasq to initial value and

update the cluster number as cluster_count ) cluster-_count + 1 and GOTO step1.

Step 7. Delta condition: check if deltasq < M‚epssqCase a: If yes, then for yi,j ) 1 set wi ) 0 and yi,j ) 0.

GOTO step 1.Case b: If no, then set deltasq to initial value and

update the cluster number as cluster_count ) cluster-_count + 1 and GOTO step1.

Note that deltasq is adapted using step, thus, deltasqand step are initialized to small values. Large initialvalues of deltasq and step may result in large neighbor-hoods. However, if the values are too small, the compu-tational load is high. Thus, deltasq and step should bechosen optimally perhaps through a mathematical pro-gramming approach.

Results and Discussion

Key Features of Adaptive Centroid Algorithm.The adaptive centroid algorithm has the following keyfeatures: (a) no a priori specification of the number ofclusters (as required in self-organizing maps (9) or in

Figure 1. Flowchart for the adaptive centroid algorithm.

distance ) [||GEVnearestj -

GEVi||]2, for i such that wi ) 0

distance ) [||GEVnearestj - GEVi||] 2, for all i

wi ) 1 and yi,j ) 1

Cclassifiedj ) [∑i)1

N

GEVi(wi)

∑i)1

N

wi ]

1144 Biotechnol. Prog., 2003, Vol. 19, No. 4

Page 4: An Adaptive Strategy for Single- and Multi-Cluster Gene Assignment

support vector machines (10)); (b) no dimensionalityreduction (as done in principle component analysis (3)and multidimensional scaling plots (4)); (c) use of dy-namic distance thresholds for each cluster (this is incontrast to one single constant distance threshold insingle-pass clustering (14) or in QT_clust (15)); (d) noassumption about the data distribution; (e) unique solu-tion; and (f) the ability to perform both single- and multi-cluster assignments.

Differences with Selected Algorithms. The varioussteps in ACA look similar to those in single-pass cluster-ing (SPC) (14) and QT_clust (15). However, ACA has twomain differences. First, the distance threshold in ACAis adaptive as opposed to the a priori fixed value in SPCor QT_clust. Second, at each iteration we use all (asopposed to one in SPC and QT_clust) unclassified geneexpressions to calculate the centroid. ACA therefore doesnot depend on the order in which the genes are processed.As a result (as has been found in SPC), the clustersidentified in ACA in the early stages tend to be largerthan the clusters identified later. Heyer et al. (15) use“greatest jackknife correlation” to form a candidatecluster in QT_clust while ACA employs only Euclidiandistances to form a candidate cluster.

Limitations. If more than one gene is equidistantfrom the centroid of unclassified genes, the new clustercentroid will not be unique. However, this case is notlikely to occur in practice. This observation is based onthe analysis of the two colon cancer data sets (17, 18)with 4200 genes each with two replicates. Also note thatthe initial selection of deltasq and min_dis values isempirical.

Application Results. The data set chosen is a subsetof human fibroblast response to serum data set (19). Theauthors reported the response of human fibroblasts toserum, using cDNA microarrays representing about 8600distinct human genes. This data set was generated usingcDNA microarray hybridization to measure the temporalchanges in mRNA levels of 8613 human genes at 12different times, ranging from 0 to 24 h. Cells growingexponentially were labeled “unsync” and included in theirstudy. The authors reported that only 517 genes wereobserved to have significant changes. In this paper weemployed the gene expression matrix for the same 517-gene subset (http://genome-www.stanford.edu/serum/data/Figure 2clusterdata.txt).

Stage 1 of the proposed approach results in the ANOVAof the gene expression matrix as shown in Table 1 (lastrow, cluster ALL). A high value of MSE for gene factor(5.80) shows that the gene effect is significant and aclustering analysis could be useful. The MSE value of0.49 is used in the performance criterion in stage 2.

Stage 2 algorithm (ACA) is used for both single- andmulti-cluster gene assignments. The single-cluster as-signment (SCA) for genes results in 42 clusters; 447 genesare in 11 of these clusters that have significant cardi-nalities (i.e., with more than 10 genes). The remainingclusters have low cardinalities (most of them have onlya single gene). Sample genes in each cluster are reportedin Table 2, and the ANOVA results are shown in Table1. In Table 1, the MSEs for individual clusters (for thegene effect) are significantly lower (around 3%-12%)than the MSE for the given gene set (cluster ALL). Thuson the basis of the MSEs, the clustering algorithm hasdone very well. Single-cluster assignment results areshown in Figure 2. ACA converged in 8 CPU seconds (ona 550 MHz, 256 MB RAM PC) for single-cluster assign-ments.

The multi-cluster assignment (MCA) for genes, in stage2 of our approach, also result in 42 clusters; 26 of theseclusters have significant cardinalities (i.e., with morethan 10 genes). It is interesting to note that the geneexpression patterns for these clusters are quite similarto those for the single-cluster assignment case. As in thesingle-cluster assignment case, the remaining clustershave low cardinalities; most of them have only a singlegene. Strong correlations are observed for small cardi-nality clusters with high cardinality clusters. We suspectthat these small cardinality clusters are “noise clusters”.Even if we superimpose a random noise on all of the geneexpression values, the initial clusters (high cardinality)remain nearly the same. This was tested by adding 5%randomly generated noise to one of the data sets. Thusthe solutions appear to be robust to slight perturbations.Multiple cluster assignment results are shown in Figure3. In general, there is a 10-30% overlap among clustersin MCA, although there are few clusters where theoverlap is much higher. ACA converged in nearly 35 CPUseconds (on a 550 MHz, 256 MB RAM PC) for multiplecluster assignments.

Sample genes in each cluster are reported in Table 2,and the ANOVA results are shown in Table 1. Note thatin Table 2 MCA has all of the genes in the same clusteras in SCA. In addition, in Table 1, the MSEs (for the geneeffect) for different clusters are significantly lower (around4-20%) than the MSE for the entire gene set (Table 1,cluster ALL). The MSE values for MCA are a bit higherthat the values for the SCA, which is a more strictassignment of genes. Thus, based on the MSEs, theclustering algorithm has done very well for multi-clustergene assignment case as well.

Table 1. ANOVA Results for Significant CardinalityClustersa,b

MSE SCA MSE MCA

cluster gene time error gene time error

1 0.21 6.30 0.10 0.21 6.15 0.102 0.67 6.15 0.14 0.36 6.65 0.133 0.57 12.01 0.24 1.07 11.03 0.234 0.50 12.66 0.14 0.48 17.99 0.135 0.47 9.01 0.17 0.55 9.66 0.186 0.34 17.06 0.12 0.53 27.63 0.137 0.29 3.18 0.27 0.69 5.12 0.258 0.35 6.09 0.23 0.34 12.16 0.209 0.21 8.70 0.22 0.48 19.38 0.22

10 0.39 2.91 0.1811 0.84 14.80 0.1412 0.50 5.46 0.1615 0.15 12.71 0.22 0.47 21.59 0.2116 0.22 9.43 0.15 0.58 32.51 0.1318 0.58 6.31 0.2219 0.36 12.90 0.1920 0.65 22.46 0.1623 0.24 13.58 0.2226 0.27 4.24 0.1027 0.42 20.27 0.1029 0.74 17.70 0.1230 0.51 21.76 0.1131 1.16 11.83 0.1632 0.35 18.28 0.1036 0.41 25.21 0.1138 0.49 10.43 0.14

ALL 5.80 6.30 0.49 5.80 6.30 0.49a MSE, mean squared sum of errors. SCA, single-cluster as-

signment. MCA, multi-cluster assignment. b Clusters with 10 ormore genes are reported.

Table 2. Clusters and Sample Genes

Please see http://www.engr.uconn.edu/cheg/achenie/public/pdf/bp_sg_table 2.pdf

Biotechnol. Prog., 2003, Vol. 19, No. 4 1145

Page 5: An Adaptive Strategy for Single- and Multi-Cluster Gene Assignment

In stage 3, Pearson correlation coefficients are calcu-lated on the basis of average gene expression patterns

for different clusters. There are strong positive and strongnegative correlations among average expression patternsof different clusters. It is interesting to note that manyof the low cardinality clusters have strong correlationswith high cardinality clusters. The low cardinality “noiseclusters” are due to a larger Euclidian distance betweenthe average gene expression patterns of the two clustersthan the MSE calculated in the first stage. The negativecorrelations indicate that the genes are anti-correlatedand could be part of the same cellular pathway. Differentgenes with known functionality are discussed in detailbelow.

“Reprogramming” of Fibroblasts-Signal Trans-duction. Iyer and co-workers (Figure 4A in ref 19)showed that different genes involved with signal trans-duction in the reprogramming phase of the response havediverse gene expression patterns. In this study, we showthat we are able to find clusters (with MCA) that areunique to signal transduction as well as clusters whichalso have genes with other functions.

Clusters 9, 20, and 25 with different average expres-sion patterns are unique to signal transduction genes.The example genes are Cluster 9, ESTs highly similar toopioid binding protein, NET1; Cluster 20, ESTs highlysimilar to opioid binding protein, EDG1; Cluster 25,MKP1.

Clusters 2, 5, 7, 15, and 16 have a few genes involvedwith signal transduction as well as genes with otherfunctions. The example genes are Cluster 2, Gem GTPase;Cluster 5, BMPR2; Cluster 7, SGK; Cluster 15, EDG1;Cluster 16, ROR1. It is also interesting to note bothpositive and negative Pearson correlations calculated instage 3 for these clusters. For example, we observe acorrelation value of 0.88 between clusters 7 and 25 anda value of -0.90 between clusters 16 and 9. The negativevalue indicates that there might be two different modesof signal transduction.

“Reprogramming” of Fibroblasts-Immediate-EarlyTranscription Factors. Iyer and co-workers also showed

Figure 2. ACA results for single-cluster assignments (t1-t12are the 12 experimental time points and “unsync” is theunsynchronized case (18). Clusters with 10 or more genes aremarked.

Figure 3. ACA results for multiple-cluster assignments (t1-t12 are the 12 experimental time points and “unsync” is theunsynchronized case (18). Clusters with 10 or more genes are shown. Clusters 10, 12, 18, 23, 31, and 38 are marked but not numbered.

1146 Biotechnol. Prog., 2003, Vol. 19, No. 4

Page 6: An Adaptive Strategy for Single- and Multi-Cluster Gene Assignment

(Figure 4B in ref 19) different genes involved with signaltransduction in the reprogramming phase of the responsehave diverse gene expression patterns. In this work weobserve clusters that are unique to immediate-earlytranscription factors and clusters that have genes withother functions.

Clusters 8, 14, 21, 22, and 33 are unique to immediate-early transcription factors. Example genes are Cluster8, DEC1, HIV-1 enhancer binding protein-2; Cluster 14,ATF3; Cluster 21, MINOR, JUNB; Cluster 22, C-FOS,immediate-early response protein; Cluster 33, ID3.

Clusters 5 and 7 have genes with known roles asimmediate-early transcription factors as well as geneswith other roles. Example genes include Cluster 5, MYC,HIV-1 enhancer binding protein-2; Cluster 7, CPBP/EKLF. A few correlation values are 0.80 (clusters 7 and21) and 0.82 (clusters 14, 22).

“Reprogramming” of Fibroblasts-Other Tran-scription Factors. Iyer and co-workers showed (Figure4C in ref 19) that there are other transcription factorsthat peak later than the immediate-early transcriptionfactor. In this study, we observe clusters that are uniqueto other transcription factors and clusters that also havegenes with other known functionalities.

Clusters 1 and 6 are unique to other transcriptionfactors. Example genes are Cluster 1, AHR, HSF Protein2; Cluster 6, LDB1, AHR, DP2, ERF2, and FREAC-2.

Cluster 16 has a few genes with roles as othertranscription factors as well as genes with other roles.Example genes are MEIS1, HSF Protein 2, DP2, ERF2.We observe a high value of Pearson correlation coefficientequal to 0.98 between clusters 1 and 6.

Cell Cycle and Proliferation. Clusters 30 and 33uniquely include genes that are involved in mediatingcell cycle progression. Examples of these genes areCluster 30, PCNA, DNA topoisomerase II R, DNA topoi-somerase II R subunit, Madp2, Cyclin A, Cyclin B1, CDC28, Cell division cycle 2 G1 to S and G2 to M; Cluster 33,ID3. Genes encoding for regulators of passage throughthe S phase and the transition from G2 to M phase, DNAtopoisomerase II R (required for chromosome segmenta-tion at mitosis), Madp2 (a component of the spindle checkpoint that prevents completion of mitosis if chromosomesare not attached to the spindle) all were grouped inCluster 30.

Several other clusters have genes involved with cellcycle and proliferation besides having other functionalitygenes. Example genes are Cluster 3, PCNA, CENP-F,CDK7; Cluster 6, WEE1 like protein kinase (believed toinhibit mitosis by phosphorylation of CDC2); Cluster 15,PCNA, DNA topoisomerase II R, CENP-F, Cyclin A, Celldivision cycle 2 G1 to S and G2 to M, Cyclin B1, CDC28,CDK7. The induction of CDK-7 along with CDC28suggests a potential role in mediating M phase assuggested in Iyer et al. (19). These genes are grouped indifferent clusters as they have different gene expressions(as observed in Figure 5A in ref 19). We observe highcorrelation coefficients between these clusters [0.96(Clusters 3, 15), 0.94 (Clusters 2, 16), 0.90 (Clusters 15,30)].

Coagulation and Hemostasis. Genes involved withthe process of coagulation and hemostasis are inducedand grouped in different clusters, mainly in Cluster 18.Example genes for Cluster 18 include Factor III, Endot-helin1, PAI1. Besides Cluster 18, Clusters 5, 7, 8, 9, 31,and 37 have genes encoding for coagulation and hemo-stasis besides having genes with other functions. Ex-ample genes are Cluster 5, Factor III; Cluster 7, FactorIII; Cluster 8, THBD; Cluster 31, PAI2; Cluster 37, TFPI2.

It is interesting to note that the proposed algorithm isable to differentiate between the subtypes of PAI encod-ing genes. A few correlation coefficients are 0.91 (Clusters5, 18), 0.83 (Clusters 7, 18), 0.92 (Clusters 8, 9) and 0.92(Clusters 9, 37). Cluster 7 is unique to coagulation andhemostasis gene clusters. The heterogeneity in the geneexpression is also observed in the published study (Figure5B in ref 19).

Inflammation. Genes involved with inflammationhave diverse expression patterns with peak expressionat different times after serum stimulation (Figure 5C inref 19). The range for peak expression patterns is from 2to 24 h and as such these genes are grouped in differentclusters. Genes known to be involved with inflammationare ICAM1 (Cluster 5), SDF 1 (Cluster 15), IL-1â (Cluster19), IL6 (Cluster 35.) and IL8 (Cluster 42). Few correla-tion coefficients in this case are 0.97 (Clusters 5, 19), 0.95(Clusters 5, 42), 0.94 (Clusters 19, 42). Cluster 35 isunique to inflammation gene clusters.

Angiogenesis. Genes encoding for products associatedwith angiogenesis have different expression patterns (asshown in Figure 5D in ref 19) and therefore are groupedin different clusters. Example genes are Cluster 8, VEGF;Cluster 9, FGF2, FGF7, Furin; Cluster 15, SDF1; Cluster19, IL-1â; Cluster 20, SDF1; Cluster 42, IL8. Highcorrelation coefficients are 0.92 (Clusters 8, 9), 0.93(Clusters 8, 19), 0.90 (Clusters 8, 42), 0.83 (Clusters 9,20), 0.94 (Clusters 19, 42). Cluster 20 is unique toangiogenesis gene clusters.

Tissue Remodeling. Genes known for encoding fac-tors involved in tissue remodeling are grouped in differ-ent clusters but most of them are in Clusters 9 and 18.Example genes are Cluster 3, Alpha 1 Type 3 collagen,Aminopeptidase N, Elastin; Cluster 9, Furin, PLOD2,Aminopeptidase N, Elastin, PLAUR; Cluster 10, CTGF;Cluster 18, Furin, PLAUR, PAI1; Cluster 31, PAI2;Cluster 37, TFPI2. These genes have different geneexpressions in the published study (Figure 5E in ref 19).A few of correlation coefficients are 0.84 (Clusters 9, 18),0.94 (Clusters 9, 31), 0.92 (Clusters 9, 37), 0.94 (Clusters31, 37).

Cytoskeleton Reorganization. Genes involved withcytoskeleton reorganization group in different clusters.Examples are Cluster 3, Desmoplakin I and II; Cluster9, Vimentin, NET1; Cluster 15, EDG1. Different expres-sions were observed for these genes in the publishedstudy (Figure 5F in ref 19) also. The correlation coef-ficient between Clusters 3 and 15 is 0.96.

Re-epithelialization. Genes encoding for re-epithe-lialization are mainly grouped in cluster 19. The examplegenes in Cluster 19 are IL-1â, FGF7, FGF2 and Endot-helin-1. IL8 encoding gene is the only gene in Cluster 42.Besides Cluster 19 and 42, Clusters 9 and 18 also havegenes with known role in re-epithelialization process(Figure 5G in ref 19). Example genes are Cluster 9, FGF7,FGF2; Cluster 18, Endothelin 1. Few correlation coef-ficients are 0.80 (Clusters 9, 42), 0.91 (Clusters 18, 19),0.92 (Clusters 18, 42), 0.94 (Clusters 19, 42).

Previously Unidentified Role. Several of the geneswith unidentified roles in wound healing in Iyer et al.,(19, Figure 5H) are clustered in this study with genes ofknown functionality as shown in Table 2. Based on thehomology assumption, similar roles can be assigned tothese genes.

Cholesterol Biosynthesis. Iyer and co-workers (19)showed (Figure 4I in their paper) that different genesinvolved in cholesterol biosynthesis have similar expres-sions. In this study we are able to find most of these genesin one cluster (Cluster 38). The example genes in Cluster

Biotechnol. Prog., 2003, Vol. 19, No. 4 1147

Page 7: An Adaptive Strategy for Single- and Multi-Cluster Gene Assignment

38 are HMG CoA reductase, IPP Isomerase, Squaleneepoxidase. Besides Cluster 38, Clusters 16 and 29 alsohave genes involved with cholesterol biosynthesis. Ex-ample genes are Cluster 16, farnesyl-diphosphae farnesyltransferase; Cluster 29, IPP isomerase, squalene epoxi-dase. We observe a high correlation coefficient betweenClusters 29 and 38 that signifies a similar role for theirgenes.

Single-cluster assignment benchmarking is done quali-tatively by comparing the cluster associations of geneswith similar functions with results from the open litera-ture. For brevity, details of comparisons for clusterassociations are not included in this study and areinstead reported elsewhere (16). No benchmarking isdone for multi-cluster assignment since we could not findresults to compare with in the open literature. Finally,correlation coefficients between different clusters are alsoreported elsewhere (16).

ConclusionsWe have shown that the proposed approach is able to

perform single- as well as multi-cluster assignment ofgenes. We also showed that the approach is able to groupgenes in different clusters in which the variance due togene factor has been significantly reduced compared tothat of the given gene expression matrix. Differentclusters correlated with “reprogramming” of fibroblasts,namely, signal transduction, immediate-early transcrip-tion factors, and other transcription factors were found.Correlated average gene expression patterns were alsofound to indicate biologically significant functional geneclasses, namely, (i) cell cycle and proliferation, (ii)coagulation and hemostasis, (iii) inflammation, (iv) an-giogenesis, (v) tissue remodeling, (vi) cytoskeletal reor-ganization (vii) re-epithelialization, and (viii) cholesterolbiosynthesis.

Moreover, the ACA does not have the following undes-ired features: (a) a priori specification of the number ofcenters (as required in self-organizing maps or in supportvector machines); (b) dimensionality reduction (as donein principle component analysis and multidimensionalscaling); (c) assumption about the data distribution; (d)a priori specified distance threshold; (e) only single-cluster assignments; and (f) nonunique solution. Inconclusion, the proposed approach can effectively be usedfor multi-cluster gene assignments and identification ofcorrelated clusters. Therefore, the approach can poten-tially be used as a preliminary step for providing basicinsights into the genetic regulatory networks and cellularpathways.

References and Notes(1) Fodor, S. P.; Rava, R. P.; Huang, X. C.; Pease, A. C.; Holmes,

C. P.; Adams, C. L. Multiplexed biochemical assays withbiological chips. Nature 1993, 364, 555-556.

(2) Schena, M.; Shalon, D.; Davis, R. W.; Brown, P. O. Quan-titative monitoring of gene expression patterns with a comple-mentary DNA microarray. Science 1995, 270, 467-470.

(3) Raychaudhuri, S.; Stuart, J. M.; Altman, R. B. Principalcomponents analysis to summarize microarray experiments:application to sporulation time series. Pac. Symp. Biocomput.2000, 5, 455-466.

(4) Bittner, M.; Meltzer, P.; Chen, Y.; Jiang, Y.; Seftor, E.;Hendrix, M.; Radmacher M.; Simon, R.; Yakhini, Z.; Ben-Dor,

A.; Sampas, N.; Dougherty, E.; Wang, E.; Marincola, F.;Gooden, C.; Lueders, J.; Glatfelter, A.; Pollock, P.; Carpten,J.; Gillanders, E.; Leja, D.; Dietrich, K.; Beaudry, C.; Berens,M.; Alberts, D.; Sondak, V.; Hayward, N.; Trent, J. Molecularclassification of cutaneous malignant melanoma by geneexpression profiling. Nature 2000, 406, 536-540.

(5) Roth, F. P.; Huges, J. D.; Estep, P. W.; Church, G. M.Finding DNA regulatory motifs within unaligned noncodingsequences clustered by whole-genome mRNA quantitation.Nat. Biotechnol. 1998, 16, 939-945.

(6) Eisen, M. B.; Spellman, P. T.; Brown, P. O.; Botstein, D.Cluster analysis and display of genome-wide expressionpatterns. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 14863-14868.

(7) Alon, U.; Barkai, N.; Notterman, D. A.; Gish, K.; Ybarra,S.; Mack, D.; Levine, A. J. Broad patterns of gene expressionrevealed by clustering analysis of tumor and normal colontissues probed by oligonucleotide arrays. Proc. Natl. Acad.Sci. U.S.A. 1999, 96, 6745-6750.

(8) Tavazoie, S.; Huges, J. D.; Campbell, M. J.; Cho, R. J.;Church, G. M. Systematic determination of genetic networkarchitecture. Nat. Genet. 1999, 22, 281-285.

(9) Tamayo, P.; Slonim, D.; Mesirov, J.; Zhu, Q.; Kitareewan,S.; Dmitrovsky, E.; Lander, E. S.; Golub, T. R. Interpretingpatterns of gene expression with self-organizing maps: meth-ods and application to hematopoietic differentiation. Proc.Natl. Acad. Sci. U.S.A. 1999, 96, 2907-2912.

(10) Brown, M. P.; Grundy, W. N.; Lin, D.; Cristianini, N.;Sugnet, C. W.; Furey, T. S.; Ares, M., Jr.; Haussler, D.Knowledge based analysis of microarray gene expression databy using support vector machines. Proc. Natl. Acad. Sci.U.S.A. 2000, 97, 262-267.

(11) Hastie, T.; Tibshirani, R.; Eisen, M. B.; Alizadeh, A.; Levy,R.; Staudt, L.; Chan, W. C.; Botstein, D.; Brown, P. “Geneshaving” as a method for identifying distinct set of genes withsimilar expression patterns. Genome Biol. 2000, 1, 1-31.

(12) Shatkay, H.; Edwards, S.; Wilbur, W. J.; Boguski, M.Genes, themes, and microarray: using information retrievalfor large-scale gene analysis. In Proceedings of 8th Interna-tional Conference on Intelligent Systems for Molecular Biology(ISMB). AAAI Press: Menlo Park, CA, 2000; pp 317-328.

(13) Halliday, D.; Resnick, R.; Walker, J. Fundamentals ofPhysics, 6th ed.; John Wiley & Sons Inc.: New York 2000.

(14) Willet, P. Similarity and Clustering in Chemical Informa-tion Systems; Research Studies Press Ltd.: Letchworth,England, 1987.

(15) Heyer, L. J.; Kruglyak, S.; Yooseph, S. Exploring expressiondata: identification and analysis of coexpressed genes. Ge-nome Res. 1999, 9, 1106-1115.

(16) Garg, S. Systems engineering approaches to computationalbiology. PhD thesis, University of Connecticut, 2002.

(17) Guda, K. K.; Cui, H.; Garg, S.; Achenie, L. E. K.; Nambiar,P.; Rosenberg, D. W. Multiscale gene expression profiling ina differentially susceptible mouse colon cancer model. CancerLett. 2003, 191, 17-25.

(18) Cui, H.; Guda, K. K.; Garg, S.; Mohler, P.; Achenie, L. E.K.; Rosenberg, D. W. Alterations of gene expression profilesin inflammatory mucosa, adenoma and adenocarcinomas inulcerative colitis. Personal communication.

(19) Iyer, V. R.; Eisen, M. B.; Ross, D. T.; Schuler, G.; Moore,T.; Lee, J. C. F.; Trent, J. M.; Staudt, L. M.; Hudson, J., Jr.;Boguski, M. S.; Lashkari, D.; Shalon, D.; Botstein, D.; Brown,P. O. The transcriptional program in the response of humanfibroblasts to serum. Science 1999, 283, 83-87.

Accepted for publication May 15, 2003.

BP025648P

1148 Biotechnol. Prog., 2003, Vol. 19, No. 4