[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

4
Comparing Clustering Techniques for Real Microarray Data Vilda Purutc ¸uo˘ glu Gazi Department of Statistics Middle East Technical University Ankara, Turkey Email: [email protected] Telephone: +90 (312) 210 5319 Elif Kayıs ¸ Institute of Applied Mathematics Middle East Technical University Ankara, Turkey Email: [email protected] Abstract—The clustering of genes detected as significant or differentially expressed provides useful information to biologists about functions and functional relationship of genes. There are variant types of clustering methods that can be applied in genomic data. These are mainly divided into the two groups, namely, hierarchical and partitional methods. In this paper, as the novelty, we perform a detailed clustering analysis for the recently collected boron microarray dataset to investigate biologically more interesting results and to construct a basis for the selection of the most effective method in the analysis of different microarray datum. In the calculation, we implement the agglomerative hierarchical clustering among hierarchical techniques and use the k-means and the PAMSAM methods within partitional clustering approaches, and finally use a re- cently improved method, called HIPAM, which is not only a hierarchical but also partitional approach. In the assessment, we compare and discuss the significant genes of the boron data whose estimated signals are found by the FGX normalization method. I. I NTRODUCTION The microarray technology is a developing technique that measures gene expression levels via RNA under interested conditions. In this technology, thousands of genes are im- mobilized on probes of small chips. In the preparation of microarray, the RNA mixtures are extracted from the interested control and treatment cells and coloured by dye molecules. Then they are poured on the chips for hybridization. The coloured RNA sequences find their target genes and attach to them. During the scanning process of the chips, the underlying dye molecules on the attached RNAs give signals. These signal intensities from each gene represent the expression level of that particular gene, i.e. the amount of the RNA produced by the corresponding gene [1], [2]. The microarray studies can be divided into mainly two parts, namely, the normalization of signal intensities for the detection of differentially expressed genes under different conditions and the clustering of differentially expressed genes. In the following parts we present each of these steps in details by using a real microarray dataset. mds April 15, 2012 A. Normalization of Signal Intensities Before detecting differentially expressed genes in a microar- ray study, the data firstly are purified from noisy signals by different types of normalization techniques so that the true signals nested by the noisy data can be observed [?], [2]. The normalization process which is composed of three steps for single channel microarrays, has a special order: i) spatial normalization, ii) background normalization, and iii) quantile normalization. In this study we are particularly interested in the background normalization whose calculations are imple- mented by a novel approach, called the FGX (frequentist gene expression index) method [8]. In the analysis we use a dataset which describes the boron effects on leaf cells [3]. These data can be downloaded freely from the Gene Expression Om- nibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) under the series GSE14521. The measurements cover 22840 genes and 9 arrays, in which 3 of them belong to the control group, the other 3 arrays are from the treatment-1 group, and the remaining 3 arrays come from the treatment- 2 group. In the treatment-1 and treatment-2 groups, the barley leaves are subjected to 5 mM B(OH) 3 and 10 mM B(OH) 3 concentrations, respectively. For this study, we only consider the measurements from the control and treatment- 1 groups having 11 probes pairs which are totally 22801 genes. Moreover in the analysis, we initially select 1000 over 22801 species by arbitrarily choosing every 22nd of genes. In these 1000 FGX-normalized genes, the differentially expressed ones are detected under 0.05 significance level by using multiple hypothesis testing. From the detection, the 93 over 1000 genes are found as differentially expressed [4]. Then we choose a suitable clustering approach among alternatives for the analysis of the complete 22801-dimensional data and interpret the results biologically. B. Clustering Differentially Expressed Genes The clustering is the division of the data into groups ac- cording to the similarity between the objects [5]. Accordingly the objects in the same cluster have more homogeneous struc- tures with respect to the other elements in different clusters. Therefore its results enable us to identify plausible genes having functional relationships. In microarray technology, the 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.143 820 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.143 788

Upload: e

Post on 27-Mar-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Comparing Clustering Techniques for RealMicroarray Data

Vilda Purutcuoglu GaziDepartment of Statistics

Middle East Technical University

Ankara, Turkey

Email: [email protected]

Telephone: +90 (312) 210 5319

Elif KayısInstitute of Applied Mathematics

Middle East Technical University

Ankara, Turkey

Email: [email protected]

Abstract—The clustering of genes detected as significant ordifferentially expressed provides useful information to biologistsabout functions and functional relationship of genes. There arevariant types of clustering methods that can be applied ingenomic data. These are mainly divided into the two groups,namely, hierarchical and partitional methods. In this paper,as the novelty, we perform a detailed clustering analysis forthe recently collected boron microarray dataset to investigatebiologically more interesting results and to construct a basisfor the selection of the most effective method in the analysis ofdifferent microarray datum. In the calculation, we implementthe agglomerative hierarchical clustering among hierarchicaltechniques and use the k-means and the PAMSAM methodswithin partitional clustering approaches, and finally use a re-cently improved method, called HIPAM, which is not only ahierarchical but also partitional approach. In the assessment, wecompare and discuss the significant genes of the boron data whoseestimated signals are found by the FGX normalization method.

I. INTRODUCTION

The microarray technology is a developing technique that

measures gene expression levels via RNA under interested

conditions. In this technology, thousands of genes are im-

mobilized on probes of small chips. In the preparation of

microarray, the RNA mixtures are extracted from the interested

control and treatment cells and coloured by dye molecules.

Then they are poured on the chips for hybridization. The

coloured RNA sequences find their target genes and attach to

them. During the scanning process of the chips, the underlying

dye molecules on the attached RNAs give signals. These signal

intensities from each gene represent the expression level of that

particular gene, i.e. the amount of the RNA produced by the

corresponding gene [1], [2].

The microarray studies can be divided into mainly two parts,

namely, the normalization of signal intensities for the detection

of differentially expressed genes under different conditions

and the clustering of differentially expressed genes. In the

following parts we present each of these steps in details by

using a real microarray dataset.

mds

April 15, 2012

A. Normalization of Signal Intensities

Before detecting differentially expressed genes in a microar-

ray study, the data firstly are purified from noisy signals by

different types of normalization techniques so that the true

signals nested by the noisy data can be observed [?], [2].

The normalization process which is composed of three steps

for single channel microarrays, has a special order: i) spatial

normalization, ii) background normalization, and iii) quantile

normalization. In this study we are particularly interested in

the background normalization whose calculations are imple-

mented by a novel approach, called the FGX (frequentist gene

expression index) method [8]. In the analysis we use a dataset

which describes the boron effects on leaf cells [3]. These data

can be downloaded freely from the Gene Expression Om-

nibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/)

under the series GSE14521. The measurements cover 22840

genes and 9 arrays, in which 3 of them belong to the

control group, the other 3 arrays are from the treatment-1

group, and the remaining 3 arrays come from the treatment-

2 group. In the treatment-1 and treatment-2 groups, the

barley leaves are subjected to 5 mM B(OH)3 and 10 mM

B(OH)3 concentrations, respectively. For this study, we only

consider the measurements from the control and treatment-

1 groups having 11 probes pairs which are totally 22801

genes. Moreover in the analysis, we initially select 1000

over 22801 species by arbitrarily choosing every 22nd of

genes. In these 1000 FGX-normalized genes, the differentially

expressed ones are detected under 0.05 significance level by

using multiple hypothesis testing. From the detection, the 93

over 1000 genes are found as differentially expressed [4]. Then

we choose a suitable clustering approach among alternatives

for the analysis of the complete 22801-dimensional data and

interpret the results biologically.

B. Clustering Differentially Expressed Genes

The clustering is the division of the data into groups ac-

cording to the similarity between the objects [5]. Accordingly

the objects in the same cluster have more homogeneous struc-

tures with respect to the other elements in different clusters.

Therefore its results enable us to identify plausible genes

having functional relationships. In microarray technology, the

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.143

820

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.143

788

Page 2: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

clustering methods are applied for grouping both arrays and

genes. When it is implemented to the arrays, the effect of

the treatment condition on the arrays can be seen in such a

way that the arrays of control and treatment groups consist

of separate classes when the treatment condition has a clear

effect on the overall gene expression pattern.

There are different types of clustering methods that are used

in various areas. The hierarchical method, which defines a hier-

archical structure in the clusters, is one of the fundamental and

old tools in gene clustering. The k-means is an another basic

and old method under the partitional clustering approaches [7].

On the other side PAMSAM (partitioning around medoids with

Sammon mapping) [2] and HIPAM (hierarchical PAM) [2] are

two recently improved methods in the sense that the former

is a partitional approach based on the PAM (partitioning

around medoids) clustering and the Sammon mapping, and

the latter shows both hierarchical and partitional feature in the

calculation.

With more details, the hierarchical method takes each object

as distinct cluster and puts the closest objects together into

one cluster [7]. Then by combining the closest branches, it

accumulates all objects into a single class. Hereby this method

can be considered as a bottom-up (agglomerative) hierarchical

clustering approach. Moreover this approach presents the

final structure of the data as a tree-type graph, also called

the dendogram. In general although the hierarchical method

enables the user to interpret the data easily, in particular,

when the data have a natural hierarchical feature, it is not

possible to turn back to the branches of the tree if a mistake

is observed in any branch. Under such cases, the problematic

branch may affect the following branches too and thereby the

global results. The plot of the hierarchical clustering for the

93 differentially expressed genes is shown in Figure 1. On

the other hand, the k-means clustering algorithm chooses kclusters and calculates the centers of clusters by taking the

average of objects. Then it puts the objects to the nearest

cluster. This process continues until the clusters converge.

In this method, the initialization of k clusters is determined

by the user and the performance of the clustering depends

on this initialization. Furthermore if the data have outliers,

the final clustering can be unstable since the centers of the

clusters are found by using the mean of the values within the

cluster [2], [10]. Furthermore the visualization of the plot is

not always easily interpretable with respect to other methods.

In Figure 2 we present the k-means results for the boron

dataset. The PAMSAM method partitions the objects into kclusters around medoids and improves the performance of the

clusters by maximizing the average silhouette width (a.s.w)

value that gives information about how well an object is

grouped. In the visualization, it uses the Sammon mapping

multidimensional scaling method [2]. The PAMSAM graph

of our data is shown in Figure 3. In general the PAMSAM

method is more preferable than the k-means in the sense that

the former is robust to outliers as it uses medoids, rather than

centroids, in the calculation. Additionally since it maximizes

the a.s.w [2] as the objective function, it indicates distinctions

in the clustered genes regarding other methods. From the plot

of PAMSAM in Figure 3, we observe that the clusters are

more observable and interpretable than the ones in Figure 2. In

PAMSAM plot, the distribution of the objects for each cluster

can be also found.

136 53 87 72 59 17 54 34

15 26 57 73 63

25 71 20 8348

74 88 64 30 8091 66 79 90 14 50 76 70

38 35 613

45 32 69 67 77 11

37 31 93 58 75 84 7 12 81

22 5 29 23 27 86

18 334

19 65 60

13 6 9 89

28 6292 2 42 4

1 52 78

47 6846

10 56 44

49 85 16 55 24 8 51

8239

2140 43

02

46

8

Cluster dendogram

Linkage: average

Hei

ght

Fig. 1. Hierarchical clustering of 93 genes detected after FGX normalizationof the small dataset with 1000 genes. The vertical axis shows the distancebetween branches and the horizontal axis presents the genes in each branch.The plot is based on the average linkage measure.

0 1 2 3 4 5 6 7

02

46

Fig. 2. k-means clustering of 93 genes detected after FGX normalizationof the small dataset with 1000 genes. The axes show the sum of squares ofeach gene in 2-dimensional space.

Finally the HIPAM starts the hierarchical construction from

up and continues to down in such a way that at each stage

of the hierarchy, the current branches of the genes are further

partitioned. The method stops the hierarchy until the maximum

a.s.w is reached. On the other side, regarding the findings

of other alternative approaches, this clustering gives more

stable global results than the hierarchical method, whereas, the

hierarchical method finds more stable local results. Moreover

the number of the clusters is determined by the algorithm

itself via the a.s.w value. Thus when a.s.w. is maximized,

the method finishes the clustering. The HIPAM results of

821789

Page 3: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

−1.5 −1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

1

2

3

4

PAMSAM

Fig. 3. PAMSAM clustering of 93 genes detected after FGX normalizationof the small dataset with 1000 genes. The axes show how clusters are differentfrom each other in terms of correlation from -1 to 1.

93 differentially expressed genes is plotted in Figure 4. In

all underlying analyses, we use correlation values as the

dissimilarity measure due to the fact that our aim is to

detect functionally related genes under boron exposed cells. In

various plant-specific reactions, it is observed that the reactions

can be grouped under 4 categories, namely, jasmonic acid,

glutathione s-transferase, pathogenesis related, and senescence

associated genes [3]. Therefore in all calculations, we set the

number of class to 4.

11 12

9 10 7 8

3 4 5 6

1 2

R

HIPAM

Fig. 4. HIPAM clustering of 93 genes detected after FGX normalizationof the small dataset with 1000 genes. The graph show the 2-dimensionalrepresentation of the hierarchical tree.

To sum up, as seen from the results of different clustering

approaches, the choice of the clustering method highly de-

pends on the question of interest. If it is known that there

is a hierarchical structure in the data, the hierarchical and

HIPAM methods can give more plausible outcomes. But in

order to get globally more stable results and clusters with

clear differences, HIPAM is more preferable than the hier-

archical method. Whereas if distinct clusters are required,

the k-means and PAMSAM methods can be suggested. But

PAMSAM outperforms the k-means in terms of clear visu-

alization and robustness. Finally to compare the findings of

all these methods, we list the elements of each cluster and

check their similarities. Hereby, we group 93 differentially

expression genes under 4 classes. From the analyses we find

that the hierarchical, k-means, and HIPAM approaches give

very close findings. Whereas the results from PAMSAM does

not match very much with them. Then we merely imple-

ment the PAMSAM clustering to investigate the functional

groups of complete dataset. Because it is seen that there is

no any particular biological information which supports a

natural hierarchical structure of plants’ reactions. Therefore

we concentrate on partitional clustering approaches and choose

the more robust one among alternatives. Hereby we apply the

22801 genes in clustering, select the significantly differentially

expressed genes which are totally 1715 species, and divide

these significant genes under 4 functional groups. Some of

the gene names from each group are presented in Table I

as examples and the PAMSAM plot of this new dataset is

shown in Figure 5. The full names of genes can be found in

http://www.plexdb.org/modules/PD-probeset/annotation.php and the complete list of genes in all

classes can be seen in Supplementary Material upon request.

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

12

34

PAMSAM

Fig. 5. PAMSAM clustering of 1715 genes detected after FGX normalizationof the complete 22801 genes. The axes show how clusters are different fromeach other in terms of correlation from -1 to 1.

From the clustered genes in Table I, we detect that Cluster 1,

Cluster 2, Clusters 3, and Cluster 4 are related to the functional

categories of jasmonic acid biosynthesis, pathogenetic, senes-

cence, and glutathione related genes, respectively. Among

these activations, basically, jasmonic acid is exposed when

the cell faces with an environmental stress or pathogens which

refers to disease related species. On the other hand the senes-

cence related genes are responsible for aging of the cell and

finally the glutathione genes protect the cells from oxidative

tissue damage [3]. Indeed although these functional groups

are distinct in non-hierarchical divisions, they imply two cate-

gories under HIPAM clustering by joining pathogenetic related

822790

Page 4: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

TABLE IEXAMPLES OF SIGNIFICANT GENES FROM PAMSAM CLUSTERING UNDER 4 FUNCTIONAL CATEGORIES OF BARLEY LEAVES EXPOSED BY 5 MM BORON

TOXICITY.

Cluster 1 Cluster 2 Cluster 3 Cluster 4

Contig101 at AFFX-r2-Bs-dap-3 at AFFX-r2-Bs-lys-3 at 1200459 Reg 88-1740 atHM05O08r at 74797-75570.AF427791 x at Contig333 M at Contig140 atContig897 s at Contig333 3 x at U29604.1 at Contig141 atHV09P07u s at 1289374 Reg 826-1545 at Contig558 s at Contig282 x atContig1512 at AJ421947.1 s at HA28L22r s at Contig286 s atrbags7l20 s at Contig41 at Contig172 s at Contig682 s atContig1858 at Contig41 x at Contig545 s at Contig661 atContig1865 at HVSMEb0014C02r2 s at Contig585 x at Contig700 s atContig2462 at Contig67 at Contig785 x at Contig1281 atContig2502 at Contig86 at Contig931 at Contig1478 s at

.

.

....

.

.

....

.

.

....

.

.

....

HV CEa0004A08r2 at rbags12l21 at HVSMEh0083J23r2 s at 6 HD04M05u atHVSMEl0006J08r2 s at rbags13f02 at HVSMEi0015K21r2 x at HV CEb0024B09r2 atHVSMEa0002A17r2 at rbags16k21 at rbaal20n01 s at HVSMEm0022B10r2 at

HVSMEa0006I22r2 x at rbags17k13 x at rbaal31f01 at HVSMEl0003I12r2 atHVSMEb0001F21r2 x at rbags18k24 s at rbaal33f13 at HVSMEa0016D02r2 at

rbaal1h20 at rbags19k14 at rbaal34e03 at rbaal0c08 atrbaal21c12 at rbags19n07 at rbaal38e14 x at rbaal10k01 at

rbaal23d16 s at rbags1c11 at rbags11o22 at rbaal11f18 atrbaal30c10 at rbags22n02 at rbags18a16 s at rbaal20h12 s atrbags16d08 at rbags24e18 at rbags19e09 at rbaal23n02 s at

Total number of genes:249 675 407 384

genes with senescence associated genes and merging jasmonic

acid biosynthesis genes with glutathione related genes. We

believe that the resulting clusters, particularly, via PAMSAM,

can be helpful for the reconstruction of the barley leaves

network under boron and without boron toxicity in a more

detailed research and molecular distinction of genes.

II. CONCLUSION

From the results of microarray analyses, we consider that

if the aim of the study is to divide the genes into disjoint

clusters so as to see their functional properties, the PAMSAM

method can be suggested against the partitional method like k-

means. But if the functional relationship between the clusters

is also required, the HIPAM method can be referred since it

gives more stable global results with respect to the hierarchical

clustering approach. Therefore the pair of PAMSAM and

HIPAM are our suggested approaches for non-hierarchical and

hierarchical clustering, respectively. But here we interpret the

results of PAMSAM since the biological knowledge about

the cell validates the non-hierarchial structure in the barley

pathway. Furthermore from these two methods, we get more

boron-affected genes in majority of clusters under the complete

microarray data. As a future work, we consider to combine

all these methods in the consensus clustering method [11]

which enables us to get a unique merged cluster matrix

being weighted by the selected clustering approaches. Also

we consider to increase the stability of the k-means algorithm

by the method of the minimal spanning tree [12].

ACKNOWLEDGMENT

The authors would like to thank Prof. Gerhard Wilhelm

Weber for his helpful discussion.

REFERENCES

[1] M. Schena, Microarray Analysis, Hobokon New Jersey, England:JohnWiley and Sons, 2003.

[2] E. Wit and J. McClure, Statistics for Microarray Design, Analysis, andInference,1st ed. England: John Wiley and Sons, 2004.

[3] M. T. Oz, R. Yılmaz, F. Eyidogan, L. de Graaff, M. Yucel, andH. A. Oktem, Microarray analysis of late response to boron toxicity inbarley (Hordeum vulgare L.) leaves, Turkish Journal of Agriculture andForestry, 191–202, Vol:33, 2009.

[4] V. Purutcuoglu, E. Kayıs, and G. W. Weber, Survey of backgroundnormalizations for Affymetrix arrays and a case study, preprint: 1-22. Chapter in: Advances in Intelli-gent Modelling and Simulation:Simulation Tools and Applications. Editor: A. Byrski, Z. Oplatkova,M. Carvalho, and M. Kisiel-Dorohinicki. Springer, 2011.

[5] B. S. Everitt and S. Landau and M. Leese, Cluster Analysis, 4th. ed.,ARNOLD-A Member of the Hodder Headline Group, 2001.

[6] S. Knudsen, A Biologists Guide to Analysis of DNA Microarray Data,England: John Wiley and Sons, 2002.

[7] G. Gan, C. Ma and J. Wu, Data Clustering Theory, Algorithms, andApplications, Society for Industrial and Applied Mathematics, 2007.

[8] V. Purutcuoglu and E. Wit, FGX: a frequentist gene expression indexfor Affymetrix arrays, Biostatistics, Vol:8, 433-37, 2007.

[9] R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. An-tonellis, U. Scherf, and T. P. Speed, Exploration, normalization, andsummaries of high density oligonucleotide array probe level data,Biostatistics, Vol:4, 249-64, 2003.

[10] S. Ben-David, D. Pal, and H. U. Simon, Stability of k-means cluster-ing, In Proceedings of the 20th Annual Conference on ComputationalLearning Theory, 2034, 2007.

[11] S. Monti, P. Tamayo, J. Mesirov and T. Golub, Consensus clustering: aresampling-based method for class discovery and visualization of geneexpression microarray data, Machine Learning, Vol:52, 91-118, 2003.

[12] Z. Barzily, Z.V. Volkovich, B. Akteke-Ozturk, and G.-W. Weber, Clusterstability using minimal spanning trees, ISI Proceedings of EURO MiniConference, Lithuania, 248-252, 2008.

823791