[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...
TRANSCRIPT
Comparing Clustering Techniques for RealMicroarray Data
Vilda Purutcuoglu GaziDepartment of Statistics
Middle East Technical University
Ankara, Turkey
Email: [email protected]
Telephone: +90 (312) 210 5319
Elif KayısInstitute of Applied Mathematics
Middle East Technical University
Ankara, Turkey
Email: [email protected]
Abstract—The clustering of genes detected as significant ordifferentially expressed provides useful information to biologistsabout functions and functional relationship of genes. There arevariant types of clustering methods that can be applied ingenomic data. These are mainly divided into the two groups,namely, hierarchical and partitional methods. In this paper,as the novelty, we perform a detailed clustering analysis forthe recently collected boron microarray dataset to investigatebiologically more interesting results and to construct a basisfor the selection of the most effective method in the analysis ofdifferent microarray datum. In the calculation, we implementthe agglomerative hierarchical clustering among hierarchicaltechniques and use the k-means and the PAMSAM methodswithin partitional clustering approaches, and finally use a re-cently improved method, called HIPAM, which is not only ahierarchical but also partitional approach. In the assessment, wecompare and discuss the significant genes of the boron data whoseestimated signals are found by the FGX normalization method.
I. INTRODUCTION
The microarray technology is a developing technique that
measures gene expression levels via RNA under interested
conditions. In this technology, thousands of genes are im-
mobilized on probes of small chips. In the preparation of
microarray, the RNA mixtures are extracted from the interested
control and treatment cells and coloured by dye molecules.
Then they are poured on the chips for hybridization. The
coloured RNA sequences find their target genes and attach to
them. During the scanning process of the chips, the underlying
dye molecules on the attached RNAs give signals. These signal
intensities from each gene represent the expression level of that
particular gene, i.e. the amount of the RNA produced by the
corresponding gene [1], [2].
The microarray studies can be divided into mainly two parts,
namely, the normalization of signal intensities for the detection
of differentially expressed genes under different conditions
and the clustering of differentially expressed genes. In the
following parts we present each of these steps in details by
using a real microarray dataset.
mds
April 15, 2012
A. Normalization of Signal Intensities
Before detecting differentially expressed genes in a microar-
ray study, the data firstly are purified from noisy signals by
different types of normalization techniques so that the true
signals nested by the noisy data can be observed [?], [2].
The normalization process which is composed of three steps
for single channel microarrays, has a special order: i) spatial
normalization, ii) background normalization, and iii) quantile
normalization. In this study we are particularly interested in
the background normalization whose calculations are imple-
mented by a novel approach, called the FGX (frequentist gene
expression index) method [8]. In the analysis we use a dataset
which describes the boron effects on leaf cells [3]. These data
can be downloaded freely from the Gene Expression Om-
nibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/)
under the series GSE14521. The measurements cover 22840
genes and 9 arrays, in which 3 of them belong to the
control group, the other 3 arrays are from the treatment-1
group, and the remaining 3 arrays come from the treatment-
2 group. In the treatment-1 and treatment-2 groups, the
barley leaves are subjected to 5 mM B(OH)3 and 10 mM
B(OH)3 concentrations, respectively. For this study, we only
consider the measurements from the control and treatment-
1 groups having 11 probes pairs which are totally 22801
genes. Moreover in the analysis, we initially select 1000
over 22801 species by arbitrarily choosing every 22nd of
genes. In these 1000 FGX-normalized genes, the differentially
expressed ones are detected under 0.05 significance level by
using multiple hypothesis testing. From the detection, the 93
over 1000 genes are found as differentially expressed [4]. Then
we choose a suitable clustering approach among alternatives
for the analysis of the complete 22801-dimensional data and
interpret the results biologically.
B. Clustering Differentially Expressed Genes
The clustering is the division of the data into groups ac-
cording to the similarity between the objects [5]. Accordingly
the objects in the same cluster have more homogeneous struc-
tures with respect to the other elements in different clusters.
Therefore its results enable us to identify plausible genes
having functional relationships. In microarray technology, the
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
978-0-7695-4799-2/12 $26.00 © 2012 IEEE
DOI 10.1109/ASONAM.2012.143
820
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
978-0-7695-4799-2/12 $26.00 © 2012 IEEE
DOI 10.1109/ASONAM.2012.143
788
clustering methods are applied for grouping both arrays and
genes. When it is implemented to the arrays, the effect of
the treatment condition on the arrays can be seen in such a
way that the arrays of control and treatment groups consist
of separate classes when the treatment condition has a clear
effect on the overall gene expression pattern.
There are different types of clustering methods that are used
in various areas. The hierarchical method, which defines a hier-
archical structure in the clusters, is one of the fundamental and
old tools in gene clustering. The k-means is an another basic
and old method under the partitional clustering approaches [7].
On the other side PAMSAM (partitioning around medoids with
Sammon mapping) [2] and HIPAM (hierarchical PAM) [2] are
two recently improved methods in the sense that the former
is a partitional approach based on the PAM (partitioning
around medoids) clustering and the Sammon mapping, and
the latter shows both hierarchical and partitional feature in the
calculation.
With more details, the hierarchical method takes each object
as distinct cluster and puts the closest objects together into
one cluster [7]. Then by combining the closest branches, it
accumulates all objects into a single class. Hereby this method
can be considered as a bottom-up (agglomerative) hierarchical
clustering approach. Moreover this approach presents the
final structure of the data as a tree-type graph, also called
the dendogram. In general although the hierarchical method
enables the user to interpret the data easily, in particular,
when the data have a natural hierarchical feature, it is not
possible to turn back to the branches of the tree if a mistake
is observed in any branch. Under such cases, the problematic
branch may affect the following branches too and thereby the
global results. The plot of the hierarchical clustering for the
93 differentially expressed genes is shown in Figure 1. On
the other hand, the k-means clustering algorithm chooses kclusters and calculates the centers of clusters by taking the
average of objects. Then it puts the objects to the nearest
cluster. This process continues until the clusters converge.
In this method, the initialization of k clusters is determined
by the user and the performance of the clustering depends
on this initialization. Furthermore if the data have outliers,
the final clustering can be unstable since the centers of the
clusters are found by using the mean of the values within the
cluster [2], [10]. Furthermore the visualization of the plot is
not always easily interpretable with respect to other methods.
In Figure 2 we present the k-means results for the boron
dataset. The PAMSAM method partitions the objects into kclusters around medoids and improves the performance of the
clusters by maximizing the average silhouette width (a.s.w)
value that gives information about how well an object is
grouped. In the visualization, it uses the Sammon mapping
multidimensional scaling method [2]. The PAMSAM graph
of our data is shown in Figure 3. In general the PAMSAM
method is more preferable than the k-means in the sense that
the former is robust to outliers as it uses medoids, rather than
centroids, in the calculation. Additionally since it maximizes
the a.s.w [2] as the objective function, it indicates distinctions
in the clustered genes regarding other methods. From the plot
of PAMSAM in Figure 3, we observe that the clusters are
more observable and interpretable than the ones in Figure 2. In
PAMSAM plot, the distribution of the objects for each cluster
can be also found.
136 53 87 72 59 17 54 34
15 26 57 73 63
25 71 20 8348
74 88 64 30 8091 66 79 90 14 50 76 70
38 35 613
45 32 69 67 77 11
37 31 93 58 75 84 7 12 81
22 5 29 23 27 86
18 334
19 65 60
13 6 9 89
28 6292 2 42 4
1 52 78
47 6846
10 56 44
49 85 16 55 24 8 51
8239
2140 43
02
46
8
Cluster dendogram
Linkage: average
Hei
ght
Fig. 1. Hierarchical clustering of 93 genes detected after FGX normalizationof the small dataset with 1000 genes. The vertical axis shows the distancebetween branches and the horizontal axis presents the genes in each branch.The plot is based on the average linkage measure.
0 1 2 3 4 5 6 7
02
46
Fig. 2. k-means clustering of 93 genes detected after FGX normalizationof the small dataset with 1000 genes. The axes show the sum of squares ofeach gene in 2-dimensional space.
Finally the HIPAM starts the hierarchical construction from
up and continues to down in such a way that at each stage
of the hierarchy, the current branches of the genes are further
partitioned. The method stops the hierarchy until the maximum
a.s.w is reached. On the other side, regarding the findings
of other alternative approaches, this clustering gives more
stable global results than the hierarchical method, whereas, the
hierarchical method finds more stable local results. Moreover
the number of the clusters is determined by the algorithm
itself via the a.s.w value. Thus when a.s.w. is maximized,
the method finishes the clustering. The HIPAM results of
821789
−1.5 −1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
1
2
3
4
PAMSAM
Fig. 3. PAMSAM clustering of 93 genes detected after FGX normalizationof the small dataset with 1000 genes. The axes show how clusters are differentfrom each other in terms of correlation from -1 to 1.
93 differentially expressed genes is plotted in Figure 4. In
all underlying analyses, we use correlation values as the
dissimilarity measure due to the fact that our aim is to
detect functionally related genes under boron exposed cells. In
various plant-specific reactions, it is observed that the reactions
can be grouped under 4 categories, namely, jasmonic acid,
glutathione s-transferase, pathogenesis related, and senescence
associated genes [3]. Therefore in all calculations, we set the
number of class to 4.
11 12
9 10 7 8
3 4 5 6
1 2
R
HIPAM
Fig. 4. HIPAM clustering of 93 genes detected after FGX normalizationof the small dataset with 1000 genes. The graph show the 2-dimensionalrepresentation of the hierarchical tree.
To sum up, as seen from the results of different clustering
approaches, the choice of the clustering method highly de-
pends on the question of interest. If it is known that there
is a hierarchical structure in the data, the hierarchical and
HIPAM methods can give more plausible outcomes. But in
order to get globally more stable results and clusters with
clear differences, HIPAM is more preferable than the hier-
archical method. Whereas if distinct clusters are required,
the k-means and PAMSAM methods can be suggested. But
PAMSAM outperforms the k-means in terms of clear visu-
alization and robustness. Finally to compare the findings of
all these methods, we list the elements of each cluster and
check their similarities. Hereby, we group 93 differentially
expression genes under 4 classes. From the analyses we find
that the hierarchical, k-means, and HIPAM approaches give
very close findings. Whereas the results from PAMSAM does
not match very much with them. Then we merely imple-
ment the PAMSAM clustering to investigate the functional
groups of complete dataset. Because it is seen that there is
no any particular biological information which supports a
natural hierarchical structure of plants’ reactions. Therefore
we concentrate on partitional clustering approaches and choose
the more robust one among alternatives. Hereby we apply the
22801 genes in clustering, select the significantly differentially
expressed genes which are totally 1715 species, and divide
these significant genes under 4 functional groups. Some of
the gene names from each group are presented in Table I
as examples and the PAMSAM plot of this new dataset is
shown in Figure 5. The full names of genes can be found in
http://www.plexdb.org/modules/PD-probeset/annotation.php and the complete list of genes in all
classes can be seen in Supplementary Material upon request.
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
12
34
PAMSAM
Fig. 5. PAMSAM clustering of 1715 genes detected after FGX normalizationof the complete 22801 genes. The axes show how clusters are different fromeach other in terms of correlation from -1 to 1.
From the clustered genes in Table I, we detect that Cluster 1,
Cluster 2, Clusters 3, and Cluster 4 are related to the functional
categories of jasmonic acid biosynthesis, pathogenetic, senes-
cence, and glutathione related genes, respectively. Among
these activations, basically, jasmonic acid is exposed when
the cell faces with an environmental stress or pathogens which
refers to disease related species. On the other hand the senes-
cence related genes are responsible for aging of the cell and
finally the glutathione genes protect the cells from oxidative
tissue damage [3]. Indeed although these functional groups
are distinct in non-hierarchical divisions, they imply two cate-
gories under HIPAM clustering by joining pathogenetic related
822790
TABLE IEXAMPLES OF SIGNIFICANT GENES FROM PAMSAM CLUSTERING UNDER 4 FUNCTIONAL CATEGORIES OF BARLEY LEAVES EXPOSED BY 5 MM BORON
TOXICITY.
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Contig101 at AFFX-r2-Bs-dap-3 at AFFX-r2-Bs-lys-3 at 1200459 Reg 88-1740 atHM05O08r at 74797-75570.AF427791 x at Contig333 M at Contig140 atContig897 s at Contig333 3 x at U29604.1 at Contig141 atHV09P07u s at 1289374 Reg 826-1545 at Contig558 s at Contig282 x atContig1512 at AJ421947.1 s at HA28L22r s at Contig286 s atrbags7l20 s at Contig41 at Contig172 s at Contig682 s atContig1858 at Contig41 x at Contig545 s at Contig661 atContig1865 at HVSMEb0014C02r2 s at Contig585 x at Contig700 s atContig2462 at Contig67 at Contig785 x at Contig1281 atContig2502 at Contig86 at Contig931 at Contig1478 s at
.
.
....
.
.
....
.
.
....
.
.
....
HV CEa0004A08r2 at rbags12l21 at HVSMEh0083J23r2 s at 6 HD04M05u atHVSMEl0006J08r2 s at rbags13f02 at HVSMEi0015K21r2 x at HV CEb0024B09r2 atHVSMEa0002A17r2 at rbags16k21 at rbaal20n01 s at HVSMEm0022B10r2 at
HVSMEa0006I22r2 x at rbags17k13 x at rbaal31f01 at HVSMEl0003I12r2 atHVSMEb0001F21r2 x at rbags18k24 s at rbaal33f13 at HVSMEa0016D02r2 at
rbaal1h20 at rbags19k14 at rbaal34e03 at rbaal0c08 atrbaal21c12 at rbags19n07 at rbaal38e14 x at rbaal10k01 at
rbaal23d16 s at rbags1c11 at rbags11o22 at rbaal11f18 atrbaal30c10 at rbags22n02 at rbags18a16 s at rbaal20h12 s atrbags16d08 at rbags24e18 at rbags19e09 at rbaal23n02 s at
Total number of genes:249 675 407 384
genes with senescence associated genes and merging jasmonic
acid biosynthesis genes with glutathione related genes. We
believe that the resulting clusters, particularly, via PAMSAM,
can be helpful for the reconstruction of the barley leaves
network under boron and without boron toxicity in a more
detailed research and molecular distinction of genes.
II. CONCLUSION
From the results of microarray analyses, we consider that
if the aim of the study is to divide the genes into disjoint
clusters so as to see their functional properties, the PAMSAM
method can be suggested against the partitional method like k-
means. But if the functional relationship between the clusters
is also required, the HIPAM method can be referred since it
gives more stable global results with respect to the hierarchical
clustering approach. Therefore the pair of PAMSAM and
HIPAM are our suggested approaches for non-hierarchical and
hierarchical clustering, respectively. But here we interpret the
results of PAMSAM since the biological knowledge about
the cell validates the non-hierarchial structure in the barley
pathway. Furthermore from these two methods, we get more
boron-affected genes in majority of clusters under the complete
microarray data. As a future work, we consider to combine
all these methods in the consensus clustering method [11]
which enables us to get a unique merged cluster matrix
being weighted by the selected clustering approaches. Also
we consider to increase the stability of the k-means algorithm
by the method of the minimal spanning tree [12].
ACKNOWLEDGMENT
The authors would like to thank Prof. Gerhard Wilhelm
Weber for his helpful discussion.
REFERENCES
[1] M. Schena, Microarray Analysis, Hobokon New Jersey, England:JohnWiley and Sons, 2003.
[2] E. Wit and J. McClure, Statistics for Microarray Design, Analysis, andInference,1st ed. England: John Wiley and Sons, 2004.
[3] M. T. Oz, R. Yılmaz, F. Eyidogan, L. de Graaff, M. Yucel, andH. A. Oktem, Microarray analysis of late response to boron toxicity inbarley (Hordeum vulgare L.) leaves, Turkish Journal of Agriculture andForestry, 191–202, Vol:33, 2009.
[4] V. Purutcuoglu, E. Kayıs, and G. W. Weber, Survey of backgroundnormalizations for Affymetrix arrays and a case study, preprint: 1-22. Chapter in: Advances in Intelli-gent Modelling and Simulation:Simulation Tools and Applications. Editor: A. Byrski, Z. Oplatkova,M. Carvalho, and M. Kisiel-Dorohinicki. Springer, 2011.
[5] B. S. Everitt and S. Landau and M. Leese, Cluster Analysis, 4th. ed.,ARNOLD-A Member of the Hodder Headline Group, 2001.
[6] S. Knudsen, A Biologists Guide to Analysis of DNA Microarray Data,England: John Wiley and Sons, 2002.
[7] G. Gan, C. Ma and J. Wu, Data Clustering Theory, Algorithms, andApplications, Society for Industrial and Applied Mathematics, 2007.
[8] V. Purutcuoglu and E. Wit, FGX: a frequentist gene expression indexfor Affymetrix arrays, Biostatistics, Vol:8, 433-37, 2007.
[9] R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. An-tonellis, U. Scherf, and T. P. Speed, Exploration, normalization, andsummaries of high density oligonucleotide array probe level data,Biostatistics, Vol:4, 249-64, 2003.
[10] S. Ben-David, D. Pal, and H. U. Simon, Stability of k-means cluster-ing, In Proceedings of the 20th Annual Conference on ComputationalLearning Theory, 2034, 2007.
[11] S. Monti, P. Tamayo, J. Mesirov and T. Golub, Consensus clustering: aresampling-based method for class discovery and visualization of geneexpression microarray data, Machine Learning, Vol:52, 91-118, 2003.
[12] Z. Barzily, Z.V. Volkovich, B. Akteke-Ozturk, and G.-W. Weber, Clusterstability using minimal spanning trees, ISI Proceedings of EURO MiniConference, Lithuania, 248-252, 2008.
823791