unravelling regulatory modules involved in amyotrophic lateral … › downloadfile › ... ·...

11
Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis Mafalda Ruas Gon¸ calves Under supervision of Sara Alexandra Cordeiro Madeira and Alexandre Paulo Louren¸ co Francisco Dep. Bioengineering, IST, Lisbon, Portugal November 2012 Abstract Amyothrophic Lateral Sceloris (ALS) is a devastating disease, whose pathogenesis is still not fully understood. In the literature, evidence regarding the genetic framework of disease abounds. In the present work, two different approaches are combined in an attempt to unravel novel disturbed biological pathways in sporadic cases of ALS. First, standard unsupervised data mining techniques are employed to samples and genes: hierarchical clustering and K-Means clustering. Then, the Weighted Gene Co- expression Analysis based on network concepts is used to obtain modules of highly correlated genes. Several network concepts are integrated to guide the algorithm and assist the selection of the more cohesive modules. The purpose of using both procedures is to identify genes involved in abnormal biological processes. Toward this end, an overlap study of the resulting clusters and modules was performed. This procedure was followed by functional enrichment with Gene Ontology and KEGG terms. With both approaches significant groups of genes were identified, which should be analysed in depth as future work. The application of WGCNA provided a more straightforward identification of enriched modules. However, clustering techniques also led to results with high correlation with the disease. As a future work, it is suggested to compare these different approaches and to improve the confidence of the results, a larger dataset should be considered. Keywords: Amyotrophic Lateral Sclerosis, K-Means, Hierarchical Clustering, Weighted Gene Co-expression Network Analysis, Regulatory Gene Networks, Transcriptomics 1. Amyotrophic Lateral Sclerosis Amyotrophic Lateral Sclerosis (ALS), also colloqui- ally known as Lou Gehrig’s disease [3], is the most frequent motor neuron disorder with adult onset. It is characterized by the progressive degeneration of the upper motor neurons of the corticospinal tract and the lower motor neurons of the spinal cord ante- rior horns [5, 22]. It leads to progressive weakness and atrophy of muscles, paralysis and ultimately death [30]. The pathogenesis of ALS is still unknown [22]. Relatively to the genetic origin of the disease, 5- 10% of the ALS cases are of familial origin (fALS), whilst the majority are sporadic (sALS) [30]. Both these types present indistinguishable clinical man- ifestations [30]. Nowadays it is considered enough evidences of genetic contribution in both is present [3]. The list of genes associated with the familial form keeps increasing and the most consensus identified genes are SOD1, TARDBP, fused in sarcoma (FUS) and the optinerin (OPTN) gene [5]. The multifacto- rial pathogenesis of this disease has been associated with perturbation of non-neuronal cells, specially astrocytes and microglia, neuroinflammatory pro- cesses, protein aggregation or inclusions, oxidative stress and abnormal axonal transport or axonopa- thy [30, 5, 3]. 2. Microarray Technology The technology of microarrays has the ability of parallel assessment of thousands of gene expression profiles and, is nowadays an established tool in the study of the behaviour of genes under different con- ditions [11]. In the past decades, the evolution of this technique was accompanied by the advent of database technology [26] and it is now possible to composed datasets of different studies to increase the confidence of results. The applications of mi- croarrays are vast in clinical practice and genetic re- search as it offers the possibility to register gene ex- pression response to external factors or differences in the phenotype [8, 23]. 3. Microarray Data Analysis Several approaches have been proposed over the years to uncover biological meaning from microar- ray data but still many questions remained unan- swered [29]. Besides, the results obtained with this technique would be more robust if the analysis was 1

Upload: others

Post on 25-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

Unravelling regulatory modules involved inAmyotrophic Lateral Sclerosis

Mafalda Ruas GoncalvesUnder supervision of Sara Alexandra Cordeiro Madeira and Alexandre Paulo Lourenco Francisco

Dep. Bioengineering, IST, Lisbon, Portugal

November 2012

Abstract

Amyothrophic Lateral Sceloris (ALS) is a devastating disease, whose pathogenesis is still not fullyunderstood. In the literature, evidence regarding the genetic framework of disease abounds. In thepresent work, two different approaches are combined in an attempt to unravel novel disturbed biologicalpathways in sporadic cases of ALS. First, standard unsupervised data mining techniques are employedto samples and genes: hierarchical clustering and K-Means clustering. Then, the Weighted Gene Co-expression Analysis based on network concepts is used to obtain modules of highly correlated genes.Several network concepts are integrated to guide the algorithm and assist the selection of the morecohesive modules. The purpose of using both procedures is to identify genes involved in abnormalbiological processes. Toward this end, an overlap study of the resulting clusters and modules wasperformed. This procedure was followed by functional enrichment with Gene Ontology and KEGGterms. With both approaches significant groups of genes were identified, which should be analysed indepth as future work. The application of WGCNA provided a more straightforward identification ofenriched modules. However, clustering techniques also led to results with high correlation with thedisease. As a future work, it is suggested to compare these different approaches and to improve theconfidence of the results, a larger dataset should be considered.Keywords: Amyotrophic Lateral Sclerosis, K-Means, Hierarchical Clustering, Weighted GeneCo-expression Network Analysis, Regulatory Gene Networks, Transcriptomics

1. Amyotrophic Lateral SclerosisAmyotrophic Lateral Sclerosis (ALS), also colloqui-ally known as Lou Gehrig’s disease [3], is the mostfrequent motor neuron disorder with adult onset. Itis characterized by the progressive degeneration ofthe upper motor neurons of the corticospinal tractand the lower motor neurons of the spinal cord ante-rior horns [5, 22]. It leads to progressive weaknessand atrophy of muscles, paralysis and ultimatelydeath [30].

The pathogenesis of ALS is still unknown [22].Relatively to the genetic origin of the disease, 5-10% of the ALS cases are of familial origin (fALS),whilst the majority are sporadic (sALS) [30]. Boththese types present indistinguishable clinical man-ifestations [30]. Nowadays it is considered enoughevidences of genetic contribution in both is present[3].

The list of genes associated with the familial formkeeps increasing and the most consensus identifiedgenes are SOD1, TARDBP, fused in sarcoma (FUS)and the optinerin (OPTN) gene [5]. The multifacto-rial pathogenesis of this disease has been associatedwith perturbation of non-neuronal cells, speciallyastrocytes and microglia, neuroinflammatory pro-

cesses, protein aggregation or inclusions, oxidativestress and abnormal axonal transport or axonopa-thy [30, 5, 3].

2. Microarray Technology

The technology of microarrays has the ability ofparallel assessment of thousands of gene expressionprofiles and, is nowadays an established tool in thestudy of the behaviour of genes under different con-ditions [11]. In the past decades, the evolution ofthis technique was accompanied by the advent ofdatabase technology [26] and it is now possible tocomposed datasets of different studies to increasethe confidence of results. The applications of mi-croarrays are vast in clinical practice and genetic re-search as it offers the possibility to register gene ex-pression response to external factors or differencesin the phenotype [8, 23].

3. Microarray Data Analysis

Several approaches have been proposed over theyears to uncover biological meaning from microar-ray data but still many questions remained unan-swered [29]. Besides, the results obtained with thistechnique would be more robust if the analysis was

1

Page 2: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

made in a standardize way [8, 23].

Usually, microarray studies start with low-leveloperations, such as filtering and normalisation andfinishes with high-level ones, like clustering tech-niques or other pattern recognition algorithms [4].

In these studies, genes may either be consideredindividually in a procedure known as differential ex-pression analysis [8, 23, 9], which is focused on theidentification of genes (or pathways) that presentsignificantly different expression profiles betweentwo classes of samples. Another approach is to con-sider genes in a more global manner, using clus-tering/biclustering and classification techniques ornetwork inference [27, 23]. There appear to be noconsensus as to which is the best method to use butrather that different techniques may explore diffe-rent aspects of the data.

Supervised approaches are chosen to classify thesamples into known classes, which requires priorknowledge of the data structure [4, 29]. The mostinteresting application would be to identify typesand sub-types of diseases in diagnosis by findinga list of relevant genes that allow this classifica-tion [29, 23]. However, in most biological prob-lems targeted by microarrays, the class labels, re-quired for the supervised methods, are not pro-vided. Therefore, in microarray data analysis, un-supervised methods are more frequently used asthey attempt to unravel novel or unexpected hid-den patterns in the data [4, 23].

Clustering algorithms are a data mining conceptapplicable to different context and that may be di-vided into hierarchical and partitioning (or non-hierarchical) methods [27]. The three most commonused clustering algorithms are hierarchical cluster-ing and two partitioning techniques, k-means andself-organizing maps [23]. Over the years, severaltechniques, some of which derived from these three,have been widely applied to microarray data [27].

In 2005, Zhang and Horvath [28], demonstratedthat gene co-expression networks follow an approx-imated scale-free topology and developed a pipelineto determine gene co-expression networks, knownas Weighted Gene Co-expression Network Analysis(WGCNA) [28]. This procedure was implementedas a package of functions [14] in R project [2]. Thismethod has been extensively applied, as for exam-ple to study the human transcriptome, and thepreservation of gene co-expression modules betweendifferent human brain cells [19], and even the dif-ferences between human and mouse brain cells [18].There are also studies more focused in studying theproperties of this method, as the study of the rela-tionship between the essentiality of hub genes usingyeast microarray data [6]. In the context of ALS,this method has been successfully applied to identi-fied 5 co-expression modules, 2 of which were highly

enriched with differential expressed genes [22].

4. Methods4.1. DataData used in this work, as well as previous datapre-processing steps, are in detail described in [22].The complete dataset is divided into three subsetsof ALS patients and their matching controls. How-ever, in this work only the first two would be usedand correspond to 30 patients and correspondingcontrols, which are named dataset 1 (ALS1 and C1)and dataset 2 (ALS2 and C2). These datasets weredesigned to be similar regarding proportion of fe-male/male patients, average of ages and proportionof spinal/bulbar onset in patients.

4.2. Data AnalysisSimilarity Measures The similarity measuresmay be divided into metric and semi-metric mea-sures [27]. The standard metric distance used isthe Euclidean Distance [27], which is a generaliza-tion of the Pythagorean theorem. Initially definedin 3-dimensional space, may be extended to higherdimensional spaces:

d = ‖x− y‖ =√∑n

i=1(xi − yi)2 (1)

Other examples of metric distances is the squaredversion of the Euclidean distance and the Manhat-tan distance [7]. The most used semi-metric dis-tance is the Pearson correlation coefficient (or cen-tred Pearson correlation coefficient) [27], r, whichis given by:

r =

∑ni=1 (xi − x)(yi − y)√∑n

i=1 (xi − x)2√∑n

i=1 (yi − y)2, (2)

where x e y are, respectively, the mean values forthe X and Y objects. This correlation ranges be-tween −1 and 1, taking the value zero when two vec-tors are completely independent from each other,which means that they are uncorrelated or ortho-gonal vectors. The distance metrics based on thePearson correlation are given by Dij = (1− rij)/2[27].

Topological Overlap This measure was initiallyproposed to measure the relatedness of the sub-strates forming a metabolic network [21]. How-ever, since genetic and protein domain networkalso present an approximate scale-free topology, thisframework was extended to these types of networks[28].

wij =

∑u aiuauj + aij

min {∑u aiu,

∑u aju}+ 1− aij

(3)

2

Page 3: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

Hierarchical Clustering Hierarchical clusteringdepends on the computation of a similarity matrixbetween all objects being clustered, that is thenused to guide the algorithm. This method may bean agglomerative or divisive procedure. Agglomer-ative one is the most commonly used and thus willbe used in this work [23].

The Hierarchical clustering algorithm is asfollows [25]: Given a object matrix and a user-defined similarity measure and inter-cluster dis-tance:

1. Compute the similarity matrix by determiningthe distances between all pairs of objects;

2. Merge the two closest objects r and s;

3. Replace r with the new cluster and delete s;

4. Repeat step 1 to 3 until number of clustersequals 1.

The most commonly used inter-cluster distancemeasure is the average-linkage, which generallyworks well with standardized microarray data [27].Using average-linkage, the distance between twoclusters may be defined as the average of distancebetween all possible pair of objects. However othertypes are also available [20].

K-Means Given X = X1, X2, ..., Xn as the setof genes, the goal of K-Means is to divide thisn objects for k (positive integer) number of clus-ters. Each of this objects are represented by Xi =xi1, xi2, , ..., xim that represent its value through ex-perimental conditions m. To guide the algorithm,a minimization of a commonly used cost function[12, 25] is used:

E =

k∑l=1

n∑i=1

yild(Xi, Qj) (4)

The K-Means Algorithm is as follows [13]:

1. Random selection of the initial k means for kclusters;

2. Computation of the dissimilarity between anobject and the mean of a cluster;

3. Mapping of objects to the clusters whose meanis nearest to the object;

4. Re-calculation of the mean of a cluster fromthe objects allocated to it;

5. Repeat step 2 to 4, until convergence.

WGCNA Weighted Gene Co-expression Net-work Analysis (WGCNA) [28] method builds aweighted network that presents a scale-free topol-ogy, where nodes represent genes, which are con-nected if there is significant evidence of their co-expression.

First, a similarity matrix is built making use ofabsolute Pearson correlation to obtain the pairwisesimilarity between genes i and j, S = [Si,j ]. A softthreshold is then applied to this matrix using thepower function with the parameter β determined bythe scale free topology criterion proposed by [28],being the power function given by:

aij = |cor(xi, xj)|β (5)

Average linkage is then applied to cluster genesbased on the topological overlap dissimilarity mea-sure (1 - topological overlap similarity (wij)).

The modules are obtained by the cut of the den-drogram resulting from the hierarchical clusteringalgorithm and correspond to groups of nodes withhigh topological overlap [21]. A dynamic iterativecutting algorithm, the dynamic hybrid cut [16, 14],is used to cut the dendrogram. Two parametersof this functions were varied, the deepSplit andminimumclustersize. The first one controls therelative sensitivity of cluster splitting, whilst thesecond sets a minimum cluster size.

Module Eigengene The singular value decom-position (X = UDV T ) is used to determine theprincipal component of each cluster and determinethe so-called module eigengene [29].

4.3. Network ConceptsThe key concept of networks is the node connec-tivity, or degree, as it measures the relative impor-tance of the node in the network [29] and it is mostwidely used measure to distinguish network nodes[20]. It has been found relevant in biological ap-plications, for example, to identify significant genesin cancer and primate brain development [10]. Thehigher this value, the higher the importance of thegene in the network. Genes that are highly con-nected are named ’hub’ genes and are thought toplay an important role in the structure of this bio-logical networks [10].

ki =∑j

aij =∑i

aij (6)

An intramodular connectivity measure with bio-logically significance may be defined as Fuzzy Mod-ule Membership Measure or Eigengene-based Con-nectivity [22, 19, 17] and it measures the correlationbetween the i-th gene and the q-th module eigen-gene: MMq(i) = Cor(xi,MEq) .

3

Page 4: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

Line density, or mean (off-diagonal) adjacencymeasure, is closely related to the mean connectivityand is defined as [10]:

Density =

∑i

∑j 6=i aij

n(n− 1)=mean(k)

n− 1(7)

Centralization has been used to describe struc-tural differences of metabolic networks [10]:

Centralization =n

n− 2(max(k)

n− 1−Density) (8)

The heterogeneity studies the variation of theconnectivity in the network [10] and may be definedas:

Heterogeneity =

√variance(k)

mean(k)(9)

Clustering coefficients are useful to study howcohesive the clustering resulting modules, that isthe strength of the connections between the neigh-bours of the nodes [20]. It varies from zero to one(0 ≤ Ci ≤ 1), taking the unit when it is at the cen-ter of a fully interlinked cluster and zero when itsneighbours are not connected at all. The averageclustering coefficient may be used to measure if anetwork presents a modular organization [28].

ClusterCoefi =

∑l 6=i

∑m 6=i ailalmami

{(∑l 6=i ail)

2 −∑l 6=i(ail)

2}(10)

4.4. Sample NetworkOldham et. al [20] proposed a method to exploresamples’ network relationships that builds a signednetwork of samples and computes for each of themthe most frequently used network concepts, connec-tivity and clustering coefficient, in their standard-ized forms (Z.K and Z.C, respectively) [20]. A scat-ter plot between Z.K and Z.C, named standardizedC(k) curve, allows to compare different behaviourof the networks.

4.5. Module PreservationCross-tabulation The standard and intuitiveprocedure to study module preservation is to per-form a cross-tabulation of module membership be-tween the two networks. The most commonly usedtable is to report the number of clusters that areshared between two modules and use Fisher’s exacttest to obtain p-values as a significant level mea-surement to more easily identify overlapping [15].

Network Based Statistics Langfelder et. al[15] proposed a network based statistics to studymodule preservation between two networks. TheZ statistics summary, Zsummary, that summarizes

different statistics measures [15], gives strong evi-dence of preservation if Zsummary > 10, weak tomoderated evidence if 2 < Zsummary < 10 and noevidence when Zsummary < 2.

5. Results and DiscussionData preprocessing

The data was imported to R and no missing val-ues were identified using the R function goodSam-plesGenes() (part of the WGCNA package). Theprobes were converted to gene symbols by a avail-able online tool named Array Information LibraryUniversal Navigator (AILUN) [1]. The probes rep-resenting the same gene were averaged, resulting ina total 6777 unique gene symbols of the original8,000 probes.

Data was pre-processed accordingly to the stan-dard procedures before using clustering techniques[11, 24]. As we were interested in discriminatingbetween ALS samples and controls, the chosen nor-malization was the one that, by making use of theEuclidean distance, allows to better visualize rele-vant differences. Therefore first was performed nor-malization by sample and then by gene, where theresults were visualize making use of average linkageeuclidean distance and k-means algorithm (Figure1).

However, the second method applied to the ex-pression data, the WGCNA, makes use of Pear-son correlation and so a recent tool, the SampleNetwork, was applied, that takes into considerationthe intra-array distances given by Pearson correla-tion, but also some network properties, that includeconnectivity and clustering coefficient. Dataset 1demonstrated to present more significant networkproperties as it separates samples from differentphenotypes (Figure 2).

5.1. Clustering Genes

The purpose of this clustering analysis was to iden-tify clusters of genes where the expression of ALSsamples is significantly different from that in con-trols (for instance, up regulation in ALS and downregulation in controls) and thus a combination ofdatasets was used.

Hierarchical Clustering In the dataset 1, twoclusters were considered to present differentially ex-pressed genes between ALS and controls (Figure 3and 4), as they clearly present a different color (rep-resenting the expression value) between these twoclasses. In the second dataset (ALS2 with C2), norelevant clusters were identified, with some clusterssuggesting the presence of sample outliers.

4

Page 5: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

CON3

1CO

N36

ALS5

6AL

S59

CON0

1AL

S09

ALS0

7AL

S24 CO

N43

CON5

6AL

S41

ALS5

0AL

S31

ALS3

5CO

N35

ALS2

6AL

S42

CON5

7AL

S38

CON3

2AL

S33

ALS5

1AL

S40

ALS5

8AL

S52

ALS4

6AL

S54

CON3

4CO

N55

CON5

0CO

N38

CON5

2AL

S34

ALS4

8AL

S39

ALS4

4AL

S49

CON5

9CO

N46

CON4

7AL

S11

ALS1

0CO

N12

ALS2

8AL

S08

ALS1

8CO

N05

ALS2

7AL

S13

ALS1

4AL

S20

ALS0

6AL

S30

CON2

5AL

S16

ALS1

7CO

N20

CON1

6CO

N18

ALS2

1CO

N02

CON0

3CO

N04

CON3

0AL

S22

ALS1

9CO

N14

CON2

3CO

N08

CON1

7CO

N22

CON1

3CO

N15

CON2

1CO

N28

CON1

0CO

N11

ALS5

3AL

S37

ALS5

5AL

S32

CON5

1 CON4

0CO

N48

CON3

9CO

N41a

vAL

S36

ALS6

0AL

S43a

vAL

S45

ALS2

3CO

N06

ALS1

2CO

N07

CON0

9CO

N24

ALS2

9CO

N19

CON2

9 ALS2

5AL

S15

CON2

7CO

N45

CON4

4CO

N60 CO

N53

CON5

4CO

N58 AL

S47

CON3

3CO

N37

CON4

2AL

S57

CON4

9

5010

015

0

Sample clustering to detect outliersHe

ight

Figure 1: Average linkage hierarchical clustering based on Euclidean distance between expression values of thefour datasets after sample normalization followed by gene normalization. The ALS and control samples are named,respectively, ALS and CON followed by its identification number.

Figure 2: Sample Network study performed on the fourdatasets: ALS dataset 1 (black) and control (red). (1 -top) Dendrogram resulting from average linkage hierar-chical clustering using 1 - ISA (intersample adjacency)for a subset of samples. (2 - bottom) Comparison ofstandardized sample connectivities (Z.C) against stan-dardized sample clustering coefficients (Z.K).

Figure 3: The two clusters considered containing geneswith significant different expression between ALS1 andC1, identified with average linkage hierarchical cluster-ing based on Euclidean distance. The first (left) and sec-ond (right) cluster are color label as blue and turquoise,respectively. The expression matrix colors encode forup- and down- regulated genes corresponding respec-tively to red and green colors. Dataset order from leftto right: ALS1, C1 (each one corresponds to 1/2 of theimage’s width).

K-Means K-Means was applied to the three com-binations of datasets previously described (ALL,ALS1 with C1, ALS2 with C2), making use of Eu-clidean distance and using 4 values of k: 10, 20, 50and 100. Only the k = 20, 50 in dataset 1 (ALS1with C1) produced interesting clusters in the cur-rent problem are the ones as they allow to identifyclusters where the expression between ALS samplesis significantly different from that in controls (ex-ample in Figure 5 and 6).

WGCNA The WGCNA method was applied toeach dataset considered using soft thresholdingpower 6. To decide the best set of modules in

5

Page 6: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

Figure 4: Centroid view (left) and mean expressionpattern (right) of the two clusters considering contain-ing genes significantly different expressed between ALS1and C1 that were identified with average linkage hierar-chical clustering based on Euclidean distance. The bluecluster is represented on the top and the turquoise onthe bottom images. Dataset order from left to right:ALS1, C1 (each one corresponds to 1/2 of the image’swidth).

Figure 5: Expression profile of two interesting clustersusing k = 20 with Euclidean distance when clusteringgenes from samples from dataset 1. These clusters werecolor named turquoise and blue, respectively. The ex-pression matrix colors encode for up- and down- regu-lated genes corresponding respectively to red and greencolors. Dataset order from left to right: ALS1, C1 (eachone corresponds to 1/2 of the image’s width).

Figure 6: Expression profile of two interesting clustersusing k = 50 with Euclidean distance when clusteringgenes from samples from dataset 1. The expression ma-trix colors encode for up- and down- regulated genes cor-responding respectively to red and green colors. Theseclusters were color named turquoise and blue, respec-tively. Dataset order from left to right: ALS1, C1 (eachone corresponds to 1/2 of the image’s width).

each network, it was used the clustering coeffi-cient and chosen the set of modules that gener-ally produced modules with higher clustering co-efficients (Table 1). In that sense the two parame-ters of the branch cutting hybrid method, deepSplitand minimumclustersize, chosen for each datasetwas, respectively, for ALS 1 - deepSplit = 1and minimumclustersize = 4 (Figure 7); forC1 - deepSplit = 0 and minimumclustersize =4 (Figure 8); for ALS2 - deepSplit = 2 andminimumclustersize = 8; for C2 - deepSplit = 3and minimumclustersize = 32.

0.50.6

0.70.8

0.91.0

hclust (*, "average")d

Heigh

t

1

Figure 7: Average linkage hierarchical clustering den-drogram applied to dataset ALS1. Module assignmentgiven by applying dynamic hybrid branch cutting withdeepSplit = 1 and minimumclustersize = 4 is depictedby row color immediately below the dendrogram, withgrey representing unassigned genes.

0.50.6

0.70.8

0.91.0

hclust (*, "average")d

Heigh

t

1

Figure 8: Average linkage hierarchical clustering den-drogram applied to dataset C1. Module assignmentgiven by applying dynamic hybrid branch cutting withdeepSplit = 0 and minimumclustersize = 4 is depictedby row color immediately below the dendrogram, withgrey representing unassigned genes.

6. Post-Processing Results6.1. Module Preservation in WGCNA networksFirst, a study of the module preservation betweennetworks of the same phenotype was performed toverify if the same modules were significantly presentin networks of the same phenotype. Therefore, twoapproaches are used to determine this preservation:cross-tabulation and network based statistics.

Using cross-tabulation to study the overlapping

6

Page 7: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

Table 1: Network properties applied to the four original datasets and the comparison with the average (andstandard deviation) for each module present in the selected set of modules (grey module is not taken into consid-eration). The selected measures include size, density, centralization, Heterogeneity, Mean Cluster Coefficient andMean Connectivity within cluster.

NetworkMeanSize

MeanDensity

MeanCentral-ization

MeanHetero-geneity

MeanClusterCoeffi-cient

MeanConnec-tivity

original ALS 1 6777 0.378 0.182 0.293 0.448 2558

selected ALS1 cut250.4 ±425.8

0.246 ±0.12

0.157 ±0.03

0.378 ±0.13

0.31 ±0.13

52.0 ±92.87

original ALS 2 6777 0.446 0.183 0.330 0.544 3023

selected ALS2 cut1660.4± 737.5

0.21 ±0.11

0.16 ±0.03

0.479 ±0.14

0.303 ±0.10

124.38 ±290.7

original Control 1 6777 0.276 0.135 0.219 0.316 1871

selected Control 1 cut363.52± 353.3

0.147 ±0.11

0.139 ±0.04

0.520 ±0.16

0.229 ±0.13

85.46 ±228.79

original Control 2 6777 0.431 0.174 0.312 0.521 2921

selected Control 2 cut957.99± 366.4

0.227 ±0.13

0.158 ±0.03

0.436 ±0.16

0.312 ±0.12

31.17 ±33.44

between modules was difficult to interpret in thiscontext. Therefore, the network based statisticswas applied for a pairwise comparison of moduleswith the same phenotype, that is ALS1 with ALS2(Figure 9) and C1 with C2 (Figure 10). Using therecommended threshold by the authors [15] resultedin enough confidence of modules preservation be-tween networks, although it is more evident whencomparing controls (Figure 10). Therefore, mod-ules from dataset 1 were used for further analysisas high confidence on their presence in dataset 2exists.

Analysing WGCNA modules For each mod-ule in WGCNA networks, it was computed the cor-responding module eigengene and a correlation oftheir expression values is presented in Figure 11.This value was used to compute the Eigengene-based Connectivity, MMq(i), and rank the genes ofeach module accordingly to their absolute connec-tivity value. A threshold of 0.85 was then appliedto select the most important genes in each mod-ule and a study of their expression profiles betweenALS and control was performed. Six modules areidentified to contain genes that present a differentexpression profile between ALS and control samples(Figure 12).

6.2. Comparing PartitionsA comparison of partitions between the two differ-ent clustering methods was performed (Figure 13).One k-means (k = 50) contains unassigned genesbut both HCL clusters correspond to K-means clus-ters. The higher number of clusters obtained usingk-means algorithm, makes the partitioning of inte-

Figure 9: Network based statistics summary accord-ingly to Langfelder et. al [15] using as reference net-work dataset ALS1 and as testing set ALS2. The mea-sure Zsummary translates the general behaviour of thesenetwork statistics. The green line defines the empiricaldetermined significance threshold (=10) obtained in [15]and above each a module is considered preserved. Theblue line is the lower limit of significance (=2), belowthis line it is considered that there is no enough evidenceof that module being preserved on both networks.

7

Page 8: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

Figure 10: Network based statistics summary accord-ingly to Langfelder et. al [15] using as reference net-work dataset C1 and as testing set C2. The measureZsummary translates the general behaviour of these net-work statistics. The green line defines the empirical de-termined significance threshold (=10) obtained in [15]and above each a module is considered preserved. Theblue line is the lower limit of significance (=2), belowthis line it is considered that there is no enough evidenceof that module being preserved on both networks.

ME

blue

ME

brow

n

ME

gree

nyel

low

ME

yello

w

ME

mag

enta

ME

blac

k

ME

pink M

Epu

rple

ME

salm

on

ME

tan

ME

mid

nigh

tblu

e

ME

gree

n

ME

turq

uois

e

ME

cyan

ME

red0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0

0.2

0.4

0.6

0.8

1

ME

red

ME

gree

n

ME

yello

w

ME

turq

uois

e

ME

pink

ME

blac

k

ME

brow

n

ME

gree

nyel

low

ME

purp

le

ME

blue

ME

mag

enta

ME

tan

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0

0.2

0.4

0.6

0.8

1

Figure 11: Study of the module eigengene relationshipsin WGCNA modules in ALS1 (left) and C1 (right): hi-erarchical clustering with average linkage and using asdistance, 1 - (Pearson correlation), (top) and a heatmapthat shows the Pearson correlation between modules inthe same ALS network.

Figure 12: Centroid view of example clusters obtainedfor ALS1 with WGCNA: brown(left) and pink (right).Dataset order from left to right: ALS1 and C1.

resting clusters of HCL into more than one clusterin k-means a consequence of forcing the algorithmto divide genes into 50 clusters.

The cross-tabulation of modules was also appliedto compare the clusters obtained in the hierarchi-cal clustering method and the WGCNA techniques.In the ALS1 network (Figure 14), several modulesmatch the hierarchical clusters, although being par-titioned most of them between the two clusters.These modules are: blue, brown, greenyellow, grey,pink, red and yellow (WGCNA label names).

Less modules of WGCNA control network matchthe hierarchical clusters (Figure 15). In fact, froma set of 12 modules, only 3 were significantly over-lapped with the HCL clusters, which are: green,grey and yellow (WGCNA label names). Besides,no module from this network significantly intersectsthe turquoise HCL cluster, whilst in the ALS caseit significantly intersects 4 modules.

Interestingly, ALS1 WGCNA network modulespresent better overlapping with clusters containinggenes that present significant expression differencesbetween ALS1 and C1 samples and thus further ex-ploration of this dataset will be performed.

6.3. Functional annotation of clustersFunctional annotation of clusters was then per-formed by comparing each of them against the en-tire human genome. The KEGG terms identified ineach of the methods is presented in Table 2, severalGO terms were identified in each of these groups ofgenes but further analysis is required.

7. Conclusions and Future workThe hierarchical clustering using Euclidean distanceapplied to samples usually works well as a methodfor outlier screening before applying clustering tech-niques that use this measure as similarity. On theother hand, the Sample network method is moreappropriate before WGCNA as in addition to usingPearson correlation, it also takes into considerationdifferent network properties.

When clustering genes, both average linkage hier-archical clustering and k-means (using k = 20, 50),were able to identify clusters with significant differ-ences of gene expression profiles between ALS andcontrol samples. Most of which presented interest-ing functional annotation enrichment.

On the other hand, the WGCNA method offersthe possibility of using network concepts well stud-ied and applied to complex networks in several stepsof the procedure and understanding of the structureof the results. A comparison of the modules iden-tified with the latter method and the clusters ob-tained with the standard clustering techniques wasthen performed. The only set of genes that signifi-cantly intersects all other sets is the WGNA mod-ules obtained for the ALS samples in dataset 1.

8

Page 9: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

C. k−means 50 modules (rows) vs. Hierarchical Clusters (columns)

0

10

20

30

40

50

blue: 7

03

grey: 5

730

turquoise: 3

44

blue: 288

brown: 203

green: 92

grey: 5841

red: 84

turquoise: 130

yellow: 139

250 2.6e−220

38 1

0 1

16 0.91

187 0.00084

0 1

0 1

91 3.2e−06

1 0.99

228 1

5332 3.3e−241

281 0.99

0 1

22 1

62 7.7e−64

119 5.9e−107

11 1

0 1

90 1.3e−55

49 1

0 1

Figure 13: Cross-tabulation of the clusters obtained with k-means algorithm (k = 20) (rows) and clusters obtainedby Hierarchical clustering (columns), where the colors respects the color code defined. In the table, numbers givethe number of genes that overlap between the corresponding row and column module. The color-code in thistable is given by the Fisher exact test p value, −log(p), accordingly to the bar coded given in the right side o thetable.

C. WGCNA ALS1 modules (rows) vs. HCL modules (columns)

0

10

20

30

40

50

blue: 7

03

grey: 5

730

turquoise: 3

44

black: 165

blue: 285

brown: 281

cyan: 23

green: 203

greenyellow: 66

grey: 3021

magenta: 134

midnightblue: 16

pink: 150

purple: 105

red: 200

salmon: 34

tan: 65

turquoise: 1753

yellow: 276

12 0.93

153 0.0011

0 1

121 1.4e−47

87 1

77 5.9e−37

58 1.5e−07

223 0.99

0 1

0 1

23 0.021

0 1

0 1

203 9.1e−16

0 1

6 0.69

35 1

25 2.9e−16

217 1

2605 0.00033

199 2.7e−07

0 1

131 3.9e−07

3 0.97

0 1

16 0.068

0 1

107 3.3e−73

43 1

0 1

2 1

102 2.7e−05

1 1

1 1

176 0.099

23 0.00018

0 1

34 0.0033

0 1

1 1

64 0.00023

0 1

1 1

1744 2.2e−132

8 1

177 4.3e−111

91 1

8 0.97

Figure 14: Cross-tabulation of the ALS1 WGCNA modules (rows) and clusters obtained by Hierarchical clustering(columns), where the colors respects the color code defined. In the table, numbers give the number of genes thatoverlap between the corresponding row and column module. The color-code in this table is given by the Fisherexact test p value, −log(p), accordingly to the bar coded given in the right side o the table.

C. WGCNA C1 modules (rows) vs. HCL modules (columns)

0

10

20

30

40

50

blue: 7

03

grey: 5

730

turquoise: 3

44

black: 183

blue: 887

brown: 651

green: 339

greenyellow: 64

grey: 2537

magenta: 73

pink: 155

purple: 65

red: 274

tan: 10

turquoise: 1154

yellow: 385

1 1

182 1e−12

0 1

102 0.13

769 0.031

16 1

15 1

636 4.1e−31

0 1

92 2.8e−19

243 1

4 1

0 1

62 0.0017

2 0.84

235 0.99

2056 1

246 1.4e−39

2 1

71 0.00046

0 1

16 0.55

134 0.3

5 0.9

1 1

64 0.00023

0 1

31 0.33

237 0.21

6 1

0 1

10 0.19

0 1

58 1

1040 1.2e−09

56 0.67

150 1.2e−53

226 1

9 1

Figure 15: Cross-tabulation of the C1 WGCNA modules (rows) and clusters obtained by Hierarchical clustering(columns), where the colors respects the color code defined. In the table, numbers give the number of genes thatoverlap between the corresponding row and column module. The color-code in this table is given by the Fisherexact test p value, −log(p), accordingly to the bar coded given in the right side o the table.

9

Page 10: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

Type of cluster Term

K-means, HCL, WGCNA RNA degradation (hsa03018)HCL RNA polymerase (hsa03020)

K-means, WGCNA Spliceosome (hsa03040)K-means, WGCNA Cardiac muscle contraction (hsa04260)K-means, WGCNA Ribosome (hsa03010)K-means, WGCNA Oxidative phosphorylation (hsa00190)K-means, WGCNA Alzheimer’s disease (hsa05010)K-means, WGCNA Parkinson’s disease (hsa05012)

K-means, WGCNA, HCL Huntington’s disease (hsa05016)K-means Protein export (hsa03060)K-means Proteasome (hsa03050)K-means Neurotrophin signaling pathway (hsa04722)K-means Graft-versus-host disease (hsa05332)

K-means, HCL Natural killer cell mediated cytotoxicity (hsa04650)K-means Acute myeloid leukemia (hsa05221)K-means Toll-like receptor signaling pathway (hsa04620)K-means B cell receptor signaling pathway (hsa04662)

HCL T cell receptor signaling pathway (hsa04660)K-means Insulin signaling pathway (hsa04910)

HCL Phosphatidylinositol signaling system (hsa04070)HCL Glycolysis / Glucogenesis (hsa00010)HCL O-Mannosyl glycan biosynthesis (hsa00514)HCL Long-term potentiation (hsa04720)HCL Ubiquitin mediated proteolysis (hsa04120)

Table 2: KEGG terms significantly associated with clusters in WGCNA method, hierarchical clustering (HCL)and K-means algoritgm.

As future work, further study of the terms func-tional annotated in the resulting clusters and mod-ules may result in interesting contributions to theunderstanding of this disorder. Further compar-isons of these methods as well as improvement inthe confidence of the results would be possible if ahigher number of samples is considered.

References[1] Ailun: Platform Annotation.

[2] The R Project for Statistical Computing.

[3] A. Al-Chalabi, A. Jones, C. Troakes, A. King, S. Al-Sarraj,and L. H. van den Berg. The genetics and neuropathol-ogy of amyotrophic lateral sclerosis. Acta neuropathologica,124(3):339–52, Sept. 2012.

[4] P. C. Boutros and A. B. Okey. Unsupervised pattern recog-nition : An introduction to the whys and wherefores ofclustering microarray data. 6(4):331–344, 2005.

[5] J. Brettschneider, J. B. Toledo, V. M. Van Deerlin, L. El-man, L. McCluskey, V. M.-Y. Lee, and J. Q. Trojanowski.Microglial activation correlates with disease progressionand upper motor neuron clinical symptoms in amyotrophiclateral sclerosis. PloS one, 7(6):e39216, Jan. 2012.

[6] M. R. J. Carlson, B. Zhang, Z. Fang, P. S. Mischel, S. Hor-vath, and S. F. Nelson. Gene connectivity, function, andsequence conservation: predictions from modular yeast co-expression networks. BMC genomics, 7:40, Jan. 2006.

[7] H. Chipman, T. J. Hastie, and R. Tibshirani. ClusteringMicroarray Data. pages 161–204, 2001.

[8] M. R. Dalman, A. Deeter, G. Nimishakavi, and Z.-H. Duan.Fold change and p-value cutoffs significantly alter microar-ray interpretations., Jan. 2012.

[9] A. de la Fuente. From ’differential expression’ to ’differen-tial networking’ - identification of dysfunctional regulatorynetworks in diseases. Trends in genetics : TIG, 26(7):326–33, July 2010.

[10] J. Dong and S. Horvath. Understanding network conceptsin modules. BMC systems biology, 1(1):24, Jan. 2007.

[11] E. Freyhult, M. Landfors, J. Onskog, T. R. Hvidsten, andP. Ryden. Challenges in microarray class discovery: acomprehensive examination of normalization, gene selectionand clustering. BMC bioinformatics, 11(1):503, Jan. 2010.

[12] Z. Huang. Clustering large data sets with mixed numericaland categorical values. Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining.Word Scientific.

[13] Z. Huang. A fast clustering algorithm to cluster very largecategorical data sets in data mining. In Proc. SIGMODWorkshop on Research Issues on Data Mining and Knowl-edge Discovery, 1997.

[14] P. Langfelder and S. Horvath. WGCNA: an R package forweighted correlation network analysis. BMC bioinformat-ics, 9:559, Jan. 2008.

[15] P. Langfelder, R. Luo, M. C. Oldham, and S. Horvath. Ismy network module preserved and reproducible? PLoScomputational biology, 7(1):e1001057, Jan. 2011.

[16] P. Langfelder, B. Zhang, and S. Horvath. Dynamic TreeCut : in-depth description , tests and applications. pages1–11, 2009.

10

Page 11: Unravelling regulatory modules involved in Amyotrophic Lateral … › downloadFile › ... · Unravelling regulatory modules involved in Amyotrophic Lateral Sclerosis ... Clustering

[17] M. J. Mason, G. Fan, K. Plath, Q. Zhou, and S. Hor-vath. Signed weighted gene co-expression network analy-sis of transcriptional regulation in murine embryonic stemcells. BMC genomics, 10:327, Jan. 2009.

[18] J. a. Miller, S. Horvath, and D. H. Geschwind. Divergence ofhuman and mouse brain transcriptome highlights Alzheimerdisease pathways. Proceedings of the National Academy ofSciences, 107(28):12698–12703, June 2010.

[19] M. C. Oldham, G. Konopka, K. Iwamoto, P. Langfelder,T. Kato, S. Horvath, and D. H. Geschwind. Functionalorganization of the transcriptome in human brain. Natureneuroscience, 11(11):1271–82, Nov. 2008.

[20] M. C. Oldham, P. Langfelder, and S. Horvath. Networkmethods for describing sample relationships in genomicdatasets: application to Huntington’s disease. BMC sys-tems biology, 6(1):63, Jan. 2012.

[21] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai,and A. L. Barabasi. Hierarchical organization of modu-larity in metabolic networks. Science (New York, N.Y.),297(5586):1551–5, Aug. 2002.

[22] C. G. J. Saris, S. Horvath, P. W. J. van Vught, M. a. vanEs, H. M. Blauw, T. F. Fuller, P. Langfelder, J. DeYoung,J. H. J. Wokke, J. H. Veldink, L. H. van den Berg, andR. a. Ophoff. Weighted gene co-expression network analysisof the peripheral blood from Amyotrophic Lateral Sclerosispatients. BMC genomics, 10:405, Jan. 2009.

[23] S. Selvaraj and J. Natarajan. Microarray data analysis andmining tools. Bioinformation, 6(3):95–9, Jan. 2011.

[24] W. Shannon, R. Culverhouse, and J. Duncan. Analyzingmicroarray data using cluster analysis. Pharmacogenomics,4(1):41–52, Jan. 2003.

[25] A. Sturn. Cluster Analysis for Large Scale Gene Expres-sion Studies. PhD thesis, 2000.

[26] G. C. Tseng, D. Ghosh, and E. Feingold. Comprehensiveliterature review and statistical considerations for microar-ray meta-analysis. Nucleic acids research, 40(9):3785–99,May 2012.

[27] P. Valarmathie. Survey on Clustering Algorithms for Mi-croarray Gene Expression Data. 69(1):5–20, 2012.

[28] B. Zhang and S. Horvath. A General Framework forWeighted Gene Co-Expression Network Analysis A Gen-eral Framework for Weighted Gene Co-Expression NetworkAnalysis. 4(1), 2005.

[29] W. Zhao, P. Langfelder, T. Fuller, J. Dong, A. Li, andS. Hovarth. Weighted gene coexpression network analysis:state of the art. Journal of Biopharmaceutical Statistics,20(2):281–300, 2010.

[30] L. Zinman and M. Cudkowicz. Emerging targets and treat-ments in amyotrophic lateral sclerosis. The Lancet Neurol-ogy, 10(5):481–490, 2011.

11