| exploratory data analysis | clustering gene expression...

— Exploratory Data Analysis —

Clustering gene expression data

Jorg Rahnenfuhrer and Florian Markowetz

Practical DNA Microarray Analysis, Heidelberg, 2006 November 27–30http://compdiag.molgen.mpg.de/ngfn/pma2006nov.shtml

Abstract. This is the tutorial for the clustering exercises on day 2 of the course on Practical DNA Mi-croarray Analysis. Topics will be hierarchical clustering, k-means clustering, partitioning around medoids,selecting the number of clusters, reliability of results, identifying relevant subclusters, and pitfalls of clus-tering.

We consider expression data from patients with acute lymphoblastic leukemia (ALL) that were investigatedusing HGU95AV2 Affymetrix GeneChip arrays [Chiaretti et al., 2004]. The data were normalized usingquantile normalization and expression estimates were computed using RMA. The preprocessed version ofthe data is available in the Bioconductor data package ALL from http://www.bioconductor.org. Typein the following code chunks to reproduce our analysis. At each ”>” a new command starts, the ”+”indicates that one command spans over more than one line.

> library(ALL)

> data(ALL)

> help(ALL)

> cl <- substr(as.character(ALL$BT), 1, 1)

> dat <- exprs(ALL)

The most pronounced distinction in the data is between B- and T-cells. You find these true labels in thevector cl. In this session we try to rediscover the labels in an unsupervised fashion. The data consist of128 samples with 12625 genes.

1 Hierarchical clustering of samples

The first step in hierarchical clustering is to compute a matrix containing all distances between the objectsthat are to be clustered. For the ALL data set, the distance matrix is of size 128 × 128. From this wecompute a dendrogram using complete linkage hierarchical clustering:

> d <- dist(t(dat))

> image(as.matrix(d))

> hc <- hclust(d, method = "complete")

> plot(hc, labels = cl)

We now split the dendrogram into two clusters (using the function cutree) and compare the resultingclusters with the true classes. We first compute a contingency table and then apply an independence test.

> groups <- cutree(hc, k = 2)

> table(groups, cl)

1

T TB B B B

B BB B

B BB

B BB B

B BB B

BB B

B TB

BB B

BB B

TT

T TT T

T TT

T T T TT

T T TT

T TT T

BB B

BB

B BB B B B

B BB

BB B

BB B

BB B

B BB

BB

B BB B

BB B

B BB B

B BB

B B BB

B BB B

BB

B BT T T T T

T T TB

B BB

B BB

B BB B

4060

8010

012

0

Cluster Dendrogram

hclust (*, "complete")d

Hei

ght

Figure 1: Dendrogram of complete linkage hierarchical clustering. The two cell types are not well separated.

clgroups B T

1 73 312 22 2

> fisher.test(groups, cl)$p.value

[1] 0.03718791

The null-hypothesis of the Fisher test is ”the true labels and the cluster results are independent”. Thep-value shows that we can reject this hypothesis at a significance level of 5%. But this is just statistics:What is your impression of the quality of this clustering? Did we already succeed in reconstructing thetrue classes?

Exercise: In the lecture you learned that hierarchical clustering is an agglomerative bottom-up process ofiteratively joining single samples and clusters. Plot hc again, using as parameter labels=1:128. Now,read the help text for hclust: hc$merge describes the merging of clusters. Read the details carefully.Use the information in hc$merge and the cluster plot to explain the clustering process and to retrace thealgorithm.

Exercise: Cluster the data again using single linkage and average linkage. Do the dendrogram plotschange?

2 Gene selection before clustering samples

The dendrogram plot suggests that the clustering can still be improved. And the p-value is not that low.To improve our results, we should try to avoid using genes that contribute just noise and no information.A simple approach is to exclude all genes that show no variance across all samples. Then we repeat theanalysis from the last section.

> genes.var <- apply(dat, 1, var)

> genes.var.select <- order(genes.var, decreasing = T)[1:100]

> dat.s <- dat[genes.var.select, ]

2

> d.s <- dist(t(dat.s))

> hc.s <- hclust(d.s, method = "complete")

> plot(hc.s, labels = cl)

T TT T T T T TT

T T TT T

T TT

TT

T TT T

T TT T T T

TT

T TB B B B B B

B BB B

B BB B

BB

B BB B B B

BB B B B

BB B

BB

BB B

B BB

B BB

B BB B

BB B B B

BB B

BB B

B B B BB B

BB

B BB

B BB B

B B BB B

B BB B

BB B

B BB B B

BB B

BB B B

510

1520

2530

35

Cluster Dendrogram

hclust (*, "complete")d.s

Hei

ght

Figure 2: Dendrogram of hierarchical clustering on 100 genes with highest variance across samples. Thetwo classes are clearly separated.

Plot the distance matrix d.s. Compare it to d. Then split the tree again and analyze the result by acontingency table and an independence test:

> groups.s <- cutree(hc.s, k = 2)

> table(groups.s, cl)

clgroups.s B T

1 95 02 0 33

> fisher.test(groups.s, cl)$p.value

[1] 2.326082e-31

It seems that reducing the number of genes was a good idea. But where does the effect come from: Is itjust dimension reduction (from 12625 to 100) or do high-variance genes carry more information than othergenes? We investigate by comparing our results to a clustering on 100 randomly selected genes:

> genes.random.select <- sample(nrow(dat), 100)

> dat.r <- dat[genes.random.select, ]

> d.r <- dist(t(dat.r))

> image(as.matrix(d.r))

> hc.r <- hclust(d.r, method = "complete")

> plot(hc.r, labels = cl)

> groups.r <- cutree(hc.r, k = 2)

> table(groups.r, cl)

> fisher.test(groups.r, cl)$p.value

3

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3: Distance matrices for 128 samples, using all genes (left), 100 random genes (middle), and 100high-variance genes (right).

Plot the distance matrix d.r and compare it against the other two distance matrices d and d.s. Repeatthe random selection of genes a few times. How do the dendrogram, the distance matrix, and the p-valuechange? How do you interpret the result?

Finally, we compare the results on high-variance and on random genes in a cluster plot in two dimensions:

> library(cluster)

> par(mfrow = c(2, 1))

> clusplot(t(dat.r), groups.r, main = "100 random genes", color = T)

> clusplot(t(dat.s), groups.s, main = "100 high-variance genes", color = T)

−10 −5 0 5 10

−5

05

10

100 random genes

Component 1

Com

pone

nt 2

These two components explain 30.39 % of the point variability.

−10 −5 0 5

−5

05

100 high−variance genes

Component 1

Com

pone

nt 2

These two components explain 44.08 % of the point variability.

Figure 4: You see the projection of the data onto the plane spanned by the first two principal components(= the directions in the data which explain the most variance). The group labels result from hierarchicalclustering on 100 genes. In the upper plot the genes are selected randomly, in the lower plot accordingto highest variance across samples. The red and blue ellipses indicate class boundaries. Do we need bothprincipal components to separate the data using high-variance genes?

4

3 Partitioning methods

In hierarchical methods, the samples are arranged in a hierarchy according to a dissimilarity measure.Extracting clusters means cutting the tree-like hierarchy at a certain level. Increasing the number ofclusters is just subdividing existing clusters – clusters are nested.

In partitioning methods this is different. Before the clustering process you decide on a fixed number kof clusters, samples are then assigned to the cluster where they best fit in. This is more flexible thanhierarchical clustering, but leaves you with the problem of selecting k. Increasing the number of clusters isnot subdivision of existing clusters but can result in a totally new allocation of samples. We discuss twoexamples of partitioning methods: K-means clustering and PAM (Partitioning Around Medoids).

3.1 k-means

First look at help(kmeans) to become familiar with the function. Set the number of clusters to k=2.Then perform k-means clustering for all samples and all genes. k-means uses a random starting solutionand thus different runs can lead to different results. Run k-means 10 times and use the result with smallestwithin-cluster variance.

> k <- 2

> withinss <- Inf

> for (i in 1:10) {

+ kmeans.run <- kmeans(t(dat), k)

+ print(sum(kmeans.run$withinss))

+ print(table(kmeans.run$cluster, cl))

+ cat("----\n")

+ if (sum(kmeans.run$withinss) < withinss) {

+ result <- kmeans.run

+ withinss <- sum(result$withinss)

+ }

+ }

> table(result$cluster, cl)

The last result is the statistically best out of 10 tries – but does it reflect biology? Now do k-means againusing the 100 top-variance genes. Compare the results.

> kmeans.s <- kmeans(t(dat.s), k)

> table(kmeans.s$cluster, cl)

3.2 PAM: Partitioning Around Medoids

As explained in today’s lecture, PAM is a generalization of k-means. Look at the help page of the R-function.Then apply pam for clustering the samples using all genes:

> result <- pam(t(dat), k)

> table(result$clustering, cl)

Now use k=2:50 top variance genes for clustering and calculate the number of misclassifications in eachstep. Plot the number of genes versus the number of misclassifications. What is the minimal number ofgenes needed for obtaining 0 misclassifications?

5

> ngenes <- 2:50

> o <- order(genes.var, decreasing = T)

> miscl <- NULL

> for (k in ngenes) {

+ dat.s2 <- dat[o[1:k], ]

+ pam.result <- pam(t(dat.s2), k = 2)

+ ct <- table(pam.result$clustering, cl)

+ miscl[k] <- min(ct[1, 2] + ct[2, 1], ct[1, 1] + ct[2, 2])

+ }

> xl = "# genes"

> yl = "# misclassification"

> plot(ngenes, miscl[ngenes], type = "l", xlab = xl, ylab = yl)

4 How many clusters are in the data?

Both kmeans and pam assume that the number of clusters is known. The methods will produce a result nomatter what value of k you choose. But how can we decide, which choice is a good one?

4.1 The objective function

Partitioning methods try to optimize an objective function. The objective function of k-means is the sumof all within-cluster sum of squares. Run k-means with k=2:20 clusters and 100 top variance genes. Fork=1 calculate the total sum of squares. Plot the obtained values of the objective function.

> totalsum <- sum(diag((ncol(dat.s) - 1) * cov(t(dat.s))))

> withinss <- numeric()

> withinss[1] <- totalsum

> for (k in 2:20) {

+ withinss[k] <- sum(kmeans(t(dat.s), k)$withinss)

+ }

> plot(1:20, withinss, xlab = "# clusters", ylab = "objective function", type = "b")

Why is the objective function not necessarily decreasing? Why is k=2 a good choice?

4.2 The silhouette score

First read the Details section of help(silhouette) for a definition of the silhouette score. Then computesilhouette values for PAM clustering results with k=2:20 clusters. Plot the silhouette widths. Choose anoptimal k according to the maximal average silhouette width. Compare silhouette plots for k=2 and k=3.Why is k=2 optimal? Which observations are misclassified? Which cluster is more compact?

> asw <- numeric()

> for (k in 2:20) {

+ asw[k] <- pam(t(dat.s), k)$silinfo$avg.width

+ }

> plot(1:20, asw, xlab = "# clusters", ylab = "average silhouette width", type = "b")

> plot(silhouette(pam(t(dat.s), 2)))

6

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = t(dat.s), k = 2)

Average silhouette width : 0.32

n = 128 2 clusters Cj

j : nj | avei∈Cj si

1 : 95 | 0.28

2 : 33 | 0.41

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = t(dat.s), k = 3)




1 : 48 | 0.13

2 : 47 | 0.06

3 : 33 | 0.38

Figure 5: Silhouette plot for k=2 and k=3

5 How to check significance of clustering results

Randomly permute class labels 20 times. For k=2, calculate the average silhouette width (asw) for pamclustering results with true labels and for all random labelings. Generate a box-plot of asw values obtainedfrom random labelings. Compare it with the asw value obtained from true labels.

> d.s <- dist(t(dat.s))

> cl.true <- as.numeric(cl == "B")

> asw <- mean(silhouette(cl.true, d.s)[, 3])

> asw.random <- rep(0, 20)

> for (sim in 1:20) {

+ cl.random = sample(cl.true)

+ asw.random[sim] = mean(silhouette(cl.random, d.s)[, 3])

+ }

> symbols(1, asw, circles = 0.01, ylim = c(-0.1, 0.4), inches = F, bg = "red")

> text(1.2, asw, "Average silhouette value\n for true labels")

> boxplot(data.frame(asw.random), col = "blue", add = T)

You will see: The average silhouette value of the pam clustering result lies well outside the range of valuesachieved by random clusters.

6 Clustering of genes

Cluster the 100 top variance genes with hierarchical clustering. You can get the gene names from theannotation package hgu95av2.

> library(hgu95av2)

> gene.names <- sapply(geneNames(ALL), get, hgu95av2SYMBOL)

> gene.names <- gene.names[genes.var.select]

> d.g = dist(dat.s)

> hc = hclust(d.g, method = "complete")

> plot(hc, labels = gene.names)

7

HB

A1

HB

BIG

HM

IGH

MH

LA−

DR

AC

D74

HLA

−D

PB

1H

LA−

DP

A1 S

H3B

P1

MA

FF

NR

4A1

IL8

CC

L3R

GS

2C

EB

PB

HB

G1

HB

DC

5orf

13LI

LRB

1C

RIP

1 IGLL

1C

SD

AH

LA−

DR

B1

HLA

−D

QB

1H

LA−

DQ

B1

BT

EB

1S

OC

S2

LILR

A2

FH

L1 FLT

3B

CL1

1AIT

GA

6P

DE

4BF

OX

O1A

CR

IM1

SLC

2A5

HLA

−D

RB

4IG

LJ3 T

CL1

AIG

HM

CD

19C

D79

BH

LA−

DM

AID

3S

H3B

P5

CD

W52

HLA

−D

PB

1T

CF

8ID

2D

DX

3YR

PS

4Y1

LCK

AR

L7C

MA

HLD

LRT

RD

@T

RD

D3

CD

3DT

CF

7IT

M2A

AK

R1C

3P

CD

H9

TU

BB

PR

DX

2 IGJ

SM

AD

1S

MA

D1

CC

ND

2D

US

P6

DN

TT

ER

GF

13A

1P

OU

2AF

1A

LOX

5N

PY

CD

9C

D24

BLN

K CT

GF

HLA

−D

QB

1P

TP

RM

HO

XA

9S

H2D

1AM

AF

XIS

TG

SN

SC

HIP

1G

NA

I1LR

IG1

IRF

4S

EM

A6A

CN

N3

PR

OM

1IM

P−

3C

1orf

29N

R4A

2IL

8

020

4060

8010

0

Cluster Dendrogram

hclust (*, "complete")d.g

Hei

ght

Figure 6: Hierarchical clustering of 100 top variance genes

Now plot the average silhouette widths for k=2:20 clusters using pam and have a look at the silhouetteplot for k=2. What do you think: Did we find a structure in the genes?

> asw = numeric()

> for (k in 2:20) {

+ asw[k] = pam(dat.s, k)$silinfo$avg.width

+ }

> plot(1:20, asw, xlab = "# clusters", ylab = "average silhouette width", type = "b")

> plot(silhouette(pam(dat.s, 2)))

5 10 15 20

0.14

0.16

0.18

0.20

0.22

0.24

# clusters

aver

age

silh

ouet

te w

idth

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

Silhouette plot of pam(x = dat.s, k = 2)




1 : 57 | 0.22

2 : 43 | 0.27

Figure 7: Left: Average silhouette width of the 100 top-variance genes for k=2:20 clusters. Right:Silhouette plot for k=2.

7 Presenting results

In the last section we saw that a substructure in the genes is not easy to find. When interpreting heatmapplots like the following, it is important to keep in mind: Any clustering method will eventually produce

8

a clustering; before submitting a manuscript to Nature you should check the validity and stability of thegroups you found.

> help(heatmap)

> heatmap(dat.s)

1901

717

003

1800

1LA

L419

014

2000

502

020

4301

528

008

3101

510

005

1100

228

009

0100

704

018

1500

624

006

0900

216

007

1600

264

005

4300

612

008

8300

126

009

6500

356

007

1900

801

003

4400

149

004

3700

119

002

0401

628

007

2402

203

002

3600

209

017

2700

449

006

6200

143

004

2000

212

012

6400

165

005

2803

684

004

2600

362

002

1500

124

008

2600

526

001

0802

448

001

1201

925

003

1100

501

005

2401

143

007

0400

731

011

1200

722

011

2401

714

016

3701

322

013

6800

324

010

1200

643

001

0800

104

006

2600

828

032

1600

415

004

1900

524

005

2802

831

007

6300

157

001

2401

964

002

3600

108

018

2800

3LA

L522

010

1202

606

002

0400

816

009

6800

125

006

2200

924

018

0401

028

021

2400

130

001

2803

528

024

2700

328

037

2800

628

001

2804

328

031

3300

528

042

4301

228

023

2804

708

012

0801

128

019

0101

028

044

2800

562

003

1500

509

008

31525_s_at31687_f_at41165_g_at41164_at37039_at35016_at38833_at38095_i_at38355_at41214_at39878_at39729_at296_at39829_at33238_at39317_at32855_at41468_at40775_at37399_at32649_at38319_at38917_at1110_at995_g_at38446_at41504_s_at38147_at37809_at37558_at41470_at36927_at35372_r_at37623_at32612_at33809_at36536_at40953_at36275_at34800_at37625_at37006_at1325_at37280_at914_g_at34168_at36650_at41193_at36638_at36108_at38052_at36239_at307_at38604_at39389_at266_s_at38242_at33516_at38585_at35926_s_at33232_at39710_at32794_g_at33412_at280_g_at36711_at37701_at38354_at36103_at1369_s_at38514_at37043_at34210_at38968_at38096_f_at33439_at41215_s_at33274_f_at33273_f_at39318_at41166_at1096_g_at37988_at37344_at41266_at34362_at32035_at33705_at40936_at40570_at1065_at41356_at40202_at38994_at32542_at34033_s_at39839_at41723_s_at36878_f_at36773_f_at

Figure 8: Heatmap representation of all samples and 100 selected high-variance genes. Rows and columnsare hierarchically clustered.

8 How to fake clusterings

For publication of your results it is advantageous to show nice looking cluster plots supporting your claims.Here we demonstrate how to produce almost any clustering you like in a situation with few samples butmany genes.

First reduce the data set to the first 5 B-cell and T-cell samples. Then permute class labels randomly. Thepermuted labels do not correspond to biological entities. Nevertheless we will demonstrate how to get aperfect clustering result.

The strategy is easy: Just select 100 genes according to differential expression between the groups definedby the random labels, then perform hierarchical clustering of samples with these genes only.

> select = c(which(cl == "B")[1:5], which(cl == "T")[1:5])

> dat.small = dat[, select]

> cl.random = sample(cl[select])

> library(multtest)

> tstat = mt.teststat(dat.small, classlabel = (cl.random == "B"), test = "t")

> genes = order(abs(tstat), decreasing = T)[1:100]

> dat.split = dat.small[genes, ]

> d = dist(t(dat.split))

9

> hc = hclust(d, method = "complete")

> par(mfrow = c(1, 2))

> plot(hc, labels = cl.random, main = "Labels used for gene selection", xlab = "")

> plot(hc, labels = cl[select], main = "True labels", xlab = "")

Repeat this experiment a few times and compare the two plots. You should also have a look at the heatmaprepresentation of dat.split.

T

T T

T T

B

B

B

B B24

68

10

Labels used for gene selection

hclust (*, "complete")

Hei

ght

B

B B

T T

B

T

B

T T24

68

10

True labels

hclust (*, "complete")H

eigh

t

Figure 9: Selecting only highly differential genes leads to well separated groups of samples — even if thesegroups are defined by random assignments of labels without any biological meaning.

9 Ranking is no clustering!

A typical situation in microarray analysis is the following. You are interested in one specific gene and wantto identify genes with a similar function. A reasonable solution to this problem is to calculate the distanceof all other genes to the gene of interest and then analyze the list of the closest (top-ranking) genes.

We choose the fifth gene out of our 100 high variance genes and print the distances and the names of the10 genes that are closest to it.

> list.distances = as.matrix(d.g)[5, ]

> o = order(list.distances)

> list.distances[o][1:10]

> gene.names[o][1:10]

You can see that the original gene of interest is ranked first with a distance 0 to itself, and the second andthe seventh gene are replicates of this gene. This is a plausible result.

Another popular approach in this situation is to first cluster all genes and then analyze the genes that arein the same cluster as the original gene of interest. In general this strategy is not suitable since even twoobjects that are very close to each other can belong to two different clusters if they are close to a clusterboundary line.

In our example, look at the result of the hierarchical clustering of genes obtained above. Define threeclusters based on this result and print the clusters of the 10 closest genes identified above.

> hc = hclust(d.g, method = "complete")

> cutree(hc, 3)[o][1:10]

To which clusters do the original gene of interest and the replicates belong? You can see that even on thislow clustering level the replicates are separated. Thus remember: If you are interested in detecting genesthat are similar to a gene of your choice, you should always prefer a ranked list to a clustering approach.

10

10 ISIS: Identifying splits with clear separation

At the end of this exercise we present a method, which addresses two problems of clustering samples:

1. Only a subset of genes may be important to distinguish one group of samples from another.

2. Different sets of genes may determine different splits of the sample set.

The method ISIS (von Heydebreck et al., 2001) searches for binary class distinctions in the set of samplesthat show clear separation in the expression levels of specific subsets of genes. Several mutually independentclass distinctions may be found, which is difficult to obtain from the most commonly used clusteringalgorithms. Each class distinction can be biologically interpreted in terms of its supporting genes.

> library(isis)

> help(isis)

Read the help text and find out about the parameters of the function isis. Now we use it on the 1000top variance genes:

> dat.isis = dat[order(apply(dat,1,var),decreasing=T)[1:1000], ]

> splits = isis(dat.isis)

The matrix splits contains in each row a binary class separation of samples and in the last column thecorresponding score.

Exercise: Use table() to compare the found splits with the vector cl. You will see: One of the high-scoring splits perfectly recovers the difference between B- and T-cells.

Exercise: Find out what the parameter p.off does. Run isis again with p.off=10. How does the resultchange?

11

11 Summary

Scan through this paper once again and identify in each section the main message you have learned. Someof the most important points are:

• You learned about hierarchical clustering, k-means, partitioning around medoids, and silhouette scores.

• Selecting high-variance genes before clustering improves the result.

• Selecting only highly differential genes leads to well separated groups of samples — even if thesegroups are defined by random assignments of labels without any biological meaning.

If you have any comments, criticisms or suggestions on our lectures, please contact us:[email protected] and [email protected]

12 References

Chiaretti S, Li X, Gentleman R, Vitale A, Vignetti M, Mandelli F, Ritz J, Foa R. Gene expression profileof adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response totherapy and survival. Blood. 2004 Apr 1;103(7):2771-8.

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer, 2001.

von Heydebreck A, Huber W, Poustka A, Vingron M. Identifying splits with clear separation: a new classdiscovery method for gene expression data. Bioinformatics 2001 17: 107-114S.

12

| exploratory data analysis | clustering gene expression...

Documents