1 microarray data analysis class discovery and class prediction: clustering and discrimination

1

Microarray Data Analysis

Class discovery and Class prediction: Clustering and Discrimination

2

Gene expression profiles

Many genes show definite changes of expression between conditions

These patterns are called gene profiles

3

Motivation (1): The problem of finding patterns It is common to have hybridizations where

conditions reflect temporal or spatial aspects. Yeast cycle data Tumor data evolution after chemotherapy CNS data in different part of brain

Interesting genes may be those showing patterns associated with changes.

Our problem seems to be distinguishing interesting or real patterns from meaningless variation, at the level of the gene

4

Finding patterns: Two approaches

If patterns already exist Profile comparison (Distance analysis)Find the genes whose expression fits specific,

predefined patterns.Find the genes whose expression follows the

pattern of predefined gene or set of genes. If we wish to discover new patterns

Cluster analysis (class discovery)Carry out some kind of exploratory analysis to

see what expression patterns emerge;

5

Motivation (2): Tumor classification A reliable and precise classification of tumours is

essential for successful diagnosis and treatment of cancer.

Current methods for classifying human malignancies rely on a variety of morphological, clinical, and molecular variables.

In spite of recent progress, there are still uncertainties in diagnosis. Also, it is likely that the existing classes are heterogeneous.

DNA microarrays may be used to characterize the molecular variations among tumors by monitoring gene expression on a genomic scale. This may lead to a more reliable classification of tumors.

6

Tumor classification, cont

There are three main types of statistical problems associated with tumor classification:

1. The identification of new/unknown tumor classes using gene expression profiles - cluster analysis;

2. The classification of malignancies into known classes - discriminant analysis;

3. The identification of “marker” genes that characterize the different tumor classes - variable selection.

7

Cluster and Discriminant analysis

These techniques group, or equivalently classify, observational units on the basis of measurements.

They differ according to their aims, which in turn depend on the availability of a pre-existing basis for the grouping. In cluster analysis (unsupervised learning, class

discovery) , there are no predefined groups or labels for the observations,

Discriminant analysis (supervised learning, class prediction) is based on the existence of groups (labels)

8

Clustering microarray data

Cluster can be applied to genes (rows), mRNA samples (cols), or both at once.Cluster samples to identify new cell or tumour

subtypes.Cluster rows (genes) to identify groups of co-

regulated genes.We can also cluster genes to reduce

redundancy e.g. for variable selection in predictive models.

9

Advantages of clustering

Clustering leads to readily interpretable figures. Clustering strengthens the signal when averages

are taken within clusters of genes (Eisen). Clustering can be helpful for identifying patterns

in time or space. Clustering is useful, perhaps essential, when

seeking new subclasses of cell samples (tumors, etc).

10

Applications of clustering

Alizadeh et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.

Three subtypes of lymphoma (FL, CLL and DLBCL) have different genetic signatures. (81 cases total)

DLBCL group can be partitioned into two subgroups with significantly different survival. (39 DLBCL cases)

11

Clusters on both genes and arrays

Taken from Nature February, 2000Paper by Allizadeh. A et alDistinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,

12

Discovering tumor subclasses DLBCL is clinically

heterogeneous Specimens were

clustered based on their expression profiles of GC B-cell associated genes.

Two subgroups were discovered: GC B-like DLBCL Activated B-like

DLBCL

13

Basic principles of clustering

Aim: to group observations that are “similar” based on predefined criteria.

Issues: Which genes / arrays to use?

Which similarity or dissimilarity measure?

Which clustering algorithm?

It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific, see later example.

14

Measures of dissimilarity

Correlation Distance

ManhattanEuclideanMahalanobis distanceMany more ….

15

Two basic types of methods

Partitioning Hierarchical

16

Partitioning methods

Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups.

Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares.

Examples: k-means, self-organizing maps (SOM), PAM, etc.; Fuzzy: needs stochastic model, e.g. Gaussian

mixtures.

17

Hierarchical methods

Hierarchical clustering methods produce a tree or dendrogram.

They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.

The tree can be built in two distinct waysbottom-up: agglomerative clustering; top-down: divisive clustering.

18

Agglomerative methods

Start with n clusters. At each step, merge the two closest clusters

using a measure of between-cluster dissimilarity, which reflects the shape of the clusters.

Between-cluster dissimilarity measuresMean-link: average of pairwise dissimilaritiesSingle-link: minimum of pairwise

dissimilarities.Complete-link: maximum& of pairwise

dissimilarities.Distance between centroids

19

Distance between centroids Single-link

Complete-link Mean-link

20

Divisive methods

Start with only one cluster. At each step, split clusters into two parts. Split to give greatest distance between two new

clusters Advantages.

Obtain the main structure of the data, i.e. focus on upper levels of dendogram.

Disadvantages. Computational difficulties when considering all

possible divisions into two groups.

21

1 5 2 3 4

1 5 2 3 4

1,2,5

3,41,5

1,2,3,4,5

Agglomerative

Illustration of points In two dimensional space

1

5

34

2

22

1 5 2 3 4

1 5 2 3 4

1,2,5

3,41,5

1,2,3,4,5

Agglomerative

Tree re-ordering?

1

5

34

2

1 52 3 4

23

Partitioning or Hierarchical? Partitioning:

Advantages Optimal for certain criteria. Genes automatically

assigned to clusters

Disadvantages Need initial k; Often require long

computation times. All genes are forced into a

cluster.

Hierarchical Advantages

Faster computation. Visual.

Disadvantages Unrelated genes are

eventually joined Rigid, cannot correct

later for erroneous decisions made earlier.

Hard to define clusters.

24

Hybrid Methods

Mix elements of Partitioning and Hierarchical methodsBagging

Dudoit & Fridlyand (2002)

HOPACH van der Laan & Pollard (2001)

25

Three generic clustering problems

Three important tasks (which are generic) are:

1. Estimating the number of clusters;2. Assigning each observation to a cluster;3. Assessing the strength/confidence of

cluster assignments for individual observations.

Not equally important in every problem.

26

Estimating number of clusters using silhouette

Define silhouette width of the observation as :

S = (b-a)/max(a,b) Where a is the average dissimilarity to all the points in the cluster and

b is the minimum distance to any of the objects in the other clusters. Intuitively, objects with large S are well-clustered while the ones with

small S tend to lie between clusters. How many clusters: Perform clustering for a sequence of the

number of clusters k and choose the number of components corresponding to the largest average silhouette.

Issue of the number of clusters in the data is most relevant for novel class discovery, i.e. for clustering samples

27

Estimating number of clusters using the bootstrapThere are other resampling (e.g. Dudoit and Fridlyand, 2002) and non-resampling based rules for estimating the number of clusters (for review see Milligan and Cooper (1978) and Dudoit and Fridlyand (2002) ).

The bottom line is that none work very well in complicated situation and, to a large extent, clustering lies outside a usual statistical framework.

It is always reassuring when you are able to characterize a newly discovered clusters using information that was not used for clustering.

28

LimitationsCluster analyses: Usually outside the normal framework of statistical

inference; less appropriate when only a few genes are likely to

change. Needs lots of experiments Always possible to cluster even if there is nothing going

on. Useful for learning about the data, but does not provide

biological truth.

29

Examples

30

The data We will use the dataset presented in van't Veer et al.

(2002) which is available at http://www.rii.com/publications/2002/vantveer.htm.

These data come from a study of gene expression in 91 breast cancer node-negative tumors.

The training samples consisted of 78 tumors, 44 of which did not recur within 5 years of diagnosis and 34 did.

Among the test samples, 7 have not recurred within 5 years and 12 did.

The data were collected on two color Agilant oligo arrays containing about 24K genes..

31

Preprocessing The data has been filtered using procedures described in

the original manuscript. Only genes showing 2-fold differential expression and p-value

for a gene being expressed < 0.01 in more than 5 samples are retained.

There are 4,348 such genes. Missing values were imputed using k-nearest neighbors

imputation procedure (Troyanskaya, et al, 2001). There, for each gene containing at leats one missing value we

find 5 genes most highly correlated with it and take average of their value for the sample in which a value for a given gene is missing.

The missing value is replaced with the average.

32

R data

The filtered gene expression levels are stored in a 4348 × 97 matrix named bcdata with rows corresponding to genes and columns to mRNA samples.

Additionally, the labels are contained in the 97-element vector surv.resp with 0 indicating good outcome (no recurrence within 5 years after diagnosis) and 1 indicating bad outcome (recurrence within 5 years after diagnosis).

33

Hierarchical clustering (1) Start performing a hierarchical clustering on the mRNA

samples using correlation as similarity function and complete linkage agglomeration

library(stats)

x1<-as.dist(1 - cor(bcdata)

clust.cor <- hclust(x1), method="complete")

plot(clust.cor, cex = 0.6)

34

Hierarchical clustering (1)

35

Hierarchical clustering (2) Perform a hierarchical clustering on the mRNA samples

using Euclidean distance and average linkage agglomeration. Results can differ significantly.

X2<-dist(t(bcdata)

clust.euclid <- hclust(x2), method = "average")

plot(clust.euclid, cex = 0.6)

36

Hierarchical clustering (2)

54:b

ad37

:goo

d73

:bad

66:b

ad34

:goo

d97

:bad

77:b

ad94

:bad

55:b

ad78

:bad

75:b

ad43

:goo

d64

:bad

96:b

ad41

:goo

d23

:goo

d39

:goo

d72

:bad

15:g

ood

4:go

od63

:bad

32:g

ood

22:g

ood

28:g

ood

56:b

ad88

:goo

d2:

good

33:g

ood

1:go

od6:

good

49:b

ad58

:bad

47:b

ad9:

good

13:g

ood

3:go

od81

:bad

83:g

ood

14:g

ood

42:g

ood

26:g

ood

92:b

ad61

:bad

70:b

ad18

:goo

d31

:goo

d10

:goo

d51

:bad

5:go

od86

:goo

d76

:bad

19:g

ood

36:g

ood

29:g

ood

46:b

ad30

:goo

d27

:goo

d82

:goo

d84

:goo

d17

:goo

d7:

good

21:g

ood

59:b

ad89

:bad

35:g

ood

85:g

ood

95:b

ad60

:bad

69:b

ad62

:bad

52:b

ad11

:goo

d79

:bad

16:g

ood

40:g

ood 38

:goo

d25

:goo

d45

:bad

53:b

ad68

:bad

57:b

ad93

:bad

80:b

ad8:

good

71:b

ad24

:goo

d67

:bad

12:g

ood

65:b

ad74

:bad

44:g

ood

48:b

ad50

:bad 20

:goo

d87

:goo

d90

:bad

91:b

ad

10

20

30

40

50

Cluster Dendrogram

hclust (*, "average")dist(t(bcdata))

He

igh

t

37

Comparison between orderingsSample CORR.GROUP EUC.GROUP1:good 1 12:good 1 16:good 1 17:good 1 114:good 1 119:good 1 121:good 1 123:good 1 130:good 1 139:good 1 141:good 1 142:good 1 149:bad 1 158:bad 1 161:bad 1 170:bad 1 172:bad 1 188:good 1 192:bad 1 13:good 2 19:good 2 113:good 2 147:bad 2 183:good 2 14:good 3 111:good 3 116:good 3 140:good 3 152:bad 3 163:bad 3 1

IN THIS CASE WE OBSERVE THAT: Clustering based on

correlation and complete linkage dsitributes samples more uniformly between groups

Euclidean-average linkage combination yields one huge group and many small one

38

Partitioning around medoids If we assume a fixed number of clusters we can

use a partitioning method such as PAM It is a robust version of k-means which clusters

data around the medoids

library(cluster)

x3<-as.dist(1-cor(bcdata))

clust.pam <- pam(x3, 5, diss =TRUE)

clusplot(clust.pam, labels = 5,

col.p = clust.pam$clustering)

39

PAM clustering (2)

-0.5 0.0 0.5

-0.6

-0.2

0.2

0.4

0.6

clusplot(pam(x = x3, k = 5, diss = TRUE))

Component 1

Co

mp

on

en

t 2

These two components explain 24.97 % of the point variability.

1

2

34

5

1 microarray data analysis class discovery and class prediction: clustering and discrimination

Documents

cluster genes

expression patterns

gene slide

cluster rows genes

predefined patterns

genes rows

tumor classification

finding patterns