microarray data analyisis: clustering and validation measures

62
10/29/22 Raffaele Giancarlo 1 Microarray Data Analyisis: Clustering and Validation Measures Raffaele Giancarlo Dipartimento di Matematica Università di Palermo Italy

Upload: winter-wise

Post on 31-Dec-2015

36 views

Category:

Documents


0 download

DESCRIPTION

Microarray Data Analyisis: Clustering and Validation Measures. Raffaele Giancarlo Dipartimento di Matematica Università di Palermo Italy. What we want (tipically). Genes. Expression Levels. Genes Expression Matrix. Group functionally related genes together - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 1

Microarray Data Analyisis: Clustering

and Validation Measures

Raffaele GiancarloDipartimento di Matematica

Università di PalermoItaly

Page 2: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 2

What we want (tipically)What we want (tipically)

Genes Expression Matrix•Group functionally related genes together

•Basic Axiom of Computational Biology: Guilt by Association A high similarity among object, as measured by mathematical functions, is strong indication of functional relatedness…Not always

•Clustering

Expression LevelsGenes

Page 3: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 3

What we want (tipically)What we want (tipically)

Clustering Solution

Page 4: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 4

Limitations in the Analysis Process

Page 5: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 5

Limitations: Microarray Technology

• MIAME, we have a problem-Robert Shields, Trends in Genetics, 2006

– …no amount of statistical or algorithmic knowledge can compensate for limitations of the technology itself

– A large proportion of the transcriptome is beyond the reach of current technology, i.e, the signal is too weak

Page 6: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 6

Limitations: Visualization Tools

• One of those two Clusters is random noise …Which One ???

Page 7: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 7

Limitations: Statistics

• Towards sound epistemological foundations of statistical methods for high-dimensional biology- T. Mehta et al, Nature Genetics, 2004

– Many papers for omic research describe development or application of statistical methods— Many of those are questionable

Page 8: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 8

Overview Of Remaining Part

• Clustering as a three step process• Internal validation Techniques• External Validation Techniques• Experiments• One stop shops software systems• Some Issues I Really Had to Talk

About

Page 9: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 9

Cluster Analysis as a Three Step Process

Page 10: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 10

What is clustering?

• Group similar objects together

E1 E2 E3 E4

Gene 1 -2 +2 +2 -1

Gene 2 +8 +3 0 +4

Gene 3 -4 +5 +4 -2

Gene 4 -1 +4 +3 -1Clu

steri

ng

gen

es Clustering

experiments

Page 11: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 11

What is Clustering?

• Goal: partition the observations {xi} so that

– C(i)=C(j) if xi and xj are “similar”

– C(i)C(j) if xi and xj are “dissimilar”

• natural questions: – What is a cluster– How do I choose a good similarity function– How do I choose a good algorithm

• APPLICATION and DATA DEPENDENT

– How many clusters are REALLY present in the data

Page 12: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 12

What’s a Cluster?

• No rigorous definition• Subjective• Scale/Resolution dependent (e.g.

hierarchy)

Page 13: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 13

Step One

• Choose a good similarity function-

– Euclidean Distance- • capture magnetudo and pattern of

expression, i.e., direction

– Correlation functions• Captures pattern of expression, i.e. direction

– Etc…

Page 14: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 14

Step Two

• Choose a good clustering algorithm. Algorithms may be broadly classified according to the objective function they optimize

– Compactness: Intra- Cluster Variation Small• They like well separated or spherical clusters but fail on more complex

cluster shapes• Kmeans, Average Link Hierarchical Clustering

– Connectedness- neighboring items should share the same cluster• Robust with respect to cluster shapes, but fail when separation in the

data is poor. • Single Link Hierarchical Clustering, CAST, CLICK

– Spatial Separation- Poor performer by itself, usually coupled with other criteria

• Simulated Annealing, Tabu Search

Page 15: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 15

Step Three• An index that tells us how many clusters are really present in the data:

Consistency/Uniformity

more likely to be 2 than 3

more likely to be 2 than 36?(depends, what if each circle represents 1000 objects?)

Page 16: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 16

Step Three

• An index that tells us: Separability

increasing confidence to be 2

Page 17: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 17

Step Three

• An index that tells us: Separability

increasing confidence to be 2

Page 18: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 18

Step Three

• An index that tells us: Separability

increasing confidence to be 2

Page 19: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 19

Step Three

• An index that is– independent of cluster “volume”?– independent of cluster size?– independent of cluster shape?– sensitive to outliers?– etc…

• Theoretically Sound-Gap Statistics• Data Driven and Validated-Many

Page 20: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 20

Internal Validation Measures

• How many clusters are really present in the data• Assess Cluster Quality•Internal: No external knowledge about the dataset is given

Page 21: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 21

The Basic Scheme

• Given an Index F – a function of clustering solution

• black box producing clustering solutions with k=2,…,m clusters

• Compute F( ) to decide which k is best

2C

kC

kC

Page 22: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 22

Internal Validation Measures

• Within-Cluster Sum of Squares [Folklore]

• Gap Statistics [Tibshirani, Walther, Hastie 2001]

• FOM [Yeung, Haynor, Ruzzo 2001]

• Consensus Clustering [Monti et al., 2003]

• Etc…

Page 23: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 23

Within-Cluster Sum of Squares

r rCi Cj

jir xxD2

xi

xj

Page 24: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 24

Within-Cluster Sum of Squares

r

r r

Ciir

Ci Cjjir

xxn

xxD

2

2

2

k

rr

rk D

nW

1 2

1

Measure of compactness of clusters

Page 25: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 25

Using Wk to determine # clusters

Idea of L-Curve Method: use the k corresponding to the “elbow”

(the most significant increase in goodness-of-fit)

Page 26: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 26

Example• Yeast Cell Cycle Dataset, 698 genes and

72 conditions

• Five functional classes-The gold solution

• Algorithm, K-means with Av. Link input and Euclidean Distance

• We want to know how many clusters are predicted by Wk , with K-means as an “oracle”

Page 27: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 27

Example

Page 28: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 28

Problems with Use of Wk

• No reference clustering solution to compare against, i.e., no model

• The values of Wk are not normalized and therefore cannot be compared

• In a nutshell: we get values of Wk but we do not quite know how far we are from randomness

• Gap Statistics takes care of those problems

Page 29: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 29

The Gap Statistics

• Based on solid statistical work for the 1-D case, i.e., the objects to be clustered are scalars, takes care of the problems outlined for Wk

• Extended to work in higher dimensions – No Theory

• Validated experimentally

Page 30: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 30

Sample Uniformly and at Random

1. Align with feature axes (data-geometry independent)

ObservationsBounding Box (aligned

with feature axes)Monte Carlo Simulations

Page 31: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 31

Computation of the Gap Statistic

for l = 1 to B

Compute Monte Carlo sample X1b, X2b, …, Xnb (n is # obs.)

for k = 1 to K

Cluster the observations into k groups and compute log Wk

for l = 1 to B

Cluster the M.C. sample into k groups and compute log Wkb

Compute

Compute sd(k), the s.d. of {log Wkb}l=1,…,B

Set the total s.e.

Find the smallest k such that

)(/11 ksdBsk

B

bkkb WW

BkGap

1

loglog1

)(

1)1()( kskGapkGap

Page 32: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 32

Example

• The same experimental setting as for

Within-Sum of Squares

• We want to know whether the Gap Statistics predicts 5 clusters, with K-means as an “oracle”

Page 33: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 33

Example

Page 34: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 34

Figure of Merit• A purely experimental approach,

designed and validated specifically for microarray data

Page 35: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 35

FOMExperiments

1 m

gen

es

1

n

e

Cluster C1

Cluster Ci

Cluster Ck

g

R(g,e)

m

e

keFOMkFOM1

),()(

n

k-nk)/FOM(e, k)FOM(e, adjusted

k

i CgC

i

ieegR

nkeFOM

1

2))(),((1

),(

Page 36: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 36

FOM

Page 37: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 37

Example

• Same experimental setting as in the Within Sum of Squares

• We want to know whether FOM indicates 5 clusters in the data set, with K-means as an “oracle”

• Hint: look for the elbow in the FOM plot, exactly as for the Wk curve.

Page 38: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 38

Example

Page 39: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 39

External Validation Measures

• Given two partitions of the same dataset, how close they are ?

• Assess Quality of a partition against a given gold standard

•External: the gold standard, i.e., the refernce partition must be given and trusted. In case of Biology, the elements in a cluster must be biologically correlated, i.e., same functional group of genes

Page 40: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 40

Some External Validation Measures

• The two partitions must have the same number of classes– Jaccard Index– Minkowski score– Rand Index [Rand 71]

• The two partitions can have a different number of classes– The Adjusted Rand Index [Hubert and

Arabie 85]– The F measure [van Rijsbergen 79]

Page 41: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 41

Some External Validation Measures

• Problem with the mentioned indexes:– What is their expected value ?

• In very intuitive terms, if one picks blindly two partitions, among the possible partitions of the data, what is the value of the index we should expect ? Same problem we had with Gap Statistics.

Page 42: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 42

The Adjusted Rand Index

• It takes in input two partitions, not necessarely having the same number of classes.

– Value 1, its maximum, means perfect agreement

– The expected value of the index, i.e., its value on two randomly correlated partitions, is zero

• Note1: the index may take negative values

• Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index

– The index must be maximased

– We will see some of its uses later

Page 43: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 43

The Adjusted Rand Index

• It takes in input two partitions, not necessarely having the same number of classes.

– Value 1, its maximum, means perfect agreement

– The expected value of the index, i.e., its value on two randomly correlated partitions, is zero

• Note1: the index may take negative values

• Note2: The same property is not shared by other mentioned indexes, including its relative-the Rand Index

– The index must be maximased

– We will see some of its uses later

Page 44: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 44

Adjusted Rand index

• Compare clusters to classes• Consider # pairs of objects

Same cluster

Different cluster

Same class a c

Different class

b d

Page 45: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 45

Example (Adjusted Rand)

c#1(4) c#2(5) c#3(7) c#4(4)class#1(2) 2 0 0 0class#2(3) 0 0 0 3class#3(5) 1 4 0 0class#4(10) 1 1 7 1

1192

20

2831592

10

2

5

2

3

2

2

1231432

4

2

7

2

5

2

4

312

7

2

4

2

3

2

2

cbad

ac

ab

a

469.0)(1

)( Rand Adjusted

789.0,

RE

RERdcda

daRRand

Closed form in the paper by Handl et al.(supplementary material)

Page 46: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 46

Some Experiments or on the Need of Benchmark

Data Set

Page 47: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 47

How Do I Pick:

• Distance and Similarity Functions, given algorithm and data set

• algorithm, given data set

• Internal Validation Measures, given data set

Page 48: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 48

Different Distances-Same Algorithm and implementation

(k-means)

Page 49: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 49

Same Distance-Two Different Implementations of the Same

Algorithm: not all k-means are equal

Page 50: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 50

Performance of Different Algorithms- precision

Method Clusters Adjusted Rand

Max K-means Random 5 0,44

Min K-means Random 5 0,49

Cast 5 0,529

K-means Avlink 5 0,508

Avlink 5 0,559

Click 8 0,51

Page 51: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 51

Performance of Different Indexes-Precision

Page 52: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 52

Performance of Different Indexes-Precision

Page 53: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 53

Performance of Different Indexes-Time

Measure Time in ms

Wk 157672

FOM 3695437

Gap MC 28082500

Gap P 26468125

Page 54: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 54

Performance Evaluation

• Which conclusions can one draw from the shown experiments ?

– Some indication of which distance, algorithm and measure to pick

• A much more extensive analysis is need, with well designed benchmark datasets

Page 55: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 55

Performance Evaluation

• Benchmark data sets

– Hard to design, in particular for Microarrays

– Worth the trouble (see Tompa et al, Nature Biotechnology, 2005)

Page 56: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 56

One Stop Shop Systems for Analysis of Micro

Array Data

Page 57: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 57

MIDAS and MEV

• Filtering and data normalization tools

• Clustering Algorithms (K-means, Cast)

• Validation Measures (FOM)

• Statistical Analysis tools

Page 58: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 58

Click and Expander

• Data Normalization and Filtering

• Clustering Algorithm (In particular Click)

• Biclusterting algorithms • Validation Methods • Statistical and Visualization Tools

Page 59: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 59

Visualization Methods for Statistical Analysis of

Microarray Data• A system that combines statistical

methods and data visualization

• Sinoptyc views and limited navigation on the data are supported

Page 60: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 60

Some Issues I Should Have Talked About

• Issue 25: Over-expression and Under-expression of genes

– Problem: one gene subject to “normal” conditions; same gene subject to “different” conditions.

– Question: Are the measured expression levels different ?

– Sensitivity Analysis in Microarray Data: Quite a bit of work– see for instance

http://www-stat.stanford.edu/~tibs/SAM/

Page 61: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 61

Advertisement

• Second Lipari International Summer School in Bioinformatics and Computational Biology

• Where and When- Lipari Island, Italy-June 14-21, 2008

• Theme- Biological Networks: Evolution, Interaction and Computation

• More Info at http://lipari.cs.unict.it/LipariSchool/Bio/index.php

Page 62: Microarray Data Analyisis: Clustering and Validation Measures

04/19/23 Raffaele Giancarlo 62

Conclusions

• Data analysis for microarrays (and not only) is a complicated interactive process with no clear-cut recipe

• Reliable tools or knowledge of their limitations is a must

GOOD LUCK!!!