cluster validation for gene expression data · gene expression data ka yee yeung 1 david r. haynor...

Cluster Validation forGene Expression Data

Ka Yee Yeung 1

David R. Haynor 2

Walter L. Ruzzo 1

1 Department of Computer Science & Engineering2 Department of Radiology

University of Washington

Bioinformatics v17 #4, (2001) pp 309-318

14

Eisen’s ClusterSoftware

(PNAS 1998)

• Centroid-link hierarchicalclustering algorithm

• Reorder for display

• Decide on your own cluster!

Gene

s

15

Why Validate clusters?

• All clustering algorithms find “clusters”:– Are they real?

– Are they good?conditions

gene

s

A cluster from Eisenet al. (1998) on ayeast data set

A simulated data setwith no intrinsicclusters.

gen

es

conditions

16

Approaches to ClusterValidation: External Criteria• Agreement with an external “gold

standard” answer (rarely available)• Uniformity of clusters w.r.t. related

external information, e.g. GeneOntology or MIPS categories

• Either is quantifiable in various ways --Jaccard, Hubert, adjusted Rand indices,relative entropy, hypergeometric, …

17

Approaches to ClusterValidation: Internal Criteria

• “Compactness” & “separation”

• E.g. residual sum of squares to clustercenters vs sums of squares betweencenters

• E.g. Silhouette - average distance topoints in same cluster vs nearest othercluster

18

Silhouette

a(i) = avg to same

b(i) = avg to neighbor

s(i) = (b(i)-a(i))/ max(b(i),a(i))

19

Approaches to ClusterValidation: Model-based

• Given (statistical) model of data, howwell does model fit

• E.g. look at likelihood ratio that datacould have been generated by onemodel vs alternative

• More on this topic later in the quarter

20

Our Methodology forAlgorithm Comparison

• A form of “Leave Out One Cross Validation”– Cluster genes based on all but one condition.

– Use left-out condition to check cohesiveness ofclusters.

• I.e., within each cluster, how uniform are expressionlevels in the left-out condition?

• Meaningful clusters should be more uniform that chanceaggregations

– Repeat for each condition

• Compare algorithms based on performance.

21

Figure of Merit (FOM)

• FOM measures uniformity of gene expression levelsin each cluster in the left-out experiment(basically mean squared error)

• Low FOM => High predictive power• Leave out each experiment in turn

Cluster Ci

Cluster C1

Cluster Ck

Experiments1 p

gene

s

1

n

e

g

22

“Figure of Merit”

Â Â= Œ

-=k

i Cg i

eegRn 1

2C ))(),((

1k)FOM(e,

im

Â-

=

=1

0

),()(m

e

keFOMkFOM

)k-n/(nFOM(k) adjFOM(k) ⋅=

Experimental conditions0 m-1

gene

s

0

n-1

e

Cluster C1

Cluster Ci

Cluster Ck

g

R(g,e)

FOM(e,k) = mean squareddeviation of expression level from cluster mean:

In clusters formed, howuniform are expression levelsin the left-out condition?

23

Other approaches

• S. Datta & S. Datta ‘03 -- look atagreement between clusterings with alldata & leaving out different conditions

24

Three Successes

• We can distinguish clustered from non-clustered data

• We can tell algorithms apart

• Better FOM generally signals betterclusters

25

Are there clusters?

0

2

4

6

8

10

12

14

16

0 5 10 15 20 25 30 35 40 45 50

number of clusters

adju

sted

FO

Msingle-linkk-means

CASTrandom

0

2

4

6

8

10

12

0 5 10 15 20 25 30 35 40 45 50

number of clusters

adju

sted

F

OM

single-link

k-means

CAST

random

A simulateddata set with 5clusters

A simulateddata set withno clusters

26

Gene expression data sets

• Ovarian cancer data set(Michel Schummer, Institute of Systems Biology)

– Subset of data: 235 clones

24 experiments (cancer/normal tissue samples)

– 235 clones correspond to 4 genes

• Yeast cell cycle data (Cho et al 1998)

– 17 time points

– Subset of 384 genes associated with 5 phases of cellcycle

27

Results: ovary data

• CAST, k-means and complete-link : bestperformance

1500

2000

2500

3000

3500

4000

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

number of clusters

adju

sted

FO

M CASTrandomaverage-linksingle-linkcomplete-linkkmeanscentroid-link

28

Results: yeast cell cycle data

CAST, k-means: best performance

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

number of clusters

adju

sted

FO

M

CASTrandomavglinkcomplete-linksingle-linkk-meanscentroid-link

30

Rat CNS data

0

1

2

3

4

5

0 10 20 30 40 50 60 70 80 90 100 110 120number of clusters

FOM

k-meansCASTrandomsingle-linkaverage-link

Full range, non-adjusted FOM

31

Rat CNS data

0

1

2

3

4

5

0 10 20 30 40 50 60 70 80 90 100

number of clusters

adju

sted

FO

M

random

average-link

single-link

k-means

CAST

Full range, adjusted FOM

32

FOM on the Barrett’s data

6

6.5

7

7.5

8

8.5

9

9.5

0 5 10 15 20 25 30number of clusters

adju

sted

FO

M

k-means

CAST

random

average-link

single-link

complete-link

k-means (init with avglink)

33

FOM ª Cluster Quality

• On ovary data:– Lowest FOM clusters in good agreement with the

right answer

– Next lowest incorrectly split/merged true classes

• On Barrett’s data, 10 clusters:– the lowest FOM clusters (CAST & k-means

initialized with average-link) correctly grouped the20 cytokeratins that passed the variation filter

– the next lowest FOM (average-link) did NOT

34


• To show: lower FOM fi higheragreement with known gene categories.

• Ex: Wen et al.categorizedgenes in the ratCNS data setinto 4 families

35


• To show: lower FOM fi higheragreement with known gene categories.

• Ex: Wen et al.categorizedgenes in the ratCNS data setinto 4 families

-0.05

0

0.05

0.1

0.15

0.2

0.2 0.25 0.3 0.35 0.4

adjusted FOM

Hu

rbe

rt s

co

re

k-means run#1

k-means run#2

k-means run#3

k-means run#4

k-means run#5

random

36

FOM Summary

• Simple quantitative methodology tocompare different clustering algorithmson any data set without using anyexternal knowledge

• Reduced FOM generally signalsimproved clusters

• Omitting one condition doesn't destroycluster quality

37

FOM Summary, cont.

• All clustering algorithms not createdequal

• Some algorithm comparisons (on this data):– CAST and k-means produce higher quality

clusters than the hierarchical algorithms

– Single-link has the worst performanceamong the hierarchical algorithms

38

Acknowledgements• Ka Yee Yeung

• David Haynor

• Michael Barrett

• Michèl Schummer

More Infohttp://www.cs.washington.edu/homes/{kayee,ruzzo}

UW CSE Computational Biology Group

42

Adjusted Rand Examplec#1(4) c#2(5) c#3(7) c#4(4)

class#1(2) 2 0 0 0class#2(3) 0 0 0 3class#3(5) 1 4 0 0class#4(10) 1 1 7 1

1192

20

2831592

10

2

5

2

3

2

2

1231432

4

2

7

2

5

2

4

312

7

2

4

2

3

2

2

=---˜̃¯

ˆÁÁË

Ê=

=-=-˜̃¯

ˆÁÁË

Ê+˜̃

¯

ˆÁÁË

Ê+˜̃

¯

ˆÁÁË

Ê+˜̃

¯

ˆÁÁË

Ê=

=-=-˜̃¯

ˆÁÁË

Ê+˜̃

¯

ˆÁÁË

Ê+˜̃

¯

ˆÁÁË

Ê+˜̃

¯

ˆÁÁË

Ê=

=˜̃¯

ˆÁÁË

Ê+˜̃

¯

ˆÁÁË

Ê+˜̃

¯

ˆÁÁË

Ê+˜̃

¯

ˆÁÁË

Ê=

cbad

ac

ab

a

469.0)(1

)( Rand Adjusted

789.0,

=-

-=

=+++

+=

RE

RERdcda

daRRand

cluster validation for gene expression data · gene expression data ka yee yeung 1 david r. haynor...

Documents