cluster validation for gene expression data · gene expression data ka yee yeung 1 david r. haynor...
TRANSCRIPT
Cluster Validation forGene Expression Data
Ka Yee Yeung 1
David R. Haynor 2
Walter L. Ruzzo 1
1 Department of Computer Science & Engineering2 Department of Radiology
University of Washington
Bioinformatics v17 #4, (2001) pp 309-318
14
Eisen’s ClusterSoftware
(PNAS 1998)
• Centroid-link hierarchicalclustering algorithm
• Reorder for display
• Decide on your own cluster!
Gene
s
15
Why Validate clusters?
• All clustering algorithms find “clusters”:– Are they real?
– Are they good?conditions
gene
s
A cluster from Eisenet al. (1998) on ayeast data set
A simulated data setwith no intrinsicclusters.
gen
es
conditions
16
Approaches to ClusterValidation: External Criteria• Agreement with an external “gold
standard” answer (rarely available)• Uniformity of clusters w.r.t. related
external information, e.g. GeneOntology or MIPS categories
• Either is quantifiable in various ways --Jaccard, Hubert, adjusted Rand indices,relative entropy, hypergeometric, …
17
Approaches to ClusterValidation: Internal Criteria
• “Compactness” & “separation”
• E.g. residual sum of squares to clustercenters vs sums of squares betweencenters
• E.g. Silhouette - average distance topoints in same cluster vs nearest othercluster
18
Silhouette
a(i) = avg to same
b(i) = avg to neighbor
s(i) = (b(i)-a(i))/ max(b(i),a(i))
19
Approaches to ClusterValidation: Model-based
• Given (statistical) model of data, howwell does model fit
• E.g. look at likelihood ratio that datacould have been generated by onemodel vs alternative
• More on this topic later in the quarter
20
Our Methodology forAlgorithm Comparison
• A form of “Leave Out One Cross Validation”– Cluster genes based on all but one condition.
– Use left-out condition to check cohesiveness ofclusters.
• I.e., within each cluster, how uniform are expressionlevels in the left-out condition?
• Meaningful clusters should be more uniform that chanceaggregations
– Repeat for each condition
• Compare algorithms based on performance.
21
Figure of Merit (FOM)
• FOM measures uniformity of gene expression levelsin each cluster in the left-out experiment(basically mean squared error)
• Low FOM => High predictive power• Leave out each experiment in turn
Cluster Ci
Cluster C1
Cluster Ck
Experiments1 p
gene
s
1
n
e
g
22
“Figure of Merit”
 Â= Œ
-=k
i Cg i
eegRn 1
2C ))(),((
1k)FOM(e,
im
Â-
=
=1
0
),()(m
e
keFOMkFOM
)k-n/(nFOM(k) adjFOM(k) ⋅=
Experimental conditions0 m-1
gene
s
0
n-1
e
Cluster C1
Cluster Ci
Cluster Ck
g
R(g,e)
FOM(e,k) = mean squareddeviation of expression level from cluster mean:
In clusters formed, howuniform are expression levelsin the left-out condition?
23
Other approaches
• S. Datta & S. Datta ‘03 -- look atagreement between clusterings with alldata & leaving out different conditions
24
Three Successes
• We can distinguish clustered from non-clustered data
• We can tell algorithms apart
• Better FOM generally signals betterclusters
25
Are there clusters?
0
2
4
6
8
10
12
14
16
0 5 10 15 20 25 30 35 40 45 50
number of clusters
adju
sted
FO
Msingle-linkk-means
CASTrandom
0
2
4
6
8
10
12
0 5 10 15 20 25 30 35 40 45 50
number of clusters
adju
sted
F
OM
single-link
k-means
CAST
random
A simulateddata set with 5clusters
A simulateddata set withno clusters
26
Gene expression data sets
• Ovarian cancer data set(Michel Schummer, Institute of Systems Biology)
– Subset of data: 235 clones
24 experiments (cancer/normal tissue samples)
– 235 clones correspond to 4 genes
• Yeast cell cycle data (Cho et al 1998)
– 17 time points
– Subset of 384 genes associated with 5 phases of cellcycle
27
Results: ovary data
• CAST, k-means and complete-link : bestperformance
1500
2000
2500
3000
3500
4000
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
number of clusters
adju
sted
FO
M CASTrandomaverage-linksingle-linkcomplete-linkkmeanscentroid-link
28
Results: yeast cell cycle data
CAST, k-means: best performance
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
number of clusters
adju
sted
FO
M
CASTrandomavglinkcomplete-linksingle-linkk-meanscentroid-link
30
Rat CNS data
0
1
2
3
4
5
0 10 20 30 40 50 60 70 80 90 100 110 120number of clusters
FOM
k-meansCASTrandomsingle-linkaverage-link
Full range, non-adjusted FOM
31
Rat CNS data
0
1
2
3
4
5
0 10 20 30 40 50 60 70 80 90 100
number of clusters
adju
sted
FO
M
random
average-link
single-link
k-means
CAST
Full range, adjusted FOM
32
FOM on the Barrett’s data
6
6.5
7
7.5
8
8.5
9
9.5
0 5 10 15 20 25 30number of clusters
adju
sted
FO
M
k-means
CAST
random
average-link
single-link
complete-link
k-means (init with avglink)
33
FOM ª Cluster Quality
• On ovary data:– Lowest FOM clusters in good agreement with the
right answer
– Next lowest incorrectly split/merged true classes
• On Barrett’s data, 10 clusters:– the lowest FOM clusters (CAST & k-means
initialized with average-link) correctly grouped the20 cytokeratins that passed the variation filter
– the next lowest FOM (average-link) did NOT
34
FOM ª Cluster Quality
• To show: lower FOM fi higheragreement with known gene categories.
• Ex: Wen et al.categorizedgenes in the ratCNS data setinto 4 families
35
FOM ª Cluster Quality
• To show: lower FOM fi higheragreement with known gene categories.
• Ex: Wen et al.categorizedgenes in the ratCNS data setinto 4 families
-0.05
0
0.05
0.1
0.15
0.2
0.2 0.25 0.3 0.35 0.4
adjusted FOM
Hu
rbe
rt s
co
re
k-means run#1
k-means run#2
k-means run#3
k-means run#4
k-means run#5
random
36
FOM Summary
• Simple quantitative methodology tocompare different clustering algorithmson any data set without using anyexternal knowledge
• Reduced FOM generally signalsimproved clusters
• Omitting one condition doesn't destroycluster quality
37
FOM Summary, cont.
• All clustering algorithms not createdequal
• Some algorithm comparisons (on this data):– CAST and k-means produce higher quality
clusters than the hierarchical algorithms
– Single-link has the worst performanceamong the hierarchical algorithms
38
Acknowledgements• Ka Yee Yeung
• David Haynor
• Michael Barrett
• Michèl Schummer
More Infohttp://www.cs.washington.edu/homes/{kayee,ruzzo}
UW CSE Computational Biology Group
42
Adjusted Rand Examplec#1(4) c#2(5) c#3(7) c#4(4)
class#1(2) 2 0 0 0class#2(3) 0 0 0 3class#3(5) 1 4 0 0class#4(10) 1 1 7 1
1192
20
2831592
10
2
5
2
3
2
2
1231432
4
2
7
2
5
2
4
312
7
2
4
2
3
2
2
=---˜̃¯
ˆÁÁË
Ê=
=-=-˜̃¯
ˆÁÁË
Ê+˜̃
¯
ˆÁÁË
Ê+˜̃
¯
ˆÁÁË
Ê+˜̃
¯
ˆÁÁË
Ê=
=-=-˜̃¯
ˆÁÁË
Ê+˜̃
¯
ˆÁÁË
Ê+˜̃
¯
ˆÁÁË
Ê+˜̃
¯
ˆÁÁË
Ê=
=˜̃¯
ˆÁÁË
Ê+˜̃
¯
ˆÁÁË
Ê+˜̃
¯
ˆÁÁË
Ê+˜̃
¯
ˆÁÁË
Ê=
cbad
ac
ab
a
469.0)(1
)( Rand Adjusted
789.0,
=-
-=
=+++
+=
RE
RERdcda
daRRand