general prediction strength methods for estimating the number of clusters in a dataset

55
General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset Moira Regelson, Ph.D. September 15, 2005

Upload: jennifer-rodriquez

Post on 02-Jan-2016

28 views

Category:

Documents


2 download

DESCRIPTION

General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset. Moira Regelson, Ph.D. September 15, 2005. Motivation. k -means (medoids) clustering will happily divide any dataset into k clusters, regardless of whether that’s appropriate or not. Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

General Prediction Strength Methods for Estimating the

Number of Clusters in a Dataset

Moira Regelson, Ph.D.

September 15, 2005

Page 2: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Motivation

• k-means (medoids) clustering will happily divide any dataset into k clusters, regardless of whether that’s appropriate or not.

Elongated Clusters

8 10 12 14 16 18 20 22

810

1214

1618

2022

8

10

12

14

16

18

20

22

Page 3: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Overview

• Review of previous methods

• Re-formulation and extension of Tibshirani’s prediction strength method

• Contrast results for different cluster configurations

• Application to gene co-expression network

Page 4: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Different Methods for Deciding Number of Clusters

• Methods based on internal indices– Depend on between- and within- sum of

squared error (BSS and WSS)

• Methods based on external indices– Depends on comparison between different

partitionings

• Evaluate indices for different values of k and decide which is “best”

Page 5: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Internal Index Methods

Page 6: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Internal Indices

• Calinski & Harabasz

• Hartigan

• Krzanowski & Lai

• n=number of samples

• p=dimension of samples

Page 7: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Calinski and Harabasz (1974)

• For each number of clusters k ≥ 2, define the index

• The estimated number of clusters is the k which maximizes the above.

trace( ) /( -1)

trace( ) /( - )k

kk

BSS kI

WSS n k

Page 8: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Hartigan

• For each number of clusters k ≥ 1, define the index

• The estimated number of clusters is the smallest k ≥ 1 such that Ik ≤ 10.

1

trace( )1 ( 1)

trace( )k

kk

WSSI n k

WSS

Page 9: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Krzanowski and Lai (1985)

• For each number of clusters k ≥ 2, define the indices

• The estimated number of clusters is the k which maximizes Ik.

2 /1

1

( 1) trace( ) trace( ), andpk k k

k k k

d k WSS WSS

I d d

Page 10: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

The silhouette width method (Kaufman and Rousseeuw,

1990) • Silhouettes use average dissimilarity

between observation i and other observations in the same cluster.

• Silhouette width of the observation is

• ai = average dissimilarity of observation i

• bi = minimum dissimilarity within the cluster

( ) / max( , )ik i i i iI b a a b

Page 11: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

The silhouette width method (cont.)

• Overall silhouette width is the average over all observations:

• The estimated number of clusters is the k for which Ik is maximized.

iki

k

II

n

Page 12: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Gap (uniform) or Gap(pc) (Tibshirani et al., 2000)

• For each number of clusters k,

• B reference datasets generated under null distribution.

1log(trace( ) log(trace( )b

k k kb

I WSS WSSB

Page 13: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Gap statistic (cont.)

• Estimated number of clusters is smallest k ≥ 1 that maximizes Ik and satisfies

• sk = standard deviation over reference datasets.

• Uniform gap statistic samples from a uniform distribution• “pc” (principal component) statistic samples from a

uniform box aligned with the principal components of the dataset (Sarle, 1983).

1 1k k kgap gap s

Page 14: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

External Index Methods

Page 15: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

External Indices/Approaches

• Comparing Partitionings

• Rand Index

• Tibshirani

• Clest

• General Prediction Strength

Page 16: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Comparing Partitionings: The Contingency Table

• Partitionings U={u1,..., uR} and V = {v1,...,

vS} of n objects into R and S clustersU/V v1 v2 ... vS

u1 n11 n12 ... n1S

u2 n21 n22 ... ...

... ... ... ... ...

uR nR1 nR1 ... nRS

Page 17: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Comparing Partitionings: The Contingency Table

• nrs = number of objects in both ur and vs.

U/V v1 v2 ... vS

u1 n11 n12 ... n1S

u2 n21 n22 ... ...

... ... ... ... ...

uR nR1 nR1 ... nRS

Page 18: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Comparing Partitionings: The Contingency Table

• = total points in cluster ur

• = total points in cluster vs U/V v1 v2 ... vS

u1 n11 n12 ... n1S

u2 n21 n22 ... ...

... ... ... ... ...

uR nR1 nR1 ... nRS

.1

S

r rss

n n

.1

R

s rsr

n n

Page 19: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Rand Index (Rand, 1971, Hubert and Arabie, 1985)

• Rand index and adjusted Rand index (m=2)

. .

,

2s r rs

s r r s

n n nn

m m m mRand

n

m

. .

,

. . . .

.1

2

rs s r

r s s r

s r s r

s r s r

n n n n

mm m mAdj Rand

n n n n n

mm m m m

Page 20: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Clustering as a supervised classification problem

• Input data split repeatedly into a training and a test set for a given choice of k (number of clusters)

• Clustering method applied to the two sets to arrive at k “observed” training and test set clusters.

• Use the training data to construct a classifier for predicting the training set cluster labels.

• Apply classifier to test set data -> predicted test set clusters.

• Measure of agreement calculated based on the comparison of predicted to observed test set clusters (external index).

Page 21: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Predicting the number of clusters

• Use cluster reproducibility measures for different k to estimate the true number of clusters in the data set.

• Assumes that choosing the correct number of clusters -> less random assignment of samples to clusters and to greater cluster reproducibility.

Page 22: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Tibshirani Prediction Strength (Tibshirani et al., 2001)

• Specify kmax and max number of iterations, B. • For k in {2…kmax}, repeat B times:

1. Split data set into a training set and a test set2. Apply clustering procedure to partition training set into k

clusters, record cluster labels as outcomes.3. Construct classifier using training set and cluster labels.4. Apply resulting classifier to test set ->“predicted” labels.5. Apply the clustering procedure to the test set to arrive at the

“observed” labels.

6. Compute a measure of agreement (external index) ps1(k,b) comparing the sets of labels obtained in steps 4 and 5.

Page 23: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Tibshirani PS (cont.)

7. Switch the role of the test and training sets to arrive at another estimate of the index, ps2(k,b).

8. Use ps1(k,b) and ps2(k,b) to compute the mean value, ps(k,b) = (ps1(k,b)+ ps2(k,b))/2, and standard error, se(k,b) = |ps1(k,b)- ps2(k,b)|/2. Use these to define pse(k,b) = ps(k,b)+se(k,b) =

max(ps1(k,b), ps2(k,b)).

Page 24: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Tibshirani PS (cont.)

• pse(k) = median of pse(k,b) over all random splits

• Values of pse(k) used to estimate the number of clusters in the dataset using a threshold rule

*

*

*

**

*

*

**

*

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

k

PS

E(k

, m=

2)

Page 25: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Clest (Dudoit and Fridlyand, 2002)

• Step “A” identical to steps 1-6 of Tibshirani PS. Denote external indices computed in step A.6. by (sk,1, sk,2,…sk,B). Then

B. Let tk = median(sk,1, ..., sk,B) denote observed similarity statistic for the k-cluster partition of the data.

C. Generate B0 datasets under null hypothesis of k=1. Briefly, for each reference dataset, repeat the procedure described in steps A and B above, to obtain B0 similarity statistics tk,1, ..., tk,Bo.

• Let denote average of the B0 statistics

• Let dk = tk- denote the difference between the observed similarity statistic and estimated expected value under null hypothesis of k = 1.

0kt

0kt

Page 26: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

General Prediction Strength

• Re-formulation of Tibshirani PS

• Extension to m-tuplets

Page 27: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Tibshirani re-formulation

• Originally, measure of agreement formulated as

• Note: partitioning around medoid (PAM) clustering used

'

'1

. .

1( ) min ( [ ( )] 1)

( 1)i i s

te iis k

x x vs s

ps k I D U Xn n

( ) argmin (dist(medoid( ), ))te r j ter

U X u x X

1 , for some cluster [ ]

0 otherwise

i j r r

ij

x x u u UD U

Page 28: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

General Prediction Strength

• Re-formulation of Tibshirani PS & extension to m-tuplets:

• Add a standard error in cross-validation: PSE(k,m) used with threshold.

• Intuitive interpretation: fraction of m-tuplets in test set that co-cluster in the training set

1

1.

( , ) min ( ), where ( )

krs

rm ms k

s

n

mPS k m p s p s

n

m

Page 29: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Asymmetry in PS

• Difference between Rand index and PSE(k,m):

• Rand = 0.95, adjusted Rand = 0.86.

• PSE(k,m=2) = 0.17, but when role of U and V reversed, PSE(k,m=2) = 0.83.

U/V v1 v2 v3 v4 v5

u1 5 0 0 0 0

u2 5 50 0 0 0

u3 5 0 50 0 0

u4 5 0 0 50 0

u5 5 0 0 0 50

Page 30: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Tests on Simulated Data

Page 31: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Simulations

1. A single cluster containing 200 points uniformly distributed from (-1,1) in 10-d.  

2. Three normally distributed clusters in 2-d with centers at (0,0). (0,5), and (5,-3) and 25, 25, and 50 observations in each respective cluster.

3. Four normally distributed clusters in 3-d with centers randomly chosen from N(0, 5*I) and cluster size randomly chosen from {25, 50}.

4. Four normally distributed clusters in 10-d with centers randomly chosen from N(0, 1.9*I) and cluster size randomly chosen from {25, 50}.

In 3 & 4, simulations with clusters with minimum distance less than one unit were discarded.

Page 32: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Simulations

5. Two elongated clusters in 3-d. Generated by choosing equally spaced points between (-0.5, 0.5) and adding normal noise with sd 0.1 to each feature. Then add 10 to each feature of the points in the second cluster.

(a) 100 points per cluster(b) 200 points per cluster (to

illustrate effects of an increased number of observations)

Elongated Clusters

8 10 12 14 16 18 20 22

810

1214

1618

2022

8

10

12

14

16

18

20

22

Page 33: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Predicted # Clusters Predicted # Clusters

Method No Pred 1 2 3 4 5 6 7 8 9 10

Prediction Strength (cutoff=0.8) 1 2 3 4 5 6 7 8 9 10

sim1 (1 cluster, 10d)

Hartigan 7 27 12 4 m=2 50 Calinski NA 3 7 8 12 12 10 m=3 50 Kraznowski-Lai NA 8 8 5 7 5 3 5 7 2 m=5 50 Silhouette NA 33 4 1 1 1 1 3 6 12 m=10 50 Gap (uniform) 50 Gap (pc) 50 Clest* 48 2

sim2 (3 clusters, 2d)

Hartigan 4 5 4 9 8 8 7 5 m=2 49 1 Calinski NA 50 m=3 50 Kraznowski-Lai NA 29 3 4 2 1 1 5 5 m=5 1 2 47 Silhouette NA 6 44 m=10 12 5 33 Gap (uniform) 11 39 Gap (pc) 12 38 Clest* 1 49

sim3 (4 clusters, 3d)

Hartigan 11 2 10 10 4 4 4 5 m=2 6 44 Calinski NA 1 6 43 m=3 6 44 Kraznowski-Lai NA 2 6 37 1 1 1 2 m=5 1 7 42 Silhouette NA 8 13 29 m=10 4 5 6 35 Gap (uniform) 5 7 16 20 2 Gap (pc) 12 6 15 17 Clest* 1 20 29

Page 34: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

sim4 (4 clusters, 10d)

Hartigan 50 m=2 3 3 44 Calinski NA 2 9 39 m=3 5 3 5 38 Kraznowski-Lai NA 42 1 2 1 2 2 m=5 6 4 5 35 Silhouette NA 5 11 34 m=10 19 4 2 25 Gap (uniform) 1 2 20 27 Gap (pc) 5 6 7 32 Clest* 1 49

sim5a (2 clusters, 3d, 100pts/clus)

Hartigan 35 1 14 m=2 46 3 1 Calinski NA 15 29 4 2 m=3 48 2 Kraznowski-Lai NA 49 1 m=5 49 1 Silhouette NA 50 m=10 50 Gap (uniform) 31 9 4 6 Gap (pc) 50 Clest* 44 6

sim5b (2 clusters, 3d, 200pts/clus)

Hartigan 50 m=2 27 23 Calinski NA 20 29 1 m=3 50 Kraznowski-Lai NA 50 m=5 50 Silhouette NA 50 m=10 50 Gap (uniform) 34 5 6 5 Gap (pc) 50

Predicted # Clusters Predicted # Clusters

Method No Pred 1 2 3 4 5 6 7 8 9 10

Prediction Strength (cutoff=0.8) 1 2 3 4 5 6 7 8 9 10

Page 35: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Results of Simulations

• PS performs consistently well• Not all values of m perform equally well in all the

simulations (m=3 and m=5 do best overall)• Performance especially noticeable on elongated cluster

simulation.• Clest performs comparably to PS. • Of internal index methods, Hartigan seems least robust• Calinski and Kraznowski-Lai indices and the

silhouette width method cannot predict a single cluster.

Page 36: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Application to Gene Co-Expression Networks

Page 37: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

DNA Microarrays

• Expression level of thousands of genes at once

• Lots of processing and normalization

Page 38: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Use of Microarrays

• Within an experiment use “normal” and “diseased” cell types (e.g.).

• Generally examined for differences in expression levels between cell types.

• Look for genes that characteristically vary with disease.

Page 39: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Gene co-expression network

• Use DNA microarray data to annotate genes by clustering them on the basis of their expression profiles across several microarrays.

• Studying co-expression patterns can provide insight into the underlying cellular processes (Eisen et al., 1998, Tamayo et al., 1999).

Page 40: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Building a network

• Use Pearson correlation coefficient as a co-expression measure.

• Threshold correlation coefficient to arrive at gene co-expression network.

Page 41: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Building a network (cont.)

• Node corresponds to expression profile of a given gene.

• Nodes connected (aij=1) if they have significant pairwise expression profile association across perturbations (cell or tissue samples).

Page 42: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Topological Overlap Matrix

• TOM given by (Ravasz et al, 2002)

• ki is the connectivity of node i:

min( , ) 1

iu uj iju

iji j ij

a a a

k k a

1

n

i ijj

k a

Page 43: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

TOM (cont.)

ij = 1 if the node with fewer links satisfies two conditions:– all of its neighbors are also neighbors of the other node

and

– it is linked to the other node. In contrast,

ij = 0 if nodes i and j are unlinked and the two nodes have no common neighbors.

• similarity measure, associated dissimilarity measure dij = 1- ij.

Page 44: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Gene Co-Expression Network for Brain Tumor Data

• Brain tumor (glioblastoma) microarray data (previously described in Freije et al (2004), Mischel et al (2005), and Zhang and Horvath (2005)).

• Network constructed using correlation threshold of 0.7 and the 1800 most highly connected genes.

Page 45: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Gene Co-Expression Network for Brain Tumor Data (cont.)

• Used PS method (with PAM) on TOM with m=(2,5,10)

• m=2, m=5 -> 5 clusters– Same as Mischel et al.

• m=10 -> 4 clusters– Reasonable

interpretation?1 2 3 4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

1.0

k

PS

E(m

,k)

m = 2m = 5m = 10

Page 46: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Classical Multi-Dimensional Scaling

• Used to visualize abstract TOM dissimilarity• “Principal component analysis”

-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

-0.4

-0.2

0.0

0.2

0.4

0.6

cmd1

cmd

2 cmd

3

-0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

-0.4

-0.2

0.0

0.2

0.4

0.6

cmd1

cmd2 cm

d3

Page 47: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Inspection of Heatmap

• Red for highly expressed genes

• Green for low expression

• Consistent expression across genes (rows) in clusters

=> Either 4 or 5 clusters justified

OR

I.15

440.

1541

5O

RI.

1460

7.14

586

OR

I.15

479.

1547

1O

RI.

4388

.438

7O

RI.

1542

8.15

408

OR

I.15

430.

1540

9O

RI.

1545

2.15

422

OR

I.47

74.4

763

OR

I.15

434.

1541

1O

RI.

4780

.476

6O

RI.

1461

7.14

591

OR

I.47

70.4

761

OR

I.47

86.4

769

OR

I.47

84.4

768

OR

I.69

24.6

908

OR

I.69

46.6

919

OR

I.43

68.4

367

OR

I.15

450.

1542

1O

RI.

1544

8.15

420

OR

I.15

454.

1542

4O

RI.

1073

5.10

730

OR

I.47

72.4

762

OR

I.47

76.4

764

OR

I.47

78.4

765

OR

I.14

631.

1459

8O

RI.

6938

.691

5O

RI.

6950

.692

1O

RI.

6940

.691

6O

RI.

1460

5.14

585

OR

I.69

36.6

914

OR

I.69

22.6

907

OR

I.69

34.6

913

OR

I.14

609.

1458

7O

RI.

1544

4.15

418

OR

I.14

633.

1459

9O

RI.

6932

.691

2O

RI.

6926

.690

9O

RI.

6948

.692

0O

RI.

1462

9.14

597

OR

I.15

442.

1541

7O

RI.

1566

2.15

649

OR

I.15

438.

1541

4O

RI.

6930

.691

1O

RI.

1545

6.15

425

OR

I.69

44.6

918

OR

I.43

78.4

377

OR

I.47

82.4

767

OR

I.14

639.

1460

2O

RI.

6928

.691

0O

RI.

1544

6.15

419

OR

I.14

623.

1459

4O

RI.

1463

5.14

600

OR

I.15

426.

1540

7O

RI.

1543

2.15

410

*****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

***********************************************************************************************************

************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************

**********************************************************************************************************************************

*********************************************************************************************************************

Page 48: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Conclusion

• There are several indices for evaluating clusterings– External compare different partitionings, internal do not

• Indices can be used to predict number of clusters• Prediction Strength index method works across

different cluster configurations• Fairly simple and intuitive• Effective on elongated clusters• Results of varying m reflect hierarchical structure

in data

Page 49: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Acknowledgements

• Steve Horvath• Meghna Kamath• Fred Fox and Tumor Cell Biology Training Grant

(USHHS Institutional National Research Service Award #T32 CA09056)

• Stan Nelson and the UCLA Microarray Core Facility

• NIH Program Project grant #1U19AI063603-01.

Page 50: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

References

• http://www.genetics.ucla.edu/labs/horvath/GeneralPredictionStrength.• CALINSKI, R. & HARABASZ, J. (1974). A dendrite method for cluster analysis. Commun Statistics, 1-27.• DUDOIT, S. & FRIDLYAND, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3,

RESEARCH0036.• EISEN, M. B., SPELLMAN, P. T., BROWN, P. O. & BOTSTEIN, D. (1998). Cluster analysis and display of genome-wide expression patterns.

Proc Natl Acad Sci U S A 95, 14863-8.• FREIJE, W. A., CASTRO-VARGAS, F. E., FANG, Z., HORVATH, S., HARTIGAN, J. A. (1985). Statistical theory in clustering. J. Classification,

63-76.• HUBERT, L. & ARABIE, P. (1985). Comparing Partitions. Journal of Classification 2, 193-218.• KAUFMAN, L. & ROUSSEEUW, P. J. (1990). Finding groups in data: an introduction to cluster analysis. New York: Wiley.• KRZANOWSKI, W. & LAI, Y. (1985). A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics,

23-34.• MISCHEL, P., ZHANG, B., CARLSON, M., FANG, Z., FREIJE, W., CASTRO, E., SCHECK, A., LIAU, L., KORNBLUM, H., GESCHWIND,

D., CLOUGHESY, T., HORVATH, S. & NELSON, S. (2005). Hub Genes Predict Survival for Brain Cancer Patients.• RAND, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 846-850.• RAVASZ, E., SOMERA, A. L., MONGRU, D. A., OLTVAI, Z. N. & BARABASI, A. L. (2002). Hierarchical organization of modularity in

metabolic networks. Science 297, 1551-5.• SARLE, W. (1983). Cubic Clustering Criterion. SAS Institute, Inc.• TAMAYO, P., SLONIM, D., MESIROV, J., ZHU, Q., KITAREEWAN, S., DMITROVSKY, E., LANDER, E. S. & GOLUB, T. R. (1999).

Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 96, 2907-12.

• TIBSHIRANI, R., WALTHER, G., BOTSTEIN, D. & BROWN, P. (2001). Cluster validation by prediction strength. Stanford.• TIBSHIRANI, R., WALTHER, G. & HASTIE, T. (2000). Estimating the number of clusters in a dataset via the gap statistic. Department of

Biostatistics, Stanford.• YEUNG, K. Y., HAYNOR, D. R. & RUZZO, W. L. (2001). Validating clustering for gene expression data. Bioinformatics 17, 309-18.• ZHANG, B. & HORVATH, S. (2005). A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in

Genetics and Molecular Biology.

Page 51: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Extra Slides

Page 52: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

WSS and BSS

( )( )g ggi i j jij gBSS n x x x x

1

( )( )

n

i jij is jss

g gij ij

g

TSS x x x x

WSS BSS

Page 53: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Tibshirani re-formulation

• Originally, measure of agreement formulated as

• Note: partitioning around medoid (PAM) clustering used

'

'1

. .

1( ) min ( [ ( )] 1)

( 1)i i s

te iis k

x x vs s

ps k I D U Xn n

( ) argmin (dist(medoid( ), ))te r j ter

U X u x X

1 , for some cluster [ ]

0 otherwise

i j r r

ij

x x u u UD U

Page 54: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Tibshirani re-formulation detail

• Since the matrix D[U] is already the indicator matrix, this can be re-written as:

• Now we can partition the observations xi’ in cluster vs by the cluster ur to which they belong:

• We can also divide the observations xi in cluster vs by whether or not they belong to cluster ur:

'

'

'1. .

1( ) min [ ( )]

( 1)i s i s

i i

te iis kx v x vs s

x x

ps k D U Xn n

'

'

'

'1

1. .

1( ) min [ ( )]

( 1)i s i s

i i

i r

k

te iis k

x v r x vs sx xx u

ps k D U Xn n

'

'

' '1

1. .

1( ) min [ ( )] [ ( )]

( 1)i s i s i s

i r i r i r

k

te ii te iis k

r x v x v x vs sx u x u x u

ps k D U X D U Xn n

Page 55: General Prediction Strength Methods for Estimating the Number of Clusters in a Dataset

Tibshirani re-formulation detail (cont.)• By the definition of D, if xi ur and xi’ ur, then

D[U(Xte)]ii’=0 and if xi ur and xi’ ur

D[U(Xte)]ii’=1. Therefore,

• Thus, ps(k) corresponds to PS(k,m=2):

'

'

1 11 1 1 ' 1. . . .

'

1

1 11 .. .

1 1( ) min (1) min (1)

( 1) ( 1)

21 min ( 1) min

( 1)

2

rs rs

i s i s

i r i r

n nk k

s k s kr x v x v r i is s s s

i ix u x u

krs

kr

rs rss k s k

r ss s

ps kn n n n

n

n nnn n

12

1.

2( , ) min ( ), where ( )

2

krs

rm m

s ks

n

PS k m p s p sn