clustering (part ii) 10/07/09. outline affinity propagation quality evaluation

37
Clustering (Part II) 10/07/09

Post on 20-Dec-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Clustering (Part II)

10/07/09

Page 2: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Outline

• Affinity propagation

• Quality evaluation

Page 3: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Affinity propagation: main idea

• Data points can be exemplar (cluster center) or non-exemplar (other data points).

• Message is passed between exemplar (centroid) and non-exemplar data points.

• The total number of clusters will be automatically found by the algorithm.

Page 4: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Responsibility r(j,k)

• A non-exemplar data point informs each candidate exemplar whether it is suitable for joining as a member.

candidate exemplar k

data point j

Page 5: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Availability a(j,k)

• A candidate exemplar data point informs other data points whether it is a good exemplar.

candidate exemplar k

data point j

Page 6: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Self-availability a(k,k)

• A candidate exemplar data point evaluates itself whether it is a good exemplar .

candidate exemplar k

data point j

Page 7: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

An iterative procedure

• Update r(j, k) candidate exemplar k

data point j

r(j,k)

a(j,k’)

)',()',(max),(),('

kiskiakiskirkk

similarity between i and k

2),( ki xxkis

Page 8: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

An iterative procedure

• Update a(j, k) candidate exemplar k

data point j

r(j’,k)

a(j,k)

kii

kirkkrkia,'

),'(,0max),(,0min),(

Page 9: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

An iterative procedure

• Update a(k, k)

ki

kirkka'

),'(,0max),(

Page 10: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Step-by-step affinity propagation

Page 11: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

ApplicationsMulti-exon gene detection in mouse.

Expression level at different exons within a gene are corregulated among different tissue types.

37 mouse tissues involved.12 tiling arrays.

(Frey et al. 2005)

Page 12: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

“Algorithms for unsupervised classification or cluster analysis abound. Unfortunately however, algorithm development seems to be a preferred activity to algorithm evaluation among methodologists.

……

No consensus or clear guidelines exist to guide these decisions. Cluster analysis always produces clustering, but whether a pattern observed in the sample data characterizes a pattern present in the population remains an open question. Resampling-based methods can address this last point, but results indicate that most clusterings in microarray data sets are unlikely to reflect reproducible patterns or patterns in the overall population.”

-Allison et al. (2006)

Page 13: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Stability of a cluster

Motivation: Real clusters should be reproducible under perturbation: adding noise, omission of data, etc.

Procedure: • Perturb observed data by adding noise.• Apply clustering procedure to cluster the

perturbed data.• Repeat the above procedures, generate a

sample of clusters.• Global test• Cluster-specific tests: R-index, D-index.

(McShane et al. 2002)

Page 14: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

12

6

34

5

12

6

34

5

Page 15: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Where is the “truth”?

“ In the context of unsupervised learning, there is no such direct measure of success. It is difficult to ascertain the validity of inference drawn from the output of most unsupervised learning algorithms. One must often resort to heuristic arguments not only for motivating the algorithm, but also for judgments as to the quality of results. This uncomfortable situation has led to heavy proliferation of proposed methods, since effectiveness is a matter of opinion and cannot be verified directly.”

Hastie et al. 2001; ESL

Page 16: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Global test

• Null hypothesis: Data come from a multivariate Gaussian distribution.

Procedure:

• Consider a subspace spanned by top principle components.

• Estimate distribution of “nearest neighbor” distances

• Compare observed with simulated data.

Page 17: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

R-index

• If cluster i contains ni objects, then it contains mi = ni*(ni – 1)/2 of pairs.

• Let ci be the number of pairs that fall in the same cluster for the re-clustered perturbed data.

• ri = ci/mi measures the robustness of the cluster i.

• R-index = i ci / i mi measures overall stability of a clustering algorithm.

Page 18: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

D-index

• For each cluster, determine the closest cluster for the perturbed data

• Calculated the average discrepancy between the clusters for the original and perturbed data: omission vs addition.

• D-index is a summation of all cluster-specific discrepancy.

Page 19: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Applications

• 16 prostate cancer; 9 benign tumor

• 6500 genes

• Use hierarchical clustering to obtain 2,3, and 4 clusters.

• Questions: are these clusters reliable?

Page 20: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation
Page 21: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation
Page 22: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Issues with calculating R and D indices

• How big is the size of perturbation?

• How to quantify the significance level?

• What about nested consistency?

Page 23: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Biclustering

Page 24: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Gene expression

conditions

gene

s

1D-approach:

To identify condition cluster, all genes are used.

But probably only a few genes are differentially expressed.

Motivation

Page 25: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Gene expression

conditions

gene

s

1D-approach:

To identify gene cluster, all conditions are used.

But a set of genes may only be expressed under a few conditions.

Motivation

Page 26: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Gene expression

conditions

gene

s

Bi-clustering

Objective: To isolate genes that are co-expressed under a specific set of conditions.

Motivation

Page 27: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Coupled Two-Way Clustering

• An iterative procedure involving the following two steps.– Within a cluster of conditions, search for gene

clusters.– Using features from a cluster of genes, search

for condition clusters.

(Getz et al. 2001)

Page 28: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

SAMBA – A bipartite graph model

V = Genes U = Conditions

Tanay et al. 2002

Page 29: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

V = Genes U = Conditions

E = “respond” = differential expression

Tanay et al. 2002

SAMBA – A bipartite graph model

Page 30: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

V = Genes U = Conditions

E = “respond” = differential expression

Cluster = subgraph (U’, V’, E’)=subset of corregulated genes V’ in conditions U’ Tanay et al. 2002

SAMBA – A bipartite graph model

Page 31: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

SAMBA -- algorithm

Goal: Find the “heaviest” subgraphs.

H = (U’, V’, E’)

Tanay et al. 2002

Page 32: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

SAMBA -- algorithm

Goal: Find the “heavy” subgraphs.

missing edge

H = (U’, V’, E’)

Tanay et al. 2002

Page 33: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

SAMBA -- algorithm

pu,v -- probability of edge expected at random

pc – probability of edge within cluster

Compute a weight score for H.

H = (U’, V’, E’)

')','( ,')','( , 1

1loglog)(log

Evu vu

C

Evu vu

C

p

p

p

pHL

Tanay et al. 2002

Page 34: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

SAMBA -- algorithm

Finding the heaviest graph is an NP-hard problem.

Use a polynomial algorithm to search for minima efficiently.

H = (U’, V’, E’)

Tanay et al. 2002

Page 35: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Significance of weight

• Let H = (U’, V’, E’) be a subgraph.

• Fix U’, random select a new V” with the same size as V’. The weight for the new subgraph (U’, V”, E”) gives a background distribution.

• Estimate p-value bp comparing log L(H) with the background distribution.

Page 36: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Model evaluation

• The p-value distribution for the top candidate clusters.

• If biological classification data are available, evaluate the purity of class membership within each bicluster.

Page 37: Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation

Reading List

• Frey and Dueck 2007– Affinity propagation

• McShine et al. 2002– Clustering model evaluation

• Tanay et al. 2002– SAMBA for biclustering