cluster analysis i

48
Cluster Analysis I Cluster Analysis I 9/28/2012

Upload: forest

Post on 23-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Cluster Analysis I. 9/28/2012. Outline. Introduction Distance and similarity measures for individual data points A few widely used methods: hierachical clustering, K-means, model-based clustering. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cluster Analysis I

Cluster Analysis ICluster Analysis I9/28/2012

Page 2: Cluster Analysis I

OutlineOutlineIntroduction Distance and similarity measures

for individual data pointsA few widely used methods:

hierachical clustering, K-means, model-based clustering

Page 3: Cluster Analysis I

IntroductionIntroductionTo group or segment a collection of

objects into subsets or “clusters”, such that those within each cluster are more closely related to one another than objects assigned to different clusters.

Some times, the goal is to arrange the clusters into a natural hierarchy.

Cluster genes: similar expression pattern implies co-regulation.

Cluster samples: identify potential sub-classes of disease.

Page 4: Cluster Analysis I

IntroductionIntroductionAssigning subjects into group.Estimating number of clusters.Assess the strength/confidence of

cluster assignments for individual objects

Page 5: Cluster Analysis I

Proximity MatrixProximity MatrixAn NxN matrix D (N=number of

objects), each element records the proximity (distance) between object i and i’.

Most often, we have measurement of p dimension on each object. Then we can define

iid

,

1

here is the measurement of subject

,i i

ij

k

j ij id jj

x j i

xd x

Page 6: Cluster Analysis I

Dissimilarity Measures Dissimilarity Measures

• Two main classes of distance for continuous variables:◦ Distance metric (scale-dependent)◦ 1- Correlation coefficients (scale-

invariant)

Page 7: Cluster Analysis I

Minkowski distanceMinkowski distanceFor vectors and of length S,

the Minkowski family of distance measures are defined as

gix gjx

1

1/

,i j i j

kk

g g g

S

g ss

sd x x x x

Page 8: Cluster Analysis I

Two commonly used Two commonly used special casespecial caseManhattan distance (a.k.a. city-

block distance, k=1)

Euclidean distance (k=2)

1

,i j i jg g S g s

S

gs

d x x x x

1/2

2

1

,i j i j

S

g gg gs

d x x x x

Page 9: Cluster Analysis I
Page 10: Cluster Analysis I

Mahalanobis distanceMahalanobis distanceTaking the correlation structure

into account.

When assuming identity covariance matrix, it is the same as Euclidian distance.

1( ) ( ),i j i j i jg g g g g gx x xd x x x

Page 11: Cluster Analysis I

Pearson correlation and inner Pearson correlation and inner productproductPearson correlation

After standardization: Sensitive to outliers.

1

1

2

1

2

,,

i j

i j

g gi j

is i j js

i js jis

S

s

g g

g gx x

g g g g

g g g g

S S

s s

Cov xr x

xx

x x

x

x x

x xx

1

,i j i jg g g

ss g s

S

x xr x x

Page 12: Cluster Analysis I

Spearman correlationSpearman correlation

Calculate using the rank of the two vectors (note: sum of the ranks is n(n+1)/2)

2 2 2 2

22

2 2

,,

i j g gi j i

i j

i ji g j gi

i

j

j

j

j

i j

i

ji

g g x xg g

g gg g

g x g x

g g

g g

gg

g g

xxx

x

E xCov xr x

E x E

x

xx

n

x

x x

n

n

x x

Page 13: Cluster Analysis I

Spearman correlationSpearman correlation

When there is no tied observations

Robust to outliers since it is based on ranks of the data.

2

2 222

1rank of rank of

4,1 1

rank of rank of 4 4

i

i j

j

i j

g g

g g

g g

n nx x

xn n n

x x

rn

x

261 where .

1, rank of rank of i j i j

sg g s g s g sr x d

n

dx x

nx

Page 14: Cluster Analysis I
Page 15: Cluster Analysis I

Standardization of the Standardization of the datadataStandardize gene rows to mean 0

and stdev 1.Advantage: makes Euclidean

distance and correlation equivalent. Many useful methods require the data to be in Euclidean space.

Page 16: Cluster Analysis I

Clustering methodsClustering methodsClustering algorithms come in

two flavors

Page 17: Cluster Analysis I

Hierarchical clustering Hierarchical clustering Produce a tree or dendrogram.They avoid specifying how many

clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.

The tree can by built in two distinct ways◦Bottom-up: agglomerative clustering

(most used).◦Top-down: divisive clustering.

Page 18: Cluster Analysis I

Agglomerative MethodsAgglomerative MethodsThe most popular hierarchical

clustering method.Start with n clusters.At each step, merge the two

closest clusters using a measure of between-cluster dissimilarity .

Page 19: Cluster Analysis I
Page 20: Cluster Analysis I

Compute group Compute group similaritiessimilarities

Page 21: Cluster Analysis I

Choice of linkageChoice of linkage

Page 22: Cluster Analysis I

Comparison of the three Comparison of the three methodsmethodsSingle-link

◦Elongated clusters◦Individual decision, sensitive to outliers

Complete-link◦Compact clusters◦Individual decision, sensitive to outliers

Average-link or centroid◦“In between”◦Group decision, insensitive to outliers.

Page 23: Cluster Analysis I
Page 24: Cluster Analysis I
Page 25: Cluster Analysis I

Divisive MethodsDivisive MethodsBegin with the entire data set as

a single cluster, and recursively divide one of the existing clusters into two daughter clusters.

Do it till each cluster only have one object or all members overlapped with each other.

Not as popular as agglomeriative methods.

Page 26: Cluster Analysis I

Divisive AlgorithmsDivisive AlgorithmsAt each division, other method, e.g. K-

means with K=2, could be used.Smith et al. 1965 proposed a method that

does not involve other clustering method◦Start with 1 cluster G, assign the object that is

the furthest from the others (with the highest average pair-wise distance) to cluster H.

◦For the remaining iterations, each time assign the one object in G that is the closest to H (maximum difference between the average pair-wise distance to objects in H and G).

◦Do it till all objects in G is closer to each other than to objects in H.

Page 27: Cluster Analysis I

Hierarchical clusteringHierarchical clusteringThe most overused statistical

method in gene expression.Gives us pretty pictures.Results tend to be unstable,

sensitive to small changes.

Page 28: Cluster Analysis I

Partitioning methodPartitioning methodPartition the data (size N) into a pre-

specified number K of mutually excusive and exhaustive groups: a many-to-one mapping, or encoder k=C(i), that assings the ith observation to the kth cluster.

Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimization of a specific loss function

Page 29: Cluster Analysis I

Partitioning methodPartitioning methodA natural loss function would be the

within cluster point scatter:The total point scatter:

is the between cluster point scatter.

Minimizing is equivalent to minimize

1 '

1,

2

K

k C i k C i ki iW C d x x

W C B C

Page 30: Cluster Analysis I

Partitioning methodPartitioning methodIn principle, we simply need to

minimize W or maximize B over all possible assignments of N objects to K clusters.

However, the number of distinct assignment, grows rapidly as N and K goes large.

Page 31: Cluster Analysis I

Partitioning methodPartitioning methodIn practice, we can only examine

a small fraction of all possible encoders.

Such feasible strategies are based on iterative greedy descent: ◦An initial partition is specified. ◦At each iterative step, the cluster

assignments are changed in such a way that the value of the criterion is improved from its previous value.

Page 32: Cluster Analysis I

K-meansK-meansChoose the squared Euclidean

distance as dissimilarity measure: .

Minimize the within cluster point scatter:

Where .

2

1

,p

i i ij i jj

d x x x x

2

2

1 '

1

1

2

K

ki i

k i

C i k C i k

K

k C i kk

W C x x

N x x

1 ,...,k k pkx x x

Page 33: Cluster Analysis I

K-means Algorithm—closely related K-means Algorithm—closely related to the EM algorithm for estimating to the EM algorithm for estimating a certainGaussian mixture modela certainGaussian mixture model1.Choose K centroids at random.2.Make initial partition of objects into

k clusters by assigning objects to closest centroid.

3.E step:Calculate the centroid (mean) of each of the k clusters.

4.M step: Reassign objects to the closest centroids.

5.Repeat 3 and 4 until no reallocations occur.

Page 34: Cluster Analysis I

K-means exampleK-means example

Page 35: Cluster Analysis I

Initial valuesfor K-means.

“x” falls into local minimum.

K-means: local minimum K-means: local minimum problemproblem

Page 36: Cluster Analysis I

K-means: discussionK-means: discussionAdvantages:

◦Fast and easy◦Nice relationship with Gaussian mixture

model.Disadvantages:

◦Run into local minimum (should start with multiple initials).

◦Need to know the number of clusters (estimation for number of clusters).

◦Does not allow scattered objects (tight clustering).

Page 37: Cluster Analysis I

2( )

1

( )

( )

( ), medoids of cluster .

Medoid=the object with the smalles

average dissimilarity to

all other subjects in the cluster

i j

i j

kj

ij x C

ji

x C

W k x x

x median x j

Page 38: Cluster Analysis I

Mixture model for Mixture model for clusteringclustering

Page 39: Cluster Analysis I

),( kkk

Model based clusteringModel based clusteringFraley and Raftery (1998) applied a

Gaussian mixture model.

The parameters can be estimated by EM algorithm.

The cluster membership is decided on the posterior probability of each belong to cluster k.

Page 40: Cluster Analysis I

Review of EM algorithmReview of EM algorithmIt is widely used in solving

missing data problem. Here our missing data is the

cluster membership.Let us review the EM algorithm

with a simple example.

Page 41: Cluster Analysis I

E M

Page 42: Cluster Analysis I

The CML approachThe CML approachIndicators , identifying the

mixture component origin for , are treated as unknown parameters.

Two CML criteria have been proposed according to the sampling scheme.

Page 43: Cluster Analysis I

Two CMLsTwo CMLsRandom sample within each

cluster

Random sample from a population of mixture density

k

j Cxji

ji

xfCXL1

).|(log),|(

Page 44: Cluster Analysis I

-- Classification likelihood:

-- Mixture likelihood:

Gaussian assumption:

)|(log),|(11

ji

k

jj

n

ixfXL

Page 45: Cluster Analysis I

The Classification EM The Classification EM algorithmalgorithm

Page 46: Cluster Analysis I

Related to K-meansRelated to K-meansWhen f(x) is assumed to be

Gaussian and the covariance matrix is the identical and spherical across all clusters, i.e. for all k, .

So maximize C1-CML is equivalent to minimize W.

Page 47: Cluster Analysis I

Model-based methodsModel-based methodsAdvantages:

◦Flexibility on cluster covariance structure.

◦Rigorous statistical inference with full model.

Disadvantages:◦Model selection is usually difficult. Data

may not fit Gaussian model.◦Too many parameters to estimate in

complex covariance structure.◦Local minimum problem

Page 48: Cluster Analysis I

ReferencesReferencesHastie, T., Tibshirani, R., and Friedman, J.

(2009), The Elements of Statistical Learning (2nd ed.), New York: Springer.

http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011), Cluster Analysis (5th ed.), West Sussex, UK: John Wiley & Sons Ltd.

Celeux G, Govaert G. A Classification EM algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis 1992; 14:315-332