unsupervised learning & cluster analysis: basic concepts and algorithms

92
Unsupervised learning & Cluster Analysis: Basic Concepts and Algorithms Assaf Gottlieb Some of the slides are taken form Introduction to data mining, by Tan, Steinbach, and Kumar

Upload: dolan

Post on 14-Jan-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Unsupervised learning & Cluster Analysis: Basic Concepts and Algorithms. Assaf Gottlieb. Some of the slides are taken form Introduction to data mining, by Tan, Steinbach, and Kumar. What is unsupervised learning & Cluster Analysis ?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Unsupervised learning &Cluster Analysis: Basic

Concepts and Algorithms

Assaf Gottlieb

Some of the slides are taken form Introduction to data mining, by Tan, Steinbach, and Kumar

Page 2: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

What is unsupervised learning & Cluster Analysis ? Learning without a priori knowledge about the classification of samples; learning without a teacher.

Kohonen (1995), “Self-Organizing Maps”

“Cluster analysis is a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual.”

B. S. Everitt (1998), “The Cambridge Dictionary of Statistics”

Page 3: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

What do we cluster?

Samples/Instances

Features/Variables

Page 4: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Applications of Cluster Analysis Understanding

Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations

Data Exploration Get insight into data distribution Understand patterns in the data

SummarizationReduce the size of large data setsA preprocessing step

Page 5: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Objectives of Cluster Analysis Finding groups of objects such that the objects in a group

will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Competing objectives

Page 6: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters

Two Clusters

Six Clusters

Depends on “resolution” !

Page 7: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

PrerequisitesUnderstand the nature of your

problem, the type of features, etc.

The metric that you choose for similarity (for example, Euclidean distance or Pearson correlation) often impacts the clusters you recover.

Page 8: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Similarity/Distance measures Euclidean Distance

Highly depends on scaleof features may require normalization

City Block

N

n nn yxyxd1

2)(),(

d

iiiM wxD

1

1

Page 9: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

deuc=0.5846 deuc=1.1345

deuc=2.6115 These examples of Euclidean distance match our intuition of dissimilarity pretty well…

Page 10: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

deuc=1.41 deuc=1.22

…But what about these?

What might be going on with the expression profiles on the left? On the right?

Page 11: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Similarity/Distance measures Cosine

Pearson Correlation

Invariant to scaling (Pearson also to addition) Spearman correlation for ranks

yx

yxNyxC

N

i ii

1

cosine

1

),(

])(][)([

))((),(

1

2

1

2

1

N

i yi

N

i xi

N

i yixipearson

mymx

mymxyxC

Page 12: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Similarity/Distance measures Jaccard similarity

When interested in intersection size

YX

YXYXJSim

),(

X YX ∩ Y

X U Y

Page 13: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Types of Clusterings Important distinction between

hierarchical and partitional sets of clusters

Partitional Clustering A division data objects into non-overlapping subsets

(clusters) such that each data object is in exactly one subset

Hierarchical clustering A set of nested clusters organized as a hierarchical

tree

Page 14: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Partitional Clustering

Original Points A Partitional Clustering

Page 15: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Dendrogram 2

Dendrogram 1

Page 16: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Other Distinctions Between Sets of Clustering methods

Exclusive versus non-exclusive In non-exclusive clusterings, points may belong to

multiple clusters. Can represent multiple classes or ‘border’ points

Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster

with some weight between 0 and 1 Weights must sum to 1

Partial versus complete In some cases, we only want to cluster some of the

data Heterogeneous versus homogeneous

Cluster of widely different sizes, shapes, and densities

Page 17: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Clustering Algorithms Hierarchical clustering K-means Bi-clustering

Page 18: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Hierarchical Clustering Produces a set of nested clusters

organized as a hierarchical tree Can be visualized as a dendrogram

A tree like diagram that records the sequences of merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Page 19: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Strengths of Hierarchical Clustering Do not have to assume any particular

number of clusters Any desired number of clusters can be

obtained by ‘cutting’ the dendogram at the proper level

They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal

kingdom, phylogeny reconstruction, …)

Page 20: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Hierarchical Clustering Two main types of hierarchical clustering

Agglomerative (bottom up): Start with the points as individual clusters At each step, merge the closest pair of clusters until only one

cluster (or k clusters) left

Divisive (top down): Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point

(or there are k clusters)

Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

Page 21: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Agglomerative Clustering Algorithm

More popular hierarchical clustering technique

Basic algorithm is straightforward1. Compute the proximity matrix2. Let each data point be a cluster3. Repeat4. Merge the two closest clusters5. Update the proximity matrix6. Until only a single cluster remains

Key operation is the computation of the proximity of two clusters

Different approaches to defining the distance between clusters distinguish the different algorithms

Page 22: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Starting Situation Start with clusters of individual points and

a proximity matrixp1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 23: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Intermediate Situation After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 24: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Intermediate Situation We want to merge the two closest clusters (C2 and

C5) and update the proximity matrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 25: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

After Merging The question is “How do we update the proximity

matrix?”

C1

C4

C2 U C5

C3? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 26: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

MIN MAX Group Average Distance Between Centroids Ward’s method (not discussed)

Proximity Matrix

Page 27: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an

objective function Ward’s Method uses squared error

Page 28: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an

objective function Ward’s Method uses squared error

Page 29: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an

objective function Ward’s Method uses squared error

Page 30: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids

Page 31: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Cluster Similarity: MIN or Single Link Similarity of two clusters is based on the

two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one

link in the proximity graph.

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Page 32: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Page 33: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Strength of MIN

Original Points Two Clusters

• Can handle non-elliptical shapes

Page 34: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Limitations of MIN

Original Points Two Clusters

• Sensitive to noise and outliers

Page 35: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Cluster Similarity: MAX or Complete Linkage Similarity of two clusters is based on the

two least similar (most distant) points in the different clusters Determined by all pairs of points in the two

clustersI1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Page 36: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Hierarchical Clustering: MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

6

1

2 5

3

4

Page 37: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Strength of MAX

Original Points Two Clusters

• Less susceptible to noise and outliers

Page 38: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Limitations of MAX

Original Points Two Clusters

•Tends to break large clusters

•Biased towards globular clusters

Page 39: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Cluster Similarity: Group Average Proximity of two clusters is the average of pairwise

proximity between points in the two clusters.

Need to use average connectivity for scalability since total proximity favors large clusters

||Cluster||Cluster

)p,pproximity(

)Cluster,Clusterproximity(ji

ClusterpClusterp

ji

jijjii

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 34 5

Page 40: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Hierarchical Clustering: Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

1

2

3

4

5

6

1

2

5

3

4

Page 41: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Hierarchical Clustering: Group Average Compromise between Single and

Complete Link

Strengths Less susceptible to noise and outliers

Limitations Biased towards globular clusters

Page 42: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Hierarchical Clustering: Comparison

Group Average

MIN MAX

1

2

3

4

5

61

2

5

34

1

2

3

4

5

61

2 5

3

41

2

3

4

5

6

12

3

4

5

Page 43: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot

be undone Different schemes have problems with one or more of the

following: Sensitivity to noise and outliers Difficulty handling different sized clusters and convex

shapes Breaking large clusters (divisive)

Dendrogram correspond to a given hierarchical clustering is not unique, since for each merge one needs to specify which subtree should go on the left and which on the right

They impose structure on the data, instead of revealing structure in these data.

How many clusters? (some suggestions later)

Page 44: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple

Page 45: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

K-means Clustering – Details Initial centroids are often chosen randomly.

Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured mostly by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations.

Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d )

n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

Typical choice

Page 46: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them.

x is a data point in cluster Ci and mi is the representative point for cluster Ci

can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest

error One easy way to reduce SSE is to increase K, the number of

clusters A good clustering with smaller K can have a lower SSE than a poor

clustering with higher K

K

i Cxi

i

xmdistSSE1

2 ),(

Page 47: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Issues and Limitations for K-means How to choose initial centers? How to choose K? How to handle Outliers? Clusters different in

Shape Density Size

Assumes clusters are spherical in vector space Sensitive to coordinate changes

Page 48: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Page 49: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 6

Page 50: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 51: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

Page 52: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Page 53: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Solutions to Initial Centroids Problem Multiple runs Sample and use hierarchical clustering to

determine initial centroids Select more than k initial centroids and

then select among these initial centroids Select most widely separated

Bisecting K-means Not as susceptible to initialization issues

Page 54: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Bisecting K-means Bisecting K-means algorithm

Variant of K-means that can produce a partitional or a hierarchical clustering

Page 55: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Bisecting K-means Example

Page 56: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Issues and Limitations for K-means How to choose initial centers? How to choose K?

Depends on the problem + some suggestions later

How to handle Outliers? Preprocessing

Clusters different in Shape Density Size

Page 57: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Issues and Limitations for K-means How to choose initial centers? How to choose K? How to handle Outliers? Clusters different in

Shape Density Size

Page 58: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Page 59: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Page 60: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Page 61: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.Find parts of clusters, but need to put together.

Page 62: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Overcoming K-means Limitations

Original Points K-means Clusters

Page 63: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Overcoming K-means Limitations

Original Points K-means Clusters

Page 64: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

K-meansPros Simple Fast for low dimensional data It can find pure sub clusters if large number of

clusters is specifiedCons K-Means cannot handle non-globular data of

different sizes and densities K-Means will not identify outliers K-Means is restricted to data which has the

notion of a center (centroid)

Page 65: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Biclustering/Co-clustering Two genes can have similar

expression patterns only under some conditions

Similarly, in two related conditions, some genes may exhibit different expression patterns N

genes

M conditions

Page 66: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Biclustering As a result, each

cluster may involve only a subset of genes and a subset of conditions, which form a “checkerboard” structure:

Page 67: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Biclustering In general – a hard task (NP-hard) Heuristic algorithms described briefly: Cheng & Church – deletion of rows and

columns. Biclusters discovered one at a time

Order-Preserving SubMatrixes Ben-Dor et al.

Coupled Two-Way Clustering (Getz. Et al) Spectral Co-clustering

Page 68: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Cheng and Church Objective function for heuristic methods (to

minimize):

Greedy method: Initialization: the bicluster contains all rows and

columns. Iteration:

1. Compute all aIj, aiJ, aIJ and H(I, J) for reuse.2. Remove a row or column that gives the maximum

decrease of H. Termination: when no action will decrease H or H <= . Mask this bicluster and continue

Problem removing “trivial” biclusters

JjIi

IJIjiJij aaaaJI

JIH,

2)(1

),(

Page 69: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Ben-Dor et al. (OPSM) Model:

For a condition set T and a gene g, the conditions in T can be ordered in a way so that the expression values are sorted in ascending order (suppose the values are all unique).

Submatrix A is a bicluster if there is an ordering (permutation) of T such that the expression values of all genes in G are sorted in ascending order.

Idea of algorithm: to grow partial models until they become complete models.

t1 t2 t3 t4 t5

g1 7 13 19 2 50

g2 19 23 39 6 42

g3 4 6 8 2 10

Induced permutation

2 3 4 1 5

Page 70: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Ben-Dor et al. (OPSM)

Page 71: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Getz et al. (CTWC) Idea: repeatedly perform one-way

clustering on genes/conditions. Stable clusters of genes are used as the

attributes for condition clustering, and vice versa.

Page 72: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Spectral Co-clustering Main idea:

Normalize the 2 dimension Form a matrix of size m+n (using SVD) Use k-means to cluster both types of data

http://adios.tau.ac.il/SpectralCoClustering/

Page 73: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Evaluating cluster quality Use known classes (pairwise F-measure,

best class F-measure) Clusters can be evaluated with

“internal” as well as “external” measures Internal measures are related to the

inter/intra cluster distance External measures are related to how

representative are the current clusters to “true” classes

Page 74: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Inter/Intra Cluster Distances

Intra-cluster distance (Sum/Min/Max/Avg) the

(absolute/squared) distance between- All pairs of points in

the cluster OR- Between the centroid

and all points in the cluster OR

- Between the “medoid” and all points in the cluster

Inter-cluster distanceSum the (squared) distance

between all pairs of clusters

Where distance between two clusters is defined as:

- distance between their centroids/medoids

- (Spherical clusters)- Distance between the

closest pair of points belonging to the clusters

- (Chain shaped clusters)

Page 75: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Davies-Bouldin index A function of the ratio of the sum of within-

cluster (i.e. intra-cluster) scatter to between cluster (i.e. inter-cluster) separation

Let C={C1,….., Ck} be a clustering of a set of N objects:

with and

where Ci is the ith cluster and ci is the centroid for cluster i

k

iiR

kDB

1

.1

jikjiji RR

,,..1

max||||

)var()var(

ji

ji

jiij cc

CCR

Page 76: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Davies-Bouldin index example For eg: for the clusters shown

Compute

var(C1)=0, var(C2)=4.5, var(C3)=2.33

Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33

So, R12=1, R13=0.152, R23=0.797

Now, compute

R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31 and R32)

Finally, compute DB=0.932

||||

)var()var(

ji

ji

jiij cc

CCR

jikjiji RR

,,..1

max

k

iiR

kDB

1

.1

Page 77: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Davies-Bouldin index example (ctd) For eg: for the clusters shown

Compute

Only 2 clusters here var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c2=18.33

R12=1.26 Now compute Since we have only 2 clusters here, R1=R12=1.26;

R2=R21=1.26 Finally, compute DB=1.26

||||

)var()var(

ji

ji

jiij cc

CCR

jikjiji RR

,,..1

max

k

iiR

kDB

1

.1

Page 78: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Other criteria Dunn method

(Xi, Xj): intercluster distance between clusters Xi and Xj

(Xk): intracluster distance of cluster Xk

Silhouette method Identifying outliers

C-index Compare sum of distances S over all pairs from the same

cluster against the same # of smallest and largest pairs.

)(max

),(minmin)(

111

kck

ji

ijcjci X

XXUV

Page 79: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Example datasetAML/ALL dataset (Golub et al.)

samples

Genes/ features

Leukemia 72 patients (samples) 7129 genes 4 groups

Two major types ALL & AML T & B Cells in ALL With/without treatment in

AML

Page 80: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

AML/ALL dataset Davies-Bouldin index - C=4

Dunn method - C=2

Silhouette method – C=2

Validity index

c = 2 c = 3 c = 4 c = 5 c = 6

V11 0.18 0.25 0.20 0.19 0.19 V21 1.21 0.69 0.58 0.49 0.47 V31 0.68 0.37 0.39 0.32 0.33 V41 0.55 0.22 0.27 0.21 0.23

V51 0.61 0.32 0.33 0.27 0.28 V61 0.93 0.52 0.53 0.42 0.40

V12 0.62 0.56 0.63 0.59 0.61 V22 4.27 2.20 1.87 1.56 1.49 V32 2.40 1.18 1.24 1.03 1.06 V42 1.95 0.70 0.85 0.66 0.73 V52 2.15 1.01 1.07 0.86 0.90

V62 3.26 1.66 1.69 1.36 1.28

V13 0.23 0.20 0.23 0.22 0.23 V23 1.55 0.80 0.69 0.57 0.55

V33 0.87 0.43 0.46 0.37 0.39 V43 0.71 0.26 0.31 0.24 0.27 V53 0.78 0.32 0.39 0.31 0.33 V63 1.18 0.61 0.63 0.50 0.47

Average 1.34 0.68 0.69 0.57 0.57

Validity index

c = 2 c = 3 c = 4 c = 5 c = 6

V11 0.18 0.25 0.20 0.19 0.19 V21 1.21 0.69 0.58 0.49 0.47 V31 0.68 0.37 0.39 0.32 0.33 V41 0.55 0.22 0.27 0.21 0.23

V51 0.61 0.32 0.33 0.27 0.28 V61 0.93 0.52 0.53 0.42 0.40

V12 0.62 0.56 0.63 0.59 0.61 V22 4.27 2.20 1.87 1.56 1.49 V32 2.40 1.18 1.24 1.03 1.06 V42 1.95 0.70 0.85 0.66 0.73 V52 2.15 1.01 1.07 0.86 0.90

V62 3.26 1.66 1.69 1.36 1.28

V13 0.23 0.20 0.23 0.22 0.23 V23 1.55 0.80 0.69 0.57 0.55

V33 0.87 0.43 0.46 0.37 0.39 V43 0.71 0.26 0.31 0.24 0.27 V53 0.78 0.32 0.39 0.31 0.33 V63 1.18 0.61 0.63 0.50 0.47

Average 1.34 0.68 0.69 0.57 0.57

Page 81: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Visual evaluation - coherency

Page 82: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Cluster quality example do you see clusters?

C Silhouette

2 0.49223 0.57394 0.47735 0.49916 0.54047 0.5418 0.51719 0.595610 0.6446

C Silhouette

2 0.48633 0.57624 0.59575 0.53516 0.57017 0.54878 0.50839 0.531110 0.5229

Page 83: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Dimensionality Reduction Map points in high-dimensional space to

lower number of dimensions Preserve structure: pairwise distances, etc. Useful for further processing:

Less computation, fewer parameters Easier to understand, visualize

Page 84: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Dimensionality Reduction Feature selection vs. Feature Extraction Feature selection – select important

features Pros

meaningful features Less work acquiring

Unsupervised Variance, Fold UFF

Page 85: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Dimensionality Reduction Feature Extraction Transforms the entire feature set to lower

dimension. Pros

Uses objective function to select the best projection

Sometime single features are not good enough Unsupervised

PCA, SVD

Page 86: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Principal Components Analysis (PCA) approximating a high-dimensional data set

with a lower-dimensional linear subspace

****

******

**

** ****

******

**

**

**

**

**** **

****

**

****

Data pointsData points

First principal componentFirst principal componentSecond principal componentSecond principal component

Original axesOriginal axes

Page 87: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Singular Value Decomposition

Page 88: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Principal Components Analysis (PCA) Rule of thumb for selecting number of

components “Knee” in screeplot Cumulative percentage variance

****

******

**

** ****

******

**

**

**

**

**** **

****

**

****

Page 89: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Tools for clustering Matlab – COMPACT http://adios.tau.ac.il/compact/

Page 90: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Tools for clustering Matlab – COMPACT http://adios.tau.ac.il/compact/

Page 91: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Tools for clustering Cluster +TreeView (Eisen et al.) http://rana.lbl.gov/eisen/?page_id=42

Page 92: Unsupervised learning & Cluster Analysis: Basic Concepts  and Algorithms

Summary Clustering is ill-defined and considered an

“art” In fact, this means you need to

Understand your data beforehand Know how to interpret clusters afterwards

The problem determines the best solution (which measure, which clustering algorithm) – try to experiment with different options.