lecture 15 cluster analysis

Lecture 15Cluster analysis

Species SequenceP.sym A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.xan A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T A A T A T T C C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.pola A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T C C G T A T G C T A T G T A G C T G G A G G G T A C T G A C G G T A GC.plat A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A T T T T A GC.grad A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T C C G G G T A C T G A T T T T A GD.sym T T A T G C G A G A C G T G A A A A A T C T T T A G G G C T A A G G T G A T T A T T T C G G T T G C T A T G T A G A G G A A G G G T A C T G A C G G T A G

Linkage algorithm

Distance metric

A cluster analysis is a two stepp process that needs includes the choice of a) a distance metric and b) a linkage algortihm

Between clusters

Within clusters

Cluster analysis tries to minimize within cluster distances and to maximize between cluster distances.


The distance metric

P.sym P.xan P.pola C.plat C.grad D.symP.sym 0 2 3 7 9 13P.xan 2 0 4 11 11 15P.pola 3 4 0 10 10 12C.plat 7 11 10 0 2 19C.grad 9 11 10 2 0 19D.sym 13 15 12 19 19 0

A distance matrix counts in the simplest case the number of differences between two data sets.

Site 1 Site 2 Site 3 Site 4P.sym 1 0 1 1P.xan 1 0 0 1P.pola 0 1 0 1C.plat 0 1 1 1C.grad 1 0 0 0D.sym 1 0 1 1Sum 4 2 3 5

Species presence-absence matrix A

Site 1 Site 2 Site 3 Site 4Site 1 4 0 2 3 Site 2 0 2 1 2Site 3 2 1 3 3Site 4 3 2 3 5

Site 1 Site 2 Site 3 Site 4Site 1 1 0 0.571429 0.666667 Site 2 0 1 0.4 0.571429Site 3 0.571429 0.4 1 0.75Site 4 0.666667 0.571429 0.75 1

Distance matrix D = ATA Soerensen index

Jaccard index

B Site A SitejointS

2

joint-B Site A SitejointS

0

2 40*2

2,1

Soerensen

Site 1 Site 2 Site 3 Site 4P.sym 0.31 0.12 0.24 0.05P.xan 0.20 0.65 0.54 0.44P.pola 0.38 0.81 0.28 0.52C.plat 0.35 0.69 0.86 0.30C.grad 0.07 0.99 0.64 0.84D.sym 0.43 0.78 0.73 0.21Sum 1.75 4.04 3.30 2.36

Abundance data

n

kjkikij aaD

1

2Euclidean distance

n

kjkikij aaD

1

Manhattan distance

ijij rD Correlation distance

Site 1 Site 2 Site 3 Site 4Site 1 1 -0.27534 -0.04805 -0.71587 Site 2 -0.27534 1 0.519139 0.807173Site 3 -0.04805 0.519139 1 0.157251Site 4 -0.71587 0.807173 0.157251 1

Correlation distance matrix

Bray Curtis distance

n

kjk

n

kik

n

kjkik

ij

aa

aaD

11

11

Due to squaring Euclidean distances put particulalry weight on outliers. Needs a linear scale.The Manhattan distance needs linear scales. Despite of a large distance the metric might be zero.

Correlations are sensitive to non-linearities in the data.The Bray-Curtis distance is equivalent to the Soerensen index for presence-absence data. Suffers from the same shortcoming as the Manhattan distance.

P.sym P.xan P.pola C.plat C.grad D.sym

P.sym 0 2 3 7 9 13

P.xan 2 0 4 11 11 15

P.pola 3 4 0 10 10 12

C.plat 7 11 10 0 2 19

C.grad 9 11 10 2 0 19

D.sym 13 15 12 19 19 0


Linkage algorithm

We first combine species that are nearest to from an inner cluster

In the next step we look for a species or a cluster that is clostest to the average distance or the initial cluster

We continue this procedure until all species are grouped.

The single linkage algorithm tends to produce many small clusters.

P.sym P.xanP.polaC.plat C.gradD.sym

Sequential versus simultaneous algorithms In simultaneous algorithms the final solution is obtained in a single step and not stepwise as in

the single linkage above.

Agglomeration versus division algorithms Agglomerative procedures operate bottom up,

division procedures top down.

Monothetic versus polythetic algorithms Polythetic procedures use several descriptors of linkage, monothetic use the same at each step

(for instance maximum association).

Hierarchical versus non-hierarchical algorithms Hierarchical methods proceed in a non-

overlapping way. During the linkage process all members of lower clusters are members of the next higher cluster. Non hierarchical methods

proceed by optimization within group homogeneity. Hence they might include

members not contained in higher order cluster.

The single linkage algorithm uses the minimum distance between the members of

two clusters as the measure of cluster distance. It favours chains of small clusters.

The average linkage uses average distances between clusters. It gives frequently larger

clusters. The most often used average linkage algorithm is the Unweighted Pair-Groups

Method Average (UPGMA).

The Ward algorithm calculates the total sum of squared deviations from the mean of a

cluster and assigns members as to minimize this sum. The method gives often clusters of

rather equal size.

Median clustering tries to minimize within cluster variance.

To check the performance of different cluster algorithms and distance metrics we use a matrix of random numbers.

Which clusters to accept?

Which clusters to accept?

Different cluster algorithms give different results.

We accept those clusters that are stable irrespective of algorithm.

In the case of our random numbers clustering is very unstable.

Two methods detected the clusters OP and ABC

All other items are not clearly separated.

The position of item F remains unclear

Clustering using a predefined number of clustersK-means

O

P

AB D

C F

E H

K

I

LNM

JG

K-means clustering starts from a predefind number of clusters and

then arranges the items in a way that the distances between clusters are

maximized with respect to the distances within the clusters.

Technically the algorithm first randomly assigns cluster means and then places items (each time calculating new cluster means) until an optimal solution (convergence)

has been reached).K-means always uses Euclidean

distances

Neighbour joining

A

F

DE

C

B

Root

A

F

DE

C

B

RootX

A

F

DE

C

B

RootX

Y

Neighbour joining is particularly used to generate phylogenetic trees

in

(X) (X,Y )

(X,Y)Q (n 2) (X,Y) (X) (Y)

AB(X,A) (X,B) (A,B)(X, U )

2

(n 2) (A,B) (A) (B)(A, U)2(n 2)

(n 2) (A,B) (A) (B)(B, U)2(n 2)

Dissimilarities

You need similarities (phylogenetic distances) (XY) between all elements X and Y.

Select the pair with the lowest value of QCalculate new dissimilarities

Calculate the distancies from the new node

Calculate

Distance matrixMouse Raven Octopus Lumbricus

Mouse 0 0.2 0.6 0.7Raven 0.2 0 0.6 0.8Octopus 0.6 0.6 0 0.5Lumbricus 0.7 0.8 0.5 0

Delta values 1.5 1.6 1.7 2

Q-valuesMouse/Raven -2.7Mouse/Octopus -2Mouse/Lumbricus -2.1Raven/Octopus -2.1Raven/Lumbricus -2Octopus/Lumbricus -2.7

Distance matrixMouse Raven Protostomia

Mouse 0 0.2 0.4Raven 0.2 0 0.45Protostomia 0.4 0.45 0

Delta values 0.6 0.65 0.85

Q-valuesMouse/Raven -1.25Mouse/Protostomia -1.05Raven/Protostomia -0.6

Distance matrixVertebrata Protostomia

Vertebrata 0 0.075Protostomia 0.075 0

in

(X) (X,Y )

(X,Y)Q (n 2) (X,Y) (X) (Y)

AB(X,A) (X,B) (A,B)(X, U )

2

(X,Y)Q (n 2) (X,Y) (X) (Y)

in

(X) (X,Y )

Home work and literature

Refresh:

• Distance metrics• Euclidean distance• Manhattan distance• UPGMA• Ward clustering• Neighbor joining• K-means cluster

Literature:

http://en.wikipedia.org/wiki/Cluster_analysis

http://statsoft.com/textbook/

http://en.wikipedia.org/wiki/Cluster_analysis

http://statsoft.com/textbook/

lecture 15 cluster analysis

Documents

large distance

distance matrix d

xan20411111 distance

braycurtis distance

cluster distances

presenceabsence data

euclidean distances

data sets