lecture 15 cluster analysis

15
Lecture 15 Cluster analysis Specie s Sequenc e P.sym AAATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xan AAATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTT AA TATT C CGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.pola AAATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATT C CGTATGCTATGTAGCT GG AGGGTACTGACGGTAG C.plat AAATGCCTGACGTGGGAAATC AA TAGGGCTAAGG AA TTTATTTCGTATGCTATGTAGCTTAAGGGTACTGA TTT TAG C.grad AAATGCCTGACGTGGGAAATC AA TAGGGCTAAGG AA TTTATTTCGTATGCTATGTAGCTT CC GGGTACTGA TTT TAG D.sym TT ATGC GA GACGTG AA AAATCTTTAGGGCTAAGGT GA TTATTTCG GT TGCTATGTAG AGG AAGGGTACTGACGGTAG Linkage algorith m Distanc e metric A cluster analysis is a two stepp process that needs includes the choice of a) a distance metric and

Upload: eshe

Post on 23-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Lecture 15 Cluster analysis. Distance metric. Linkage algorithm. A cluster analysis is a two stepp process that needs includes the choice of a) a distance metric and b) a linkage algortihm. Within clusters. Between clusters. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture  15 Cluster analysis

Lecture 15Cluster analysis

Species SequenceP.sym A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.xan A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T A A T A T T C C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.pola A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T C C G T A T G C T A T G T A G C T G G A G G G T A C T G A C G G T A GC.plat A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A T T T T A GC.grad A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T C C G G G T A C T G A T T T T A GD.sym T T A T G C G A G A C G T G A A A A A T C T T T A G G G C T A A G G T G A T T A T T T C G G T T G C T A T G T A G A G G A A G G G T A C T G A C G G T A G

Linkage algorithm

Distance metric

A cluster analysis is a two stepp process that needs includes the choice of a) a distance metric and b) a linkage algortihm

Page 2: Lecture  15 Cluster analysis

Between clusters

Within clusters

Cluster analysis tries to minimize within cluster distances and to maximize between cluster distances.

Page 3: Lecture  15 Cluster analysis

Species SequenceP.sym A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.xan A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T A A T A T T C C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.pola A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T C C G T A T G C T A T G T A G C T G G A G G G T A C T G A C G G T A GC.plat A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A T T T T A GC.grad A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T C C G G G T A C T G A T T T T A GD.sym T T A T G C G A G A C G T G A A A A A T C T T T A G G G C T A A G G T G A T T A T T T C G G T T G C T A T G T A G A G G A A G G G T A C T G A C G G T A G

The distance metric

P.sym P.xan P.pola C.plat C.grad D.symP.sym 0 2 3 7 9 13P.xan 2 0 4 11 11 15P.pola 3 4 0 10 10 12C.plat 7 11 10 0 2 19C.grad 9 11 10 2 0 19D.sym 13 15 12 19 19 0

A distance matrix counts in the simplest case the number of differences between two data sets.

Page 4: Lecture  15 Cluster analysis

Site 1 Site 2 Site 3 Site 4P.sym 1 0 1 1P.xan 1 0 0 1P.pola 0 1 0 1C.plat 0 1 1 1C.grad 1 0 0 0D.sym 1 0 1 1Sum 4 2 3 5

Species presence-absence matrix A

Site 1 Site 2 Site 3 Site 4Site 1 4 0 2 3 Site 2 0 2 1 2Site 3 2 1 3 3Site 4 3 2 3 5

Site 1 Site 2 Site 3 Site 4Site 1 1 0 0.571429 0.666667 Site 2 0 1 0.4 0.571429Site 3 0.571429 0.4 1 0.75Site 4 0.666667 0.571429 0.75 1

Distance matrix D = ATA Soerensen index

Jaccard index

B Site A SitejointS

2

joint-B Site A SitejointS

0

2 40*2

2,1

Soerensen

Page 5: Lecture  15 Cluster analysis

Site 1 Site 2 Site 3 Site 4P.sym 0.31 0.12 0.24 0.05P.xan 0.20 0.65 0.54 0.44P.pola 0.38 0.81 0.28 0.52C.plat 0.35 0.69 0.86 0.30C.grad 0.07 0.99 0.64 0.84D.sym 0.43 0.78 0.73 0.21Sum 1.75 4.04 3.30 2.36

Abundance data

n

kjkikij aaD

1

2Euclidean distance

n

kjkikij aaD

1

Manhattan distance

ijij rD Correlation distance

Site 1 Site 2 Site 3 Site 4Site 1 1 -0.27534 -0.04805 -0.71587 Site 2 -0.27534 1 0.519139 0.807173Site 3 -0.04805 0.519139 1 0.157251Site 4 -0.71587 0.807173 0.157251 1

Correlation distance matrix

Bray Curtis distance

n

kjk

n

kik

n

kjkik

ij

aa

aaD

11

11

Due to squaring Euclidean distances put particulalry weight on outliers. Needs a linear scale.The Manhattan distance needs linear scales. Despite of a large distance the metric might be zero.

Correlations are sensitive to non-linearities in the data.The Bray-Curtis distance is equivalent to the Soerensen index for presence-absence data. Suffers from the same shortcoming as the Manhattan distance.

Page 6: Lecture  15 Cluster analysis

P.sym P.xan P.pola C.plat C.grad D.sym

P.sym 0 2 3 7 9 13

P.xan 2 0 4 11 11 15

P.pola 3 4 0 10 10 12

C.plat 7 11 10 0 2 19

C.grad 9 11 10 2 0 19

D.sym 13 15 12 19 19 0

Species SequenceP.sym A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.xan A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T A A T A T T C C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.pola A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T C C G T A T G C T A T G T A G C T G G A G G G T A C T G A C G G T A GC.plat A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A T T T T A GC.grad A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T C C G G G T A C T G A T T T T A GD.sym T T A T G C G A G A C G T G A A A A A T C T T T A G G G C T A A G G T G A T T A T T T C G G T T G C T A T G T A G A G G A A G G G T A C T G A C G G T A G

Linkage algorithm

We first combine species that are nearest to from an inner cluster

In the next step we look for a species or a cluster that is clostest to the average distance or the initial cluster

We continue this procedure until all species are grouped.

The single linkage algorithm tends to produce many small clusters.

P.sym P.xanP.polaC.plat C.gradD.sym

Page 7: Lecture  15 Cluster analysis

Sequential versus simultaneous algorithms In simultaneous algorithms the final solution is obtained in a single step and not stepwise as in

the single linkage above.

Agglomeration versus division algorithms Agglomerative procedures operate bottom up,

division procedures top down.

Monothetic versus polythetic algorithms Polythetic procedures use several descriptors of linkage, monothetic use the same at each step

(for instance maximum association).

Hierarchical versus non-hierarchical algorithms Hierarchical methods proceed in a non-

overlapping way. During the linkage process all members of lower clusters are members of the next higher cluster. Non hierarchical methods

proceed by optimization within group homogeneity. Hence they might include

members not contained in higher order cluster.

The single linkage algorithm uses the minimum distance between the members of

two clusters as the measure of cluster distance. It favours chains of small clusters.

The average linkage uses average distances between clusters. It gives frequently larger

clusters. The most often used average linkage algorithm is the Unweighted Pair-Groups

Method Average (UPGMA).

The Ward algorithm calculates the total sum of squared deviations from the mean of a

cluster and assigns members as to minimize this sum. The method gives often clusters of

rather equal size.

Median clustering tries to minimize within cluster variance.

Page 8: Lecture  15 Cluster analysis

To check the performance of different cluster algorithms and distance metrics we use a matrix of random numbers.

Which clusters to accept?

Page 9: Lecture  15 Cluster analysis

Which clusters to accept?

Different cluster algorithms give different results.

We accept those clusters that are stable irrespective of algorithm.

In the case of our random numbers clustering is very unstable.

Page 10: Lecture  15 Cluster analysis

Two methods detected the clusters OP and ABC

All other items are not clearly separated.

The position of item F remains unclear

Page 11: Lecture  15 Cluster analysis

Clustering using a predefined number of clustersK-means

O

P

AB D

C F

E H

K

I

LNM

JG

K-means clustering starts from a predefind number of clusters and

then arranges the items in a way that the distances between clusters are

maximized with respect to the distances within the clusters.

Technically the algorithm first randomly assigns cluster means and then places items (each time calculating new cluster means) until an optimal solution (convergence)

has been reached).K-means always uses Euclidean

distances

Page 12: Lecture  15 Cluster analysis

Neighbour joining

A

F

DE

C

B

Root

A

F

DE

C

B

RootX

A

F

DE

C

B

RootX

Y

Neighbour joining is particularly used to generate phylogenetic trees

in

(X) (X,Y )

(X,Y)Q (n 2) (X,Y) (X) (Y)

AB(X,A) (X,B) (A,B)(X, U )

2

(n 2) (A,B) (A) (B)(A, U)2(n 2)

(n 2) (A,B) (A) (B)(B, U)2(n 2)

Dissimilarities

You need similarities (phylogenetic distances) (XY) between all elements X and Y.

Select the pair with the lowest value of QCalculate new dissimilarities

Calculate the distancies from the new node

Calculate

Page 13: Lecture  15 Cluster analysis

Distance matrixMouse Raven Octopus Lumbricus

Mouse 0 0.2 0.6 0.7Raven 0.2 0 0.6 0.8Octopus 0.6 0.6 0 0.5Lumbricus 0.7 0.8 0.5 0

Delta values 1.5 1.6 1.7 2

Q-valuesMouse/Raven -2.7Mouse/Octopus -2Mouse/Lumbricus -2.1Raven/Octopus -2.1Raven/Lumbricus -2Octopus/Lumbricus -2.7

Distance matrixMouse Raven Protostomia

Mouse 0 0.2 0.4Raven 0.2 0 0.45Protostomia 0.4 0.45 0

Delta values 0.6 0.65 0.85

Q-valuesMouse/Raven -1.25Mouse/Protostomia -1.05Raven/Protostomia -0.6

Distance matrixVertebrata Protostomia

Vertebrata 0 0.075Protostomia 0.075 0

in

(X) (X,Y )

(X,Y)Q (n 2) (X,Y) (X) (Y)

AB(X,A) (X,B) (A,B)(X, U )

2

(X,Y)Q (n 2) (X,Y) (X) (Y)

in

(X) (X,Y )

Page 14: Lecture  15 Cluster analysis
Page 15: Lecture  15 Cluster analysis

Home work and literature

Refresh:

• Distance metrics• Euclidean distance• Manhattan distance• UPGMA• Ward clustering• Neighbor joining• K-means cluster

Literature:

http://en.wikipedia.org/wiki/Cluster_analysis

http://statsoft.com/textbook/