lecture 4 cluster analysis species sequence p.syma...

14
Lecture 4 Cluster analysis Specie s Sequenc e P.sym AAATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xan AAATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTT AA TATT C CGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.pola AAATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATT C CGTATGCTATGTAGCT GG AGGGTACTGACGGTAG C.plat AAATGCCTGACGTGGGAAATC AA TAGGGCTAAGG AA TTTATTTCGTATGCTATGTAGCTTAAGGGTACTGA TTT TAG C.grad AAATGCCTGACGTGGGAAATC AA TAGGGCTAAGG AA TTTATTTCGTATGCTATGTAGCTT CC GGGTACTGA TTT TAG D.sym TT ATGC GA GACGTG AA AAATCTTTAGGGCTAAGGT GA TTATTTCG GT TGCTATGTAG AGG AAGGGTACTGACGGTAG Linkage algorith m Distanc e metric A cluster analysis is a two stepp process that needs includes the choice of a) a distance metric and

Post on 20-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Lecture 4Cluster analysis

Species SequenceP.sym A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.xan A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T A A T A T T C C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.pola A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T C C G T A T G C T A T G T A G C T G G A G G G T A C T G A C G G T A GC.plat A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A T T T T A GC.grad A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T C C G G G T A C T G A T T T T A GD.sym T T A T G C G A G A C G T G A A A A A T C T T T A G G G C T A A G G T G A T T A T T T C G G T T G C T A T G T A G A G G A A G G G T A C T G A C G G T A G

Linkage algorithm

Distance metric

A cluster analysis is a two stepp process that needs includes the choice of a) a distance metric and b) a linkage algortihm

Page 2: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Between clusters

Within clusters

Cluster analysis tries to minimize within cluster distances and to maximize between cluster distances.

Page 3: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Species SequenceP.sym A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.xan A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T A A T A T T C C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.pola A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T C C G T A T G C T A T G T A G C T G G A G G G T A C T G A C G G T A GC.plat A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A T T T T A GC.grad A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T C C G G G T A C T G A T T T T A GD.sym T T A T G C G A G A C G T G A A A A A T C T T T A G G G C T A A G G T G A T T A T T T C G G T T G C T A T G T A G A G G A A G G G T A C T G A C G G T A G

The distance metric

P.sym P.xan P.pola C.plat C.grad D.symP.sym 0 2 3 7 9 13P.xan 2 0 4 11 11 15P.pola 3 4 0 10 10 12C.plat 7 11 10 0 2 19C.grad 9 11 10 2 0 19D.sym 13 15 12 19 19 0

A distance matrix counts in the simplest case the number of differences between two data sets.

Page 4: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Site 1 Site 2 Site 3 Site 4P.sym 1 0 1 1P.xan 1 0 0 1P.pola 0 1 0 1C.plat 0 1 1 1C.grad 1 0 0 0D.sym 1 0 1 1Sum 4 2 3 5

Species presence-absence matrix A

Site 1 Site 2 Site 3 Site 4Site 1 4 0 2 3 Site 2 0 2 1 2Site 3 2 1 3 3Site 4 3 2 3 5

Site 1 Site 2 Site 3 Site 4Site 1 1 0 0.571429 0.666667 Site 2 0 1 0.4 0.571429Site 3 0.571429 0.4 1 0.75Site 4 0.666667 0.571429 0.75 1

Distance matrix D = ATA Soerensen index

Jaccard index

B Site A Sitejoint

S

2

joint-B Site A Sitejoint

S

02 40*2

2,1

Soerensen

Page 5: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Site 1 Site 2 Site 3 Site 4P.sym 0.31 0.12 0.24 0.05P.xan 0.20 0.65 0.54 0.44P.pola 0.38 0.81 0.28 0.52C.plat 0.35 0.69 0.86 0.30C.grad 0.07 0.99 0.64 0.84D.sym 0.43 0.78 0.73 0.21Sum 1.75 4.04 3.30 2.36

Abundance data

n

kjkikij aaD

1

2Euclidean distance

n

kjkikij aaD

1

Manhattan distance

ijij rD Correlation distance

Site 1 Site 2 Site 3 Site 4Site 1 1 -0.27534 -0.04805 -0.71587 Site 2 -0.27534 1 0.519139 0.807173Site 3 -0.04805 0.519139 1 0.157251Site 4 -0.71587 0.807173 0.157251 1

Correlation distance matrix

Bray Curtis distance

n

kjk

n

kik

n

kjkik

ij

aa

aaD

11

11

Due to squaring Euclidean distances put particulalry weight on outliers. Needs a linear scale.The Manhattan distance needs linear scales. Despite of a large distance the metric might be zero.

Correlations are sensitive to non-linearities in the data.The Bray-Curtis distance is equivalent to the Soerensen index for presence-absence data. Suffers from the same shortcoming as the Manhattan distance.

Page 6: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

P.sym P.xan P.pola C.plat C.grad D.sym

P.sym 0 2 3 7 9 13

P.xan 2 0 4 11 11 15

P.pola 3 4 0 10 10 12

C.plat 7 11 10 0 2 19

C.grad 9 11 10 2 0 19

D.sym 13 15 12 19 19 0

Species SequenceP.sym A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.xan A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T A A T A T T C C G T A T G C T A T G T A G C T T A A G G G T A C T G A C G G T A GP.pola A A A T G C C T G A C G T G G G A A A T C T T T A G G G C T A A G G T T T T T A T T C C G T A T G C T A T G T A G C T G G A G G G T A C T G A C G G T A GC.plat A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T A A G G G T A C T G A T T T T A GC.grad A A A T G C C T G A C G T G G G A A A T C A A T A G G G C T A A G G A A T T T A T T T C G T A T G C T A T G T A G C T T C C G G G T A C T G A T T T T A GD.sym T T A T G C G A G A C G T G A A A A A T C T T T A G G G C T A A G G T G A T T A T T T C G G T T G C T A T G T A G A G G A A G G G T A C T G A C G G T A G

Linkage algorithm

We first combine species that are nearest to from an inner cluster

In the next step we look for a species or a cluster that is clostest to the average distance or the initial cluster

We continue this procedure until all species are grouped.

The single linkage algorithm tends to produce many small clusters.

P.sym P.xanP.polaC.plat C.gradD.sym

Page 7: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Sequential versus simultaneous algorithms In simultaneous algorithms the final solution is obtained in a single step and not stepwise as in

the single linkage above.

Agglomeration versus division algorithms Agglomerative procedures operate bottom up,

division procedures top down.

Monothetic versus polythetic algorithms Polythetic procedures use several descriptors of linkage, monothetic use the same at each step

(for instance maximum association).

Hierarchical versus non-hierarchical algorithms Hierarchical methods proceed in a non-

overlapping way. During the linkage process all members of lower clusters are members of the next higher cluster. Non hierarchical methods

proceed by optimization within group homogeneity. Hence they might include

members not contained in higher order cluster.

The single linkage algorithm uses the minimum distance between the members of

two clusters as the measure of cluster distance. It favours chains of small clusters.

The average linkage uses average distances between clusters. It gives frequently larger

clusters. The most often used average linkage algorithm is the Unweighted Pair-Groups

Method Average (UPGMA).

The Ward algorithm calculates the total sum of squared deviations from the mean of a

cluster and assigns members as to minimize this sum. The method gives often clusters of

rather equal size.

Median clustering tries to minimize within cluster variance.

Page 8: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

To check the performance of different cluster algorithms and distance metrics we use a matrix of random numbers.

Which clusters to accept?

Page 9: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Which clusters to accept?

Different cluster algorithms give different results.

We accept those clusters that are stable irrespective of algorithm.

In the case of our random numbers clustering is very unstable.

Page 10: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Two methods detected the clusters OP and ABC

All other items are not clearly separated.

The position of item F remains unclear

Page 11: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Clustering using a predefined number of clustersK-means

O

P

AB D

C F

E H

K

I

LNM

JG

K-means clustering starts from a predefind number of clusters and

then arranges the items in a way that the distances between clusters are

maximized with respect to the distances within the clusters.

Technically the algorithm first randomly assigns cluster means and then places items (each time calculating new cluster means) until an optimal solution (convergence)

has been reached).K-means always uses Euclidean

distances

Page 12: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Neighbour joining

A

F

DE

C

B

Root

A

F

DE

C

B

RootX

A

F

DE

C

B

RootX

Y

Neighbour joining is particularly used to generate phylogenetic trees

in

(X) (X,Y )

(X,Y)Q (n 2) (X,Y) (X) (Y)

AB

(X,A) (X,B) (A,B)(X, U )

2

(n 2) (A,B) (A) (B)(A, U)

2(n 2)

(n 2) (A,B) (A) (B)(B, U)

2(n 2)

Dissimilarities

You need similarities (phylogenetic distances) d(XY) between all elements X and Y.

Select the pair with the lowest value of Q

Calculate new dissimilarities

Calculate the distancies from the new node

Calculate

Page 13: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG

Distance matrixMouse Raven Octopus Lumbricus

Mouse 0 0.2 0.6 0.7Raven 0.2 0 0.6 0.8Octopus 0.6 0.6 0 0.5Lumbricus 0.7 0.8 0.5 0

Delta values 1.5 1.6 1.7 2

Q-valuesMouse/Raven -2.7Mouse/Octopus -2Mouse/Lumbricus -2.1Raven/Octopus -2.1Raven/Lumbricus -2Octopus/Lumbricus -2.7

Distance matrixMouse Raven Protostomia

Mouse 0 0.2 0.4Raven 0.2 0 0.45Protostomia 0.4 0.45 0

Delta values 0.6 0.65 0.85

Q-valuesMouse/Raven -1.25Mouse/Protostomia -1.05Raven/Protostomia -0.6

Distance matrixVertebrata Protostomia

Vertebrata 0 0.075Protostomia 0.075 0

in

(X) (X,Y )

(X,Y)Q (n 2) (X,Y) (X) (Y)

AB

(X,A) (X,B) (A,B)(X, U )

2

(X,Y)Q (n 2) (X,Y) (X) (Y)

in

(X) (X,Y )

Page 14: Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG