disk-covering method

1

Disk-Covering Method

Based on the paper by D.Huson, S.Nettles,

T.Warnow

Presented by Galiya S. , Eduard S.

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

2

Phylogenetic Tree

A phylogenetic tree is a tree showing the evolutionary interrelationships among various species.

From the Desert Vista high school, Phoenix, Arizona

3

Definition 1: Let T be a fixed rooted tree with leaves labeled 1,…,n.The Jukes-Cantor model makes the following assumptions:

1. The possible states for each site are A,C,T,G.

2. The sequence length is an input parameter and for each site, the state at the root is drawn from a distribution (typically uniform).

siteAGACTT

Jukes-Cantor model

3. The sites evolve identically and independently (i.i.d) down the tree from the root.

GGACTTAGGCCT

4

4. For each edge with u the parent of v, if the state

of a site is different at u than at v, then the probability that v has

any state of the three remaining states is equal.

Jukes-Cantor model (cont.)

( , ) ( )e u v E T

1 3

1 3

1 3

1 3

A G C TAGCT

a a a a

a a a a

a a a a

a a a a

GGGCAT AGCCCT GCACTT

AGACTT

GGACTTAGGCCT

eu

v

The example above based on CIPRES ppt.University of Texas at Austin.

5

5. To each edge e in the tree T associated a Poisson random variable for the number of mutations of a randomly selected site on that edge.

6. Each edge has an expectancy , .

Jukes-Cantor model (cont.)

eX

( )e eE X e

3

Multiple changes at a single site – hidden changes: seq1 AGTCAG

seq2 AGTCAC

Number of changes: Seq1 T G C A

Seq2 T A

21

1

AGTCACAGTCAG

AGTCTG

1e 2e

6

Definition 2 split - Removing an edge e from an unrooted phylogenetic tree T

partitions the leaf set S of the tree into two not empty sets. We denote it . Example:e

( ) { | ( )} ( ') { | ( ')}e eC T e E T C T e E T

Definition 2: T is the unrooted true tree, and T’ is the unrooted

inferred tree, both with leaves labeled 1,…,n. e is internal edge.

let define:

5

e1

23S={1,2,3,4,5}

{1,2}|{3,4,5}e T:

4

7

1 2

1

2

( ) { , }

{1,2} | {3,4,5}

{1,2,3} | {4,5}

e e

e

e

C T

Example: 5

T: e2e11

23

4FN

1 2

1

2

( ') { , }

{1,2}|{3,4,5}

{1,2,4} |{3,5}

e e

e

e

C T

T’:

Definition 2 (cont.) Any split is called a false negative (FN).( ) ( ')C T C T

Any split is called a false positive (PN).( ') ( )C T C T

An edge is recovered in T’ if the split appears in .( ')C T( )e C T

e11

2

34

e2

5

FP

E(T)e

8

FP rate:

FN rate: Number of false negative

Number of internal edges in T

Number of false positive

Number of internal edges in T'

Definition 2 (cont.)

Example: 5

T: e2e11

23

4FN

e11

2

3T’:4

e2

5FP

FN=0.5=50%

FP=0.5=50%

9

Definition 3: A matrix D is called additive if there exists a tree T with

positive edge weighting w such that .

is the path in T between leaves i and j.

Additive matrix

( )ij

ij e PD w e

ijp

Given an additive matrix D the tree T can be uniquely reconstruct in .2( )o n

A dissimilarity matrix is a symmetric matrix that is 0 on the diagonal.

10

Xe2

Xe3

is called the true distance between i and j.

is an additive matrix.

ij

Let and let [ ] then:

where [ ]

ij

ij

ij e ij ije P

e e ee P

x X X

X

ij

remainder: Let T be the unrooted true tree.

is the path in T between leaves i and j.

we represent the evolutionary process by a set of Poisson process.

ijp

{ | ( )}eX e E T

i

j

Xe1

Xij= Xe1+Xe2 +Xe3

True distance

11

is the number of different sites between sequences i and j.

is called the Hamming Distance.

Hamming Distance

is the normalized Hamming distance.ij

H(i,j)h =

k

H(i,j)

is the sequence length.k

H(i,j)

Example:

s1 CAACCCCGGT H(s1, s2) = 4

s2 TAATTTCGGT k = 10

h(s1, s2) = 4/10 = 0.4

12

Jukes-Cantor distance correction for each two leaves i, j is:

If : ij

3 3log(1- h )

4 4ijd 3

4ijh

Afterwards, compute the maximum Jukes-Cantor distance, multiply

that value by the number n of leaves and replace all undefined values.

distance correction

Example:

The matrix d is:

1 2 3 4

1 0 0.05 0.116

2 0 0.194

3 0

4 0

0.194 4 0.778

Replace * with

1 2 3 4

1 0 0.05 0.778 0.116

2 0 0.778 0.194

3 0 0.778

4 0

* 0.778

3 TCAAG 4 TTGGATTGCC1 TGGCC2 The 4 leaves are:

3

4

13

Definition 7: Let be a real number. Then:

and

| |ij ij ijd

,( ) max{ | min( ) }ij ij ijq d q

0q

Example: q=3.2 1 2 3 4

1 0 1 3 4

0 3.4 42

0 1.53

04

:

1 2 3 4

1 0 1.2 2.8 4.3

2 0 3.1 3.8

3 0 1.1

4 0

:d

1 2 3 4

1 0 0.2 0.2 0.3

0 0.3 0.22

0 0.43

04

:e

( ) 0.4q

The error

1 3 1.2

1.5

2.8

3.1

1.1

0

0.2 0.2

0.4

0.3

0

00

0

0

0

0

0

00

0

14

Let d be an dissimilarity matrix and let be any real number.

The threshold graph Thresh(d,q) is defined as:

Vertex set is {1,2,…,n }.

The edges are: (i,j) is an edge if and only if q.

For example: q = 4.5

Threshold Graph

1 2 3 4

1 0 2 4 6

2 0 7 5

3 0 1

4 0

d:

n n

ijd

4

1

3

2Thresh(d,4.5):2 4

1

0q

15

Triangulated graph

Definetion: A graph is triangulated if no subset of nodes

induced a cycle of size four or more.

Taken from wikipedia

16

A generic disk-covering method has four steps:

1. Decomposition: Compute a decomposition of the dataset into overlapping subsets.

2. Solution: Construct trees on the subsets using a base method.

3. Merge: Use a supertree method to merge the trees on the subsets into a tree on the full dataset.

4. Refinement: Compute the asymetric median tree of all posible supertrees.

Disk Covering Method

The example above based on CIPRES ppt.University of Texas at Austin.

17

Simplicial elimination order

{ : , ( , ) ( )}i j i jX v j i v v E G

Lemma: Simplicial elimination order is ordering of the verticesof G so the set

Form a clique. Every triangulated graph G has a simplicial elimination ordering.

The maximal clique in G are of the form This ordering can be found at . So maximal cliques of Gcan be found at

Example:

1 2 3 4 5 6 7 8{ , , , , , , , }v v v v v v v v

3

7 8

5

{ }i iv X

2O n 2O n

18

Constructing Tq

input: d dissimilarity matrix, Real number q>0.output: reconstructed tree, Tq.

1. Compute Thresh(d,q) 2. Triangulate Thresh(d,q) Polynomial Complexity 3. Compute Buneman Trees far all Maximal Cliques in

Triangulated Thresh(d,q). 4. Merge subtrees into a supertree.

Overall Complexity: Polynomial Complexity

2O n

2O n 2O n

19

Intersection graph Intersection graph is undirected graph formed by sets of sets of vertices:

by choosing one vertex for each set and connecting two vertices when the corresponding sets have none empty intersection.

1 2 1 2{ , ,..., } { , ,..., }m i i i ikS S S S v v v

iv iS,i jv v

jv

iv

jS

iS


20

Triangulaing Tresh(d,q) Complexity

Lemma: If d is an additive matrix, then Tresh(d,q) is triangulated.

Proof: let d be an arbitrary additive matrix, and let (T,w) be the edge weighted tree associated uniquily to d. Let q > 0. Add intermediate vertices to the edges of T and re-weight the edges so that the path between leaf pair are unchanged, but for every pair of leaves u and v in T if then there is a node x in the enlarged tree T’ so that

' ' '( , ) / 2 ( , ) ( , ) / 2T T Td u x q and d x v d u v q

, / 2u vd q

subtree of T’

u

v

tree T’

xuX

21

Triangulaing Tresh(d,q) Complexity

Now let denote the subtree of T’ of distance at most q/2 of u. Note that if only if , and so the Thresh(d,q) is identical to the intersection graph of the as u ranges over the leaves of T. Consecuntly Thresh(d,q) is triangulated.

u vX X ,u vd quX

uX

u

v

tree T

xuXvX

u

v

u

v

Thresh(d,q)Intersection

Graph


22

Supertree Construction Algorithm (SCA)

Step 1 : First obtain a simplicial elemination ordering for G. Compute where

For each Ci find a maximal clique C containing Ci and compute a tree ti for Ci by deleting the leaves in C-Ci form Tc.

Step 2 : Construct tree for i = n-3,n-4,…,1 compute the tree Ti formed by merging ti and using Consensus Subtree Merger method

{ : , ( , ) ( )}i j i jX v j i v v E G

Example:

C: {1,2,3,4}

C2: { 2,3,4}

C-C2{1 }

left { 2,3,4}

1iT

iii XvC

23

Strict Consenseus Subtree Merger

This method contracts a minimum set of edges in each tree in order to make them identical on the subtree they induce, lets denote that subtree by X and call it the backbone.

Merging two tree is done by attaching the pieces of each tree appropriately to the different edges of the backbone.

The situatuion in which the some piece of each tree attaches onto the same edge of the backbone, called collision.

1 2

34 6

5

1 2

37 4

1

3

2

4

1 2

3 4

1 2

3 4

12

34

5

6

7

24

Short Quartet Definition

Let (T,w) be a binary tree edge weighted by , and leaf laled by the set of spieces. Let e be an edge in T that is not incident to a leaf of T. Aroun e there is four subtrees A,B,C,D. Let a,b,c,d be four laves of the subtrees A,B,C,D repectivly, closest to e.Where the distance between leaves i and j measured as . We call {a,b,c,d} a short quartet around e. and the collection of all short quartets around internal nodes of T is denoted by )(TQshort

ijPeew )(

RTEw )(:},...,2,1{ nS

subtree of B

subtree of A

subtree of D

subtree of C

dc

ba

e

25

Gsq Definition

Let be the additive distance matrix associated to T. The Graph Gsq on the vertex set S = {1,2,…,n} is defined by if i and j are in same short quatet

Examples:

j

i

Tj

i

sqG

sqGji ),(

26

Proof of Tq correctnessTheorem: Let T be a leaf-labeled tree, Let G be a triangulated graph such that . Let Be the collection of Buneman trees applied to on the maximal cliques of G and assume this collection reconstructs the correct subtree, and let T* be the tree obtained by applying SCA to (G, ). Then T*=T.

Proof: We will show that under this conditions, Ti and the T restricted to the same vertices are identical and no collision occur.

Part I: Let T be a tree whose leaves are labeled by . Let G be a triangulated graph on S, and let where is a tree on leaf set A for every maximal clique A in G. Let be a simplicial elimination ordering of G. Let show that for every i

Base: this is true since we assumed that all buneman trees are correct.

sqG G

1 2{ , ,..., }nS v v v

AT{ }AT

1|{ , ,..., }i i n iT v v v T 1 2{ , ,..., }nv v v

3 3 2 1|{ , , , }n n n n nT T v v v v

27

Proof of Tq correctness(Cont.)Lets assume for some . forms the leaf set of the back bone of the strict consensus merger of . So we get Consequently there is no edge contraction when we compute the back bone.

Part II: There can be a collision only if the backbone contains an edge onto which both and some other attach, denote this edge by e. Thus, some subtree t’ of Ti attached onto e. Let the leaf set of t’ by . Let P be a path in T corresponding to edge e and let its endpoints be a and b. Let denote T0 be subtree of T obtained by deleting all the nodes in T that are separated from a by the deletion of b, and vice versa. Let be the leaves of T0. The following are true:1. and all leaves in t’ are also in 2. restricted to is path connected.3.

{1,2,..., 4}i n 1|{ , ,..., }i i i nT T v v v

1i it and T

1iX

1 1 1| | |i i i i iT X T X t X

1iv 1j iv X 1 1{ , ,..., }i i n iY v v v X

1 ,i a bX A

,a bA

,i a bv A,a bA

sqG ,a bA

28

Proof of Tq correctness(Cont.)Now, let P’ be a path lying in form to some node in Y. Let y be the first node in Y on the path P’. by (3) also lies entirely in so Consequently But this contradicts earlier assumption that

1iy X ,sq a bG A 1iv

1 2 1, ,..., iv v v 1iv

1 1( ) { , ,..., }i i i ny v v v v 1,iv y E G

1iy X

29

Experimental Results-Buneman FN rate of DCM-Buneman is lower than Buneman for every sequnce length. FP rate of DCM-Buneman is slightly higher than Buneman 3% and 0% respectively FN rate of DCM-Buneman reaches 5% at 10,000 sequence length,Buneman doesn’t reach this value.

30

Experimental Results - NJ FN and FP rates of DCM-NJ is significantly lower than NJ.

DCM-NJ becomes lower then 5% at 250 sequence length.

DCM-NJ can reconstruct the true tree at sequence beyond length of 900.

31

Distance Methods

The goal is a phylogenetic tree T such that the distance between species in T approximate The distance in D.

A distance matrix D is a symmetric, non-negative with zero diagonal.

we now describe some distance methods.

32

Buneman Input: a dissimilarity matrix d. Output: tree T.

1. Topology on every four-leaf subset is inferred using Four-Point Method:

Input – 4*4 dissimilarity matrix on i, j ,k, l.

Output –

if dij+dkl< min {dik+djl, dil+djk} then:

The topology ij | kl (i, j are separated from k, l by an edge) is returned.

if dij+dkl= min {dik+djl, dil+djk} then a star tree is returned.

j

i k

lstar

i

lj

ke

ij | kl

33

Buneman (cont.) Let Q be a set of four-leaf trees, defined by the FPM. The buneman tree is the maximally resolved tree satisfying:

for all quartets i, j, k, l if T restricted to i, j, k, l induces a binary tree, then: the tree in Q in i, j, k, l is the same binary tree.

Lemma 1: Let d be an input dissimilarity matrix. Let T be the buneman tree defined by d. Then C(T) is the set of splits (A, B) defined by:

complexity: polynomial time.

Qb'b,|a'a,treetheB,}b'{b,andA}a'{a,allFor

A={1,2,3}

B={4,5}Q: 1

52

4

1,2 | 4,5

1

53

4

1,3 | 4,5

2

53

4

2,3 | 4,5C(T)={(A,B)}

34

Neighbor - Joining

Input: a distance matrix d.

Output: unrooted binary tree T.

Algorithm Description:

For every 2 species, it determines a score, based on the distance matrix.

At each step the algorithm joins the pair with the minimum score:

make a subtree whose root replaces the two chosen species in the matrix.

The distance are recalculated to this new node.

This is reapeted until only tree nodes remain.

Finally, it connects the remaining two vertices with edge.

complexity: polynomial time - o(n3)

35

THE END!

disk-covering method

Documents