the algorithm for constructing phylogenetic tree

The Algorithm for Constructing Phylogenetic Tree

---by MYZ

what's the phylogenetic tree

siamang合趾猴

hylobatidae长臂猿

orangutan猩猩

human人类

chimpanzee黑猩猩

The Evolutionary Tree for Some Primates

the phylogenetic tree is used to express the evolutionary relationship among species

common ancestor

generally speaking , the phylogenetic tree is a binary tree

the value of the research

1> Infer evolutionary history

2> estimate the evolution time of the existing species

3> exploit the molecular information to offset the shortage of the fossil

maximum parsimony( 最大简约法 )

maximum likelihood （最大似然法）

distance matrix （距离矩阵法）

The common methods

1

2

3

introduction of the maximum parsimony

introduction of maximum likelihood

educe how to use heuristic algorithm copperate with max likelihood

The framework of this presentation

explain the distance matrix in detail

transform the problem to TSP

and then we can use heuristic algorithm and approximation algorithm to construct the phylogenetic tree

The max parsimony

basic principle :

constructing a phylogenetic tree with minimum amino acid substitution

eg: a : G A A A T T G C b : G A A C T T G T c : G C A C T T G T d : G C C C T T G T e : G C C A T T G T

GAACTTGT

GCCATTGT

GCACTTGT GCCCTTGT

GAAATTGC GAACTTGT GCACTTGT GCCCTTGT

1 11 1

max parsimony

The max likelihood

basic principle :

compute the probability of a particular set of sequences on a given tree and maximizing this probability over all trees.

Input: a set of sequences , a given pattern tree

Output: the likelihood value of the tree

Target: the tree structure wiht max likelihood value

8

2 3 4 5

7

6

1

0

t6 t8

t7

t1 t2

t3t4 t5

max likelihood

how to compute the likelihood value

the probability of a given set of data arising on a given tree can be computed site by site

S

S[i])|logL(TS)|L(T

8

2 3 4 5

7

6

1

0

t6 t8

t7

t1 t2

t3t4 t5

1: ATCGGGTGTGTGCAGTGCTG2: ATGCCTTGTGTGCAGTGCTG3: ATGCCTTACTGTGCAGTGCT4: GTCAAATCGTGATCGATAGCT5: ATGCTAGTTGCTAGCATAGAT

L(T | S1) L(T | S2) L(T | Sn)…

max likelihood

The L(T | S[i]))()()()()()()()(])[|( 548321760 5848803627177660tPtPtPtPtPtPtPtPiSTL xxxxxxxxxxxxxxxx

)(tPij

8

2 3 4 5

7

6

1

0

t6 t8

t7

t1 t2

t3t4 t5

where i , j corresponding to the four bases A T G C

is the probability that a lineage which is initially in state i will be in state j after t units of time have elapsed

max likelihood

0 is the prior probability

The L(T | S[i]))()()()()()()()(])[|( 548321760 5848803627177660tPtPtPtPtPtPtPtPiSTL xxxxxxxxxxxxxxxx

But in the formula , x0 x6 x7 x8 are unknown variables

0

58488036

6 7 8

27177660)()()()()()()()(])[|( 548321760

xxxxxxxxx

x x xxxxxxxxx tPtPtPtPtPtPtPtPiSTL

This expression have 256 terms , in general a tree with n leaves will have n-1 internal nodes and then will have 4^(n-1) terms

max likelihood

8

584880)()()( 548

xxxxxxx tPtPtP

0 6

36

7

271776600)()()()()(])[|( 32176

x xxx

xxxxxxxxxx tPtPtPtPtPiSTL

8

2 3 4 5

7

6

1

0

t6 t8

t7

t1 t2

t3t4 t5

The L(T | S[i])

S

S[i])|logL(TS)|L(T

notice that the pattern of parenthese describes an exact relationship of the topology

max likelihood

L( T | S ) as the fitness function

The structure of the tree is the solution

our target is get a tree's structure with max likelihood value

The number of the trees' structure is

heuristic algorithm

n

i

inT3

)32()(

output

initialize

Neighbour soulutions

requirement

record max value

compute L(T|S)

no

yes

max likelihood

distance matrix -------Neighbour joining

Neighbour joining seeks to build a tree which minimizes the sum of all branch lengths

X

1

2

3

87

6

5

4

1

2

87

6

54

3

Y X

distance matrix

step1 : obtain a distance table of each pair sequences

1 2 3 4 5

1 0 0.015 0.045 0.143 0.198

2 0 0.03 0.126 0.179

3 0 0.092 0.179

4 0 0.179

5 0sequeces two theoflength

site same in the base same theofnumber q

)14

3ln(

4

3

qd

]21)221ln[(2

1221 pppd

distance matrix of five sequences

Jukes-Cautor single parameter model

Kimuradouble parametes model

sequence theoflength

ation transformI ofnumber 1 p

sequence oflength

ation transformII ofnumber 2 p

step 2: select the min distance and merge nodes

1 2 3 4 5

1 0 0.015 0.045 0.143 0.198

2 0 0.03 0.126 0.179

3 0 0.092 0.179

4 0 0.179

5 0

distance matrix of five sequences

so the select the node 1 and 2 as branch add the 6 to the structure and compute the distance of 6 to each nodes

1

2

3

4

5

6

distance matrix

in the meantime , creat a new nodes 6 as the parent of the 1&2

step 3: the disatance of new node to remaining nodes

2jzizij

ix

DDDL

if we select two nodes i and j with min distance , and then creat a new node xas the parent node , we compute the distance of k to other nodes as follow formula

ji,kn 1,2k 2

jkikxk

DDD

we should also modify the distance of i,j to the x as the length of the branch

2izjzij

jx

DDDL

1

2

3

4

5

6

z is all nodes except i and j

use the rate-corrected distance 1

N

nmmnijjkik D

NDDD

N 1ij 2

1

2

1)(

)2(2

1S

ij

ji

N

kk

ij DN

AAAS

2

1

)2(21

ji 1

N

jiji DA

2

N

AADM jiijij

1 2 3 4 5 A

1 0 0.015 0.045 0.143 0.198 0.401

2 0.015 0 0.03 0.126 0.179 0.35

3 0.045 0.03 0 0.092 0.179 0.346

4 0.143 0.126 0.092 0 0.179 0.540

5 0.198 0.179 0.179 0.179 0 0.735

distance matrix

1 2 3 4 5

1 0 0.015 0.045 0.143 0.198

2 0 0.03 0.126 0.179

3 0 0.092 0.179

4 0 0.179

5 0

2

N

AADM jiijij

use the rate-corrected distance 2

1 2 3 4 5

1 0 -0.21 -0.179 -0.139 -0.143

2 0 -0.179 -0.141 -0.147

3 0 -0.174 -0.145

4 0 -0.214

5 0

table of diastance matrix table of rate-corrected distance

distance matrix

summarize the process

i

j

k

i

j

distance matrix

1 compute Ai according to 2 while N>2 do 3 for i=0 to m-1 do 4 for j=i+1 to m do 5 compute Mij according to

6 select the min Mij , cluste i j to a new node x7 compute the Dxk according to

8 modify the branch length of i and j to x according to

9 delete the i and j from the table , add the x to the table 10 N=N-111 end of while

The pseudo code of the NJ algorithm

ji 1

N

jiji DA

2

N

AADM jiijij

ji,kn 1,2k 2

jkikxk

DDD

2jzizij

ix

DDDL

2izjzij

jx

DDDL

maximum parsimony

maximum likelihood

make full use of the information of the nucleotide while there have few species , MP will find the global optimum tree while there have plenty of species , the performance under restrictions

merit and demerit

make full use of the information of the nucleotide highly dependent on the nucleotide substitution model the performance is the worst

Neighbour joining

is the most fast algorithm of all but sometimes get the wrong topology

Transform the problem to TSP

A B

C

D

y

x

z2 2

2

2

1

1

A B

C

D

y

x

z2 2

2

2

1

1

zxy

A

y

B

y xC

x

z

D

1

1

2

21 2

2

2

2

22

1


zxy

A

y

B

y xC

x

z

D

1

1

2

21 2

2

2

2

22

1

A

B

C

D

3

5

6

6

add the edges of the unshadow node


A

B

C

D

3

5

6

6

A B C D

A 0 3 4 6

B 3 0 5 7

C 4 5 0 6

D 6 7 6 0

A B

DC

6

3

5

6

4 7

the circle is one of the hamiltonian circuit of the complete graph


now assume that if we get a hamiltonian circuit , can we construct the phylogenetic tree

A

BC

D

3

5

6

6

y

BA

D

y

C

66

5

x

y

A B

C

X

D

z

x

y

A B

C

D


so the question transform to seek the min hamiltonian circuit of a given complete graph

A B

DC

6

3

5

6

4 7

A

BC

D

3

5

6

6z

x

y

A B

C

D

step 1 distance matrix step 2 TSP

step 3 construct tree1 ant colony optimization,ACO2 particle Swarm Optimization, PSO3 genetic Algorithm, GA4 simulated Annealing , SA5 artificial bee colony algorithm, ABC6 approximation algorithm, NN,ShortestLink,Insertheuristic

Transform the problem to TSPz

x

y

A B

C

D

A B C D

A 0 3 4 6

B 3 0 5 7

C 4 5 0 6

D 6 7 6 0

6

7

5

6

4

3

zDxzCx

zDxzyxBy

xCyxBy

zDxzyxAy

xCyxAy

yBAy

2

2

2

1

2

1

zD

xz

xC

xy

By

Ay

A B

C

D

y

x

z2 2

2

2

1

1

Thank you !

maoyaozong

the algorithm for constructing phylogenetic tree

Documents