the algorithm for constructing phylogenetic tree
DESCRIPTION
The Algorithm for Constructing Phylogenetic Tree. ---by MYZ. what's the phylogenetic tree. common ancestor. the phylogenetic tree is used to express the evolutionary relationship among species. siamang 合趾猴. hylobatidae 长臂猿. orangutan 猩猩. human 人类. chimpanzee 黑猩猩. - PowerPoint PPT PresentationTRANSCRIPT
The Algorithm for Constructing Phylogenetic Tree
---by MYZ
what's the phylogenetic tree
siamang合趾猴
hylobatidae长臂猿
orangutan猩猩
human人类
chimpanzee黑猩猩
The Evolutionary Tree for Some Primates
the phylogenetic tree is used to express the evolutionary relationship among species
common ancestor
generally speaking , the phylogenetic tree is a binary tree
the value of the research
1> Infer evolutionary history
2> estimate the evolution time of the existing species
3> exploit the molecular information to offset the shortage of the fossil
maximum parsimony( 最大简约法 )
maximum likelihood (最大似然法)
distance matrix (距离矩阵法)
The common methods
1
2
3
introduction of the maximum parsimony
introduction of maximum likelihood
educe how to use heuristic algorithm copperate with max likelihood
The framework of this presentation
explain the distance matrix in detail
transform the problem to TSP
and then we can use heuristic algorithm and approximation algorithm to construct the phylogenetic tree
The max parsimony
basic principle :
constructing a phylogenetic tree with minimum amino acid substitution
eg: a : G A A A T T G C b : G A A C T T G T c : G C A C T T G T d : G C C C T T G T e : G C C A T T G T
GAACTTGT
GCCATTGT
GCACTTGT GCCCTTGT
GAAATTGC GAACTTGT GCACTTGT GCCCTTGT
1 11 1
max parsimony
The max likelihood
basic principle :
compute the probability of a particular set of sequences on a given tree and maximizing this probability over all trees.
Input: a set of sequences , a given pattern tree
Output: the likelihood value of the tree
Target: the tree structure wiht max likelihood value
8
2 3 4 5
7
6
1
0
t6 t8
t7
t1 t2
t3t4 t5
max likelihood
how to compute the likelihood value
the probability of a given set of data arising on a given tree can be computed site by site
S
S[i])|logL(TS)|L(T
8
2 3 4 5
7
6
1
0
t6 t8
t7
t1 t2
t3t4 t5
1: ATCGGGTGTGTGCAGTGCTG2: ATGCCTTGTGTGCAGTGCTG3: ATGCCTTACTGTGCAGTGCT4: GTCAAATCGTGATCGATAGCT5: ATGCTAGTTGCTAGCATAGAT
L(T | S1) L(T | S2) L(T | Sn)…
max likelihood
The L(T | S[i]))()()()()()()()(])[|( 548321760 5848803627177660tPtPtPtPtPtPtPtPiSTL xxxxxxxxxxxxxxxx
)(tPij
8
2 3 4 5
7
6
1
0
t6 t8
t7
t1 t2
t3t4 t5
where i , j corresponding to the four bases A T G C
is the probability that a lineage which is initially in state i will be in state j after t units of time have elapsed
max likelihood
0 is the prior probability
The L(T | S[i]))()()()()()()()(])[|( 548321760 5848803627177660tPtPtPtPtPtPtPtPiSTL xxxxxxxxxxxxxxxx
But in the formula , x0 x6 x7 x8 are unknown variables
0
58488036
6 7 8
27177660)()()()()()()()(])[|( 548321760
xxxxxxxxx
x x xxxxxxxxx tPtPtPtPtPtPtPtPiSTL
This expression have 256 terms , in general a tree with n leaves will have n-1 internal nodes and then will have 4^(n-1) terms
max likelihood
8
584880)()()( 548
xxxxxxx tPtPtP
0 6
36
7
271776600)()()()()(])[|( 32176
x xxx
xxxxxxxxxx tPtPtPtPtPiSTL
8
2 3 4 5
7
6
1
0
t6 t8
t7
t1 t2
t3t4 t5
The L(T | S[i])
S
S[i])|logL(TS)|L(T
notice that the pattern of parenthese describes an exact relationship of the topology
max likelihood
L( T | S ) as the fitness function
The structure of the tree is the solution
our target is get a tree's structure with max likelihood value
The number of the trees' structure is
heuristic algorithm
n
i
inT3
)32()(
output
initialize
Neighbour soulutions
requirement
record max value
compute L(T|S)
no
yes
max likelihood
distance matrix -------Neighbour joining
Neighbour joining seeks to build a tree which minimizes the sum of all branch lengths
X
1
2
3
87
6
5
4
1
2
87
6
54
3
Y X
distance matrix
step1 : obtain a distance table of each pair sequences
1 2 3 4 5
1 0 0.015 0.045 0.143 0.198
2 0 0.03 0.126 0.179
3 0 0.092 0.179
4 0 0.179
5 0sequeces two theoflength
site same in the base same theofnumber q
)14
3ln(
4
3
qd
]21)221ln[(2
1221 pppd
distance matrix of five sequences
Jukes-Cautor single parameter model
Kimuradouble parametes model
sequence theoflength
ation transformI ofnumber 1 p
sequence oflength
ation transformII ofnumber 2 p
step 2: select the min distance and merge nodes
1 2 3 4 5
1 0 0.015 0.045 0.143 0.198
2 0 0.03 0.126 0.179
3 0 0.092 0.179
4 0 0.179
5 0
distance matrix of five sequences
so the select the node 1 and 2 as branch add the 6 to the structure and compute the distance of 6 to each nodes
1
2
3
4
5
6
distance matrix
in the meantime , creat a new nodes 6 as the parent of the 1&2
step 3: the disatance of new node to remaining nodes
2jzizij
ix
DDDL
if we select two nodes i and j with min distance , and then creat a new node xas the parent node , we compute the distance of k to other nodes as follow formula
ji,kn 1,2k 2
jkikxk
DDD
we should also modify the distance of i,j to the x as the length of the branch
2izjzij
jx
DDDL
1
2
3
4
5
6
z is all nodes except i and j
use the rate-corrected distance 1
N
nmmnijjkik D
NDDD
N 1ij 2
1
2
1)(
)2(2
1S
ij
ji
N
kk
ij DN
AAAS
2
1
)2(21
ji 1
N
jiji DA
2
N
AADM jiijij
1 2 3 4 5 A
1 0 0.015 0.045 0.143 0.198 0.401
2 0.015 0 0.03 0.126 0.179 0.35
3 0.045 0.03 0 0.092 0.179 0.346
4 0.143 0.126 0.092 0 0.179 0.540
5 0.198 0.179 0.179 0.179 0 0.735
distance matrix
1 2 3 4 5
1 0 0.015 0.045 0.143 0.198
2 0 0.03 0.126 0.179
3 0 0.092 0.179
4 0 0.179
5 0
2
N
AADM jiijij
use the rate-corrected distance 2
1 2 3 4 5
1 0 -0.21 -0.179 -0.139 -0.143
2 0 -0.179 -0.141 -0.147
3 0 -0.174 -0.145
4 0 -0.214
5 0
table of diastance matrix table of rate-corrected distance
distance matrix
summarize the process
i
j
k
i
j
distance matrix
1 compute Ai according to 2 while N>2 do 3 for i=0 to m-1 do 4 for j=i+1 to m do 5 compute Mij according to
6 select the min Mij , cluste i j to a new node x7 compute the Dxk according to
8 modify the branch length of i and j to x according to
9 delete the i and j from the table , add the x to the table 10 N=N-111 end of while
The pseudo code of the NJ algorithm
ji 1
N
jiji DA
2
N
AADM jiijij
ji,kn 1,2k 2
jkikxk
DDD
2jzizij
ix
DDDL
2izjzij
jx
DDDL
maximum parsimony
maximum likelihood
make full use of the information of the nucleotide while there have few species , MP will find the global optimum tree while there have plenty of species , the performance under restrictions
merit and demerit
make full use of the information of the nucleotide highly dependent on the nucleotide substitution model the performance is the worst
Neighbour joining
is the most fast algorithm of all but sometimes get the wrong topology
Transform the problem to TSP
A B
C
D
y
x
z2 2
2
2
1
1
A B
C
D
y
x
z2 2
2
2
1
1
zxy
A
y
B
y xC
x
z
D
1
1
2
21 2
2
2
2
22
1
Transform the problem to TSP
zxy
A
y
B
y xC
x
z
D
1
1
2
21 2
2
2
2
22
1
A
B
C
D
3
5
6
6
add the edges of the unshadow node
Transform the problem to TSP
A
B
C
D
3
5
6
6
A B C D
A 0 3 4 6
B 3 0 5 7
C 4 5 0 6
D 6 7 6 0
A B
DC
6
3
5
6
4 7
the circle is one of the hamiltonian circuit of the complete graph
Transform the problem to TSP
now assume that if we get a hamiltonian circuit , can we construct the phylogenetic tree
A
BC
D
3
5
6
6
y
BA
D
y
C
66
5
x
y
A B
C
X
D
z
x
y
A B
C
D
Transform the problem to TSP
so the question transform to seek the min hamiltonian circuit of a given complete graph
A B
DC
6
3
5
6
4 7
A
BC
D
3
5
6
6z
x
y
A B
C
D
step 1 distance matrix step 2 TSP
step 3 construct tree1 ant colony optimization,ACO2 particle Swarm Optimization, PSO3 genetic Algorithm, GA4 simulated Annealing , SA5 artificial bee colony algorithm, ABC6 approximation algorithm, NN,ShortestLink,Insertheuristic
Transform the problem to TSPz
x
y
A B
C
D
A B C D
A 0 3 4 6
B 3 0 5 7
C 4 5 0 6
D 6 7 6 0
6
7
5
6
4
3
zDxzCx
zDxzyxBy
xCyxBy
zDxzyxAy
xCyxAy
yBAy
2
2
2
1
2
1
zD
xz
xC
xy
By
Ay
A B
C
D
y
x
z2 2
2
2
1
1
Thank you !
maoyaozong