an improved algorithm for the macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf ·...
TRANSCRIPT
1
An Improved
Algorithm for the
Macro-evolutionary
Phylogeny Problem
Ilia Flax 015600042
2
Lecture Outline
1. Biologic Motivation
2. Introduction.
3. Description of the model and a formal definition of the problem.
4. Combinatorial properties of the optimal answers which will be useful for our algorithm design.
5. Improved algorithm for solving the macro-evolutionary phylogeny problem.
6. Improvement in the case of the unit cost duplication and loss events.
7. Conclusion.
Biological Motivation
NucleotideA group of organic compounds. Nucleotides
appear in all living beings and construct the
genetic material.
Biological Motivation
Nucleic Acid Composed of a chain of
nucleotides.
DNA - Deoxyribonucleic AcidA huge molecule of Nucleic Acids
that contains all of the information
for the contracture of all the
proteins in an organic cell.
Biological Motivation Protein
A group of organic compounds. Located in the
cells of all living beings, in the cell itself
located in almost every structure. The
characters of an organism are a direct result of
protein activity.
Biological Motivation Gene
a data unit that is passed from an organism to
its opsprings using the DNA. The genes contain
the "manufacturing instructions" for tens of
thousands of cell proteins. Physically, the
genes are segments of the DNA
Biological Motivation
Evolution
The process of genetic change in a
population of organisms during time.
Using the process of evolution we can
explain:
The biological diversity that exists in
morphological, physical and behavioral
characters.
The complexity of life.
The creation of new taxonomic species.
Evolution researchers struggle to
understand the likelihood of evolutionary
processes and to know which of them
actually took place.
Biological Motivation
Human Evolution
Biological Motivation
Phylogenies are
evolutionary
histories
Systematics is
“an analytic
approach to
understanding
the diversity of
relationships of
organisms” p. 491,
Campbell & Reece (2005)
“Taxonomy is the ordered division of
organisms into categories based on…
similarities and differences.” p. 495, Campbell & Reece
(2005)
Shown is a phylogenetic tree
Biological Motivation
Micro-Evolution
The occurrence of small-scale changes in
a population, over a few generations
These changes may be due to several
processes: mutation, natural selection,
gene flow and genetic drift, the
randomness of mating within populations.
For example bacterial strains that have
antibiotic resistance.
Biological Motivation
Macro-Evolution The large-scale patterns, trends,
and rates of change (i. e. gene
frequencies) among living
organisms over long periods of
time.
consisting of extended
microevolution.
The difference is largely one of
approach.
The evidence comes from 2 main
sources: fossils and comparisons
between living organisms.
Morphological
Divergence
Change from body form
of a common ancestor
Produces homologous
structures s
1
1
1
1
1
1
2
2
2
2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4
5
5
5
5
pterosaur
chicken
bat
porpoise
penguin
human
early reptile
Biological MotivationFossils
Developmental Patterns
FISH REPTILE BIRD MAMMAL
15
Introduction
The goal of evolutionary biology:
reconstruction of the evolutionary history.
Evolutionary history tree - phylogenetic tree.
The internal nodes - ancestral species
leaves - current species.
DNA sequences used as characters to
estimate phylogenetic trees.
Related genes evolve through the process of
gene duplication.
copies of a duplicated genes can evolve
distinct variations - paralogues.
16
Introduction
A single species may contain none, one, or several copies of what was a single gene in an ancestor.
building a phylogenetic tree:
know which copies of a gene are be compared.
gene tree
Evolution of a set of genes
Species tree
Evolution of species
Determination of evolutionary history:
micro-evolutionary events
macro-evolutionary events.
extreme
thermophiles
halophilesmethanogens cyanobacteria
ARCHAEBACTERIA
PROTISTANS
FUNGIPLANTS
ANIMALS
clubfungi
sacfungi
zygospore-forming
fungi
chordatesannelids
mollusks
flatworms
sponges
cnidarians
flowering plants conifers
horsetails
lycophytes
ferns
bryophytes
sporozoans
green algae amoeboidprotozoans
slime molds
ciliatesredalgae
brown algae
chrysophytes
cycads
ginkgos
rotifers
arthropods
round-
worms
chytrids
oomycotes
euglenoids
dinoflagellates
Gram-positive bacteria
spirochetes
chlamydias
proteobacteria
?crown of eukaryotes(rapid divergences)
molecular origin of life
EUBACTERIAparabasalids
diplomonads(e.g., Giardia)
(alveolates)
(stramenopiles)
chlorophytes
kinetoplastids
extreme
(e.g., Trichomonas)
DNA Sequence at each leaf
Gene Tree
18
Outline
2005: Durand
algorithm to find the tree that optimizes a
macro-evolutionary criterion.
Complexity of the algorithm is
n - species number
m - maximum number of copies of the gene.
Our algorithm:
.
- for unit costs for loss and
duplication of a gene.
Lecture Outline
Biological Motivation
Introduction
Problem description
Macro-evolutionary
Phylogeny Problem
Properties of
optimal histories
Algorithm
Conclusion
( )O nm
( )O n
2( )O nm
19
Durand’s Algorithm
Two-phase approach to gene tree reconstruction: sequence evolution.
gene duplication ( or loss ) for the reconstruction of phylogenies.
Dynamic programming algorithm : finds phlyogenies.
Using macro-evolutionary model of gene duplication and loss.
Input: number of genes in each species
Output: A tree with fewest duplications and losses.
20
Durand’s Algorithm
1. A tree is constructed using the sequence evolution operations considering micro-evolutionary events only.
2. Some regions of the tree are refined with respect to a macro-evolutionary model.
Macro-evolutionary events are used only for explaining the areas where the sequence data cannot resolve the topology
The total search space is reduced.
21
The presented algorithm
Improvement of
important - a gene family can have hundreds of duplicates.
Use combinatorial properties of the structure of optimal histories.
Using these properties for the macro-evolutionary phylogeny problem.
( )O m
22
Problem Description
A single current species may have zero, one,
or several copies of what was a single gene in
an ancestor.
macro-evolutionary phylogeny problem:
Explains the different multiplicities by duplications
and losses.
infinitely many different histories to generate a
phylogenetic tree.
Interesting histories
Fewest number of losses and duplications.
23
Problem Description
The optimization criterion is based on the cost
of duplication and loss.
cost of a duplication .
cost of a loss .
The score of a gene tree is
number of duplications
number of losses
/D L c L c D
c
c
D
L
HumanChimpGorilla
Iliagutan
Number of gene copies at each node
1
2
36 7
Species Tree
25
Problem Description
History is presented as a species tree
Each node is annotated with its multiplicity
The multiplicity of the root is one
The multiplicity of the leaves is specified in
the input.
Duplication
increases the number of gene copies in the species
by one.
Loss
decreases the number of gene copies of the species
associated to a node.
1,..., sm m
26
Problem Description
i gene copies exist in species x and it passes j
copies to each of its children.
Explaining the data:
If j > i, j −i duplications.
If j < i, i−j losses.
If i = j, no loss or duplication.
27
Problem Description
a) b) c)
frog
mouse humanm=2m=1
m=2
1
1
1
11
1 1
1
22
22 2
2
2
2
28
Macro-evolutionary
Phylogeny Problem
Input: A rooted species tree , with leaves
a list of multiplicities , where is
the number of gene family members found
in species ;
weights and .
Output: The set of all rooted species trees such
that the Score of is minimal.
ST S
1,..., sm m lm
l
c c
{ }GT
GT/D L
Macro-evolutionary
Phylogeny Problem
Input:
Output:
ST
1m
2c
3c
{ }GT
/ minD LScore
2msm
1x
1z{ }1m
2m sm
1
1y
1w
2x
2z1m
2m sm
1
2y
2w
30
Problem Description Durand’s dynamic programming algorithm:
1. Fill a table Cost[v, i, j]
1. minimum D/L score for the sub-tree rooted at v
2. v has i entering copies
3. v passes j gene copies to its children.
4. 1 ≤ i, j ≤ m, m is the maximum multiplicity for a leaf of the tree.
2. reconstruct the gene trees using the table.
3. The complexity is for giving one optimal history and for reporting k optimal histories.
2( )O nm2( )O nm nmk
31
Properties of Optimal
Histories Definition 1.
Tree and a given list of multiplicities for its leaves.
- the minimum D/L score for duplication/loss
history of tree .
x - entering copies of genes to the root of the tree.
In an optimal history, in a given node there
either is no event or a first duplication (or loss)
event is followed by an optimal history.
T
( , )g x T T
( 1, ) (1)
( , ) min ( 1, ) (2) (#)
( , ) ( , ) (3)L R
g x T c
g x T g x T c
g x T g x T
32
( 1, ) (1)
( , ) min ( 1, ) (2) (#)
( , ) ( , ) (3)L R
g x T c
g x T g x T c
g x T g x T
BA
C
4 2
g={1,0,1,2}
For every vertex we will save a vector representing the g function value for ientering gene copies
1c c
g={3,2,1,0}
g(1,c)=min(g(2,c)+1,g(1,a)+g(1,b))
g(2,c)=min(g(1,c)+1,g(3,c)+1,g(2,a)+g(2,b))
g(3,c)=min(g(2,c)+1,g(4,c)+1,g(3,a)+g(3,b))
g(4,c)=min(g(3,c)+1,g(4,a)+g(4,b))
2
2
3
2
Properties of Optimal
Histories
33
Properties of Optimal
Histories
for a sufficiently large ,
The recurrences are finite for any tree .
, 0c c
N
( , ) ( , ) ( , ) , ( , )L Rg x T g x T g x N T Nc g x N T Nc
T
,R LT T T
( , ) ( , )
( , ) ( , )
g x T g x k T kc
g x T g x k T kc
1.
2.
3.
34
Properties of Optimal
Histories
Definition 2. For a given tree T and a given list of multiplicities for its leaves, is defined as the set of all integers x such that for any integer , .This optimal cost itself is denoted by .
Lemma 1. Let be a binary tree and ( )
be its left (right) sub-tree.
The optimal history of includes the optimal
histories of and .
We will prove that is an integer interval for any tree .
( )OPT T
'x ( , ) ( ', )g x T g x T
( )opt T
T LTRT
( ) ( ) ( )L Ropt T opt T opt T
T
LTRT
( )OPT TT
35
Properties of Optimal
Histories
Proposition 1.
For any tree T with given input multiplicities for
the leaves, is an integer interval.
By proposition 1.The function is
minimum in an integer interval which is
denoted by .
We will show that the function , is
strictly decreasing for and strictly
increasing for .
( )OPT T
( )OPT T
( , )g x T
( , )g x T
1 2,x x
1x x
2x x
36
Properties of Optimal
Histories
Proposition 2.
Let T be a binary tree with given multiplicities
for leaves and let be . The
function is strictly decreasing for all x
smaller than and strictly increasing for all x
larger than .
Proposition 3.
Let T be a binary tree with given multiplicities
for leaves. is a convex function.( , )g x T
( , )g x T
1 2,x x
1x
2x
( )OPT T
37
Properties of Optimal
Histories
Optimal interval
38
Properties of Optimal
Histories
The general structure of :
1. is firstly strictly decreasing then it takes its
minimum on an interval and then it is strictly
increasing.
2. The function is convex; is increasing.
3. For large values of x, .
4. For small values of x, .
It would be convenient to extend the definition
of function for negative values of x.
Rather than proving the above three
propositions separately, we prove them all
together by induction on the size of the tree.
( , )g x T
( , )g x T
( , )g x T
( , )g x T
( , )g x T c
( , )g x T c
39
Properties of Optimal
Histories
Proof
The proof is done by induction on the size
of the tree.
1. As the base step, let us consider a
tree T which has one leaf with
multiplicity p. In this case it is easy to
verify that
0
( , ) ( )
( )
if x p
g x T p x c if x p
x p c if x p
40
Properties of Optimal
Histories
2. All the three propositions are true for this
function.
3. Now suppose that the three propositions
are true for any tree with strictly less than
k leaves (k > 1), and consider a tree T with
k leaves.
4. Both left and right sub-trees of T which are
denoted by and have less than k
leaves so the three propositions are true
for them by induction hypothesis.
LT RT
41
Properties of Optimal
Histories
5. Consider an interval I of integers where is
non-decreasing and is
non-increasing.
6. The main part of the proof is that we
show:
a. Lets prove this by contradiction. Suppose
(*) is not true.
b. There exists x in I, such that
( , )Lg x T ( , )Rg x T
: ( , ) ( , ) ( , ) (*)L Rx g x T g x T g x T
( , ) ( , ) ( , )L Rg x T g x T g x T
42
Properties of Optimal
Histories
c. By (#), w.l.g. we suppose that
. The number of
consecutive duplications in a node in
optimal generation is finite.
d. Exists u such that:
( , ) ( 1, )g x T g x T c
( , ) ( , )g x T g x u T uc
( , ) ( , ) ( , )L Rg x u T g x u T g x u T
Properties of Optimal
Histories
( , ) ( , ) ( , )L Rg x u T uc g x T g x T
( , ) ( , ) ( , ) ( , )L R L Rg x u T g x u T uc g x T g x T
0
( , ) ( , ) ( , ) ( , )L L R R
uc
uc g x T g x u T g x T g x u T
uc uc
!Contradiction
44
Properties of Optimal
Histories
If , similar contradiction can be obtained.
Symmetrically if is decreasing and is increasing in an interval the
equality is correct.
As a consequent of this equality, is convex in the interval - The sum of two convex functions is convex.
Note that if the optimal generating interval for and are and respectively,
then in the interval , the equality (*) is correct.
( , ) ( 1, ) ( , ) ( , )L Rg x T g x T c g x T g x T
( , )Lg x T
( , )Rg x T
( , )g x T
1 1 2 2min , ,max ,l r l r
LT RT 1 2[ , ]l l1 2[ , ]r r
45
Properties of Optimal
Histories Now let us consider the interval
.
In this interval both and are strictly increasing by the induction hypothesis.
It is easy to verify that the function and so is
equal to the minimum of and .
If for all values of x in this interval we have then
becomes strictly increasing and convex as the sum of two strictly increasing and convex functions.
2 2[max{ , } 1, )l r
( , ) ( 1, )g x T g x T c
( , ) ( , )L Rg x T g x T
( , )g x T( 1, )g x T c
( , )Lg x T ( , )Rg x T
( , ) ( , ) ( , )L Rg x T g x T g x T ( , )g x T
46
Properties of Optimal
Histories Otherwise, consider the first value of in
the interval such that
By the way we defined , we have
Consequently, we have
On the other hand and are convex and so and are increasing.
So, The function is strictly increasing and convex for any , so is convex and strictly increasing for any x in .
0 0 0( 1, ) ( 1, ) ( 1, )L Rg x T g x T g x T
0 0 0 0( , ) ( 1, ) ( , ) ( , )L Rg x T g x T c g x T g x T
0 0( 1, ) ( 1, )L Rg x T g x T c
0x
0x
( , )Lg x T ( , )Rg x T( , )Lg x T ( , )Rg x T
0 : ( , ) ( 1, )x x g x T g x T c ( , )g x T
0x x ( , )g x T
47
Properties of Optimal
Histories
Similarly, we can show in an interval where both and are strictly decreasing, is strictly decreasing and convex.
In order to complete the proof we need to consider the different possible configurations of and .
The next figure shows the three possible arrangements of the optimal intervals of the two functions.
( , )Lg x T ( , )Rg x T
( , )Lg x T ( , )Rg x T
( , )g x T
48
Properties of Optimal
Historiesa) b)
c)
1
1
12
2
2 3 4 5
43 5
3 4 5
49
Properties of Optimal
Histories
In all three cases intervals 1 and 5 refer to the and in the proof.
Interval 2 and interval (3,4) refer to the interval in the proof.
The proof of the convexity in the exchange points of these intervals is easy and is omitted.
It is also easy to show (by Lemma 1) that in cases (a) and (b) the optimal interval of T is inteval 3.
In case (c) the optimal interval is an interval which is included in interval 3.
50
Algorithm
we present an algorithm for computation of the optimal D/L score histories for a given tree.
The algorithm fills the table for any and for all sub-trees of T.
is not computed for non positive values of x because it is not biologically meaningful.
On the other hand an optimal solution has never more than m genes present in a species so there is no need to compute for .
[ , ]g x T
( , )g x T
( , )g x T
1 x m
x m
51
AlgorithmAlgorithm GenCost(tree T)
1. if T is a leaf then
1.1 for i ← 1 to m do
1.1.1 if i ≥ label(T) then g[i, T ] ← (i − label(T)) ׳cλ
1.1.2 if i < label(T) then g[i, T ] ← (label(T) − i) ׳cδ
1.2 exit
2. GenCost(TL); GenCost(TR);
3. [l1, l2] ← OPT(TL); [r1, r2] ← OPT(TR)
4. t1 ← min{l1, r1}; t2 ← max{l2, r2}
5. for i ← t1 to t2 do g[i, T ] ← g[i, TL] + g[i, TR]
6. for i ← t2+1 to m do g[i,T]←min{g[i−1,T]+cλ, g[i,TL] + g[i,TR]}
7. for i ←t1−1 downto 1 do g[i,T]←min{g[i+1,T]+cδ,g[i,TL]+g[i,TR]}
Algorithm Example
A
B
DE GF
g(i,T) A B C D E F G
1
2
3
4
C
3 1 4 2
1
Loss : Cλ=X
Duplication: Cδ=Y
2Y
1Y
0
1X
0
1X
2X
3X
3Y
2Y
1Y
0
1Y
0
1X
2X
{3,3} {1,1} {4,4} {2,2}
{1,3} {2,4}
2Y
1Y+1X
2X
3X
2Y
1Y+1X
2X
3Y
{3,4}
0.5Y<X<Y
{3,3} {4,4}
1Y+3X
5X
3Y+X
4Y+X
53
Algorithm
The correctness of the algorithm is a result of
the proof of the propositions.
The complexity of the algorithm is .
Once table g has been determined by the Algorithm, finding optimal D/L score history is easy.
Starting with at each step one checks how g is minimized.
( )O mn
[1, ]g T
54
Algorithm
The total complexity will be for
computing k optimal answers.
Note that multiple optimal histories correspond
to the nodes and values of x such that g is
minimized by two lines of the recurrence.
( )O mn nk
( 1, ) (1)
( , ) min ( 1, ) (2) (#)
( , ) ( , ) (3)L R
g x T c
g x T g x T c
g x T g x T
55
Algorithm
In practice depending on the values of and
this complexity can be improved.
The function is convex.
is increasing.
.
Store and update the function g just by
computing the points that ( ) changes its
value from x to x + 1.
This will reduce the complexity of the algorithm
to where .
complexity of the algorithm for generating k
optimal histories is .
( , )c g x T c
( (min{ , } ))O n m c k
( )O cn c c c
( , )g x T
g
cc
( , )g x T
56
Algorithm
Unit Loss/Duplication Costs. When
, function becomes very
simple.
If , is constant in ,
increasing with step 1 for and
decreasing with step -1 for .
The complexity of the algorithm will become
for finding k optimal trees.
1c c ( , )g x T
( , )g x T1 2( ) [ , ]OPT T k k 1 2[ , ]k k
2x k
1x k
( )O nk
57
Conclusion
combinatorial properties of the optimal D/L
histories for a given species tree.
Improved algorithm for finding the optimal
histories in O(m) order faster.
The improvement is for unit cost
duplication/loss function.
The macro-evolutionary phylogeny problem has
been shown to be useful and interesting in
order to build phylogenies based on both macro
and micro-evolutionary processes.
2( )O m
58