an improved algorithm for the macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf ·...

58
1 An Improved Algorithm for the Macro-evolutionary Phylogeny Problem Ilia Flax 015600042

Upload: truongkhuong

Post on 27-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

1

An Improved

Algorithm for the

Macro-evolutionary

Phylogeny Problem

Ilia Flax 015600042

Page 2: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

2

Lecture Outline

1. Biologic Motivation

2. Introduction.

3. Description of the model and a formal definition of the problem.

4. Combinatorial properties of the optimal answers which will be useful for our algorithm design.

5. Improved algorithm for solving the macro-evolutionary phylogeny problem.

6. Improvement in the case of the unit cost duplication and loss events.

7. Conclusion.

Page 3: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological Motivation

NucleotideA group of organic compounds. Nucleotides

appear in all living beings and construct the

genetic material.

Page 4: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological Motivation

Nucleic Acid Composed of a chain of

nucleotides.

DNA - Deoxyribonucleic AcidA huge molecule of Nucleic Acids

that contains all of the information

for the contracture of all the

proteins in an organic cell.

Page 5: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological Motivation Protein

A group of organic compounds. Located in the

cells of all living beings, in the cell itself

located in almost every structure. The

characters of an organism are a direct result of

protein activity.

Page 6: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological Motivation Gene

a data unit that is passed from an organism to

its opsprings using the DNA. The genes contain

the "manufacturing instructions" for tens of

thousands of cell proteins. Physically, the

genes are segments of the DNA

Page 7: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological Motivation

Evolution

The process of genetic change in a

population of organisms during time.

Using the process of evolution we can

explain:

The biological diversity that exists in

morphological, physical and behavioral

characters.

The complexity of life.

The creation of new taxonomic species.

Evolution researchers struggle to

understand the likelihood of evolutionary

processes and to know which of them

actually took place.

Page 8: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological Motivation

Human Evolution

Page 9: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological Motivation

Phylogenies are

evolutionary

histories

Systematics is

“an analytic

approach to

understanding

the diversity of

relationships of

organisms” p. 491,

Campbell & Reece (2005)

“Taxonomy is the ordered division of

organisms into categories based on…

similarities and differences.” p. 495, Campbell & Reece

(2005)

Shown is a phylogenetic tree

Page 10: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological Motivation

Micro-Evolution

The occurrence of small-scale changes in

a population, over a few generations

These changes may be due to several

processes: mutation, natural selection,

gene flow and genetic drift, the

randomness of mating within populations.

For example bacterial strains that have

antibiotic resistance.

Page 11: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological Motivation

Macro-Evolution The large-scale patterns, trends,

and rates of change (i. e. gene

frequencies) among living

organisms over long periods of

time.

consisting of extended

microevolution.

The difference is largely one of

approach.

The evidence comes from 2 main

sources: fossils and comparisons

between living organisms.

Page 12: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Morphological

Divergence

Change from body form

of a common ancestor

Produces homologous

structures s

1

1

1

1

1

1

2

2

2

2

2

2

2

3

3

3

3

3

3

3

4

4

4

4

4

5

5

5

5

pterosaur

chicken

bat

porpoise

penguin

human

early reptile

Page 13: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Biological MotivationFossils

Page 14: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Developmental Patterns

FISH REPTILE BIRD MAMMAL

Page 15: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

15

Introduction

The goal of evolutionary biology:

reconstruction of the evolutionary history.

Evolutionary history tree - phylogenetic tree.

The internal nodes - ancestral species

leaves - current species.

DNA sequences used as characters to

estimate phylogenetic trees.

Related genes evolve through the process of

gene duplication.

copies of a duplicated genes can evolve

distinct variations - paralogues.

Page 16: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

16

Introduction

A single species may contain none, one, or several copies of what was a single gene in an ancestor.

building a phylogenetic tree:

know which copies of a gene are be compared.

gene tree

Evolution of a set of genes

Species tree

Evolution of species

Determination of evolutionary history:

micro-evolutionary events

macro-evolutionary events.

Page 17: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

extreme

thermophiles

halophilesmethanogens cyanobacteria

ARCHAEBACTERIA

PROTISTANS

FUNGIPLANTS

ANIMALS

clubfungi

sacfungi

zygospore-forming

fungi

chordatesannelids

mollusks

flatworms

sponges

cnidarians

flowering plants conifers

horsetails

lycophytes

ferns

bryophytes

sporozoans

green algae amoeboidprotozoans

slime molds

ciliatesredalgae

brown algae

chrysophytes

cycads

ginkgos

rotifers

arthropods

round-

worms

chytrids

oomycotes

euglenoids

dinoflagellates

Gram-positive bacteria

spirochetes

chlamydias

proteobacteria

?crown of eukaryotes(rapid divergences)

molecular origin of life

EUBACTERIAparabasalids

diplomonads(e.g., Giardia)

(alveolates)

(stramenopiles)

chlorophytes

kinetoplastids

extreme

(e.g., Trichomonas)

DNA Sequence at each leaf

Gene Tree

Page 18: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

18

Outline

2005: Durand

algorithm to find the tree that optimizes a

macro-evolutionary criterion.

Complexity of the algorithm is

n - species number

m - maximum number of copies of the gene.

Our algorithm:

.

- for unit costs for loss and

duplication of a gene.

Lecture Outline

Biological Motivation

Introduction

Problem description

Macro-evolutionary

Phylogeny Problem

Properties of

optimal histories

Algorithm

Conclusion

( )O nm

( )O n

2( )O nm

Page 19: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

19

Durand’s Algorithm

Two-phase approach to gene tree reconstruction: sequence evolution.

gene duplication ( or loss ) for the reconstruction of phylogenies.

Dynamic programming algorithm : finds phlyogenies.

Using macro-evolutionary model of gene duplication and loss.

Input: number of genes in each species

Output: A tree with fewest duplications and losses.

Page 20: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

20

Durand’s Algorithm

1. A tree is constructed using the sequence evolution operations considering micro-evolutionary events only.

2. Some regions of the tree are refined with respect to a macro-evolutionary model.

Macro-evolutionary events are used only for explaining the areas where the sequence data cannot resolve the topology

The total search space is reduced.

Page 21: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

21

The presented algorithm

Improvement of

important - a gene family can have hundreds of duplicates.

Use combinatorial properties of the structure of optimal histories.

Using these properties for the macro-evolutionary phylogeny problem.

( )O m

Page 22: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

22

Problem Description

A single current species may have zero, one,

or several copies of what was a single gene in

an ancestor.

macro-evolutionary phylogeny problem:

Explains the different multiplicities by duplications

and losses.

infinitely many different histories to generate a

phylogenetic tree.

Interesting histories

Fewest number of losses and duplications.

Page 23: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

23

Problem Description

The optimization criterion is based on the cost

of duplication and loss.

cost of a duplication .

cost of a loss .

The score of a gene tree is

number of duplications

number of losses

/D L c L c D

c

c

D

L

Page 24: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

HumanChimpGorilla

Iliagutan

Number of gene copies at each node

1

2

36 7

Species Tree

Page 25: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

25

Problem Description

History is presented as a species tree

Each node is annotated with its multiplicity

The multiplicity of the root is one

The multiplicity of the leaves is specified in

the input.

Duplication

increases the number of gene copies in the species

by one.

Loss

decreases the number of gene copies of the species

associated to a node.

1,..., sm m

Page 26: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

26

Problem Description

i gene copies exist in species x and it passes j

copies to each of its children.

Explaining the data:

If j > i, j −i duplications.

If j < i, i−j losses.

If i = j, no loss or duplication.

Page 27: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

27

Problem Description

a) b) c)

frog

mouse humanm=2m=1

m=2

1

1

1

11

1 1

1

22

22 2

2

2

2

Page 28: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

28

Macro-evolutionary

Phylogeny Problem

Input: A rooted species tree , with leaves

a list of multiplicities , where is

the number of gene family members found

in species ;

weights and .

Output: The set of all rooted species trees such

that the Score of is minimal.

ST S

1,..., sm m lm

l

c c

{ }GT

GT/D L

Page 29: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Macro-evolutionary

Phylogeny Problem

Input:

Output:

ST

1m

2c

3c

{ }GT

/ minD LScore

2msm

1x

1z{ }1m

2m sm

1

1y

1w

2x

2z1m

2m sm

1

2y

2w

Page 30: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

30

Problem Description Durand’s dynamic programming algorithm:

1. Fill a table Cost[v, i, j]

1. minimum D/L score for the sub-tree rooted at v

2. v has i entering copies

3. v passes j gene copies to its children.

4. 1 ≤ i, j ≤ m, m is the maximum multiplicity for a leaf of the tree.

2. reconstruct the gene trees using the table.

3. The complexity is for giving one optimal history and for reporting k optimal histories.

2( )O nm2( )O nm nmk

Page 31: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

31

Properties of Optimal

Histories Definition 1.

Tree and a given list of multiplicities for its leaves.

- the minimum D/L score for duplication/loss

history of tree .

x - entering copies of genes to the root of the tree.

In an optimal history, in a given node there

either is no event or a first duplication (or loss)

event is followed by an optimal history.

T

( , )g x T T

( 1, ) (1)

( , ) min ( 1, ) (2) (#)

( , ) ( , ) (3)L R

g x T c

g x T g x T c

g x T g x T

Page 32: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

32

( 1, ) (1)

( , ) min ( 1, ) (2) (#)

( , ) ( , ) (3)L R

g x T c

g x T g x T c

g x T g x T

BA

C

4 2

g={1,0,1,2}

For every vertex we will save a vector representing the g function value for ientering gene copies

1c c

g={3,2,1,0}

g(1,c)=min(g(2,c)+1,g(1,a)+g(1,b))

g(2,c)=min(g(1,c)+1,g(3,c)+1,g(2,a)+g(2,b))

g(3,c)=min(g(2,c)+1,g(4,c)+1,g(3,a)+g(3,b))

g(4,c)=min(g(3,c)+1,g(4,a)+g(4,b))

2

2

3

2

Properties of Optimal

Histories

Page 33: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

33

Properties of Optimal

Histories

for a sufficiently large ,

The recurrences are finite for any tree .

, 0c c

N

( , ) ( , ) ( , ) , ( , )L Rg x T g x T g x N T Nc g x N T Nc

T

,R LT T T

( , ) ( , )

( , ) ( , )

g x T g x k T kc

g x T g x k T kc

1.

2.

3.

Page 34: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

34

Properties of Optimal

Histories

Definition 2. For a given tree T and a given list of multiplicities for its leaves, is defined as the set of all integers x such that for any integer , .This optimal cost itself is denoted by .

Lemma 1. Let be a binary tree and ( )

be its left (right) sub-tree.

The optimal history of includes the optimal

histories of and .

We will prove that is an integer interval for any tree .

( )OPT T

'x ( , ) ( ', )g x T g x T

( )opt T

T LTRT

( ) ( ) ( )L Ropt T opt T opt T

T

LTRT

( )OPT TT

Page 35: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

35

Properties of Optimal

Histories

Proposition 1.

For any tree T with given input multiplicities for

the leaves, is an integer interval.

By proposition 1.The function is

minimum in an integer interval which is

denoted by .

We will show that the function , is

strictly decreasing for and strictly

increasing for .

( )OPT T

( )OPT T

( , )g x T

( , )g x T

1 2,x x

1x x

2x x

Page 36: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

36

Properties of Optimal

Histories

Proposition 2.

Let T be a binary tree with given multiplicities

for leaves and let be . The

function is strictly decreasing for all x

smaller than and strictly increasing for all x

larger than .

Proposition 3.

Let T be a binary tree with given multiplicities

for leaves. is a convex function.( , )g x T

( , )g x T

1 2,x x

1x

2x

( )OPT T

Page 37: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

37

Properties of Optimal

Histories

Optimal interval

Page 38: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

38

Properties of Optimal

Histories

The general structure of :

1. is firstly strictly decreasing then it takes its

minimum on an interval and then it is strictly

increasing.

2. The function is convex; is increasing.

3. For large values of x, .

4. For small values of x, .

It would be convenient to extend the definition

of function for negative values of x.

Rather than proving the above three

propositions separately, we prove them all

together by induction on the size of the tree.

( , )g x T

( , )g x T

( , )g x T

( , )g x T

( , )g x T c

( , )g x T c

Page 39: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

39

Properties of Optimal

Histories

Proof

The proof is done by induction on the size

of the tree.

1. As the base step, let us consider a

tree T which has one leaf with

multiplicity p. In this case it is easy to

verify that

0

( , ) ( )

( )

if x p

g x T p x c if x p

x p c if x p

Page 40: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

40

Properties of Optimal

Histories

2. All the three propositions are true for this

function.

3. Now suppose that the three propositions

are true for any tree with strictly less than

k leaves (k > 1), and consider a tree T with

k leaves.

4. Both left and right sub-trees of T which are

denoted by and have less than k

leaves so the three propositions are true

for them by induction hypothesis.

LT RT

Page 41: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

41

Properties of Optimal

Histories

5. Consider an interval I of integers where is

non-decreasing and is

non-increasing.

6. The main part of the proof is that we

show:

a. Lets prove this by contradiction. Suppose

(*) is not true.

b. There exists x in I, such that

( , )Lg x T ( , )Rg x T

: ( , ) ( , ) ( , ) (*)L Rx g x T g x T g x T

( , ) ( , ) ( , )L Rg x T g x T g x T

Page 42: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

42

Properties of Optimal

Histories

c. By (#), w.l.g. we suppose that

. The number of

consecutive duplications in a node in

optimal generation is finite.

d. Exists u such that:

( , ) ( 1, )g x T g x T c

( , ) ( , )g x T g x u T uc

( , ) ( , ) ( , )L Rg x u T g x u T g x u T

Page 43: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Properties of Optimal

Histories

( , ) ( , ) ( , )L Rg x u T uc g x T g x T

( , ) ( , ) ( , ) ( , )L R L Rg x u T g x u T uc g x T g x T

0

( , ) ( , ) ( , ) ( , )L L R R

uc

uc g x T g x u T g x T g x u T

uc uc

!Contradiction

Page 44: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

44

Properties of Optimal

Histories

If , similar contradiction can be obtained.

Symmetrically if is decreasing and is increasing in an interval the

equality is correct.

As a consequent of this equality, is convex in the interval - The sum of two convex functions is convex.

Note that if the optimal generating interval for and are and respectively,

then in the interval , the equality (*) is correct.

( , ) ( 1, ) ( , ) ( , )L Rg x T g x T c g x T g x T

( , )Lg x T

( , )Rg x T

( , )g x T

1 1 2 2min , ,max ,l r l r

LT RT 1 2[ , ]l l1 2[ , ]r r

Page 45: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

45

Properties of Optimal

Histories Now let us consider the interval

.

In this interval both and are strictly increasing by the induction hypothesis.

It is easy to verify that the function and so is

equal to the minimum of and .

If for all values of x in this interval we have then

becomes strictly increasing and convex as the sum of two strictly increasing and convex functions.

2 2[max{ , } 1, )l r

( , ) ( 1, )g x T g x T c

( , ) ( , )L Rg x T g x T

( , )g x T( 1, )g x T c

( , )Lg x T ( , )Rg x T

( , ) ( , ) ( , )L Rg x T g x T g x T ( , )g x T

Page 46: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

46

Properties of Optimal

Histories Otherwise, consider the first value of in

the interval such that

By the way we defined , we have

Consequently, we have

On the other hand and are convex and so and are increasing.

So, The function is strictly increasing and convex for any , so is convex and strictly increasing for any x in .

0 0 0( 1, ) ( 1, ) ( 1, )L Rg x T g x T g x T

0 0 0 0( , ) ( 1, ) ( , ) ( , )L Rg x T g x T c g x T g x T

0 0( 1, ) ( 1, )L Rg x T g x T c

0x

0x

( , )Lg x T ( , )Rg x T( , )Lg x T ( , )Rg x T

0 : ( , ) ( 1, )x x g x T g x T c ( , )g x T

0x x ( , )g x T

Page 47: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

47

Properties of Optimal

Histories

Similarly, we can show in an interval where both and are strictly decreasing, is strictly decreasing and convex.

In order to complete the proof we need to consider the different possible configurations of and .

The next figure shows the three possible arrangements of the optimal intervals of the two functions.

( , )Lg x T ( , )Rg x T

( , )Lg x T ( , )Rg x T

( , )g x T

Page 48: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

48

Properties of Optimal

Historiesa) b)

c)

1

1

12

2

2 3 4 5

43 5

3 4 5

Page 49: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

49

Properties of Optimal

Histories

In all three cases intervals 1 and 5 refer to the and in the proof.

Interval 2 and interval (3,4) refer to the interval in the proof.

The proof of the convexity in the exchange points of these intervals is easy and is omitted.

It is also easy to show (by Lemma 1) that in cases (a) and (b) the optimal interval of T is inteval 3.

In case (c) the optimal interval is an interval which is included in interval 3.

Page 50: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

50

Algorithm

we present an algorithm for computation of the optimal D/L score histories for a given tree.

The algorithm fills the table for any and for all sub-trees of T.

is not computed for non positive values of x because it is not biologically meaningful.

On the other hand an optimal solution has never more than m genes present in a species so there is no need to compute for .

[ , ]g x T

( , )g x T

( , )g x T

1 x m

x m

Page 51: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

51

AlgorithmAlgorithm GenCost(tree T)

1. if T is a leaf then

1.1 for i ← 1 to m do

1.1.1 if i ≥ label(T) then g[i, T ] ← (i − label(T)) ׳cλ

1.1.2 if i < label(T) then g[i, T ] ← (label(T) − i) ׳cδ

1.2 exit

2. GenCost(TL); GenCost(TR);

3. [l1, l2] ← OPT(TL); [r1, r2] ← OPT(TR)

4. t1 ← min{l1, r1}; t2 ← max{l2, r2}

5. for i ← t1 to t2 do g[i, T ] ← g[i, TL] + g[i, TR]

6. for i ← t2+1 to m do g[i,T]←min{g[i−1,T]+cλ, g[i,TL] + g[i,TR]}

7. for i ←t1−1 downto 1 do g[i,T]←min{g[i+1,T]+cδ,g[i,TL]+g[i,TR]}

Page 52: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

Algorithm Example

A

B

DE GF

g(i,T) A B C D E F G

1

2

3

4

C

3 1 4 2

1

Loss : Cλ=X

Duplication: Cδ=Y

2Y

1Y

0

1X

0

1X

2X

3X

3Y

2Y

1Y

0

1Y

0

1X

2X

{3,3} {1,1} {4,4} {2,2}

{1,3} {2,4}

2Y

1Y+1X

2X

3X

2Y

1Y+1X

2X

3Y

{3,4}

0.5Y<X<Y

{3,3} {4,4}

1Y+3X

5X

3Y+X

4Y+X

Page 53: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

53

Algorithm

The correctness of the algorithm is a result of

the proof of the propositions.

The complexity of the algorithm is .

Once table g has been determined by the Algorithm, finding optimal D/L score history is easy.

Starting with at each step one checks how g is minimized.

( )O mn

[1, ]g T

Page 54: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

54

Algorithm

The total complexity will be for

computing k optimal answers.

Note that multiple optimal histories correspond

to the nodes and values of x such that g is

minimized by two lines of the recurrence.

( )O mn nk

( 1, ) (1)

( , ) min ( 1, ) (2) (#)

( , ) ( , ) (3)L R

g x T c

g x T g x T c

g x T g x T

Page 55: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

55

Algorithm

In practice depending on the values of and

this complexity can be improved.

The function is convex.

is increasing.

.

Store and update the function g just by

computing the points that ( ) changes its

value from x to x + 1.

This will reduce the complexity of the algorithm

to where .

complexity of the algorithm for generating k

optimal histories is .

( , )c g x T c

( (min{ , } ))O n m c k

( )O cn c c c

( , )g x T

g

cc

( , )g x T

Page 56: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

56

Algorithm

Unit Loss/Duplication Costs. When

, function becomes very

simple.

If , is constant in ,

increasing with step 1 for and

decreasing with step -1 for .

The complexity of the algorithm will become

for finding k optimal trees.

1c c ( , )g x T

( , )g x T1 2( ) [ , ]OPT T k k 1 2[ , ]k k

2x k

1x k

( )O nk

Page 57: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

57

Conclusion

combinatorial properties of the optimal D/L

histories for a given species tree.

Improved algorithm for finding the optimal

histories in O(m) order faster.

The improvement is for unit cost

duplication/loss function.

The macro-evolutionary phylogeny problem has

been shown to be useful and interesting in

order to build phylogenies based on both macro

and micro-evolutionary processes.

2( )O m

Page 58: An Improved Algorithm for the Macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf · Algorithm for the Macro-evolutionary ... Trichomonas) DNA Sequence at each ... Using these

58