an improved algorithm for the macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf ·...

1

An Improved

Algorithm for the

Macro-evolutionary

Phylogeny Problem

Ilia Flax 015600042

2

Lecture Outline

1. Biologic Motivation

2. Introduction.

3. Description of the model and a formal definition of the problem.

4. Combinatorial properties of the optimal answers which will be useful for our algorithm design.

5. Improved algorithm for solving the macro-evolutionary phylogeny problem.

6. Improvement in the case of the unit cost duplication and loss events.

7. Conclusion.

Biological Motivation

NucleotideA group of organic compounds. Nucleotides

appear in all living beings and construct the

genetic material.


Nucleic Acid Composed of a chain of

nucleotides.

DNA - Deoxyribonucleic AcidA huge molecule of Nucleic Acids

that contains all of the information

for the contracture of all the

proteins in an organic cell.

Biological Motivation Protein

A group of organic compounds. Located in the

cells of all living beings, in the cell itself

located in almost every structure. The

characters of an organism are a direct result of

protein activity.

Biological Motivation Gene

a data unit that is passed from an organism to

its opsprings using the DNA. The genes contain

the "manufacturing instructions" for tens of

thousands of cell proteins. Physically, the

genes are segments of the DNA


Evolution

The process of genetic change in a

population of organisms during time.

Using the process of evolution we can

explain:

The biological diversity that exists in

morphological, physical and behavioral

characters.

The complexity of life.

The creation of new taxonomic species.

Evolution researchers struggle to

understand the likelihood of evolutionary

processes and to know which of them

actually took place.


Human Evolution


Phylogenies are

evolutionary

histories

Systematics is

“an analytic

approach to

understanding

the diversity of

relationships of

organisms” p. 491,

Campbell & Reece (2005)

“Taxonomy is the ordered division of

organisms into categories based on…

similarities and differences.” p. 495, Campbell & Reece

(2005)

Shown is a phylogenetic tree


Micro-Evolution

The occurrence of small-scale changes in

a population, over a few generations

These changes may be due to several

processes: mutation, natural selection,

gene flow and genetic drift, the

randomness of mating within populations.

For example bacterial strains that have

antibiotic resistance.


Macro-Evolution The large-scale patterns, trends,

and rates of change (i. e. gene

frequencies) among living

organisms over long periods of

time.

consisting of extended

microevolution.

The difference is largely one of

approach.

The evidence comes from 2 main

sources: fossils and comparisons

between living organisms.

Morphological

Divergence

Change from body form

of a common ancestor

Produces homologous

structures s

1

1

1

1

1

1

2

2

2

2

2

2

2

3

3

3

3

3

3

3

4

4

4

4

4

5

5

5

5

pterosaur

chicken

bat

porpoise

penguin

human

early reptile

Biological MotivationFossils

Developmental Patterns

FISH REPTILE BIRD MAMMAL

15

Introduction

The goal of evolutionary biology:

reconstruction of the evolutionary history.

Evolutionary history tree - phylogenetic tree.

The internal nodes - ancestral species

leaves - current species.

DNA sequences used as characters to

estimate phylogenetic trees.

Related genes evolve through the process of

gene duplication.

copies of a duplicated genes can evolve

distinct variations - paralogues.

16

Introduction

A single species may contain none, one, or several copies of what was a single gene in an ancestor.

building a phylogenetic tree:

know which copies of a gene are be compared.

gene tree

Evolution of a set of genes

Species tree

Evolution of species

Determination of evolutionary history:

micro-evolutionary events

macro-evolutionary events.

extreme

thermophiles

halophilesmethanogens cyanobacteria

ARCHAEBACTERIA

PROTISTANS

FUNGIPLANTS

ANIMALS

clubfungi

sacfungi

zygospore-forming

fungi

chordatesannelids

mollusks

flatworms

sponges

cnidarians

flowering plants conifers

horsetails

lycophytes

ferns

bryophytes

sporozoans

green algae amoeboidprotozoans

slime molds

ciliatesredalgae

brown algae

chrysophytes

cycads

ginkgos

rotifers

arthropods

round-

worms

chytrids

oomycotes

euglenoids

dinoflagellates

Gram-positive bacteria

spirochetes

chlamydias

proteobacteria

?crown of eukaryotes(rapid divergences)

molecular origin of life

EUBACTERIAparabasalids

diplomonads(e.g., Giardia)

(alveolates)

(stramenopiles)

chlorophytes

kinetoplastids

extreme

(e.g., Trichomonas)

DNA Sequence at each leaf

Gene Tree

18

Outline

2005: Durand

algorithm to find the tree that optimizes a

macro-evolutionary criterion.

Complexity of the algorithm is

n - species number

m - maximum number of copies of the gene.

Our algorithm:

.

- for unit costs for loss and

duplication of a gene.

Lecture Outline


Introduction

Problem description

Macro-evolutionary

Phylogeny Problem

Properties of

optimal histories

Algorithm

Conclusion

( )O nm

( )O n

2( )O nm

19

Durand’s Algorithm

Two-phase approach to gene tree reconstruction: sequence evolution.

gene duplication ( or loss ) for the reconstruction of phylogenies.

Dynamic programming algorithm : finds phlyogenies.

Using macro-evolutionary model of gene duplication and loss.

Input: number of genes in each species

Output: A tree with fewest duplications and losses.

20

Durand’s Algorithm

1. A tree is constructed using the sequence evolution operations considering micro-evolutionary events only.

2. Some regions of the tree are refined with respect to a macro-evolutionary model.

Macro-evolutionary events are used only for explaining the areas where the sequence data cannot resolve the topology

The total search space is reduced.

21

The presented algorithm

Improvement of

important - a gene family can have hundreds of duplicates.

Use combinatorial properties of the structure of optimal histories.

Using these properties for the macro-evolutionary phylogeny problem.

( )O m

22

Problem Description

A single current species may have zero, one,

or several copies of what was a single gene in

an ancestor.

macro-evolutionary phylogeny problem:

Explains the different multiplicities by duplications

and losses.

infinitely many different histories to generate a

phylogenetic tree.

Interesting histories

Fewest number of losses and duplications.

23

Problem Description

The optimization criterion is based on the cost

of duplication and loss.

cost of a duplication .

cost of a loss .

The score of a gene tree is

number of duplications

number of losses

/D L c L c D

c

c

D

L

HumanChimpGorilla

Iliagutan

Number of gene copies at each node

1

2

36 7

Species Tree

25

Problem Description

History is presented as a species tree

Each node is annotated with its multiplicity

The multiplicity of the root is one

The multiplicity of the leaves is specified in

the input.

Duplication

increases the number of gene copies in the species

by one.

Loss

decreases the number of gene copies of the species

associated to a node.

1,..., sm m

26

Problem Description

i gene copies exist in species x and it passes j

copies to each of its children.

Explaining the data:

If j > i, j −i duplications.

If j < i, i−j losses.

If i = j, no loss or duplication.

27

Problem Description

a) b) c)

frog

mouse humanm=2m=1

m=2

1

1

1

11

1 1

1

22

22 2

2

2

2

28

Macro-evolutionary

Phylogeny Problem

Input: A rooted species tree , with leaves

a list of multiplicities , where is

the number of gene family members found

in species ;

weights and .

Output: The set of all rooted species trees such

that the Score of is minimal.

ST S

1,..., sm m lm

l

c c

{ }GT

GT/D L

Macro-evolutionary

Phylogeny Problem

Input:

Output:

ST

1m

2c

3c

{ }GT

/ minD LScore

2msm

1x

1z{ }1m

2m sm

1

1y

1w

2x

2z1m

2m sm

1

2y

2w

30

Problem Description Durand’s dynamic programming algorithm:

1. Fill a table Cost[v, i, j]

1. minimum D/L score for the sub-tree rooted at v

2. v has i entering copies

3. v passes j gene copies to its children.

4. 1 ≤ i, j ≤ m, m is the maximum multiplicity for a leaf of the tree.

2. reconstruct the gene trees using the table.

3. The complexity is for giving one optimal history and for reporting k optimal histories.

2( )O nm2( )O nm nmk

31

Properties of Optimal

Histories Definition 1.

Tree and a given list of multiplicities for its leaves.

- the minimum D/L score for duplication/loss

history of tree .

x - entering copies of genes to the root of the tree.

In an optimal history, in a given node there

either is no event or a first duplication (or loss)

event is followed by an optimal history.

T

( , )g x T T

( 1, ) (1)

( , ) min ( 1, ) (2) (#)

( , ) ( , ) (3)L R

g x T c

g x T g x T c

g x T g x T

32

( 1, ) (1)

( , ) min ( 1, ) (2) (#)

( , ) ( , ) (3)L R

g x T c

g x T g x T c

g x T g x T

BA

C

4 2

g={1,0,1,2}

For every vertex we will save a vector representing the g function value for ientering gene copies

1c c

g={3,2,1,0}

g(1,c)=min(g(2,c)+1,g(1,a)+g(1,b))

g(2,c)=min(g(1,c)+1,g(3,c)+1,g(2,a)+g(2,b))

g(3,c)=min(g(2,c)+1,g(4,c)+1,g(3,a)+g(3,b))

g(4,c)=min(g(3,c)+1,g(4,a)+g(4,b))

2

2

3

2


Histories

33


Histories

for a sufficiently large ,

The recurrences are finite for any tree .

, 0c c

N

( , ) ( , ) ( , ) , ( , )L Rg x T g x T g x N T Nc g x N T Nc

T

,R LT T T

( , ) ( , )

( , ) ( , )

g x T g x k T kc

g x T g x k T kc

1.

2.

3.

34


Histories

Definition 2. For a given tree T and a given list of multiplicities for its leaves, is defined as the set of all integers x such that for any integer , .This optimal cost itself is denoted by .

Lemma 1. Let be a binary tree and ( )

be its left (right) sub-tree.

The optimal history of includes the optimal

histories of and .

We will prove that is an integer interval for any tree .

( )OPT T

'x ( , ) ( ', )g x T g x T

( )opt T

T LTRT

( ) ( ) ( )L Ropt T opt T opt T

T

LTRT

( )OPT TT

35


Histories

Proposition 1.

For any tree T with given input multiplicities for

the leaves, is an integer interval.

By proposition 1.The function is

minimum in an integer interval which is

denoted by .

We will show that the function , is

strictly decreasing for and strictly

increasing for .

( )OPT T

( )OPT T

( , )g x T

( , )g x T

1 2,x x

1x x

2x x

36


Histories

Proposition 2.

Let T be a binary tree with given multiplicities

for leaves and let be . The

function is strictly decreasing for all x

smaller than and strictly increasing for all x

larger than .

Proposition 3.

Let T be a binary tree with given multiplicities

for leaves. is a convex function.( , )g x T

( , )g x T

1 2,x x

1x

2x

( )OPT T

37


Histories

Optimal interval

38


Histories

The general structure of :

1. is firstly strictly decreasing then it takes its

minimum on an interval and then it is strictly

increasing.

2. The function is convex; is increasing.

3. For large values of x, .

4. For small values of x, .

It would be convenient to extend the definition

of function for negative values of x.

Rather than proving the above three

propositions separately, we prove them all

together by induction on the size of the tree.

( , )g x T

( , )g x T

( , )g x T

( , )g x T

( , )g x T c

( , )g x T c

39


Histories

Proof

The proof is done by induction on the size

of the tree.

1. As the base step, let us consider a

tree T which has one leaf with

multiplicity p. In this case it is easy to

verify that

0

( , ) ( )

( )

if x p

g x T p x c if x p

x p c if x p

40


Histories

2. All the three propositions are true for this

function.

3. Now suppose that the three propositions

are true for any tree with strictly less than

k leaves (k > 1), and consider a tree T with

k leaves.

4. Both left and right sub-trees of T which are

denoted by and have less than k

leaves so the three propositions are true

for them by induction hypothesis.

LT RT

41


Histories

5. Consider an interval I of integers where is

non-decreasing and is

non-increasing.

6. The main part of the proof is that we

show:

a. Lets prove this by contradiction. Suppose

(*) is not true.

b. There exists x in I, such that

( , )Lg x T ( , )Rg x T

: ( , ) ( , ) ( , ) (*)L Rx g x T g x T g x T

( , ) ( , ) ( , )L Rg x T g x T g x T

42


Histories

c. By (#), w.l.g. we suppose that

. The number of

consecutive duplications in a node in

optimal generation is finite.

d. Exists u such that:

( , ) ( 1, )g x T g x T c

( , ) ( , )g x T g x u T uc

( , ) ( , ) ( , )L Rg x u T g x u T g x u T


Histories

( , ) ( , ) ( , )L Rg x u T uc g x T g x T

( , ) ( , ) ( , ) ( , )L R L Rg x u T g x u T uc g x T g x T

0

( , ) ( , ) ( , ) ( , )L L R R

uc

uc g x T g x u T g x T g x u T

uc uc

!Contradiction

44


Histories

If , similar contradiction can be obtained.

Symmetrically if is decreasing and is increasing in an interval the

equality is correct.

As a consequent of this equality, is convex in the interval - The sum of two convex functions is convex.

Note that if the optimal generating interval for and are and respectively,

then in the interval , the equality (*) is correct.

( , ) ( 1, ) ( , ) ( , )L Rg x T g x T c g x T g x T

( , )Lg x T

( , )Rg x T

( , )g x T

1 1 2 2min , ,max ,l r l r

LT RT 1 2[ , ]l l1 2[ , ]r r

45


Histories Now let us consider the interval

.

In this interval both and are strictly increasing by the induction hypothesis.

It is easy to verify that the function and so is

equal to the minimum of and .

If for all values of x in this interval we have then

becomes strictly increasing and convex as the sum of two strictly increasing and convex functions.

2 2[max{ , } 1, )l r

( , ) ( 1, )g x T g x T c

( , ) ( , )L Rg x T g x T

( , )g x T( 1, )g x T c

( , )Lg x T ( , )Rg x T

( , ) ( , ) ( , )L Rg x T g x T g x T ( , )g x T

46


Histories Otherwise, consider the first value of in

the interval such that

By the way we defined , we have

Consequently, we have

On the other hand and are convex and so and are increasing.

So, The function is strictly increasing and convex for any , so is convex and strictly increasing for any x in .

0 0 0( 1, ) ( 1, ) ( 1, )L Rg x T g x T g x T

0 0 0 0( , ) ( 1, ) ( , ) ( , )L Rg x T g x T c g x T g x T

0 0( 1, ) ( 1, )L Rg x T g x T c

0x

0x

( , )Lg x T ( , )Rg x T( , )Lg x T ( , )Rg x T

0 : ( , ) ( 1, )x x g x T g x T c ( , )g x T

0x x ( , )g x T

47


Histories

Similarly, we can show in an interval where both and are strictly decreasing, is strictly decreasing and convex.

In order to complete the proof we need to consider the different possible configurations of and .

The next figure shows the three possible arrangements of the optimal intervals of the two functions.

( , )Lg x T ( , )Rg x T

( , )Lg x T ( , )Rg x T

( , )g x T

48


Historiesa) b)

c)

1

1

12

2

2 3 4 5

43 5

3 4 5

49


Histories

In all three cases intervals 1 and 5 refer to the and in the proof.

Interval 2 and interval (3,4) refer to the interval in the proof.

The proof of the convexity in the exchange points of these intervals is easy and is omitted.

It is also easy to show (by Lemma 1) that in cases (a) and (b) the optimal interval of T is inteval 3.

In case (c) the optimal interval is an interval which is included in interval 3.

50

Algorithm

we present an algorithm for computation of the optimal D/L score histories for a given tree.

The algorithm fills the table for any and for all sub-trees of T.

is not computed for non positive values of x because it is not biologically meaningful.

On the other hand an optimal solution has never more than m genes present in a species so there is no need to compute for .

[ , ]g x T

( , )g x T

( , )g x T

1 x m

x m

51

AlgorithmAlgorithm GenCost(tree T)

1. if T is a leaf then

1.1 for i ← 1 to m do

1.1.1 if i ≥ label(T) then g[i, T ] ← (i − label(T)) ׳cλ

1.1.2 if i < label(T) then g[i, T ] ← (label(T) − i) ׳cδ

1.2 exit

2. GenCost(TL); GenCost(TR);

3. [l1, l2] ← OPT(TL); [r1, r2] ← OPT(TR)

4. t1 ← min{l1, r1}; t2 ← max{l2, r2}

5. for i ← t1 to t2 do g[i, T ] ← g[i, TL] + g[i, TR]

6. for i ← t2+1 to m do g[i,T]←min{g[i−1,T]+cλ, g[i,TL] + g[i,TR]}

7. for i ←t1−1 downto 1 do g[i,T]←min{g[i+1,T]+cδ,g[i,TL]+g[i,TR]}

Algorithm Example

A

B

DE GF

g(i,T) A B C D E F G

1

2

3

4

C

3 1 4 2

1

Loss : Cλ=X

Duplication: Cδ=Y

2Y

1Y

0

1X

0

1X

2X

3X

3Y

2Y

1Y

0

1Y

0

1X

2X

{3,3} {1,1} {4,4} {2,2}

{1,3} {2,4}

2Y

1Y+1X

2X

3X

2Y

1Y+1X

2X

3Y

{3,4}

0.5Y<X<Y

{3,3} {4,4}

1Y+3X

5X

3Y+X

4Y+X

53

Algorithm

The correctness of the algorithm is a result of

the proof of the propositions.

The complexity of the algorithm is .

Once table g has been determined by the Algorithm, finding optimal D/L score history is easy.

Starting with at each step one checks how g is minimized.

( )O mn

[1, ]g T

54

Algorithm

The total complexity will be for

computing k optimal answers.

Note that multiple optimal histories correspond

to the nodes and values of x such that g is

minimized by two lines of the recurrence.

( )O mn nk

( 1, ) (1)

( , ) min ( 1, ) (2) (#)

( , ) ( , ) (3)L R

g x T c

g x T g x T c

g x T g x T

55

Algorithm

In practice depending on the values of and

this complexity can be improved.

The function is convex.

is increasing.

.

Store and update the function g just by

computing the points that ( ) changes its

value from x to x + 1.

This will reduce the complexity of the algorithm

to where .

complexity of the algorithm for generating k

optimal histories is .

( , )c g x T c

( (min{ , } ))O n m c k

( )O cn c c c

( , )g x T

g

cc

( , )g x T

56

Algorithm

Unit Loss/Duplication Costs. When

, function becomes very

simple.

If , is constant in ,

increasing with step 1 for and

decreasing with step -1 for .

The complexity of the algorithm will become

for finding k optimal trees.

1c c ( , )g x T

( , )g x T1 2( ) [ , ]OPT T k k 1 2[ , ]k k

2x k

1x k

( )O nk

57

Conclusion

combinatorial properties of the optimal D/L

histories for a given species tree.

Improved algorithm for finding the optimal

histories in O(m) order faster.

The improvement is for unit cost

duplication/loss function.

The macro-evolutionary phylogeny problem has

been shown to be useful and interesting in

order to build phylogenies based on both macro

and micro-evolutionary processes.

2( )O m

an improved algorithm for the macro-evolutionary …michaluz/seminar/ilia_parsimony.pdf ·...

Documents