http://creativecommons.org/licenses/by-sa/2.0/. cis786, lecture 3 usman roshan

http://creativecommons.org/licenses/by-sa/2.0/

CIS786, Lecture 3

Usman Roshan

Maximum Parsimony

• Character based method

• NP-hard (reduction to the Steiner tree problem)

• Widely-used in phylogenetics

• Slower than NJ but more accurate

• Faster than ML

• Assumes i.i.d.

Maximum Parsimony

• Input: Set S of n aligned sequences of length k

• Output: A phylogenetic tree T– leaf-labeled by sequences in S– additional sequences of length k labeling the

internal nodes of T

such that is minimized. )(),(

),(TEji

jiH

Maximum parsimony (example)

• Input: Four sequences– ACT– ACA– GTT– GTA

• Question: which of the three trees has the best MP scores?

Maximum Parsimony

ACT

GTT ACA

GTA ACA ACT

GTAGTT

ACT

ACA

GTT

GTA

Maximum Parsimony

ACT

GTT

GTT GTA

ACA

GTA

12

2

MP score = 5

ACA ACT

GTAGTT

ACA ACT

3 1 3

MP score = 7

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Optimal MP tree

Maximum Parsimony: computational complexity

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Finding the optimal MP tree is NP-hard

Optimal labeling can becomputed in linear time O(nk)

Local search strategies

Phylogenetic trees

Cost

Global optimum

Local optimum

Local search for MP

• Determine a candidate solution s• While s is not a local minimum

– Find a neighbor s’ of s such that MP(s’)<MP(s)– If found set s=s’– Else return s and exit

• Time complexity: unknown---could take forever or end quickly depending on starting tree and local move

• Need to specify how to construct starting tree and local move

Starting tree for MP

• Random phylogeny---O(n) time• Greedy-MP

Greedy-MP

Greedy-MP takes O(n^3k) time

Faster Greedy MP3-way labeling

• If we can assign optimal labels to each internal node rooted in each possible way, we can speed up computation by order of n

• Optimal 3-way labeling– Sort all 3n subtrees using

bucket sort in O(n)– Starting from small subtrees

compute optimal labelings– For each subtree rooted at v,

the optimal labelings of children nodes is already computed

– Total time: O(nk)

Faster Greedy MP3-way labeling

• If we can assign optimal labels to each internal node rooted in each possible way, we can speed up computation by order of n

• Optimal 3-way labeling– Sort all 3n subtrees using

bucket sort in O(n)– Starting from small subtrees

compute optimal labelings– For each subtree rooted at v,

the optimal labelings of children nodes is already computed

– Total time: O(nk)

With optimal labeling it takes constantTime to compute MP score for eachEdge and so total Greedy-MP timeIs O(n^2k)

Local moves for MP: NNI

• For each edge we get two different topologies

• Neighborhood size is 2n-6

Local moves for MP: SPR

• Neighborhood size is quadratic in number of taxa• Computing the minimum number of SPR moves

between two rooted phylogenies is NP-hard

Local moves for MP: TBR

• Neighborhood size is cubic in number of taxa• Computing the minimum number of TBR moves

between two rooted phylogenies is NP-hard

Tree Bisection and Reconnection (TBR)


Delete an edge


Reconnect the trees with a new edgethat bifurcates an edge in each tree

Local optima is a problem

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

1 48 96 144 192 240 288 336

TNT

Iterated local search: escape local optima by perturbation

Local optimumLocal search


Local optimum

Output of perturbation

Perturbation

Local search


Local optimum


Perturbation

Local search

Local search

ILS for MP

• Ratchet

• Iterative-DCM3

• TNT


Local optimum


Perturbation

Local search

Local search

Ratchet

• Perturbation input: alignment and phylogeny– Sample with replacement p% of sites and

reweigh them to w– Perform local search on modified dataset

starting from the input phylogeny– Reset the alignment to original after

completion and output the local minimum

Ratchet: escaping local minimaby data perturbation

Local optimum

Output of ratchet

Ratchet search

Local search

Local search

Ratchet: escaping local minimaby data perturbation

Local optimum

Output of ratchet

Ratchet search

Local search

Local search

But how well does this perform?We have to examine this experimentally on real data

Experimental methodology for MP on real data

• Collect alignments of real datasets – Usually constructed using ClustalW– Followed by manual (eye) adjustments– Must be reliable to get sensible tree!

• Run methods for a fixed time period• Compare MP scores as a function of time

– Examine how scores improve over time– Rate of convergence of different methods (not

sequence length but as a function of time)


• We use rRNA and DNA alignments• Obtained from researchers and public databases • We run iterative improvement and ratchet each

for 24 hours beginning from a randomized greedy MP tree

• Each method was run five times and average scores were plotted

• We use PAUP*---very widely used software package for various types of phylogenetic analysis

500 aligned rbcL sequences (Zilla dataset)

854 aligned rbcL sequences

2000 aligned Eukaryotes

7180 aligned 3domain

13921 aligned Proteobacteria

Comparison of MP heuristics

• What about other techniques for escaping local minima?

• TNT: a combination of divide-and-conquer, simulated annealing, and genetic algorithms– Sectorial search (random): construct ancestral

sequence states using parsimony; randomly select a subset of nodes; compute iterative-improvement trees and if better tree found then replace

– Genetic algorithm (fuse): Exchange subtrees between two trees to see if better ones are found

– Default search: (1) Do sectorial search starting from five randomized greedy MP trees; (2) apply genetic algorithm to find better ones; (3) output best tree

Comparison of MP heuristics

• What about other techniques for escaping local minima?

• TNT: a combination of divide-and-conquer, simulated annealing, and genetic algorithms– Sectorial search (random): construct ancestral

sequence states using parsimony; randomly select a subset of nodes; compute iterative-improvement trees and if better tree found then replace

– Genetic algorithm (fuse): Exchange subtrees between two trees to see if better ones are found

– Default search: (1) Do sectorial search starting from five randomized greedy MP trees; (2) apply genetic algorithm to find better ones; (3) output best tree

How does this compare to PAUP*-ratchet?


• We use rRNA and DNA alignments

• Obtained from researchers and public databases

• We run PAUP*-ratchet, TNT-default, and TNT-ratchet each for 24 hours beginning from randomized greedy MP trees

• Each method was run five times on each dataset and average scores were plotted

500 aligned rbcL sequences (Zilla dataset)

854 aligned rbcL sequences

2000 aligned Eukaryotes

7180 aligned 3domain

13921 aligned Proteobacteria

Can we do even better?

Yes! But first let’s look at

Disk-Covering Methods

Disk Covering Methods (DCMs)

• DCMs are divide-and-conquer booster methods. They divide the dataset into small subproblems, compute subtrees using a given base method, merge the subtrees, and refine the supertree.

• DCMs to date– DCM1: for improving statistical performance of

distance-based methods. – DCM2: for improving heuristic search for MP and ML– DCM3: latest, fastest, and best (in accuracy and

optimality) DCM

DCM2 technique for speeding up MP searches

1. Decompose sequences into overlapping subproblems

2. Compute subtrees using a base method

3. Merge subtrees using the Strict Consensus Merge (SCM)

4. Refine to make the tree binary

2. Find separator X in G which minimizes max where are the connected components of G – X

3. Output subproblems as .

DCM2• Input: distance matrix d,

threshold , sequences S• Algorithm:

1a. Compute a threshold graph G using q and d1b. Perform a minimum weight triangulation of G

DCM2 decomposition

|| iAX iA

}{ ijdq

iAX

Threshold graph

• Add edges until graph is connected• Perform minimum weight triangulation

– NP-hard– Triangulated graph=perfect elimination ordering

(PEO)– Max cliques can be determined in linear time– Use greedy triangulation heuristic: compute PEO by

adding vertices which minimize largest edge added– Worst case is O(n^3) but fast in practice

1. Find separator X in G which minimizes max where are the connected components of G – X

2. Output subproblems as3. This takes O(n^3) worst case

time: perform depth first search on each component (O(n^2)) for each of O(n) separators

Finding DCM2 separator

|| iAX iA

iAX

DCM2 subsets

DCM1 vs DCM2

DCM1 decomposition : NJ gets better accuracyon small diameter subproblems(which we shall return to later)

DCM2 decomposition:Getting a smaller number of smaller subproblemsspeeds up solution

We saw how decomposition takes place, now on to supertree methods





Supertree Methods

Optimization problems

• Subtree Compatibility: Given set of trees ,does there exist tree ,such

that, (we say contains ).

• NP-hard (Steel 1992)• Special cases are poly-time (rooted trees,

DCM)• MRP: also NP-hard

}{ ,,1 kTT T TtTt tL )(|,T T

T

Direct supertree methods• Strict consensus supertrees,

MinCutSupertrees

Indirect supertree methods

• MRP, Average consensus

MRP---Matrix Representation using Parsimony (very popular)

Strict Consensus Merger---faster and used in DCMs

1 2

3

4 6

5

1 2

3

7 4

1

3

2

4

1 2

3 4

1 2

3 4

1

2

3

4

5

6

7

Strict Consensus Merger: compatible subtrees

Strict Consensus Merger: compatible but collision

Strict Consensus Merger: incompatible subtrees

Strict Consensus Merger: incompatible and collision

Strict Consensus Merger: difference from Gordon’s SC method

Tree Refinement

• Challenge: given unresolved tree, find optimal refinement that has an optimal parsimony score

• NP-hard

Tree Refinement

ea

b c d

f g

h

a

bc d

fg

h

e

d

e

a

bc

f g

h

a

b

c f g

hd e

Comparing DCM decompositions

Study of DCM decompositions

DCM2 is faster and better than DCM1

Comparison of MP scores Comparison of running times

Best DCM (DCM2) vs Random


DCM2 is better than RANDOM w.r.t MP scores and running times

DCM2 (comparing two different thresholds)


Threshold selection techniques

Biological dataset of 503 rRNA sequences. Threshold valueat which we get two subproblems has best MP score.

Comparing supertree methods

MRP vs. SCM

1. SCM is better than MRP


Comparing tree refinement techniques

Study of tree refinement techniques


Constrained tree search had best MP scores but is slower thanother methods

Next time

• DCM1 for improving NJ

• Recursive-Iterative-DCM3: state of the art in solving MP and ML

http://creativecommons.org/licenses/by-sa/2.0/. cis786, lecture 3 usman roshan

Documents

aca act gta gtt acaact

optimal mp tree slide

maximum parsimony input

neighbor s of s

usman roshan slide

s of n

best mp scores

linear time onk slide