improved models of molecular evolution in statistical phylogenetics - stephane guindon

Improved models of molecular

evolution in statistical phylogenetics

Stephane Guindon

Department of StatisticsThe University of Auckland

New Zealand

Outline

1 Introduction

2 Variability of rates across sites

3 Variablity of rates across sites and lineages

4 Variability of selection regimes across sites and lineages

5 Conclusion

Introduction 2/52

Data used in phylogenetics

Multiple sources:

(Alignment of) homologous sequences

Introduction 3/52

Data used in phylogenetics

Multiple sources:

(Alignment of) homologous sequences

Calibration (typically fossil data)

Geography, environment (e.g., GIS data)

In this talk, I will mainly focus on pre-determined alignment(s)of orthologous coding sequences.

Introduction 3/52

Phylogenetic model

“Hybrid” object made of a discrete parameter, the treetopology, and multiple continuous parameters such asbranch lengths, substitution rates between pairs ofcharacters (nucleotide, amino-acids, codons), populationssize, migration rates, etc.

Estimation relies on the likelihood, i.e., the probability ofthe data given the model parameter values.

Bayesian or maximum-likelihood inference, depending onthe type of problem and the amount of time/computingresources available.

Introduction 4/52

Likelihood

Pr(S1,S2, . . . ,S6|M) = Pr(S1 = AAAA|M)× . . .

Introduction 5/52

Likelihood

Pr(S1,S2, . . . ,S6|M) = . . . × Pr(S2 = CGGC|M)× . . .

Introduction 6/52

Likelihood

Pr(S1,S2, . . . ,S6|M) = . . . × Pr(S6 = GGAA|M)

Introduction 7/52

Likelihood

4n combinations (for nucleotides): not computationallytractable for the vast majority of data sets...Clever tree traversal algorithm by Felsenstein (1981): 4× 4× n

operations required → can go up to 5,000 - 10,000 sequences!

Introduction 8/52

Core of the likelihood

N (t): number of substitutions in short time interval [0, t ].

Poisson process:

Pr(N (t + dt)− N (t) = 1) ≃ λdt

Pr(N (t + dt)− N (t) = 0) ≃ 1− λdt

Pr(N (t + dt)− N (t) ≥ 2) ≃ 0

Poisson probability:

Pr(N (t) = k) =(λt)k

k !e−λt

Introduction 9/52


0/1 data, 2 substitutions

Introduction 10/52


0/1 data, 2 substitutions(

p0→0 × p0→0 + p0→1 × p1→0 p0→0 × p0→1 + p0→1 × p1→1

p1→0 × p0→0 + p1→1 × p1→0 p1→0 × p0→1 + p1→1 × p1→1

)

Introduction 11/52



p0→0 p0→1

p1→0 p1→1

)

×

(

p0→0 p0→1

p1→0 p1→1

)

Introduction 12/52



p0→0 p0→1

p1→0 p1→1

)

×

(

p0→0 p0→1

p1→0 p1→1

)

In general, for k ≥ 0 substitutions, the probabilities of changefrom one state to another is given by Rk .

Introduction 12/52


Combine Poisson probability for k substitutions happening, toRk , in order to derive P(t), the matrix of transition betweenstates in time t :

P(t) =∞∑

k=0

(Rk )(µt)ke−µt

k !

= e−µt

∞∑

k=0

(Rµt)k

k !

= e−µteRµt

= e(R−I)µt

Introduction 13/52

Core of the likelihoodCombine Poisson probability for k substitutions happening, toRk , in order to derive P(t), the matrix of transition betweenstates in time t :

P(t) =

∞∑

k=0

(Rk )(µt)ke−µt

k !

= e−µt

∞∑

k=0

(Rµt)k

k !

= e−µteRµt

= e(R−I)µt

P(µt) = eQµt

Introduction 14/52

Core of the likelihoodCombine Poisson probability for k substitutions happening, toRk , in order to derive P(t), the matrix of transition betweenstates in time t :

P(t) =

∞∑

k=0

(Rk )(µt)ke−µt

k !

= e−µt

∞∑

k=0

(Rµt)k

k !

= e−µteRµt

= e(R−I)µt

P(l) = eQl

Introduction 15/52

Outline

1 Introduction




5 Conclusion

Variability of rates across sites 16/52

Simplest model

Same rate matrix Q throughout the tree

Sites are independent and identically distributed (iid)

Edges all have the same length l


Standard model, no variation across sites

Same rate matrix throughout the tree

Sites are iid

Each edge has its own length


Standard model, variation across sites

Same rate matrix throughout the tree

Each edge has its own length

Sites are still iid, πfast + πslow = 1, πfastr fast + πslowr slow = 1


Continuous Gamma model


Discrete Gamma model (Yang, 1994)


Transition to new models

Benefit of Yang’s discrete Gamma approach: oneparameter (α) determines what the values of all the ris are.

Limiting the number of parameters to estimate isconvenient from a computational perspective.

In practice, the variation of rates across sites is a strongfeature of molecular evolution, i.e., estimating α is easy.

Modern genetic data sets are much bigger than they werein the 90’s.

Computers are also much faster.

It is ample time we move on...

Designing more flexible models of rate variation isrelatively straightforward.


FreeRate model

Non-parametric estimation of πi ’s and ri ’s: estimate theseparameters under the two constraints

∑

iπi = 1 and

∑

iπiri = 1.


FreeRate model

Main drawback is the greater number of parameters toestimate: one for discrete gamma model vs. 2C − 2 forFreeRate.

Benefits:

more flexibility in modelling the variability of rates acrosssites.possibility to select the “best” number of rate classes usingsound statistical approach (e.g., likelihood ratio tests)


Results: nucleotide data sets


Results: amino-acid data sets


Prediction of amino-acid diversity


Summary

FreeRate generally fits data better than +Γ4.

Similar computational costs.

Soubrier et al. (2012, Mol. Biol. Evol.): FreeRate returnsmore accurate estimates of node ages compared to +Γ4.

In PhyML: command-line option --freerate (or --freerates).


Outline

1 Introduction




5 Conclusion

Variablity of rates across sites and lineages 31/52

Actual rate patterns (?)


Modelling site-specific rate patterns

Each site and each edge has its own rate of evolution.

No-common-mechanism model: poor statistical properties.

Alternative: each edge has the same distribution of rates.


The Integrated Length (IL) approach



The length of a branch is a random variable, characterizedby a mean (number of substitutions) and a variance.

In the current implementation, the variance is proportionalto the mean (one extra parameter for the whole treecompared to the standard approach).



Integrate over a wide range of scenarios...

Including the good ones...



Integrate over a wide range of scenarios...

Including the good ones...

...and the not so good ones.


Theory

Standard approach:

P(l) = eQl

IL approach:

P(l) =

∫

∞

0eQlp(l)dl

If l is distributed as Γ(α, β), then:

P(α, β) = (I − βQ)−α

Same computational cost as that of the standard approach.


Results: nucleotide data sets


Results: amino-acid data sets


Summary

IL incurs approximately the same computational cost asthe standard model.

IL is nested within the standard model: avenue forhypothesis testing.

Gamma distribution is a good model for the branch lengthif the rate of evolution fluctuates according to a(geometric) Brownian process (Guindon, Syst. Biol., 2013).

Large improvement for a small proportion of data sets.

In PhyML: command-line option --il.


Outline

1 Introduction




5 Conclusion

Variability of selection regimes across sites and lineages 41/52

Codon models

Use alignments of homologous coding sequences to estimatethe ratio of non-synonymous to synonymous (dN/dS)substitution rates.

The Q matrix is now 61 by 61 (instead of 4×4 or 20×20).

We are no longer interested in the variation of the overallrate at which substitutions accumulate. Rather, we focuson the variation of dN/dS.


Variation across sites: M2a model


M2a model: a different viewpoint

Consider a new model where each state of the Markovmodel is a combination of a codon state and a selectionregime.

The M2a model is then defined by a rate matrix Q withdimension (3× 61) by (3× 61):

Q =

Qω0 0 00 Qω1 00 0 Qω2


Extending M2a: branch-site model

Q =

Qω0 0 00 Qω1 00 0 Qω2


Extending M2a: branch-site model

Q =

Qω0 0 00 Qω1 00 0 Qω2

+

- πω1I πω2I

πω0I - πω2I

πω0I πω1I -

Guindon, Rodrigo, Dyer, Huelsenbeck (2004, PNAS)


Shan et al. (2009, Mol. Biol. Evol.)


Standard branch-site model

Gold standard: branch-site model (PAML), where the userspecifies which branches are likely to be affected by positiveselection at some sites a priori.

The stochastic branch-site model does not require suchprior information.

Also, the standard branch-site model assumes that thesame branches undergo positive selection in differentregions of the alignment.

The stochastic branch-site approach does not impose thatconstraint.


Simulations


Power to detect positive selection

Truth: 50% A+ 50% F XU XV XW

std-BS with tree A 0.912 0.916 1.000std-BS with tree B 0.020 0.036 0.190std-BS with tree C 0.346 0.346 0.976std-BS with tree D 0.006 0.010 0.166std-BS with tree E 0.172 0.150 0.000

std-BS multi 0.022 0.040 0.264sto-BS 0.148 0.178 0.682

XU, XV: 20% of the sites evolve under positive selection (on green edges only).

XW: 40% of the sites evolve under positive selection (on green edges only).

Lu & Guindon (2013, Mol. Biol. Evol.)


Summary

Strong prior on where positive selection might haveoccurred: use the standard branch-site model.

Exploratory analysis: the stochastic branch-site modelperforms better. Also clearly outperforms the MEMEapproach implemented in HyPhy (Murrell et al., 2012)


Summary

Strong prior on where positive selection might haveoccurred: use the standard branch-site model.

Exploratory analysis: the stochastic branch-site modelperforms better. Also clearly outperforms the MEMEapproach implemented in HyPhy (Murrell et al., 2012)

Stochastic branch-site model implemented in fitmodel:http://code.google.com/p/fitmodel.


http://code.google.com/p/fitmodel

Outline

1 Introduction




5 Conclusion

Conclusion 51/52

Conclusion

FreeRate model almost systematically outperforms thestandard (+Γ4) one.

IL approach brings significant improvement for a smallerfraction of the alignments (why?)

Stochastic branch-site model of codon evolution is wellsuited for exploratory analysis where one does not have aclear idea about the lineages evolving under positiveselection a priori.

Conclusion 52/52

Conclusion

FreeRate model almost systematically outperforms thestandard (+Γ4) one.

IL approach brings significant improvement for a smallerfraction of the alignments (why?)

Stochastic branch-site model of codon evolution is wellsuited for exploratory analysis where one does not have aclear idea about the lineages evolving under positiveselection a priori.

The future?

More data means more variability to account for →improving models of molecular evolution is (still) essential.

Better models for other sources of data, in particularspatial coordinates of collected sequences and fossils.

Conclusion 52/52

improved models of molecular evolution in statistical phylogenetics - stephane guindon

Technology

e t e rt

introductiontk t e

etk e t

time t

likelihood n t

variation of rates

dt n t

dt prn t