improved models of molecular evolution in statistical phylogenetics - stephane guindon
DESCRIPTION
In this talk, I will present new models of molecular evolution suitable for phylogeny estimation. These models provide an improved description of the variation of rates of evolution along genomes and during the course of evolution. I will present examples that demonstrate the superiority of these new models and show how to use them in PhyMLTRANSCRIPT
Improved models of molecular
evolution in statistical phylogenetics
Stephane Guindon
Department of StatisticsThe University of Auckland
New Zealand
Outline
1 Introduction
2 Variability of rates across sites
3 Variablity of rates across sites and lineages
4 Variability of selection regimes across sites and lineages
5 Conclusion
Introduction 2/52
Data used in phylogenetics
Multiple sources:
(Alignment of) homologous sequences
Introduction 3/52
Data used in phylogenetics
Multiple sources:
(Alignment of) homologous sequences
Calibration (typically fossil data)
Geography, environment (e.g., GIS data)
In this talk, I will mainly focus on pre-determined alignment(s)of orthologous coding sequences.
Introduction 3/52
Phylogenetic model
“Hybrid” object made of a discrete parameter, the treetopology, and multiple continuous parameters such asbranch lengths, substitution rates between pairs ofcharacters (nucleotide, amino-acids, codons), populationssize, migration rates, etc.
Estimation relies on the likelihood, i.e., the probability ofthe data given the model parameter values.
Bayesian or maximum-likelihood inference, depending onthe type of problem and the amount of time/computingresources available.
Introduction 4/52
Likelihood
Pr(S1,S2, . . . ,S6|M) = Pr(S1 = AAAA|M)× . . .
Introduction 5/52
Likelihood
Pr(S1,S2, . . . ,S6|M) = . . . × Pr(S2 = CGGC|M)× . . .
Introduction 6/52
Likelihood
Pr(S1,S2, . . . ,S6|M) = . . . × Pr(S6 = GGAA|M)
Introduction 7/52
Likelihood
4n combinations (for nucleotides): not computationallytractable for the vast majority of data sets...Clever tree traversal algorithm by Felsenstein (1981): 4× 4× n
operations required → can go up to 5,000 - 10,000 sequences!
Introduction 8/52
Core of the likelihood
N (t): number of substitutions in short time interval [0, t ].
Poisson process:
Pr(N (t + dt)− N (t) = 1) ≃ λdt
Pr(N (t + dt)− N (t) = 0) ≃ 1− λdt
Pr(N (t + dt)− N (t) ≥ 2) ≃ 0
Poisson probability:
Pr(N (t) = k) =(λt)k
k !e−λt
Introduction 9/52
Core of the likelihood
0/1 data, 2 substitutions
Introduction 10/52
Core of the likelihood
0/1 data, 2 substitutions(
p0→0 × p0→0 + p0→1 × p1→0 p0→0 × p0→1 + p0→1 × p1→1
p1→0 × p0→0 + p1→1 × p1→0 p1→0 × p0→1 + p1→1 × p1→1
)
Introduction 11/52
Core of the likelihood
0/1 data, 2 substitutions(
p0→0 p0→1
p1→0 p1→1
)
×
(
p0→0 p0→1
p1→0 p1→1
)
Introduction 12/52
Core of the likelihood
0/1 data, 2 substitutions(
p0→0 p0→1
p1→0 p1→1
)
×
(
p0→0 p0→1
p1→0 p1→1
)
In general, for k ≥ 0 substitutions, the probabilities of changefrom one state to another is given by Rk .
Introduction 12/52
Core of the likelihood
Combine Poisson probability for k substitutions happening, toRk , in order to derive P(t), the matrix of transition betweenstates in time t :
P(t) =∞∑
k=0
(Rk )(µt)ke−µt
k !
= e−µt
∞∑
k=0
(Rµt)k
k !
= e−µteRµt
= e(R−I)µt
Introduction 13/52
Core of the likelihoodCombine Poisson probability for k substitutions happening, toRk , in order to derive P(t), the matrix of transition betweenstates in time t :
P(t) =
∞∑
k=0
(Rk )(µt)ke−µt
k !
= e−µt
∞∑
k=0
(Rµt)k
k !
= e−µteRµt
= e(R−I)µt
P(µt) = eQµt
Introduction 14/52
Core of the likelihoodCombine Poisson probability for k substitutions happening, toRk , in order to derive P(t), the matrix of transition betweenstates in time t :
P(t) =
∞∑
k=0
(Rk )(µt)ke−µt
k !
= e−µt
∞∑
k=0
(Rµt)k
k !
= e−µteRµt
= e(R−I)µt
P(l) = eQl
Introduction 15/52
Outline
1 Introduction
2 Variability of rates across sites
3 Variablity of rates across sites and lineages
4 Variability of selection regimes across sites and lineages
5 Conclusion
Variability of rates across sites 16/52
Simplest model
Same rate matrix Q throughout the tree
Sites are independent and identically distributed (iid)
Edges all have the same length l
Variability of rates across sites 17/52
Standard model, no variation across sites
Same rate matrix throughout the tree
Sites are iid
Each edge has its own length
Variability of rates across sites 18/52
Standard model, variation across sites
Same rate matrix throughout the tree
Each edge has its own length
Sites are still iid, πfast + πslow = 1, πfastr fast + πslowr slow = 1
Variability of rates across sites 19/52
Continuous Gamma model
Variability of rates across sites 20/52
Discrete Gamma model (Yang, 1994)
Variability of rates across sites 21/52
Discrete Gamma model (Yang, 1994)
Variability of rates across sites 22/52
Discrete Gamma model (Yang, 1994)
Variability of rates across sites 23/52
Transition to new models
Benefit of Yang’s discrete Gamma approach: oneparameter (α) determines what the values of all the ris are.
Limiting the number of parameters to estimate isconvenient from a computational perspective.
In practice, the variation of rates across sites is a strongfeature of molecular evolution, i.e., estimating α is easy.
Modern genetic data sets are much bigger than they werein the 90’s.
Computers are also much faster.
It is ample time we move on...
Designing more flexible models of rate variation isrelatively straightforward.
Variability of rates across sites 24/52
FreeRate model
Non-parametric estimation of πi ’s and ri ’s: estimate theseparameters under the two constraints
∑
iπi = 1 and
∑
iπiri = 1.
Variability of rates across sites 25/52
FreeRate model
Main drawback is the greater number of parameters toestimate: one for discrete gamma model vs. 2C − 2 forFreeRate.
Benefits:
more flexibility in modelling the variability of rates acrosssites.possibility to select the “best” number of rate classes usingsound statistical approach (e.g., likelihood ratio tests)
Variability of rates across sites 26/52
Results: nucleotide data sets
Variability of rates across sites 27/52
Results: amino-acid data sets
Variability of rates across sites 28/52
Prediction of amino-acid diversity
Variability of rates across sites 29/52
Summary
FreeRate generally fits data better than +Γ4.
Similar computational costs.
Soubrier et al. (2012, Mol. Biol. Evol.): FreeRate returnsmore accurate estimates of node ages compared to +Γ4.
In PhyML: command-line option --freerate (or --freerates).
Variability of rates across sites 30/52
Outline
1 Introduction
2 Variability of rates across sites
3 Variablity of rates across sites and lineages
4 Variability of selection regimes across sites and lineages
5 Conclusion
Variablity of rates across sites and lineages 31/52
Actual rate patterns (?)
Variablity of rates across sites and lineages 32/52
Modelling site-specific rate patterns
Each site and each edge has its own rate of evolution.
No-common-mechanism model: poor statistical properties.
Alternative: each edge has the same distribution of rates.
Variablity of rates across sites and lineages 33/52
The Integrated Length (IL) approach
Variablity of rates across sites and lineages 34/52
The Integrated Length (IL) approach
The length of a branch is a random variable, characterizedby a mean (number of substitutions) and a variance.
In the current implementation, the variance is proportionalto the mean (one extra parameter for the whole treecompared to the standard approach).
Variablity of rates across sites and lineages 35/52
The Integrated Length (IL) approach
Integrate over a wide range of scenarios...
Including the good ones...
Variablity of rates across sites and lineages 36/52
The Integrated Length (IL) approach
Integrate over a wide range of scenarios...
Including the good ones...
...and the not so good ones.
Variablity of rates across sites and lineages 36/52
Theory
Standard approach:
P(l) = eQl
IL approach:
P(l) =
∫
∞
0eQlp(l)dl
If l is distributed as Γ(α, β), then:
P(α, β) = (I − βQ)−α
Same computational cost as that of the standard approach.
Variablity of rates across sites and lineages 37/52
Results: nucleotide data sets
Variablity of rates across sites and lineages 38/52
Results: amino-acid data sets
Variablity of rates across sites and lineages 39/52
Summary
IL incurs approximately the same computational cost asthe standard model.
IL is nested within the standard model: avenue forhypothesis testing.
Gamma distribution is a good model for the branch lengthif the rate of evolution fluctuates according to a(geometric) Brownian process (Guindon, Syst. Biol., 2013).
Large improvement for a small proportion of data sets.
In PhyML: command-line option --il.
Variablity of rates across sites and lineages 40/52
Outline
1 Introduction
2 Variability of rates across sites
3 Variablity of rates across sites and lineages
4 Variability of selection regimes across sites and lineages
5 Conclusion
Variability of selection regimes across sites and lineages 41/52
Codon models
Use alignments of homologous coding sequences to estimatethe ratio of non-synonymous to synonymous (dN/dS)substitution rates.
The Q matrix is now 61 by 61 (instead of 4×4 or 20×20).
We are no longer interested in the variation of the overallrate at which substitutions accumulate. Rather, we focuson the variation of dN/dS.
Variability of selection regimes across sites and lineages 42/52
Variation across sites: M2a model
Variability of selection regimes across sites and lineages 43/52
M2a model: a different viewpoint
Consider a new model where each state of the Markovmodel is a combination of a codon state and a selectionregime.
The M2a model is then defined by a rate matrix Q withdimension (3× 61) by (3× 61):
Q =
Qω0 0 00 Qω1 00 0 Qω2
Variability of selection regimes across sites and lineages 44/52
Extending M2a: branch-site model
Q =
Qω0 0 00 Qω1 00 0 Qω2
Variability of selection regimes across sites and lineages 45/52
Extending M2a: branch-site model
Q =
Qω0 0 00 Qω1 00 0 Qω2
+
- πω1I πω2I
πω0I - πω2I
πω0I πω1I -
Guindon, Rodrigo, Dyer, Huelsenbeck (2004, PNAS)
Variability of selection regimes across sites and lineages 45/52
Shan et al. (2009, Mol. Biol. Evol.)
Variability of selection regimes across sites and lineages 46/52
Standard branch-site model
Gold standard: branch-site model (PAML), where the userspecifies which branches are likely to be affected by positiveselection at some sites a priori.
The stochastic branch-site model does not require suchprior information.
Also, the standard branch-site model assumes that thesame branches undergo positive selection in differentregions of the alignment.
The stochastic branch-site approach does not impose thatconstraint.
Variability of selection regimes across sites and lineages 47/52
Simulations
Variability of selection regimes across sites and lineages 48/52
Power to detect positive selection
Truth: 50% A+ 50% F XU XV XW
std-BS with tree A 0.912 0.916 1.000std-BS with tree B 0.020 0.036 0.190std-BS with tree C 0.346 0.346 0.976std-BS with tree D 0.006 0.010 0.166std-BS with tree E 0.172 0.150 0.000
std-BS multi 0.022 0.040 0.264sto-BS 0.148 0.178 0.682
XU, XV: 20% of the sites evolve under positive selection (on green edges only).
XW: 40% of the sites evolve under positive selection (on green edges only).
Lu & Guindon (2013, Mol. Biol. Evol.)
Variability of selection regimes across sites and lineages 49/52
Summary
Strong prior on where positive selection might haveoccurred: use the standard branch-site model.
Exploratory analysis: the stochastic branch-site modelperforms better. Also clearly outperforms the MEMEapproach implemented in HyPhy (Murrell et al., 2012)
Variability of selection regimes across sites and lineages 50/52
Summary
Strong prior on where positive selection might haveoccurred: use the standard branch-site model.
Exploratory analysis: the stochastic branch-site modelperforms better. Also clearly outperforms the MEMEapproach implemented in HyPhy (Murrell et al., 2012)
Stochastic branch-site model implemented in fitmodel:http://code.google.com/p/fitmodel.
Variability of selection regimes across sites and lineages 50/52
Outline
1 Introduction
2 Variability of rates across sites
3 Variablity of rates across sites and lineages
4 Variability of selection regimes across sites and lineages
5 Conclusion
Conclusion 51/52
Conclusion
FreeRate model almost systematically outperforms thestandard (+Γ4) one.
IL approach brings significant improvement for a smallerfraction of the alignments (why?)
Stochastic branch-site model of codon evolution is wellsuited for exploratory analysis where one does not have aclear idea about the lineages evolving under positiveselection a priori.
Conclusion 52/52
Conclusion
FreeRate model almost systematically outperforms thestandard (+Γ4) one.
IL approach brings significant improvement for a smallerfraction of the alignments (why?)
Stochastic branch-site model of codon evolution is wellsuited for exploratory analysis where one does not have aclear idea about the lineages evolving under positiveselection a priori.
The future?
More data means more variability to account for →improving models of molecular evolution is (still) essential.
Better models for other sources of data, in particularspatial coordinates of collected sequences and fossils.
Conclusion 52/52