an annealed sequential monte carlo method for bayesian

Copyedited by: YS MANUSCRIPT CATEGORY: Systematic Biology

[11:31 20/11/2019 Sysbio-OP-SYSB190028.tex] Page: 155 155–183

Syst. Biol. 69(1):155–183, 2020© The Author(s) 2019. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved.For permissions, please email: [email protected]:10.1093/sysbio/syz028Advance Access publication June 6, 2019

An Annealed Sequential Monte Carlo Method for Bayesian Phylogenetics

LIANGLIANG WANG1,∗, SHIJIA WANG1,∗, AND ALEXANDRE BOUCHARD-CÔTÉ2,∗1Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada and 2Department of Statistics,

University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada∗Correspondence to be sent to: Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada;

E-mail: [email protected]; [email protected]; [email protected].

Received 1 June 2018; reviews returned 12 April 2019; accepted 20 April 2019Associate Editor: Edward Susko

Abstract.—We describe an “embarrassingly parallel” method for Bayesian phylogenetic inference, annealed Sequential MonteCarlo (SMC), based on recent advances in the SMC literature such as adaptive determination of annealing parameters. Thealgorithm provides an approximate posterior distribution over trees and evolutionary parameters as well as an unbiasedestimator for the marginal likelihood. This unbiasedness property can be used for the purpose of testing the correctness ofposterior simulation software. We evaluate the performance of phylogenetic annealed SMC by reviewing and comparingwith other computational Bayesian phylogenetic methods, in particular, different marginal likelihood estimation methods.Unlike previous SMC methods in phylogenetics, our annealed method can utilize standard Markov chain Monte Carlo(MCMC) tree moves and hence benefit from the large inventory of such moves available in the literature. Consequently,the annealed SMC method should be relatively easy to incorporate into existing phylogenetic software packages basedon MCMC algorithms. We illustrate our method using simulation studies and real data analysis. [Marginal likelihood;phylogenetics; Sequential Monte Carlo.]

INTRODUCTION

The Bayesian paradigm is widely used in systematicbiology, principally for the purpose of phylogeneticreconstruction as well as for evaluating the empiricalsupport of evolutionary models (Chen et al. 2014). Bothof these tasks, Bayesian phylogenetic reconstructionand model selection, involve an intractable sum overtopologies as well as a high-dimensional integralover branch lengths and evolutionary parameters.Consequently, Markov chain Monte Carlo (MCMC)methods have been widely used in the past 20 yearsto approximate posterior distributions defined over thespace of phylogenetic trees (Rannala and Yang 1996).

Despite their success, MCMC phylogenetic methodsare still afflicted by two key limitations, hence motivatingthe need for alternative approximations method forposterior distributions over phylogenetic trees.

Firstly, MCMC methods do not readily take advantageof highly parallel computer architectures. This isproblematic in the current context as progress incomputational power mostly comes in the formof parallelism gains. Although there are techniquesavailable to parallelize phylogenetic MCMC methods,they are generally not “embarrassingly parallel”: forexample, parallel Metropolis coupled MCMC (Altekaret al. 2004) may reach a point where the addition ofcores actually reduces sampling efficiency (Atchadé et al.2011).

A second challenge with MCMC-based phylogeneticapproximations arises in the context of model selection.By comparing the marginal likelihood Z=p(y), wherey denotes observed data, under different models,one can approach scientific questions under theBayesian framework while naturally taking into accountdifferences in model complexity. More specifically, theratio r=p1(y)/p2(y) of two marginal likelihoods based

on two evolutionary models, p1(·),p2(·), can be used toassess the strength of evidence y provides for p1 (whenr>1) or p2 (when r<1). The ratio r is called the Bayesfactor (Jeffreys 1935; Lartillot et al. 2006; Oaks et al. 2018).In the context of phylogenetics, the Bayes factor assesseshow much support a set of sequencing data provides forone evolutionary model against another one.

Several methods have been proposed to estimatemarginal likelihoods based on MCMC methods (Newtonand Raftery 1994; Gelman and Meng 1998; Friel andPettitt 2008, inter alia), including work tailored to thephylogenetic context (Huelsenbeck et al. 2004; Lartillotet al. 2006; Fan et al. 2010; Xie et al. 2010). However thesemethods all have different drawbacks, see for examplethe aptly named review, “Nineteen dubious ways tocompute the marginal likelihood of a phylogenetictree topology” (Fourment et al. 2018b). Moreover,one limitation shared by all MCMC-based marginallikelihood estimators is that they are generally biased (inthe technical sense of the term as used in computationalstatistics, reviewed in the theory section of the paper)—unless one is able to initialize the MCMC chains tothe exact stationary distribution, which in practice isnot possible. We argue that in certain scenarios, it canbe useful to have unbiased methods. One example weelaborate on is for the purpose of a new test to ascertaincorrectness of posterior simulation software. Anotherclass of examples comes from the burgeoning field ofpseudo-marginal methods (Andrieu and Roberts 2009).

Sequential Monte Carlo (SMC) methods (see Doucetand Johansen 2011 for an accessible introduction toSMC) provide a flexible framework to construct unbiasedestimators and past work has shown they can bevery efficient in a phylogenetic context (Teh et al.2008; Görür and Teh 2009; Bouchard-Côté et al. 2012;Görür et al. 2012; Wang et al. 2015; Everitt et al. 2016;

155

Dow

nloaded from https://academ

ic.oup.com/sysbio/article/69/1/155/5484767 by Sim

on Fraser University user on 28 M

arch 2021



156 SYSTEMATIC BIOLOGY VOL. 69

Dinh et al. 2017; Smith et al. 2017; Fourment et al. 2018a).One drawback caused by the high degree of flexibilitythat comes with SMC is that the phylogenetic SMCalgorithms developed so far are non-trivial to adapt toexisting MCMC-based phylogenetic frameworks. Here,we propose a different construction based on the seminalwork of Del Moral et al. (2006), in turn based on annealedimportance sampling (AIS) (Neal 2001), which yieldsan SMC method which is in a sense much closer tostandard MCMC, while providing unbiased estimatorsof the marginal likelihood. The proposed method, whichwe call phylogenetic annealed SMC, can directly makeuse of any existing phylogenetic MCMC proposals, arich literature covering many kinds of phylogenetic trees(Rannala and Yang 1996; Yang and Rannala 1997; Largetand Simon 1999; Mau et al. 1999; Li et al. 2000; Holderand Lewis 2003; Rannala and Yang 2003; Höhna et al.2008; Lakner et al. 2008; Höhna and Drummond 2012). Itis easy to incorporate the proposed annealed SMC intoexisting phylogenetic software packages that implementMCMC algorithms, such as RevBayes (Höhna et al. 2016)or BEAST (Drummond and Rambaut 2007). At the sametime, our method can leverage state-of-the-art advancesin the field of adaption of SMC algorithms, making thealgorithm fully automated in most cases.

Our implementation of the proposed method isavailable at https://github.com/liangliangwangsfu/annealedSMC. All our experimental setups andresults are available at https://github.com/shijiaw/AnnealingSimulation. The algorithms describedhere are also available in the Blang probabilisticprogramming language https://github.com/UBC-Stat-ML/blangSDK, which supports a smallbut growing set of phylogenetic models.

LITERATURE REVIEW

There is a growing body of work on SMC-basedBayesian phylogenetic inference. Indeed, a powerfulfeature of the general SMC framework (Del Moral et al.2006) is that the space on which the distributions �r aredefined is allowed to vary from one iteration to the next.All previous work on SMC methods for phylogeneticshas exploited this feature for various purposes reviewedhere.

In one direction, several “bottom up” approaches (Tehet al. 2008; Görür and Teh 2009; Bouchard-Côté et al. 2012;Görür et al. 2012; Wang et al. 2015) have been proposedto allow more efficient reuse of intermediate stages ofthe Felsenstein pruning recursions. For these methods,the intermediate distributions are defined over forestsover the observed taxa, and hence their dimensionalityincreases with r. These methods are most effective inclock or nearly clock trees. For general trees, it is typicallynecessary to perform additional MCMC steps, whichmakes it harder to use in the context of estimation ofmarginal likelihoods.

In a related direction, Dinh et al. (2017) and Fourmentet al. (2018a) use a sequence of targets where �r is a tree

over the first r tips. This construction is especially usefulin scenarios where taxonomic data come in an onlinefashion.

Another use case of SMC methods in phylogeneticsarises from Bayesian analysis of intractable evolutionarymodels. For example, SMC has been used for Bayesianphylogenetic analysis based on infinite state-spaceevolutionary models (Hajiaghayi et al. 2014) or for jointinference of transmission networks (Smith et al. 2017).

Finally, a concurrent line of work (Everitt et al. 2016)has explored a combination of reversible jump methodswith phylogenetic models.

One drawback of letting the dimensionality of �rvary with r as all the above methods do, is thatit makes it significantly harder to incorporate SMCinto existing Bayesian phylogenetic inference packagessuch as MrBayes (Huelsenbeck and Ronquist 2001),RevBayes, or BEAST. In contrast, in our method thetarget distributions �r are all defined over the samespace. The annealed SMC framework in this contextutilizes Metropolis-Hastings kernels in the inner loopbut combines them in a different fashion compared withstandard MCMC algorithms, or even compared withparallel tempering MCMC algorithms.

SETUP AND NOTATION

We let t denote a phylogenetic tree with tips labeled bya fixed set of operational taxonomic units X. The variablet encapsulates the tree topology and a set of positivebranch lengths. Our methodology is directly applicableto any class of phylogenetic trees where MCMC proposaldistributions are available. This includes for exampleclock trees (Höhna et al. 2008) as well as nonclock trees(Lakner et al. 2008).

We let � denote evolutionary parameters, for examplethe parameters of a family of rate matrices such as thegeneral time reversible (GTR) model (Tavaré 1986), ordiffusion parameters in the case of continuous traits(Lemey et al. 2010). Again our method is applicable toany situation where MCMC proposals are available forexploring the space of �. We use x= (t,�) to denote thesetwo latent variables.

We let y denote observed data indexed by the tips Xof t. We assume a likelihood function p(y|x) is specifiedsuch that for any hypothesized tree and parameters, thevalue p(y|x) can be computed efficiently. This assumptionis sometimes called pointwise computation. This is atypical assumption in Bayesian phylogenetics, where thiscomputation is done with some version of Felsensteinpruning (Felsenstein 1973, 1981) (an instance of theForward-Backward algorithm; Forney 1973).

Finally, let p(x) denote a prior on the parameters andtrees, which we assume can also be computed pointwiseefficiently. This defines a joint distribution, denoted�(x)=p(x)p(y|x). We ignore the argument y from now onsince the data are viewed as fixed in a Bayesian analysiscontext.

Dow




arch 2021

https://github.com/liangliangwangsfu/annealedSMC

https://github.com/liangliangwangsfu/annealedSMC

https://github.com/shijiaw/AnnealingSimulation

https://github.com/shijiaw/AnnealingSimulation

https://github.com/UBC-Stat-ML/blangSDK

https://github.com/UBC-Stat-ML/blangSDK



2020 WANG ET AL.—AN ANNEALED SMC FOR BAYESIAN PHYLOGENETICS 157

We are interested in approximating a posteriordistribution on x given data y, denoted:

�(x)= �(x)∫�(x′)dx′ . (1)

Here the integral and dx′ are viewed in an abstractsense and include both summation over discrete latentvariables such as topologies and standard integrationover continuous spaces.

The denominator can be interpreted as the marginallikelihood under the model specified by the prior andlikelihood functions, which we denote by Z:

Z=p(y)=∫�(x)dx. (2)

Computation of this quantity, also called thenormalization constant or evidence, is the mainchallenge involved when doing Bayesian modelselection.

Other quantities of interest include expectations withrespect to the posterior distribution, characterized bya real-valued function of interest f based on which wewould like to compute∫

�(x)f (x)dx. (3)

For example if we seek a posterior clade support for asubset X′ ⊂X of the leaves X,

f (x)= f (t,�)=1[t admits X′ as a clade],where 1[s] denotes the indicator function which is equalto one if the Boolean expression s is true and zerootherwise.

ANNEALED SMC FOR PHYLOGENETICS

Sequences of DistributionsIn standard MCMC methods, we are interested

in a single probability distribution, the posteriordistribution. However, there are several reasons why wemay use a sequence of distributions rather than only one.

A first possibility is that we may have an onlineproblem, where the data are revealed sequentially andwe want to perform inference sequentially in time basedon the data available so far. The distribution at step ris then the posterior distribution conditioning on thefirst r batches of data. This approach is explored in thecontext of phylogenetics in Dinh et al. (2017), where abatch of data consists in genomic information for oneadditional operational taxonomic unit. We do not pursuethis direction here but discuss some possibilities forcombinations in the discussion.

A second reason for having multiple distributions, andthe focus of this work, is to facilitate the exploration ofthe state space. This is achieved for example by raisingthe likelihood term to a power �r between zero and one,

which we multiply with the prior

�r(x)=p(y|x)�r p(x). (4)

MCMC may get stuck in a region of the space ofphylogenetic trees around the initial value. This mayhappen for example around a local maximum (mode)in the posterior density. Such a region is sometimescalled a “basin of attraction,” and no single basin ofattraction may be enough to well represent the fullposterior distribution. Introducing a series of poweredposterior distributions can alleviate this issue. A smallvalue of �r flattens the posterior and makes MCMCsamplers move easily between the different basins ofattractions. The samples are initially overly dispersed butare then coerced into the posterior distribution �(x) byslowly increasing the annealing parameter �r.

We do not anneal the prior to ensure that �r(x) has afinite normalization constant,∫

�r(x)dx=Ep(x)[(p(y|X))�r ]

≤(Ep(x)[p(y|X)]

)�r=(p(y))�r <∞,

where the first inequality follows from the concavity of(·)�r and Jensen’s inequality.

A third scenario is that we may encounter a “talldata” problem, for example biological sequences with alarge number of sites. When the number of sites is large,evaluation of the unnormalized posterior �r(x) definedin Equation (4) is computationally expensive. The idea ofdata subsampling (Quiroz et al. 2018a, 2018b; Bardenetet al. 2017; Gunawan et al. 2018) could be used to definethe sequence of distributions. The construction of thesequence of distributions is described in Appendix 1.

The probability distributions

�r(x)= �r(x)∫�r(x′)dx′ (5)

are therefore well defined and we denote their respectivenormalization constants by

Zr=∫�r(x)dx. (6)

If the exponent �r is zero, then the distribution �rbecomes the prior which is often easy to explore andin fact independent samples can be extracted in manysituations. At the other extreme, the distribution atpower �r=1 is the distribution of interest.

The intermediate distributions {�r}r=1,...,R are definedon a common measurable space (X ,E). The annealedSMC is a generalization of the standard SMC method(Doucet et al. 2001). In standard SMC, the intermediatedistributions are defined on a space of strictly increasingdimension.

Dow




arch 2021




Basic Annealed SMC AlgorithmWe now turn to the description of annealed SMC

in the context of Bayesian phylogenetic inference. Thealgorithm fits into the generic framework of SMCsamplers (Del Moral et al. 2006): at each iteration, indexedby r=1,2,...,R, we maintain a collection indexed byk∈{1,2,...,K} of imputed latent states xr,k , each pairedwith a non-negative number called a weight wr,k ;such a pair is called a particle. A latent state in ourcontext consists in a hypothesized tree tr,k and a setof evolutionary parameters �r,k , i.e. xr,k= (tr,k,�r,k). Incontrast to previous SMC methods, xr,k is always of thesame data type: no partial states such as forest or treesover subsets of leaves are considered here.

A particle population consists in a list of particles(xr,·,wr,·)={(xr,k,wr,k) :k∈{1,...,K}}. A particlepopulation can be used to estimate posteriorprobabilities as follows: first, normalize the weights,denoted after normalization using capital letter,Wr,k=wr,k/

∑k′wr,k′ . Second, use the approximation:∫�r(x)f (x)dx≈

K∑k=1

Wr,kf (xr,k). (7)

For example, if we seek a posterior clade support for asubset X′ ⊂X of the leaves X, this becomes

K∑k=1

Wr,k1[sampled tree tr,k admits X′ as a clade].

The above formula is most useful at the last SMCiteration, r=R, because �R coincides with the posteriordistribution by construction.

At the first iteration, each of the particles’ tree andevolutionary parameters are sampled independentlyand identically from their prior distributions. Weassume for simplicity that this prior samplingstep is tractable, a reasonable assumption in manyphylogenetic models. After initialization, we thereforehave a particle-based approximation of the priordistribution. Intuitively, the goal behind the annealedSMC algorithm is to progressively transform this priordistribution approximation into a posterior distributionapproximation.

To formalize this intuition, we use the sequence ofdistributions introduced in the previous section. Thelast ingredient required to construct an SMC algorithmis an SMC proposal distribution Kr(xr−1,k,xr,k), used tosample a particle for the next iteration given a particlefrom the previous iteration. Because xr−1,k and xr,k havethe same dimensionality in our setup, it is tempting touse MCMC proposals qr(xr−1,k,xr,k) in order to buildSMC proposals, for example, subtree prune and regraftmoves, and Gaussian proposals for the continuousparameters and branch lengths. Indeed, there are severaladvantages of using MCMC proposals as the basis ofSMC proposals. First, this means, we can leverage a richliterature on the topic (Rannala and Yang 1996; Yangand Rannala 1997; Larget and Simon 1999; Mau et al.

1999; Li et al. 2000; Holder and Lewis 2003; Rannalaand Yang 2003; Höhna et al. 2008; Lakner et al. 2008;Höhna and Drummond 2012). Second, it makes it easierto add SMC support to existing MCMC-based softwarelibraries. Third, it makes certain benchmark comparisonbetween SMC and MCMC more direct, as we can thenchoose the set of moves to be the same for both. On theflip side, constructing MCMC proposals is somewhatmore constrained, so some of the flexibility provided bythe general SMC framework is lost.

Naively, we could pick the SMC proposal directly froman MCMC proposal, Kr(xr−1,k,xr,k)=qr(xr−1,k,xr,k).However, doing so would have the undesirable propertythat the magnitude of the fluctuation of the weights ofthe particles from one iteration to the next, ‖Wr−1,·−Wr,·‖, does not converge to zero when the annealingparameter change �r−�r−1 goes to zero. This lackof convergence to zero can potentially cause severeparticle degeneracy problems, forcing the use of anumber of particles larger than what can be realisticallyaccommodated in memory (although workaroundsexist, e.g. Jun and Bouchard-Côté 2014). To avoid thisissue, we follow Del Moral et al. (2006) and use asSMC proposal the accept–reject Metropolis-Hastingstransition probability based on qr (called a Metropolizedproposal), reviewed in Algorithm 1.

Algorithm 1 Accept–reject Metropolis-Hastingsalgorithm

1. Propose a new tree and/or new evolutionaryparameters, x∗r ∼qr(xr−1,·). �For example, using a nearest neighbor interchange,and/or a symmetric normal proposal on branchlengths and/or evolutionary parameters.

2. Compute the Metropolis-Hastings ratio based on �r:

�r(xr−1,x∗r )=min{

1,�r(x∗r )q(x∗r ,xr−1)�r(xr−1)q(xr−1,x∗r )

}.

3. Simulate u∼U(0,1).4. if u<�r(xr−1,x∗r ) then5. xr=x∗r . � Output the proposal x∗r .6. else7. xr=xr−1. � Output the previous state xr−1.

The key point is that a theoretical argument (reviewedin the Appendix 2) shows that provided that (1)Kr has stationary distribution �r (which is true byconstruction, a consequence of using the Metropolis-Hastings algorithm) and (2) we use the weight formula:

wr,k= �r

�r−1(xr−1,k), (8)

then we obtain a valid SMC algorithm, meaning thatthe key theoretical properties expected from SMC holdunder regularity conditions, see Theoretical Propertiessection.

In the important special case where �r(xr) is equal tothe prior times an annealed likelihood, we obtain

wr,k=[p(y|xr−1,k)]�r−�r−1 . (9)

Dow




arch 2021




As hoped, the update shown in Equation (9) hasthe property that weight fluctuations vanish as theannealing parameter difference �r−�r−1 goes to zero.This will form the basis of the annealing parametersequence adaptation strategies described in the nextsection. But for now, assume for simplicity that thenumber of iterations R and the annealing schedule �r,r∈{1,...,R} is prespecified. For example, a simple choicefor the annealing parameter sequence (Friel and Pettitt2008) is �r= (r/R)3, where R is the total number of SMCiterations. In this case, the difference between successiveannealing parameters is (3r2−3r+1)/R3. An annealedSMC with a larger value of R is computationally moreexpensive but has a better performance.

In contrast to other SMC algorithms, the annealedSMC algorithm does not require pointwise evaluationof the proposal Kr(xr−1,k,xr,k), that is, given xr−1,kand a sampled xr,k , we do not need to compute thenumerical value of Kr(xr−1,k,xr,k) as it does not appearin the weight update formula, Equation (8). This pointis important, because for Metropolis-Hastings kernels,pointwise evaluation would require computation ofa typically intractable integral under the proposal inorder to compute the total probability of rejection.The theoretical justification as to why we do not needpointwise evaluation of Kr is detailed in Appendix 2.

In practice, many proposals are needed to modifydifferent latent variables and to improve mixing. We givein Appendix 3 the list of MCMC proposals we consider.Let qi

r, i=1,...,M, denote the various proposals, and Kir

the corresponding Metropolized transition probabilities.We need to combine them into one proposal Kr. To ensurethat condition (1) above is satisfied, namely that Kr obeysglobal balance with respect to �r, use the followingproperty (Tierney 1994; Andrieu et al. 2003): if each ofthe transition kernels {Ki},i=1,...,M, respects globalbalance with respect to �, then the cycle hybrid kernel∏M

i=1Ki and the mixture hybrid kernel∑M

i=1piKi,∑M

i=1pi=1, also satisfy global balance with respect to�. The globalbalance condition,

∫�r(x)Kr(x,x′)dx=�r(x′), ensures

that the Markov chain encoded by Kr admits �r as astationary distribution. In practice, the mixture kernel isimplemented by randomly selecting Ki with probabilitypi at each iteration (Andrieu et al. 2003).

We can now introduce in Algorithm 2 the simplestversion of the annealed SMC, which alternates betweenreweighting, propagating, and resampling. Figure 1presents an overview of the annealed SMC algorithmicframework. In the proposal step, we propose newparticles through MCMC moves (typically Metropolis-Hastings moves). Finally, we use resampling to pruneparticles with small weights. A list of unweightedparticles is obtained after the resampling step.

In the annealed SMC algorithm, note that theweighting and proposal steps can be interchanged.This is different from standard SMC algorithms, wherein general the proposal has to be computed beforeweighting. This interchange is possible because inthe annealed SMC algorithm, the weighting function,

Equation (8), only depends on particles from theprevious iteration and not from those just proposed asin standard SMC algorithms. This flexibility will comehandy when designing adaptive schemes.

Algorithm 2 The simplest version of annealed SMCalgorithm (for pedagogy)

1. Inputs:2. (a) Prior over evolutionary parameters and trees,

p(x), where x= (�,t);3. (b) Likelihood function p(y|x);4. (c) Sequence of annealing parameters 0=�0<�1<···<�R=1.5. Outputs: Approximation of the posterior

distribution,∑

k WR,k�xR,k (·)≈�(·).6. Initialize SMC iteration index: r←0.7. Initialize annealing parameter: �r←0.8. for k∈{1,2,...,K} do9. Initialize particles x0,k← (�0,k,t0,k)∼p(·).

10. Initialize weights to unity: w0,k←1.11. for r∈{1,2,...R} do12. for k∈{1,2,...,K} do13. Sample particles xr,k∼Kr(xr−1,k,·); Kr is a �r-

invariant Metropolis-Hastings kernel.14. Compute unnormalized weights:

wr,k=[p(y|xr−1,k)]�r−�r−1 .15. if r<R then16. for k∈{1,2,...,K} do17. Resample the particles: xr,k∼∑

k′ Wr,k′�xr,k′ (·).18. else19. No resampling needed at the last iteration.20. Return the particle population xr,·,Wr,·.

Before moving on to more advanced versions of thealgorithm, we provide first some intuition to motivatethe need for resampling. Theoretically, the algorithmproduces samples from an artificial distribution withstate spaceX×X×···×X =X R (this is described in moredetail in Appendix 2). However, since we only makeuse of one copy of X (corresponding to the particlesat the final SMC iteration), we would like to decreasethe variance of the state at iteration R (more precisely,of Monte Carlo estimators of functions of the state atiteration R). This is what resampling for iteration r<R achieves, at the cost of increasing the variance forthe auxiliary part of the state space r<R. From thisargument, it follows that resampling at the last iterationshould be avoided.

When resampling is performed at every iteration butthe last, an estimate of the marginal likelihood, p(y),is given by the product of the average unnormalizedweights, namely:

ZK :=R∏

r=1

1K

K∑k=1

wr,k. (10)

Dow




arch 2021




FIGURE 1. An overview of the annealed SMC algorithmic framework for phylogenetic trees. The algorithm iterates the following three steps:(1) compute the weights using samples from the previous iteration, (2) perform MCMC moves to propose new samples, and (3) resample fromthe weighted samples to obtain an unweighted set of samples.

ADAPTIVE MECHANISMS FOR ANNEALED SMCWe discuss how two adaptive schemes from the SMC

literature can be applied in our Bayesian phylogeneticinference setup to improve the scalability and usabilityof the algorithm described in the previous section. Thefirst scheme relaxes the assumption that resampling isperformed at every step, and the second is a methodfor automatic construction of the annealing parametersequence. The two mechanisms go hand in handand we recommend using both simultaneously. Thecombination yields Algorithm 3 which we explain indetail in the next two subsections.

The two adaptive mechanisms make theoreticalanalysis considerably more difficult. This is a commonsituation in the SMC literature. A common work-aroundused in the SMC literature is to run the algorithm twice,a first time to adaptively determine the resampling andannealing schedules, and then a second independenttime using the schedule fixed in the first pass. We callit debiased adaptive annealed SMC.

Measuring Particle Degeneracy Using Relative(Conditional) Effective Sample Size

Both adaptive methods rely on being able to assess thequality of a particle approximation. For completeness,

we provide more background in Appendix 5 on thenotions of effective sample size (ESS) and conditionalESS (CESS), a recent generalization which we use here(Zhou et al. 2016). The notion of ESS in the context of ISor SMC is distinct from the notion of ESS in the contextof MCMC. The two are related in the sense of expressinga variance inflation compared with an idealized MonteCarlo scheme but they differ in the details. We willassume from now on that ESS refers to the SMC context.

We will use a slight variation of the definition of ESSand CESS where the measures obtained are normalizedto be between zero and one. Some tuning parametersof the adaptive algorithms are easier to express in thisfashion. We use the terminology relative (conditional)ESS to avoid confusion. Motivated by the analysis ofthe error of Monte Carlo estimators, the key measureof particle degeneracy needed in the following is therelative CESS:

rCESS(W,u)=⎛⎝ K∑

k=1

Wkuk

⎞⎠2/ K∑k=1

Wku2k, (11)

where W= (W1,W2,...,WK) is a vector of weights of aset of reference weighted particles being updated usinga vector of non-negative values u= (u1,u2,...,uK). What

Dow




arch 2021




Algorithm 3 An adaptive annealed SMC algorithm1. Inputs: (a) Prior over evolutionary parameters and

trees p(x), where x= (�,t); (b) Likelihood functionp(y|x).

2. Outputs: (a) Approximation Z of the marginaldata likelihood, Z≈p(y)=∫ p(dx)p(y|x); (b)Approximation of the posterior distribution,∑

k WR,k�xR,k (·)≈�(·).3. Initialize SMC iteration index: r←0.4. Initialize annealing parameter: �r←0.5. Initialize marginal likelihood estimate: Z←1.6. for k∈{1,2,...,K} do7. Initialize particles with independent samples:

x0,k← (�0,k,t0,k)∼p(·).8. Initialize weights to unity: w0,k←1.9. for r∈{1,2,...} do

10. Determine next annealing parameter:�r=NextAnnealingParameter(xr−1,·,wr−1,·,�r−1).

11. for k∈{1,...,K} do12. Compute pre-resampling unnormalized

weights: wr,k=wr−1,k[p(y|xr−1,k)]�r−�r−1 .13. Sample particles xr,k∼Kr(xr−1,k,·); Kr is a �r-

invariant Metropolis-Hastings kernel.14. if �r=1 then15. update Z← (Z/K)·∑k wr,k , then return

updated Z and particle population xr,·,Wr,·.16. else17. if particle degeneracy is too severe, i.e.

rESS(Wr,·)<� then18. Update marginal likelihood estimate, Z←

(Z/K)·∑k wr,k .19. Resample the particles.20. for k∈{1,...,K} do21. Reset particle weights: wr,k=1.22. else23. for k∈{1,...,K} do24. wr,k= wr,k ; xr,k= xr,k . � No

resampling is needed.

W and u specifically represent will be explained in thenext subsection.

Having a high rCESS value is a necessary but notsufficient condition for a good SMC approximation.If it is low during some of the intermediate SMCiterations, then the ESS at the final iteration may notbe representative of the true posterior approximationquality.

Dynamic ResamplingAs explained in Basic Annealed SMC Algorithm

section , the construction of the proposal guaranteesthat as the difference �r−�r−1 goes to zero, thefluctuation of the weights vanishes. In this context(of having small weight updates), resampling at every

iteration is wasteful. Fortunately, SMC algorithms canbe modified to forgo a subset of the resamplingsteps. From a theoretical stand-point, this is achievedby “grouping” the SMC proposals when they arenot separated by a resampling round (and groupingsimilarly the intermediate distributions �r). For example,to resample every other round, use a transformedSMC algorithm with proposal K′r/2(xr,(xr+1,xr+2))=Kr+1(xr,xr+1)Kr+2(xr+1,xr+2), for each even r. Forconvenience, this can be implemented as an algorithmover R iterations instead of R/2, with two modifications:first, when resampling is skipped, we multiply theweights; otherwise, we reset the weights to one afterresampling. This is implemented in Lines 12 and 21of Algorithm 3. Second, we only use the weightscorresponding to resampling rounds in the estimateof the marginal likelihood (Equation (10)). This isimplemented in Lines 15 and 18 of Algorithm 3.

Instead of specifying in advance the subset ofiterations in which resampling should be performed, itis customary in the SMC literature to determine whetherto resample in an adaptive fashion (Doucet and Johansen2011). To do so, the standard approach is to compute ameasure of particle degeneracy at every iteration, and toperform resampling only when the particle degeneracyexceeds a predetermined threshold. In Appendix 4,we empirically compare the performance of adaptiveannealed SMC algorithm with different resamplingthresholds. All our numerical experiments use themultinomial resampling method, but we recommendmore advanced schemes such as stratified resampling(Douc and Cappé 2005).

The standard measure of particle degeneracy used forthis purpose is called the relative ESS, defined as:

rESS(Wr,·)=⎛⎝K

K∑k=1

W2r,k

⎞⎠−1

. (12)

The above formula can be shown to be a specialcase of rCESS, Equation (11), as follows. Let r∗ denotethe iteration of the latest resampling round precedingthe current iteration r. This implies Wr∗,k=1/K forall k. Plugging in the weight update uk= wr,k intoEquation (11), we obtain

rCESS(Wr∗,·,wr,·)=⎛⎝ K∑

k=1

1K

wr,k

⎞⎠2/ K∑k=1

1K

w2r,k

= 1K

(∑Kk=1wr,k

)2

∑Kk=1w2

r,k

=⎛⎝K

K∑k=1

W2r,k

⎞⎠−1

.

Dow




arch 2021




Adaptive Determination of Annealing ParametersOur sequence of intermediate artificial distributions

�r as defined in Equation (4) is determined by thechoice of the annealing schedule, {�r}, or equivalently,by choosing the successive differences �r−�r−1. Ideally,the sequence of intermediate distributions changesgradually from the prior distribution (�0=0) to theposterior distribution (�R=1) so that the propagatedparticles from the current iteration can well approximatethe next intermediate distribution.

In practice constructing such a sequence {�r}r=1,...,Ris difficult and inconvenient. Not only the number ofdistributions R to get a certain accuracy may dependon the number of taxa, the number of sites, andthe complexity of the evolutionary model, but alsothe optimal spacing between consecutive annealingparameters is in general non-regular. To alleviate this,in the following we borrow an adaptive strategyfrom the Approximate Bayesian Computation literature(Del Moral et al. 2012), also generalized to Bayesianmodel selection in Zhou et al. (2016).

The adaptive annealing scheme is based on twoobservations. First, our discrete set of intermediatedistributions �1,�2,...,�R are actually continuouslyembedded into a continuum of distributions indexedby �∈[0,1]. Second, in the SMC algorithm presentedin Algorithm 2, the weight update, Line 14, dependsonly on xr−1 (whereas in general SMC algorithms, theweight update could depend on both xr−1 and xr;here it does not because of cancellation explained inAppendix 2). The consequence of the lack of dependenceon xr is that we can swap the order of proposal(Line 13) and particle weighting (Line 14) in Algorithm2. So instead of computing the weights only for onepredetermined annealing parameter �r, we can searchover several tentative values. For each tentative value,we can score the choice using a measure of weightdegeneracy applied to the putative weights. Crucially,each choice can be quickly scored without having topropose particles, which is key since proposals aretypically the computational bottleneck: in a phylogeneticcontext, the cost of one proposal step scales linearly inthe number of sites whereas the search over �r proposedin this section has a running time constant in the numberof sites and taxa. This is because the search involves fixedvalues of p(y|xr−1,k) cached from the last proposal step,which are exponentiated to different values.

Based on these observations, we select an annealingparameter � such that we achieve a controlled increasein particle degeneracy, namely such that

g(�)=�g(�r−1), (13)

where the function g : [�r−1,∞)→[0,1] is defined as

g(�)=rCESS(

Wr−1,·,p(y|xr−1,·)�−�r−1),

and �∈ (0,1) is a tuning parameter, which in practice isclose to 1. By construction, g(�r−1)=1, so Equation (13)is equivalent to g(�)=�.

More precisely, since we want �∈[0,1], theannealing parameter adaptation procedure,NextAnnealingParameterm (Algorithm 4), is designedto return �r=1 if g(1)≥�. Otherwise, because there isno closed-form solution for � in Equation (13), we usebisection to solve this one-dimensional search problemin the interval �∈ (�r−1,1) (Line 7 of Algorithm 4).

We now argue that the search problem in Line 7of Algorithm 4 always has a solution. Indeed, g isa continuous function with, on the left end of thesearch interval, g(�r−1)=1, and on the right end, g(1)<�(otherwise the algorithm sets �r=1 in Line 5). It followsthat there must indeed be an intermediate point �∗ withg(�∗)=�. Note that continuity and the identification ofthe left end point of the interval is possible thanks tothe form of our weight update in Equation (9), hencejustifying the earlier informal argument about the needto have the fluctuation of the weights disappearing as�r−�r−1 goes to zero.

As in the previous section on dynamic resampling,NextAnnealingParameter is again based on relativeCESS, but this time, we are interested in the degeneracyof a single iteration, i.e. we do not trace back untilthe previous resampling step (because the optimizationover the annealing schedule can only impact thecurrent iteration). As a corollary, the previous iteration’sparticles are not always equally weighted, hence thesimplification in Equation (12) is not possible here andwe use the full formula for relative CESS.

Algorithm 4 Procedure NextAnnealingParameter

1. Inputs: (a) Particle population from previous SMCiteration (xr−1,·,wr−1,·); (b) Annealing parameter�r−1 of previous SMC iteration; (c) A degeneracydecay target �∈ (0,1).

2. Outputs: automatic choice of annealing parameter�r.

3. Initialize the function g assessing the particlepopulation quality associated to a putative annealingparameter �:

g(�) = rCESS(

Wr−1,·,p(y|xr−1,·)�−�r−1)

=(∑K

k=1Wr−1,kp(y|xr−1,k)�−�r−1)2

∑Kk=1Wr−1,kp(y|xr−1,k)2(�−�r−1)

.

4. if g(1)≥� then5. return �r=1.6. else7. return �r=�∗ ∈ (�r−1,1) such that g(�∗)=� via

bisection.

The parameter � used in Algorithm 4 encodes thedecay in particle population quality that we are aimingfor. Based on our experiments we recommend valuesvery close to one. For this reason, we reparameterize theparameter � into �=1−10− and recommend a default

Dow




arch 2021




value of =5 as a reasonable starting point. Increasing improves the approximation accuracy.

Computational ComplexityThe computational complexity of annealed SMC is

linear in both the number of intermediate distributionsR and the number of particles K. Naively, the resamplingstep scales like O(K2), but a linear time multinomialresampling algorithm is obtained by generating orderstatistics via normalization of a Poisson process (Devroye1986, Section 2.1, p. 214). This technique is well knownin the SMC literature (Doucet and Johansen 2011).Alternatively, one can use stratified or systematicresampling (Doucet and Johansen 2011), which providesa simple to implement linear time resampling algorithm.

The memory consumption of annealed SMC is linearin K and constant in R.

REVIEW OF OTHER MARGINAL LIKELIHOOD ESTIMATION

METHODS

For completeness, we review here some alternatives toEquation (10) for estimating marginal likelihoods, whichwe will compare to SMC from both a theoretical andempirical stand-point.

Stepping StoneThe Stepping Stone (SS) algorithm (Xie et al. 2010) is a

method for marginal likelihood estimation. It is widelyused via its MrBayes implementation (Huelsenbeck andRonquist 2001). As with SMC, the SS method introducesa list of annealed posterior distributions connecting theposterior distribution and the prior distribution. Weuse a notation analogous to SMC, with {�d}d=0,1,...,Ddenoting the intermediate distributions, �d(x)∝�d(x)=p(y|x)�d�(x), 0=�0<�1<�2< ···<�D=1. The marginallikelihood Z can be written as

Z≡ZD=Z0

D∏d=1

ZdZd−1

.

We can rewrite the ratio of Zd and Zd−1 as

ZdZd−1

=∫

�d(x)�d−1(x)

�d−1(x)dx. (14)

The SS method prescribes running several MCMCchains targeting �d−1(x) to obtain N posterior samplesxd−1,1,xd−1,2,...,xd−1,N , then

ZdZd−1

= 1N

N∑i=1

{p(y|xd−1,i)}�d−�d−1 . (15)

The estimator of the marginal likelihood admits the form

ZD=D∏

d=1

1N

N∑i=1

{p(y|xd−1,i)}�d−�d−1 .

The number of intermediate distributions is a trade-offbetween computing cost and accuracy. A larger numberof MCMC chains can provide a better approximation forthe marginal likelihood, but the computational cost willbe higher. To make fair comparison between the marginallikelihood estimators provided by the annealed SMCand SS, we set KSMCRSMC=NSSDSS. Another factor thatwill impact the SS estimator is the choice of annealingparameter sequence {�d}d=1,2,...,D. In this paper, we usethe annealing scheme �d= (d/D)1/a recommended byXie et al. (2010), where a is between 0.2 and 0.4.

Linked Importance SamplingStepping stone uses IS to approximate the ratio of

marginal likelihoods for two intermediate distributions.However, the IS approximation would be poor if the twosuccessive distributions do not have enough overlaps.Linked IS (Neal 2005) improves the performance ofIS by introducing bridge distributions, for example“geometric” bridge: �d−1∗d(x)=√�d−1(x)�d(x). Moreimportantly, Linked Importance Sampling (LIS)provides an unbiased marginal likelihood estimator.The ratio of two marginal likelihoods can be written as

ZdZd−1

= Zd−1∗dZd−1

/Zd−1∗d

Zd={∫

�d−1∗d(x)�d−1(x)

�d−1(x)dx}/

{∫�d−1∗d(x)�d(x)

�d(x)dx}.

For d=1,...,D, to estimate the ratio Zd/Zd−1, we first runMCMC targeting �d−1(x) to obtain N posterior samplesxd−1,1,xd−1,2,...,xd−1,N (when d=1, we sample from theprior distribution). Then we sample the initial state of�d. Two successive MCMC chains �d−1(x) and �d(x) arelinked by a state xd−1,d−1 where index d−1 is sampledfrom {1,2,...,N} according to the following probabilities:

p(d−1|xd−1,1:N) = �d−2∗d−1(xd−1,d−1 )

�d−1(xd−1,d−1 )

/N∑

i=1

�d−2∗d−1(xd−1,i)�d−1(xd−1,i)

.

In case d=1, the linked state 0 is uniformly sampledfrom the N samples of �0(x). Finally, we run MCMCchain �d(x) starting from initial state xd−1,d−1 to obtainN posterior samples xd,1,xd,2,...,xd,N . The ratio of twomarginal likelihoods can be approximated by

ZdZd−1

= Zd−1∗dZd−1

/Zd−1∗d

Zd={

1N

N∑i=1

�d−1∗d(xd−1,i)�d−1(xd−1,i)

}/{

1N

N∑i=1

�d−1∗d(xd,i)�d(xd,i)

}.

Dow




arch 2021




In this article, we use the “geometric” bridge. Hence, theestimator of ratio can be simplified to

ZdZd−1

={ N∑

i=1

{p(y|xd−1,i)}�d−�d−1

2

}/{ N∑i=1

{p(y|xd,i)}�d−1−�d

2

}.

We refer to Appendix 6 for more background on the LISalgorithm.

THEORETICAL PROPERTIES

In this section, we review three theoretical propertiesof interest, consistency, marginal likelihood estimateunbiasedness, and asymptotic normality, with an emphasison their respective practical importance.

Properties of Annealed SMCIn the context of SMC algorithms, the first property,

consistency means that as the number of particles isincreased, the approximation of posterior expectationscan become arbitrarily close to the true posteriorexpectation. This makes the approximation inEquation (7) more precise:

K∑k=1

Wr,kf (xr,k)→∫�r(x)f (x)dx as K→∞, (16)

provided f satisfies regularity conditions, for examplef is bounded, and where convergence of the randomvariables holds for a set of random seeds havingprobability one. See for example Wang et al. (2015).

Consistency can be viewed as the “bare minimum”expected from modern SMC algorithms. A moreinformative class of results consists in central limittheorem equivalents of Equation (16). These results canbe used to assess the total variance of Monte Carloestimators (whereas measures such as ESS describedpreviously are local in nature), see Chan and Lai (2013).However, since numerically stable versions of thesemethods are still at their infancy (Olsson and Douc2017), we will focus the remaining on the third property,unbiasedness.

We say an estimator Z for a constant Z is unbiased ifE[Z]=Z. Here the expectation is defined with respectto the randomness of the approximation algorithm.This contrasts with the classical statistical definition ofunbiasedness in which the randomness comes from thedata generation process.

For SMC algorithms, unbiasedness holds in a morerestrictive sense compared with consistency. In general:

E

⎡⎣ K∑k=1

Wr,kf (xr,k)

⎤⎦ �=∫ �r(x)f (x)dx, (17)

in other words, repeatedly running SMC with a fixednumber of particles but different random seeds and

averaging the results does not provide arbitrarilyprecise approximations (the same negative result holdswith MCMC). However, if we restrict our attentionto marginal likelihood estimates, remarkably theunbiasedness property does hold (Del Moral et al. 2006),that is for any finite K, ZK as defined in Equation (10) issuch that:

E

[ZK

]=Z=

∫�r(x)dx. (18)

More details on the unbiasedness of the marginallikelihood SMC estimator and other theoreticalproperties of annealed SMC can be found in Appendix2, Unbiasedness, Consistency and Central LimitTheorem subsection.

Although the notion of unbiasedness has been centralto frequentist statistics because its inception, only inthe past decade has it started to emerge as a propertyof central importance in the context of (computational)Bayesian statistics. Traditionally, the main theoreticalproperties analyzed for a given Monte Carlo method Zestimating Z was consistency.

With the emergence of pseudo-marginal methods,the bias of Monte Carlo methods is now under closerscrutiny. Pseudo-marginal methods are MCMC methodswhich replace probability factors in the Metropolis-Hastings ratio by positive unbiased estimators of theseprobabilities. For example, Andrieu et al. (2010) provideexamples where global parameters of state-space modelsare sampled using an MCMC algorithm where theprobability of the data given the global parameters andmarginally over the latent states is estimated using anSMC algorithm. We refer the reader to Andrieu andRoberts (2009) for more examples where unbiasednessis used to compose MCMC algorithms in order toattack inference in complex models. In the contextof phylogenetic inference, this is useful for Bayesiananalysis of intractable evolutionary models, see forexample Hajiaghayi et al. (2014).

Another area where unbiasedness can play a role isfor checking correctness of Monte Carlo procedures.In contrast to correctness checks based on consistencysuch as Geweke (2004), which are asymptotic in natureand hence necessarily have false positive rates (i.e. caseswhere the test indicates the presence of a bug when infact the code is correct), checks based on unbiasednesscan achieve a false positive rate of zero, using the strategydescribed in the next section.

Using Unbiasedness to Test Implementation CorrectnessTypically, the algorithm shown in Algorithm 2 is

implemented in a model-agnostic fashion. Hence it isreasonable to assume that we can construct test caseson discrete state spaces. For example, one can usephylogenetic trees with fixed branch lengths, or evensimpler models such as hidden Markov models (HMMs).Furthermore, we conjecture that many software defectscan be detected in relatively small examples, where

Dow




arch 2021




exhaustive enumeration is possible, and hence Z canbe computed exactly. We can determine sufficientcomplexity of the examples to use via code coverage tools(Miller and Maloney 1963).

We would like to test if equality of Equation (18)holds for a given implementation. The right-handside can be computed easily because we assume theexample considered is small. To compute analyticallythe expectation on the left-hand side, we use a methodborrowing ideas from probabilistic programming(Wingate et al. 2011), and use an algorithm, calledExhaustiveRandom that automatically visits all possibleexecution traces �i of a given randomized algorithm.The execution trace of a randomized algorithm refersto a realization of all random choices in the algorithm(in the context of SMC, both the resampling steps andthe proposal steps). ExhaustiveRandom enumerates allthe execution traces while also computing the respectiveprobability pi of each trace. This is done by performinga depth first traversal of the decision tree correspondingto the randomized algorithm being tested. The numberof execution traces grows exponentially fast but this isstill a useful tool as very small examples are generallysufficient to reach code coverage.

For each execution trace �i, we can also obtainthe normalization estimate zi corresponding to thattrace, and hence get the value of the left-hand sideof Equation (18) as

∑i pizi. We used this check via

an open source implementation of ExhaustiveRandom(https://github.com/alexandrebouchard/bayonet/blob/1b9772e91cf2fb14a91f2e5e282fcf4ded61ee22/src/main/java/bayonet/distributions/ExhaustiveDebugRandom.java) to ensure that our software satisfies theunbiasedness property. See the numerical simulationsection for details.

Properties of the SS MethodFor the stepping stone method, the expected value

of Equation (15) depends on the nature of the samplesxd−1,1,xd−1,2,...,xd−1,N . If they are independent, theprocedure is unbiased. However, if the samples areobtained from a Markov chain, there are no guaranteesthat the procedure is unbiased unless the MCMC chainis initialized at the exact stationary distribution. Inpractice, this is not possible: Xie et al. (2010) use aburned-in MCMC chain, which implies that the chainis asymptotically unbiased, however for any finite numberof iterations, a bias remains. Unfortunately, the twomain motivations for unbiasedness (pseudo-marginalmethods and the correctness checks described earlier)both require unbiasedness to hold for any finite numberof Monte Carlo samples; asymptotic unbiasedness is notsufficient.

We show in the numerical simulation section anexplicit counterexample where we compute the non-zerobias of the stepping stone method. This motivates theneed for implementable unbiased methods, such as theannealed SMC method described in this work.

Comparison of Unbiased Marginal Likelihood EstimatorsIn Bayesian phylogenetics, the marginal likelihood

estimate is generally a very small number. Instead ofcomputing Z directly, we compute the logarithm ofthe marginal likelihood estimate, log(Z). For SMC andLIS, although Z is an unbiased estimator, taking thelogarithm of Z introduces bias. Jensen’s inequality showsthat log(Z) is a biased estimator of log(Z), and is generallyunderestimated,

E[log(Z)]≤ log(E(Z))= logZ.

This provides a tool to compare the performance of anunbiased normalization constant estimation method m1to another one m2. Suppose, we run each method M timeswith different seeds and a fixed computational budget.Let Li=

∑Mj=1 logZi,j/M denote the average estimate of

the log marginal likelihood for the i-th method, whereZi,j is the estimate with the j-th random seed using thei-th method. If m2 is also unbiased then for M largeenough, both m1 and m2 underestimate logE[Z], andthe largest Li is closest to logE[Z], which determines thebest performing method. If m2 is not unbiased, then ifL1>L2 and M is large enough, we can conclude that m1is superior (but we cannot confidently order the methodsif L2>L1).

However, the Monte Carlo counterparts of theorderings should be considered with a pinch of saltbecause the number of replicates M needed may beintractable in some cases.

SIMULATION STUDIES

Simulation Setup and Tree DistanceIn order to simulate data sets, we first generated a

set of random unrooted trees, including topology andbranch lengths, as the reference trees. The tree topologywas sampled from a uniform distribution. Each branchlength was generated from an exponential distributionwith rate 10.0.

Then, for each reference tree, we simulated DNAsequences using the K2P model with parameter �=2.0(Kimura 1980). Although the main focus of this work ison marginal likelihood estimation, we also performedsome benchmarking on the quality of the inferredtrees. To do so, we used the majority-rule consensustree (Felsenstein 1981) to summarize the weightedphylogenetic tree samples obtained from annealed SMC.We measured the distance between each estimatedconsensus tree to its associated reference tree usingthree types of distance metrics: the Robinson-Foulds (RF)metric based on sums of differences in branch lengths(Robinson and Foulds 1979), the Kuhner-Felsenstein (KF)metric (Kuhner and Felsenstein 1994), and the partitionmetric (PM), also known as symmetric difference ortopology only RF metric (Robinson and Foulds 1981).

Dow




arch 2021




a) b)

FIGURE 2. Graphical representation of (a) an HMM; (b) transitions between hidden states.

●

●●

●

●●●●●●

nTemps: 10 nTemps: 100 nTemps: 1000

100 1000 100 1000 100 1000

−27

−24

−21

Number of particles or samples

Estim

ate

d logz

factor(method)

AnnealedSMC

SteppingStone

FIGURE 3. Estimates of the log marginal likelihood based on 100 independent random seeds for each configuration. The axis label “nTemps”refers to the number of equally spaced intermediate distributions (“inverse temperatures”). The black horizontal line shows the true valuecomputed using the forward-backward algorithm.

Hidden Markov ModelsAs discussed when we introduced the unbiasedness

correctness test, it is useful to perform some preliminaryexperiments on finite state models. We used a HMM witha finite latent state (see the graphical representationsof a HMM and hidden state transitions in Fig. 2). Thevariables Xt shown in the figure are unobserved andtake on discrete values with a distribution dependingon the previous variable Xt−1. For each unobservedvariable, we define an observed variable Yt, alsodiscrete, with a conditional distribution depending onXt. The latent state space in our experiment was setto {0,1,2,3,4} and latent transitions were set uniformlyacross neighbor integers. The emissions we used taketwo possible values with conditional probabilities givenby (0.2,0.8),(0.1,0.9),(0.01,0.99),(0.2,0.8), and (0.3,0.7).The proposals were based on the Gibbs sampler on asingle variable. The posterior distribution of interest isover the latent variables X1,X2,... given the observationsY1,Y2,.... Of course, such a model would not normallybe approached using approximate inference methods.Moreover, notice that this is a non-standard way of usingSMC for a sequential model where we do not make useof the sequential structure of the model.

We first performed unbiasedness correctness tests ona chain of length two based on three equally spacedannealing parameters (0,1/2,1), and observations(0,1). We first computed the true marginal likelihood,

0.345. Using the method described in the TheoreticalProperties section, we computed the exact valueof E[Z] by exhaustive enumeration of all executiontraces for SMC and the SS method. For SMC with twoparticles, the ExhaustiveRandom algorithm enumerated1,992,084 traces resulting in an expectation of0.34499999999999525. For SS with two MCMC iterationsper annealing parameter, the ExhaustiveRandomalgorithm enumerated 1,156,288 traces resulting in anexpectation of 0.33299145257312235. This supportsthat SMC is unbiased and provides an explicitcounterexample of the bias of the stepping stonemethod.

Second, we ran experiments on larger versions of thesame model, a chain of length 32, as well as with moreannealing steps and particles per step. In this regime,it is no longer possible to enumerate all the executiontraces so we averaged over 100 realizations of eachalgorithm instead. The true marginal likelihood can stillbe computed using a forward-backward algorithm. Weshow the results in Figure 3.

Comparison of Marginal Likelihood EstimatesIn this section, we benchmark the marginal likelihood

estimates provided by adaptive annealed SMC(ASMC), debiased adaptive annealed SMC (DASMC),

Dow




arch 2021




Method ASMC DASMC LIS SS

−402.6

−402.5

−402.4

5

−772.50

−772.25

−772.00

−771.75

−771.50

10

−1256.0

−1255.5

−1255.0

15−1269.0

−1268.5

−1268.0

−1267.5

−1267.0

−1266.5

20

−1912

−1911

−1910

25

FIGURE 4. Marginal likelihood (in log scale) estimates for different numbers of taxa with a fixed computational budget.

TABLE 1. Number of annealing parameters(RSMC) for ASMC with K=1000 and =5

#taxa 5 10 15 20 25

RSMC 1932 3741 5142 6047 7219

deterministic annealed SMC (DSMC), LIS, and SS.In DASMC, the annealing scheme was determinedbefore running annealed SMC using the same annealingparameters obtained from the ASMC. In DSMC, we usedthe annealing scheme �r= (r/R)3 with a predeterminedR.

In the first experiment, we focus on evaluating themarginal likelihood estimates using ASMC, DASMC,LIS, and SS with the same computing budget. Wesimulated unrooted trees of varying sizes (numbers oftaxa): 5, 10, 15, 20, and 25. For each tree, we generatedone data set of DNA sequences. Sequence length wasset to 100. The execution of each algorithm and settingwas repeated 100 times with different random seeds. Weused =5 for adaptive annealed SMC, and the numberof particles was set to 1000. In stepping stone and linkedIS, we set the total number of heated chains D to 50, andthe annealing scheme was set to �d= (d/D)3, where d=1,2,...,D. We enforced KSMCRSMC=NSSDSS=NLISDLISin order to make the comparisons fair. Information aboutRSMC are shown in Table 1.

Figure 4 shows the comparison of the performance ofthe four algorithms in terms of the marginal likelihoodin log scale as the number of taxa increases. Asdescribed in the Theoretical Properties section, for theunbiased estimators, we have asymptotically that thelog of the marginal likelihood should underestimate themarginal likelihood by Jensen’s inequality. The resultssupport that ASMC and DASMC can achieve moreaccurate marginalized likelihood estimates comparedwith SS and LIS with the same computational cost.The performances of the two SMC algorithms arequite similar, whereas the marginal likelihood estimatesprovided by LIS and SS are close to each other. InAppendix 9, we describe an experiment comparingASMC, DASMC, LIS, and SS with a very large valueof K. The mean of log marginal likelihood for the four

methods are close (reduction of the gap is expected,because all methods are consistent), whereas ASMCand DASMC still exhibit smaller variance across seedscompared with LIS and SS.

Another experiment was conducted to measurethe variability of the marginal likelihood estimatesfrom each algorithm, by comparing the coefficients ofvariation (CV) for different numbers of taxa with thesame setting. The CV is defined as CV=sd(Z)/E(Z).We simulated 70 trees, increasing the number of taxa(from 10, 15, 20, 25, 30, 35, 40; 10 trees of each size),and created 10 data sets for each tree. For each dataset, we repeated each algorithm 10 times with differentrandom seeds. The upper bound of CV equals

√n−1,

where n represents the number of repeats with differentrandom seeds in experiments. We refer to Appendix 7for the derivation of the upper bound of CV. In oursetting, this upper bound is

√10−1=3. In ASMC, the

computational cost was fixed at K=1000 and =5. InDSMC, we used the same number of particles, and theannealing scheme was set to �r= (r/R)3, where the totalnumber of annealing parameters R was fixed to be theone obtained from running ASMC with K=1000 and=5 for a tree with 10 taxa.

Figure 5 displays the CV for ASMC and DSMC as afunction of the number of taxa. The error bars in thefigure represent 95% confidence intervals. The CV ofDSMC increases faster than ASMC as the number of taxagets larger than 15. It gradually converges to the upperbound of CV as the number of taxa reaches 35. The CVof ASMC increases more slowly as the number of taxaincreases.

Comparison of Model Selection by Annealed SMC versus SSIn this section, we compare the performance of ASMC

and SS on a Bayesian model selection task. We simulated20 unrooted tree of 10 taxa using a uniform distributionfor the tree topology and branch lengths generated froman exponential distribution with rate 10. A total of 60data sets of DNA sequences of length 500 were generatedusing each of the simulated tree and the following

Dow




arch 2021




0

1

2

3

10 20 30 40

Number of Taxa

CV

Method

ASMC

DSMC

FIGURE 5. CV for the marginal likelihood estimates versus thenumber of taxa for ASMC and DSMC with a fixed number of particles.

TABLE 2. Comparison of model selectionby ASMC and SS based on the Bayes factor

Data generated from

Method Model JC69 K2P GTR+

ASMC JC69 20 0 1K2P 0 20 1GTR 0 0 18

SS JC69 20 0 4K2P 0 20 1GTR 0 0 15

three evolutionary models: JC69, K2P, and GTR+ . Theparameter � in the K2P model was set to 2.0. In theGTR+ model, a symmetric Dirichlet distribution withparameters (10,10,10,10) was used to generate the basefrequencies, and a symmetric Dirichlet with parameters(10,10,10,10,10,10) was used to generate the GTR relativerates in the rate matrix. The discrete gamma distributionwith 4 categories was used to convey among-site rateheterogeneity, with the gamma shape parameter drawnfrom a Gamma distribution with parameters (2,3).

For each data set, marginal likelihoods were estimatedby ASMC and SS using three evolutionary models, JC69,K2P, and GTR, respectively. In ASMC, we used K=1000and =4. The total number of iterations in SS was set tothe product of the number of particles and number ofiterations in ASMC. Table 2 shows the Bayesian modelselection results. Both ASMC and SS choose the correctmodel for all of the 20 data sets generated from the JC69and K2P model, respectively. For the data generated fromGTR+ , SMC chooses the closest model, GTR, 18 timesout of 20, whereas SS only chooses GTR 15 times out of20.

Comparison of Tree Distance MetricsIn this section, we compare the quality of

reconstructed phylogenies using synthetic data. Wesimulated one unrooted tree, the reference tree, with 50taxa and then generated one data set of DNA sequences

TABLE 3. Comparison of tree distance metrics using ASMC andMCMCMethod R K Metric Value

ASMC 54,876 100 ConsensusLogLL −72,787.9954,876 100 BestSampledLogLL −72,826.1754,876 100 PartitionMetric 054,876 100 RobinsonFouldsMetric 0.7062354,876 100 KuhnerFelsenstein 0.00990

MCMC 1.0E+07 ConsensusLogLL −72,833.821.0E+07 PartitionMetric 01.0E+07 RobinsonFouldsMetric 0.920311.0E+07 KuhnerFelsenstein 0.03138

MCMC2 5.49E+06 ConsensusLogLL −72,784.865.49E+06 PartitionMetric 05.49E+06 RobinsonFouldsMetric 0.73,6445.49E+06 KuhnerFelsenstein 0.01066

of length 2000 from this tree. The ASMC was run with=6 and K=100. The MCMC algorithm was initializedwith a random tree from the prior distribution. To make afair comparison, we set the number of MCMC iterationsto be no less than KSMCRSMC. We discarded 20% ofthe MCMC chain as “burn-in.” Table 3 summarizes theiteration numbers, the log-likelihood of the consensustree and tree distance metrics from running ASMC andMCMC. Although the computational cost of MCMC isset to about twice as high as ASMC, the log-likelihoodof the consensus tree from ASMC is much higher thanthat from MCMC. In addition, ASMC achieves muchlower RF and KF distances to the reference tree. Further,to confirm that both ASMC and MCMC can convergeto the same posterior distribution, MCMC was rerunwith a better starting value, namely the consensus treeobtained after running ASMC. This run of MCMC isdenoted as MCMC2 in Table 3. The computational costof MCMC2 is set the same as the ASMC algorithm.This time MCMC achieved similar consensus treelog-likelihood and tree distance metrics compared withASMC, which supports that MCMC is indeed “trapped”in a subspace.

Influence of Number of Threads, , and KThe runtime of the ASMC is dependent on the number

of threads, as well as on the tuning parameters and K. Inthis section, we focus on investigating the effects of thesefactors on ASMC. We simulated an unrooted tree with30 taxa at the leaves, and then generated DNA sequencesof length 1500.

Next, Figure 6 displays the computing time versusnumber of threads for an implementation of ASMCwhere the proposal step is parallelized. The error barsrepresent the 95% confidence intervals based on 100runs. We used K=1000 and =2 for each numberof threads. The results indicate that by increasing thenumber of cores, the speed of the ASMC algorithm canbe increased notably.

In Table 4, we compare the performance of ASMCalgorithm as a function of K, with fixed at 5. We chose

Dow




arch 2021




TABLE 4. Comparison of adaptive SMC algorithm with different numbers of particles and

K (=5) log(Z) (K=1000) log(Z)

100 −28,288.5 (−28,283.9, −28,293.7) 3 −28,524.8 (−28,466.2, −28,641.0)300 −28,283.5 (−28,281.1, −28,287.9) 4 −28,312.2 (−28,304.5, −28,328.8)1000 −28,280.5 (−28,278.3, −28,283.5) 5 −28,280.5 (−28,278.3, −28,283.5)3000 −28,279.3 (−28,278.1, −28,280.4) 5.3 −28,279.5 (−28,278.7, −28,280.5 )

0e+00

1e+05

2e+05

3e+05

1 2 3 4

#Threads

Tim

e (m

illis

econ

d)

FIGURE 6. Computing time of ASMC using multiple threads.

four different particle values K=100,300,1000,3000. Themarginal likelihood estimates improve as K increases.

We also compared the performance of ASMCalgorithm as a function of , with K=1000. We selectedfour distinct values, =3,4,5,5.3. As expected, themarginal likelihood estimates improve when increases.The likelihood of the consensus trees and tree distancemetrics provided by these two experiments are displayedin Appendix 8. In practice, a value of close to 5 isrecommended as the default value.

Trade-Off between R and KWe conducted an experiment to investigate, for a given

amount of computation, the relative importance of R andK in improving the quality of the posterior distributioninferred by annealed SMC. We used DSMC with a cubicannealing scheme. We selected values for the tuningparameter K (100,300,1000,3000,10,000), and for eachvalue of K, a corresponding value for R such that thetotal computation cost K ·R is fixed at 106. We simulatedone unrooted tree of 15 taxa, and generated one dataset of DNA sequences. Sequence length was set to 300.Figure 7 displays the marginal likelihood estimates, andKF metric provided by DSMC with different K valueswhen the total computational budget (K ·R) is fixed. Thisresults indicate that for a given amount of computation,a relatively small K and a large R is optimal. However,the value of K cannot be too small, as an extremely smallK necessarily leads to a large Monte Carlo variance.

Analysis of Subsampling SMCSubsampling SMC, detailed in Appendix 1, can be

used to speed up the SMC algorithms at the cost of

decreasing the accuracy of estimation. The idea is todivide the data, the sites of biological sequences in ourcase, into batches, and only use a subset of the data in theintermediate distributions. In this section, we evaluatethe impact of the batch size (the number of sites ofbiological sequences in each batch), denoted bs, on thespeed of the algorithm and the posterior approximation.

In a first experiment, we analyzed the relativecomputational cost of subsampling SMC with respectto annealed SMC for different batch sizes. Wesimulated an unrooted tree with 10 taxa, and thengenerated DNA sequences of length 6000. The annealingparameter sequence �r, r=0,1,...,R, was chosen byrunning adaptive ASMC using =4 and K=100. Thecomputational cost in this subsection is measured bythe total number of sites involved in computing theunnormalized posterior and the weight update function.For example, in this simulation study, the total number ofannealing parameters in adaptive ASMC is 2318, and thenumber of sites involved in each SMC iteration is 6000.Using the fact that the likelihood for particles evaluatedat iteration r−1 can be used to evaluate the weightupdate function at iteration r, the total cost for ASMCis 2318·6000=1.39×107. Figure 8 displays the ratio ofcomputational cost (subsampling/annealing) versus thebatch size. The cost ratio increases slowly when weincrease the batch size from 1 to 100.

We investigated the performance of subsampling SMCwith different batch sizes, bs=1,10,100,1000,6000 interms of phylogenetic tree inference. We used K=100and ran the subsampling SMC algorithm 10 times foreach value of bs. The schedule �r used to compute theannealing parameter �(s,�r) in subsampling SMC wasobtained by running adaptive annealed SMC once using=4 and K=100. Figure 9 displays the performanceof the subsampling algorithm with different bs. Asexpected, there is a trade-off between the computationalcost and accuracy of most metrics: for all metricsexcept the partition metric, subsampling produces lowerquality approximations at a lower cost. However, if theuser only require a reconstruction of the tree topology,the partition metric results provide an example wheresubsampling is advantageous.

Comparison of ASMC and Combinatorial SMC (CSMC)We compared the performance of the annealed SMC

and the combinatorial SMC (CSMC) algorithm (Wanget al. 2015) for three different kinds of trees: clock, relaxedclock, and nonclock. The clock trees were simulated byassuming that the waiting time between two coalescent

Dow




arch 2021




−3250

−3230

−3210

−3190

100 300 1000 3000 10000

logZ

0.015

0.020

0.025

0.030

100 300 1000 3000 10000

Kuh

nerF

else

nste

in

FIGURE 7. Performance of deterministic SMC algorithm on a fixed computational budget (K ·R=106). We select 5 values of K,100,300,1000,3000,10,000, from left to right on X-axis.

0.00

0.25

0.50

0.75

1.00

1 10 100 1000 6000

batchSize

cost

Rat

io

FIGURE 8. Ratio of cost (subsampling/annealing) versus the batchsize.

events is exponentially distributed with rate 10. Therelaxed clock trees were obtained by perturbing thebranch length of clock trees. More specifically, wemodified each branch of length l by adding to it a noiserandomly sampled from Unif(−0.3l, 0.3l). The nonclocktrees were simulated with uniformly distributed treetopologies and exponentially distributed branch lengthswith rate 10.

For each type of phylogenetic tree, we simulated 10trees with 10 leaves. The JC69 evolutionary model was

used to generate sequences of length 500. Three datasets were generated for each tree. We ran the annealedSMC, DASMC, and CSMC, respectively, for each data setthree times with different random seeds. In the annealedSMC, was set to 5, and K=100; in CSMC, the numberof particles was set to 100,000.

Figure 10 shows the boxplots of log-likelihood ofthe consensus trees, three tree distance metrics fromthe true trees, and computing time (in milliseconds)obtained from running the three algorithms for clocktrees (top), relaxed clock trees (middle), and nonclocktrees (bottom). Note that we ran CSMC for a longer timeto favor this reference method. CSMC performs well forclock trees and relaxed clock trees, whereas the annealedSMC works for all of the three types of trees and clearlyoutperforms CSMC for nonclock trees.

REAL DATA SETS

We analyzed two difficult real data sets fromTreeBASE: M336 and M1809 in Table 1 of Lakner et al.(2008). M336 contains DNA sequences of length 1949for 27 species. In M1809, there are 59 species and thelength of each DNA sequence is 1824. We comparedthe marginal likelihood estimates, log-likelihood of the

size 1 10 100 1000 Full

−49280

−49270

−49260

logZ

−49204

−49200

−49196

−49192

−49188

ConsensusLogLL−0.50

−0.25

0.00

0.25

0.50

PartitionMetric

0.08

0.10

0.12

0.14

0.16

RobinsonFouldsMetric

0.001

0.002

0.003

KuhnerFelsenstein

FIGURE 9. Comparison of subsampling SMC algorithms with different size of batch sites.

Dow




arch 2021




FIGURE 10. Comparison of adaptive SMC algorithms with CSMC for three types of simulated trees: clock, relaxed clock, and nonclock (fromtop to bottom).

consensus tree, and tree distance metrics provided byASMC and MrBayes (with the default setting) withthe same computational budget. The reference treesused to compute tree distances are based on at least6 independent long MrBayes parallel tempering runsprovided by Lakner et al. (2008). Convergence to theposterior in these “reference runs” was established withhigh confidence in that previous work. Note that thecomparison handicaps ASMC as the set of tree movesin MrBayes is a superset of those used in ASMC. Theevolutionary model we consider in real data analysis isthe JC69 model.

Data set M336We used K=500 and =5.3 for the ASMC algorithm.

The log marginal likelihood estimated from ASMCis −7103.73, which is higher than the log marginallikelihood provided by MrBayes using SS (−7114.04).Table 5 displays the log-likelihood of the consensustree and tree distance metrics provided by ASMC andMrBayes. In the table, R represents the number ofannealing parameters in ASMC and the total numberof MCMC iterations in MrBayes, respectively. The log-likelihood of the consensus tree estimated from ASMCis slightly lower than MrBayes. The RF and KF metrics

Dow




arch 2021




TABLE 5. Comparison of running ASMC and MrBayes forM336 from TreeBASEMethod R K Metric Value

ASMC 15,706 500 ConsensusLogLL −6892.1615,706 500 BestSampledLogLL −6901.3115,706 500 PartitionMetric 015,706 500 RobinsonFouldsMetric 0.0126915,706 500 KuhnerFelsenstein 5.55E-06

MrBayes 8.0E+06 ConsensusLogLL −6889.528.0E+06 PartitionMetric 08.0E+06 RobinsonFouldsMetric 0.018328.0E+06 KuhnerFelsenstein 2.25E-5

estimated from MrBayes are slightly higher than ASMC.The majority-rule consensus tree provided by ASMC andMrBayes are identical, and coincide with the referencetree. Figure 11 displays the estimated majority-ruleconsensus trees and the clade posterior probabilitiesprovided by ASMC and MrBayes. Most clades posteriorprobabilities provided by ASMC and MrBayes are close.ASMC provides lower posterior support for some clades,which is consistent with the hypothesized superior treeexploration provided by ASMC on a fixed budget.

Data set M1809We used K=1000 and =5 for the ASMC algorithm.

The log marginal likelihood estimated from ASMC is−37,542.25, the one estimated by MrBayes using SS is−37,335.73. Table 6 displays the tree metrics provided byASMC and MrBayes. The log-likelihood of the consensustree provided by ASMC is higher than the one fromMrBayes, and PM, RF, KF metrics estimated from ASMCare lower.

CONCLUSION AND DISCUSSION

The annealed SMC algorithm discussed in thisarticle provides a simple but general framework forphylogenetic tree inference. Unlike previous SMCmethods in phylogenetics, annealed SMC considers thesame state space for all the intermediate distributions.As a consequence, many conventional Metropolis-Hastings tree moves used in the phylogenetic MCMCliterature can be utilized as the basis of SMC proposaldistributions. Because MCMC tree moves are availablefor a large class of trees, including nonclock as wellas strict and relaxed clock models, the annealed SMCmethod is automatically applicable to a wide rangeof phylogenetic models. It should also be relativelyeasy to incorporate the proposed ASMC into existingphylogenetic software packages that implement MCMCalgorithms, such as MrBayes, RevBayes, or BEAST.

The annealed SMC algorithm has two adaptivemechanisms, dynamic resampling and adaptivedetermination of annealing parameters, to make thealgorithm efficient while requiring less tuning. Dynamicresampling based on ESS is a common practice in theSMC literature. Devising the annealing parameter

sequence is a relatively newer practice (Del Moralet al. 2012). The annealing parameter sequence can bedetermined dynamically based on the CESS criterion.Because the particle weights of the current iteration onlydepend on the previous particles, there is negligiblecomputational cost for finding annealing parameters.

The consistency of annealed SMC discussed inthe theoretical results section holds when K goesto infinity. However, K cannot in practice be madearbitrarily large as the memory requirements scalelinearly in K. In contrast, increasing the number ofintermediate distributions R (in our adaptive algorithm,by increasing ) does not increase memory consumption.We conjecture that consistency for large R but fixed K alsoholds, in the sense of having the marginal distributionof each particle at the last iteration converging to theposterior distribution. We have explored the relativeimportance between K and R with fixed computationalbudgets using simulations. These results suggest thatincreasing R and K improves the approximation atdifferent rates, with increasing R giving bigger bangto the buck. Assuming that our conjecture on theconvergence in R holds true, in the regime of verylarge R and fixed K, the particle population at thelast iteration can be conceptualized as K independentMonte Carlo samples from the true posterior distribution(in particular, we conjecture that the weights willconvergence to a uniform distribution and hence to naiveMonte Carlo based on independent exact samples). Weremind the reader that the power of independent exactMonte Carlo is that the variance does not depend on thedimensionality of the problem. Hence if R is sufficientlylarge, a lower bound for K can therefore be obtained byselecting K large enough so that for independent samplesXi with distribution � the variance of the Monte Carloaverage (1/K)

∑Kk=1 f (Xi) is sufficiently small. Here is a

concrete example: suppose we have a test function f ofinterest, for example an indicator function on a fixedclade, with unknown posterior support p=∫ f (x)�(x)dx.We should take K large enough so that the Monte Carloaverage will have a 95% Monte Carlo confidence intervalhaving a width of no more than say min{p,1−p}/10. Forp≤1/2, this yields K≥ (p/10)−2(z∗)2Var�f ≈384(1−p)/p,where z∗≈1.96 is the 95% critical value. For example, ifthe clade of interest is believed from a test run to be highlyuncertain, p≈1/2, then in the large R regime, at the veryminimum K=400 particles should be used. The value ofK should also be sufficiently large to accommodate thenumber of parallel cores available, and also to ensurethat the adaptive annealing scheme is stable (i.e. thata further increase in K results in a qualitatively similarannealing schedule). See also Olsson and Douc (2017)for more sophisticated schemes for estimating the MonteCarlo variance of SMC algorithms.

Importantly, annealed SMC provides an efficient wayto estimate the marginal likelihood, which is still achallenging task in Bayesian phylogenetics. We have alsoreviewed other marginal likelihood estimation methods,including SS and LIS. Our annealed SMC algorithm

Dow




arch 2021




a) b)

FIGURE 11. The majority-rule consensus trees for the M336 data set estimated by (a) ASMC and (b) MrBayes. The numbers on the treesrepresent the clade posterior probabilities (number 100 is omitted).

TABLE 6. Comparison of running ASMC and MrBayes for M1809from TreeBASEMethod R K Metric Value

ASMC 17,639 1000 ConsensusLogLL −36,972.51317,639 1000 BestSampledLogLL −36,991.44317,639 1000 PartitionMetric 2.017,639 1000 RobinsonFouldsMetric 0.1374117,639 1000 KuhnerFelsenstein 3.95E-4

MrBayes 1.76E+07 ConsensusLogLL −36,996.131.76E+07 PartitionMetric 16.01.76E+07 RobinsonFouldsMetric 0.5132851.76E+07 KuhnerFelsenstein 0.01137

enjoys advantageous theoretical properties. The mainproperty that justifies the use of the annealed SMC isthe unbiasedness of its marginal likelihood estimate. In

addition, the unbiasedness of the marginal likelihoodestimate can be used to test implementation correctnessof the algorithm. Our simulation studies have shownthat ASMC can give a similar marginal likelihoodestimate as the one obtained from the ASMC with thesame but deterministic annealing parameter sequence(debiased ASMC). With the same computing budget,ASMC has been demonstrated to result in more accurateestimates. Moreover, the ASMC algorithm requires lesstuning than the other methods considered. Both LISand SS need a predetermined annealing parametersequence, which is often inconvenient to choose inpractice. ASMC leads to a more stable estimate for themarginal likelihood compared with the other methodsconsidered.

Dow




arch 2021


[11:31 20/11/2019 Sysbio-wwwcomputecanada.ca] Page: 174 155–183


MCMC moves often come with tuning parameters.For example, proposal distributions typically have abandwidth parameter which needs to be tuned (Robertset al. 1997). To improve the performance of annealedSMC, it would be possible to use automatic tuningof proposal distributions within SMC algorithms, asproposed in Zhou et al. (2016).

A second future direction would be to investigatemodifications in the specification of the sequence ofintermediate distributions. For example, Fan et al. (2010)proposed an alternative to the prior distribution toreplace �0. The same choice could be used within ourframework. In another direction, it may be possible tocombine the construction of Dinh et al. (2017) with oursto handle online problems via standard moves: insteadof integrating the new taxon with all of its sites un-annealed (which requires specialized proposals), it maybe beneficial to anneal the newly introduced site.

In terms of empirical comparisons, it would beinteresting to expand the set of metrics, models anddata sets used to compare the algorithms. For example,in addition to the tree distance metrics used in thisarticle, geodesic tree distance (Billera et al. 2001) isalso an important metric to compare distances betweenphylogenetic trees. The GTP software (Owen and Provan2011) allows easy calculation of the geodesic treedistance.

We have investigated the subsampling SMC algorithmfor “tall data” phylogenetic problems. The annealingparameters �(s,�r) of the subsampling SMC is derivedfrom the annealing parameter sequence �r of theadaptive ASMC without subsampling. This choice wasmade for implementation convenience, and there is noreason why the two optimal sequences of distributionsshould coincide. To improve the algorithm performance,one direction is therefore to design an adaptive schemetailored to the subsampling version. One challenge isthat pointwise evaluation of the adaptation functiong(�) is more expensive in the subsampling setup,with a cost that grows with �. Bayesian optimizationmight be useful in this context. Another use of thesubsampling arises in situations where the samplingalgorithm is not constructed using an accept–rejectstep. For example, conjugate Gibbs sampling on anaugmented target distribution (Lartillot 2006) is usedby PhyloBayes (Lartillot et al. 2009) to efficiently samplethe evolutionary model parameters. It is not clear howthe conjugate Gibbs sampling step can be modified toaccommodate the annealed distribution in Equation (4)for �<1. On the other hand, conjugate sampling isdirectly applicable to intermediate distributions thatconsist in taking subsets of sites. This sequence ofdistributions could be used to handle conjugate Gibbssampling not only in annealed SMC but also in thecontext of parallel tempering or any other sequenceof measure-based method. One last line of work is tocombine control variates to annealed SMC to reduce thevariance of the likelihood estimator, a general strategythat has been very successful in other subsampling work(Bardenet et al. 2017).

FUNDING

This research was supported by Discovery Grants of A.Bouchard-Côté and L. Wang from the National Scienceand Engineering Research Council and a CanadianStatistical Sciences Institute Collaborative ResearchTeam Project. Research was enabled in part by supportprovided by WestGrid (www.westgrid.ca) and ComputeCanada (www.computecanada.ca).

ACKNOWLEDGMENTS

We would like to thank the reviewers and editor forthe constructive feedback, as well as Arnaud Doucetand Fredrik Ronquist for helpful discussions. Part ofthis work was done while visiting the Department ofStatistics at Oxford.

APPENDIX 1

Construction of Intermediate Distributions forSubsampling SMC

Let us decompose the unnormalized posteriordistribution as

�1(x)=p(x)#S∏

s=1

p(ys|x),

where x refers to the phylogenetic tree and evolutionaryparameter of interest, s is an index for one batch of sitesfrom a biological sequence, and #S represents the totalnumber of batches. Each batch contains one or more sitesof the biological sequence; we denote the number of sitesin each batch by bs.

Consider the annealing parameter sequence 0=�0<�1< ···<�R=1. We define the sequence of intermediatedistributions for subsampling as follows:

��r (x)=p(x)#S∏

s=1

p(ys|x)�(s,�r),

where

�(s,�r)=⎧⎨⎩

1 if �r≥s/#S,0 if �r≤ (s−1)/#S,#S·�r−(s−1) otherwise.

The subsampling SMC algorithm is a more generalversion of the annealed SMC algorithm. If we define#S=1 in �1(x), then the sequence of intermediatedistributions of subsampling SMC is exactly the sameas the intermediate distributions of the annealed SMC.In this case, the computational cost of subsamplingSMC is exactly the same as the annealed SMC. Anotherextreme case is that #S=n, in which case we sequentiallyincorporate the sites of sequence one by one.

Dow




arch 2021

www.westgrid.ca

www.computecanada.ca




APPENDIX 2

Theoretical Foundations of Annealed SMCIn this section, we review the construction of Del Moral

et al. (2006), which is the basis for our work. See alsoWang et al. (2015) for a similar construction tailored to aphylogenetic setup.

The corresponding sequence of un-normalizeddistributions are denoted by {�r}1,...,R. The annealedSMC can be obtained by defining an auxiliary sequenceof distributions that admit the distribution of interest,�r(xr), as the marginal of the latest iteration

�r(xr)=�r(xr)r−1∏j=1

Lj(xj+1,xj),

where Lj(xj+1,xj) is an auxiliary “backward” Markovkernel with

∫Lj(xj+1,xj)dxj=1. We never sample from

Lj, rather its role is to allow us to derive weight updatesthat yield a valid SMC algorithm.

The idea is then to apply standard SMC (i.e. SMCfor product spaces such as state space models) to thisauxiliary sequence of distributions, �1,�2,...,�R. Theresulting sampler has a weight update given by

w(xr−1,xr)∝ �r(xr)�r(xr−1)

1Kr(xr−1,xr)

= �r(xr)Lr−1(xr,xr−1)�r−1(xr−1)Kr(xr−1,xr)

,

which is different from the one in a standard SMC.When Kr satisfies global balance with respect to �r, a

convenient backward Markov kernel that allows an easyevaluation of the importance weight is

Lr−1(xr,xr−1)= �r(xr−1)Kr(xr−1,xr)�r(xr)

.

This choice is a properly normalized backwardkernel,

∫Lr−1(xr,xr−1)dxr−1=1: this follows from the

assumption that Kr satisfies global balance with respectto �r. With this backward kernel, the incrementalimportance weight becomes

w(xr−1,xr) = �r(xr)�r−1(xr−1)

· Lr−1(xr,xr−1)Kr(xr−1,xr)

= �r(xr)�r−1(xr−1)

· �r(xr−1)Kr(xr−1,xr)�r(xr)

· 1Kr(xr−1,xr)

= �r(xr−1)�r−1(xr−1)

.

General Estimates of Marginal LikelihoodIn Basic Annealed SMC Algorithm section, we

describe the estimator for marginal likelihood in asimplified setting, that is without adaptation. Here,we describe the marginal likelihood estimator in fullgenerality.

Recall that we denote the marginal likelihood by Zfor simplicity. With a slight abuse of notation, we useKr(xr−1,·) to denote the proposal distribution for xr inthis section.

Let us start by rewriting the normalization constant ofthe first intermediate distribution as

Z1=∫�1(x1)K1(x1)

K1(x1)dx1=∫

w1(x1)K1(x1)dx1,

where K1(·) is the proposal distribution for x1.Correspondingly, an estimate of Z1 is

Z1,K= 1K

K∑k=1

w1,k.

Similarly, we can rewrite the ratio of the normalizationconstants of two intermediate distributions as

Zr

Zr−1=

∫�r(xr)dxr

Zr−1=

∫�r(xr)dxr

�r−1(xr−1)/�r−1(xr−1)

=∫

�r(xr)�r−1(xr−1)

�r−1(xr−1)dxr

=∫

�r(xr)�r−1(xr−1)Kr(xr−1,xr)

�r−1(xr−1)Kr(xr−1,xr)dxr

=∫

wr(xr)�r−1(xr−1)Kr(xr−1,xr)dxr.

Straightforwardly, an estimate of Zr/Zr−1 is provided by

Zr

Zr−1= 1

K

K∑k=1

wr,k.

Since the estimate of the marginal likelihood can berewritten as

Z≡ZR=Z1

R∏r=2

Zr

Zr−1,

an estimate of the marginal likelihood Z is

ZR,K =R∏

r=1

⎛⎝ 1K

K∑k=1

wr,k

⎞⎠=

R∏r=1

⎛⎝ 1K

K∑k=1

{p(y|xr−1,k)}�r−�r−1

⎞⎠, (19)

which can be obtained from an SMC algorithm readily.If resampling is not conducted at each iteration r, an

Dow




arch 2021




alternative form is provided by

ZR,K=tR−1+1∏

j=1⎛⎝ K∑k=1

Wnj−1,k

nj∏m=nj−1+1

{p(y|xm−1,k)}�m−�m−1

⎞⎠, (20)

where nj is the SMC iteration index at which we do thejth resampling, tR−1 is the number of resampling stepsbetween 1 and R−1.

Unbiasedness, Consistency and Central Limit Theorem forAnnealed SMC

Here, we provide more information on the theoreticalproperties discussed in Properties of AnnealedSequential Monte Carlo section.

Theorem 1 (Unbiasedness): For fixed 0=�0<�1< ···<�R=1, ZR,K is an unbiased estimate of Z,

E(ZR,K)=Z.

This result is well known in the literature, althoughmany statements of the result are specialized to SMCfor state-space models (Doucet and Johansen 2011). Theresults in Theorem 7.4.2 of Del Moral (2004) providea very general set of conditions which includes theannealed SMC algorithm presented here. However, thetheoretical framework in Del Moral (2004) being verygeneral and abstract, we outline below an alternative lineof argument to establish unbiasedness of phylogeneticannealed SMC.

First, by the construction reviewed in TheoreticalFoundations of Annealed SMC section, we can transformthe sequence of distributions on a fixed state space,�r(xr), into a sequence of augmented distributions �r(xr)on a product space admitting �r(xr) as a marginal. Wenow apply Theorem 2 of Andrieu et al. (2010) with thedistribution �n(x1:n) in this reference set to �r(xr) in ournotation. To be able to use Theorem 2, we only needto establish the “minimum assumptions” 1 and 2 inAndrieu et al. (2010). Assumption 1 is satisfied by thefact that valid MCMC proposals are guaranteed to besuch that q(x,x′)>0⇐⇒q(x′,x)>0. Assumption 2 holdsbecause we use multinomial resampling. Next, since theconditions of Theorem 2 of Andrieu et al. (2010) hold, wehave the following result from the proof of Theorem 2 inAppendix B1 of Andrieu et al. (2010):

�N(k,x1,...,xP,a1,...,aP−1)qN(k,x1,...,xP,a1,...,aP−1)

= ZN(x1,...,xP)Z

,

and hence

Z �N(k,x1,...,xP,a1,...,aP−1)

= ZN(x1,...,xP)qN(k,x1,...,xP,a1,...,aP−1).

Now taking the integral on all variablesk,x1,...,xP,a1,...,aP−1 with respect to the referencemeasure associated to �N , we obtain:

Z∫�N(k,x1,...,xP,a1,...,aP−1)d(k,x1,...,xP,

a1,...,aP−1)

=∫

ZN(x1,...,xP)qN(k,x1,...,xP,a1,...,aP−1)

d(k,x1,...,xP,a1,...,aP−1).

The left-hand side is just Z because �N is a density withrespect to . For the right-hand side, note that qN is thelaw of the full set of states produced by the particle filter,hence the right-hand side is just E[ZR,K] in our notation.This concludes the proof.

Next, we discuss consistency results. In the SMCliterature, they are generally available both in the L2

convergence and almost sure convergence flavors. Wecover the L2 case here and refer to Del Moral et al. (2006)for almost sure consistency results.

Theorem 2 (Consistency): Assume there is a constant Csuch that |f |≤C and wr,k≤C almost surely. For a fixed�r (r=1,...,R), the annealed SMC algorithm providesasymptotically consistent estimates:

K∑k=1

Wr,kf (xr,k)→∫�r(x)f (x)dx as K→∞,

where the convergence holds in the L2 norm sense.The result can be deduced from Proposition 5 in Wang

et al. (2015) as follows. Assumption 3 in Wang et al. (2015)holds because MCMC proposals satisfy q(x,x′)>0⇐⇒q(x′,x)>0. Assumption 4 holds because the support ofthe prior coincides with support of the posterior.

Note that the assumption that the weights arebounded is not valid for general tree spaces. However, ifthe branch lengths are assumed to be bounded then thespace is compact and the assumption therefore holds inthat setting.

Finally, we turn to the central limit theorem.

Theorem 3 (Central Limit Theorem): Under theintegrability conditions given in Theorem 1 of Chopinet al. (2004), or Del Moral (2004), section 9.4, pages300−306,

K1/2[ K∑

k=1

Wr,kf (xr,k)−∫�r(x)f (x)dx

]→N(0,�2

r (f ))

as K→∞,where the convergence is in distribution. The form ofasymptotic variance �2

r (f ) depends on the resamplingscheme, the Markov kernel Kr and the artificial backwardkernel Lr. We refer readers to Del Moral et al. (2006) fordetails of this asymptotic variance.

Dow




arch 2021




APPENDIX 3

MCMC Proposals for Bayesian Phylogenetics

In this article, we used the proposals qir defined as

follow:

1. q1r : the multiplicative branch proposal. This proposal

picks one edge at random and multiply itscurrent value by a random number distributeduniformly in [1/a,a] for some fixed parameter a>1(controlling how bold the move is) (Lakner et al.2008).

2. q2r : the global multiplicative branch proposal that

proposes all the branch lengths by applying theabove multiplicative branch proposal to eachbranch.

3. q3r : the stochastic NNI proposal. We consider the

nearest neighbor interchange (NNI) (Jow et al.2002) to propose a new tree topology.

4. q4r : the stochastic NNI proposal with resampling the

edge that uses the above NNI proposal in (3) andthe multiplicative branch proposal in (1) for theedge under consideration.

5. q5r : the Subtree Prune and Regraft (SPR) move that

selects and removes a subtree from the main treeand reinserts it elsewhere on the main tree to createa new tree.

Note that here, we only describe the MCMC kernels forphylogenetic trees. To sample evolutionary parameters�, one can use simple proposals such as symmetricGaussian distributions, or more complex ones, see forexample Zhao et al. (2016).

APPENDIX 4

Comparison of Resampling StrategiesIn this section, we compare the performance of

adaptive annealed SMC with three different resamplingthresholds. The first resampling threshold is �1=0. Inthis case, the particles are never resampled. The secondresampling threshold is �2=1, in which case resamplingis triggered at every iteration. The third resamplingthreshold is �3=0.5. The resampling method we usedin all our experiments was the multinomial resamplingscheme. We simulated one unrooted tree of 15 taxa,and generated one data set of DNA sequences of length200. The tree simulation setup was the same as inSimulation Studies section. We ran adaptive annealedSMC algorithm 20 times with the three resamplingthresholds described above. We used rCESSr=0.99999and K=100. Figure A.1 demonstrates the advantage ofresampling triggered by a threshold of �3=0.5 over theother two choices by displaying the marginal likelihoodestimates, log-likelihood of the consensus tree and treemetrics provided by adaptive annealed SMC using

three different �’s. The log marginal likelihood estimateand log-likelihood of the consensus tree provided byadaptive annealed SMC using �3=0.5 are higher andadmit smaller variation. The PF, RF, and KF metricsprovided by adaptive annealed SMC using �3=0.5 arelowest. Therefore the threshold 0.5 has been used in therest of the article.

APPENDIX 5

Review of Particle Degeneracy MeasuresThe two adaptive schemes in ASMC, adaptively

conducting resampling and the automaticallyconstruction of the annealing parameter sequence,rely on being able to assess the quality of a particleapproximation. For completeness, we provide somebackground in this section on the classical notation ofESS and of CESS, a recent generalization which we usehere (Zhou et al. 2016). The notion of ESS in the contextof importance sampling (IS) or SMC is distinct fromthe notion of ESS in the context of MCMC. The two arerelated in the sense of expressing a variance inflationcompared with an idealized Monte Carlo scheme butthey differ in the details. We will assume from now onthat ESS refers to the SMC context.

We will also use a slight variation of the derivationof ESS and CESS where the measures obtainedare normalized to be between zero and one (somehyperparameters of the adaptive algorithms are easier toexpress in this fashion). We use the terminology relative(conditional) ESS to avoid confusion.

The fundamental motivation of (relative and/orconditional) ESS stems from the analysis of the error ofMonte Carlo estimators. Recall that for a given functionof interest f (think for example of f being an indicatorfunction on a clade),

I=∫�r(dx)f (x)≈

K∑k=1

Wr,kf (xr,k)=: I.

The quantity on the right-hand side is a random variable(with respect to the randomness of the SMC algorithm),I, and we can think about it as an estimator of thedeterministic quantity on the left-hand side, I. Moreoverthe right-hand side is a real-valued random variable, sowe can define its mean square error, which can be furtherdecomposed as a variance term and a squared bias term.For SMC algorithms, the variance term dominates as thenumber of particles goes to infinity (Del Moral 2004). Forthis reason, we are interested in estimates of the varianceof I across SMC random seeds, VarSMC[I]. However, thevariance of I depends on the choice of function f , whichis problem dependent, and we would like to removethis dependency. The first step is to consider a notionof relative variance, comparing to the variance we wouldobtain from a basic Monte Carlo scheme I∗ relying on iidexact samples x�1,...,x

�K∼�, VarMC[I∗]=Var[ 1

K∑

k f (x�k)].

Dow




arch 2021




Threshold 0 0.5 1

−2520

−2510

−2500

−2490

logZ

−2450

−2440

−2430

−2420

ConsensusLogLL

0

2

4

6

8

PartitionMetric

0.6

0.8

1.0

1.2

1.4

1.6


0.05

0.10

0.15

0.20

0.25

KuhnerFelsenstein

FIGURE A.1. Comparison of three resampling thresholds, 0, 0.5, and 1.

To make further progress, we will make approximationsof the ratio VarMC[I∗]/VarSMC[I].

To understand these approximations, let us start witha simplified version of Algorithm 3, where the functionNextAnnealingParameter returns the value 1.0. In thissetting, no resampling occurs, and the algorithm reducesto an IS algorithm (more specifically, it reduces to asingle iteration of the AIS algorithm; Neal 2001). IS iseasier to analyze because the individual particles areindependent and identically distributed, allowing usto summarize the behavior based on one particle, sayk=1. If we assume further that (A1) �i=�i, that is, thenormalization constant is one, then a classical argumentby Kong (1992) based on the Delta method yields

VarMC[I∗]VarSMC[I]

= VarMC[I∗]Var�0 [I]

≈ 11+Var�0 [w1,1] ,

= 1E�0

[(w1,1)2

] ,where we used the fact that in this simple settingthe distribution of one proposed particle is just �0, soVarSMC[·]=Var�0 [·] and in the last line,

E�0 [w1,1]=∫�0(x0,1)

�1(x0,1)�0(x0,1)

dx0,1=1.

In general, assumption (A1) does not hold, that is thenormalization constant is not one, so for a general one-step AIS algorithm we get instead the approximation:


≈(

E�0

[(�1(x0,1)�0(x0,1)

)2])−1

=(

E�0

[(�1(x0,1)/Z1

�0(x0,1)/Z0

)2])−1

=(

Z1Z0

)2/E�0

[(�1�0

(x0,1))2

].

Generalizing the notation of this section into a generalSMC setup,�1 here plays the role of the current iteration,

and�0, of the previous iteration. However, since�0 is notknown in this case, we plug-in a particle approximation�0=

∑Kk=1W0,k�x0,k to get:


≈(

Z1Z0

)2/E�0

[(�1�0

(x0,1))2

]

=(

Z1Z0

)2/ K∑k=1

W0,k

(�1�0

(x0,k))2.

The effect of this additional approximation is that itmakes our estimator over-optimistic, by ignoring theerror of the approximation �0 of �0. It is nonethelessa useful tool to assess the degradation of performanceover a small number of SMC iterations.

Finally, since the ratio of normalization constants isalso unknown, we also need to estimate it. Based on aparticle approximation of Equation (14), we obtain:


≈⎛⎝ K∑

k=1

W0,k�1�0

(x0,k)

⎞⎠2/ K∑k=1

W0,k

(�1�0

(x0,k))2.

This quantity is called the relative CESS (rCESS),Equation (11). Having a high rCESS value is anecessary but not sufficient condition for a good SMCapproximation. If it is low during some SMC iteration,especially an iteration close to the final iteration, thenwith high probability most of the particles will have verysmall or zero weights, which will lead to a collapse of thequality of the annealed SMC algorithm.

Comparison of rESS and rCESSIn earlier work on adaptive SMC methods, the function

NextAnnealingParameter was implemented using adifferent criterion based on rESS instead of rCESS.Later, Zhou et al. (2016) argued that rCESS was moreappropriate. Here, we confirm that this is also the casein a phylogenetic context. We provide two experiments.In the first experiment, we simulated one unrooted treeof 10 taxa, and generated one data set of DNA sequences

Dow




arch 2021




0.6

0.8

1.0

0 25 50 75

r

rES

S

0.00

0.01

0.02

0.03

0.04

0.00 0.25 0.50 0.75 1.00φr

φ r−

φ r− 1

rCESS

0.5

0.6

0.7

0.8

0.9

1.0

0 200 400 600

r

rES

S

0.00

0.01

0.02

0.03

0.00 0.25 0.50 0.75 1.00φr

φ r−

φ r−1

rESS

FIGURE A.2. Comparison of rCESS (left) and rESS (right) in terms of rESS as a function of r (top) and �r−�r−1 as a function of �r (bottom).

and each sequence has length 100. The setup of treesimulation is the same as Simulation Studies section. Weran adaptive annealed SMC algorithm in two schemes:(1) rCESSr=0.99; (2) rESSr=0.99. The one based onrESSr only differs in the way NextAnnealingParameteris implemented; shown in Algorithm 5. We used K=1000 particles. Resampling of particles was triggeredwhen rESS<0.5. Figure A.2 demonstrates the advantageof using rCESS over rESS in adaptive annealed SMC.The annealing parameter difference (�r−�r−1) increasessmoothly in the rCESS scheme, whereas in the rESSscheme there are big gaps in annealing parameterincrement after doing resampling, then the consecutiveannealing parameter change decreases gradually untilthe next resampling time. The number of iterations R foradaptive annealed SMC using rESS is much larger thanusing rCESS.

In our second experiment, we compared theperformance of adaptive annealed SMC using rCESS andrESS in terms of tree metrics and marginal likelihood. Wesimulated one unrooted tree of 15 taxa, and generatedone data set of DNA sequences. Each sequence haslength 200. The tree simulation setup is the same asSimulation Studies section. We ran adaptive annealedSMC algorithm 20 times with rCESSr=0.999 andrESSr=0.978, respectively. The number of particles wasset to K=500. Under this setting, the computationalcosts of the two schemes are quite similar. The numbersof annealing parameters (R) selected via rCESS and

Algorithm 5 Alternative NextAnnealingParameterprocedure (suboptimal)

1. Inputs: (a) Particle population from previous SMCiteration (xr−1,·,wr−1,·); (b) Annealing parameter�r−1 of previous SMC iteration; (c) A degeneracydecay target �∈ (0,1).

2. Outputs: automatic choice of annealing parameter�r.

3. Initialize the function g assessing the particlepopulation quality associated to a putative annealingparameter �:

g(�)=⎛⎝ K∑

k=1

Wr−1,kp(y|xr−1,k)�−�r−1

⎞⎠2/K∑

k=1

(Wr−1,kp(y|xr−1,k)�−�r−1 )2,

4. if g(1)≥�g(�r−1) then5. return �r=1.6. else7. return �r=�∗ ∈ (�r−1,1) such that g(�∗)=�g(�r−1) via bisection.

rESS are similar. Figure A.3 displays the marginallikelihood estimates, consensus likelihood and treemetrics provided by adaptive SMC using rCESS and

Dow




arch 2021




method rCESS (0.999) rESS (0.978)

−2515

−2510

−2505

−2500

−2495

logZ

−2430.0

−2427.5

−2425.0

−2422.5

−2420.0

ConsensusLogLL

0.0

0.5

1.0

1.5

2.0

PartitionMetric

0.8

1.0

1.2

1.4


0.05

0.10

0.15

0.20

KuhnerFelsenstein

570

580

590

600

610

R

FIGURE A.3. Comparison of adaptive annealed SMC using rCESS and rESS in terms of estimating the log marginal likelihood, the log-likelihoodof the consensus tree, tree distance metrics, and the number of SMC iterations (R).

rESS. The log marginal likelihood estimates and logconsensus likelihoods provided by adaptive SMC usingrCESS are higher and have lower variability. The PF, RF,and KF metrics provided by the two schemes are quiteclose, whereas the metrics provided by rCESS schemehave lower variability.

APPENDIX 6

Estimates of Marginal Likelihood from LISWe described the LIS procedure as follows:

1. Sample an index v0 randomly from {1,2,...,N}, andsample x0,v1∼�0(·).

2. For d=0,1,...,D, sample N states from �d asfollows:

(a) If d>0: sample an index vd from {1,2,...,N},and set xd,vd=xd−1∗d.

(b) For k=vd+1,...,N, sample xd,k from theforward kernel xd,k∼Kd(xd,k−1,·).

(c) For k=vd−1,...,1, sample xd,k from thebackward kernel xd,k∼Ld(xd,k+1,·).

(d) If d<D, sample d from {1,2,...,Nd}according to the following probabilities:

p(d|xd)= �d−1∗d(xd,d )�d(xd,d )

/ Nd∑k=1

�d−1∗d(xd,k)�d(xd,k)

,

and set xd∗d+1 to xd,d .

3. Compute the likelihood estimate

ZLIS=D∏

d=1

[1N

N∑k=1

�d−1∗d(xd−1,k)�d−1(xd−1,k)

/1N

N∑k=1

�d−1∗d(xd,k)�d(xd,k)

].

Note that if the backward kernel is reversible, thenthe forward kernel is the same as backward kernel. Inthis article, we use the MCMC kernel as backward andforward kernels in LIS.

APPENDIX 7

Derivation of Upper Bound of CV

CV = sd(Z)

E(Z)

=√

1n∑n

i=1(Zi− 1n∑n

i=1 Zi)2

1n∑n

i=1 Zi

= √n

√√√√ n∑i=1

(Zi∑ni=1 Zi

− 1n

)2

For non-negative Zi, the CV is maximized when Zi∑ni=1 Zi=

1 for some i, and 0 for the rest. In this extreme case, Ziis much larger than the rest. The upper bound of the CVcan be simplified to

√n−1.

APPENDIX 8

Tuning of and KIn Figure A.4, we compare the performance of

ASMC algorithm as a function of K, with fixedat 5. We used four different numbers of particlesK=100,300,1000,3000. Both the marginal likelihoodestimate and tree metrics improve as K increases. FigureA.5 displays the performance of ASMC algorithm asa function of , with K=1000. R is the total numberof SMC iterations. We used five distinct values, =3,4,4.3,5,5.3. The marginal likelihood estimates andtree metrics improve as increases; they tend to bestable after reaches 5. A larger value of can improvethe performance of ASMC more significantly than anincrease in K.

Dow




arch 2021




−28288

−28284

−28280

logZ

−28047

−28046

−28045

−28044

ConsensusLogLL−0.50

−0.25

0.00

0.25

0.50

PartitionMetric

0.35

0.36

0.37

0.38


0.0040

0.0044

0.0048

0.0052

KuhnerFelsenstein

FIGURE A.4. Comparison of adaptive SMC algorithm with different numbers of particles, from left to right K=100,300,1000,3000.

−28500

−28400

−28300

logZ

−28100

−28080

−28060

ConsensusLogLL

0.0

0.2

0.4

0.6

PartitionMetric

0.4

0.5

0.6


0.005

0.010

0.015

KuhnerFelsenstein

5000

10000

15000

R

FIGURE A.5. Comparison of adaptive SMC algorithm with different , from left to right =3,4,4.3,5,5.3. Here, R is the total number of SMCiterations.

APPENDIX 9

Comparison of ASMC, DASMC, LIS, and SS for large KIn this experiment, we focus on evaluating the

marginal likelihood estimates using ASMC, DASMC,LIS, and SS with a shared, large computational budget.We simulated an unrooted tree of 4 taxa, generated onedata set of DNA sequences of length 10. Every algorithmfor each data set was repeated 50 times with differentrandom seeds. We set =2 and K=200,000. The setupof DASMC, LIS, and SS is the same as Comparison ofMarginal Likelihood Estimates section.

Figure A.6 shows the comparison of the performanceof the four algorithms in terms of the marginallikelihoods in the log scale. The mean log marginalizedlikelihood estimates provided by ASMC, DASMC, LIS,and SS are quite close. The variance of estimates forASMC and DASMC is smaller than LIS and SS.

REFERENCES

Altekar G., Dwarkadas S., Huelsenbeck J., Ronquist F. 2004. ParallelMetropolis coupled Markov chain Monte Carlo for Bayesianphylogenetic inference. Bioinformatics 20:407–415.

Andrieu C., de Freitas N., Doucet A., Jordan M.I. 2003. Anintroduction to MCMC for machine learning. Mach. Learn.50:5–43.

−24.975

−24.950

−24.925

−24.900

−24.875

logZ

Method

Adaptive

Deterministic

LIS

SS

FIGURE A.6. Comparison of marginal likelihood (in log scale)provided by ASMC, DASMC, LIS, and SS with a fixed computationalbudget when K=200000.

Andrieu C., Doucet A., Holenstein R. 2010. Particle Markov chainMonte Carlo methods. J. R. Stat. Soc. Series B Stat. Methodol.72:269–342.

Andrieu C., Roberts G.O. 2009. The pseudo-marginalapproach for efficient Monte Carlo computations. Ann. Stat.37:697–725.

Atchadé Y.F., Roberts G.O., Rosenthal J.S. 2011. Towards optimal scalingof metropolis-coupled Markov chain Monte Carlo. Stat. Comput.21:555–568.

Dow




arch 2021




Bardenet R., Doucet A., Holmes C. 2017. On Markov chainMonte Carlo methods for tall data. J. Mach. Learn. Res.18:1515–1557.

Billera L.J., Holmes S.P., Vogtmann K. 2001. Geometry of the space ofphylogenetic trees. Adv. Appl. Math. 27:733–767.

Bouchard-Côté A., Sankararaman S., Jordan M.I. 2012. Phylogeneticinference via sequential Monte Carlo. Syst. Biol. 61:579–593.

Chan H.P., Lai T.L. 2013. A general theory of particle filters in hiddenMarkov models and some applications. Ann. Stat. 41:2877–2904.

Chen M.-H., Kuo L., Lewis P.O. 2014. Bayesian phylogenetics: methods,algorithms, and applications. Boca Raton, FL: CRC.

Chopin N. 2004. Central limit theorem for sequential Monte Carlomethods and its application to Bayesian inference. Ann. Stat.32:2385–2411.

Del Moral P. 2004. Feynman-Kac formulae: genealogical andinteracting particle systems with applications. New York: Springer.

Del Moral P., Doucet A., Jasra A. 2006. Sequential MonteCarlo samplers. J. R. Stat. Soc. Series B Stat. Methodol.68:411–436.

Del Moral P., Doucet A., Jasra A. 2012. An adaptive sequential MonteCarlo method for approximate Bayesian computation. Stat. Comput.22:1009–1020.

Devroye L. 1986. Non-uniform random variate generation. New York:Springer-Verlag.

Dinh V., Darling A.E., Matsen IV F.A. 2017. Online Bayesianphylogenetic inference: theoretical foundations via SequentialMonte Carlo. Syst. Biol. 67:503–517.

Douc R., Cappé O. 2005. Comparison of resampling schemes forparticle filtering. In: ISPA 2005. Proceedings of the 4th internationalsymposium on image and signal processing and analysis, 2005.IEEE, Zagreb, Croatia, Croatia. p. 64–69.

Doucet A., de Freitas N., Gordon N. 2001. Sequential Monte Carlomethods in practice. New York: Springer.

Doucet A., Johansen A.M. 2011. A Tutorial on Particle Filtering andSmoothing: Fifteen Years Later. In: Crisan, D. and Rozovskii, B.,editors. The Oxford Handbook of Nonlinear Filtering. New York:Oxford University Press. pp. 656–704.

Drummond A., Rambaut A. 2007. BEAST: Bayesian evolutionaryanalysis by sampling trees. BMC Evol. Biol. 7:214.

Everitt R.G., Culliford R., Medina-Aguayo F., Wilson D.J. 2016.Sequential Bayesian inference for mixture models and the coalescentusing sequential Monte Carlo samplers with transformations.Technical report, arXiv:1612.06468.

Fan Y., Wu R., Chen M.-H., Kuo L., Lewis P.O. 2010. Choosing amongpartition models in Bayesian phylogenetics. Mol. Biol. Evol. 28:523–532.

Felsenstein J. 1973. Maximum likelihood and minimum-steps methodsfor estimating evolutionary trees from data on discrete characters.Syst. Biol. 22:240–249.

Felsenstein J. 1981. Evolutionary trees from DNA sequences: amaximum likelihood approach. J. Mol. Evol. 17:368–376.

Forney G.D. 1973. The viterbi algorithm. Proc. IEEE 61:268–278.Fourment M., Claywell B.C., Dinh V., McCoy C., Matsen I., Frederick

A., Darling A.E. 2018a. Effective online Bayesian phylogenetics viasequential Monte Carlo with guided proposals. Syst. Biol. 67:490–502.

Fourment M., Magee A.F., Whidden C., Bilge A., Matsen IV F.A., MininV.N. 2018b. 19 dubious ways to compute the marginal likelihood ofa phylogenetic tree. Technical report, arXiv:1811.11804.

Friel N., Pettitt A.N. 2008. Marginal likelihood estimation via powerposteriors. J. R. Stat. Soc. Series B Stat. Methodol. 70:589–607.

Gelman A., Meng X.-L. 1998. Simulating normalizing constants: fromimportance sampling to bridge sampling to path sampling. Stat. Sci.13:163–185.

Geweke J. 2004. Getting it right. J. Am. Stat. Assoc. 99:799–804.Görür D., Boyles L., Welling M. 2012. Scalable inference on Kingman’s

coalescent using pair similarity. J. Mach. Learn. Res. 22:440–448.Görür D., Teh Y.W. 2009. An efficient sequential Monte Carlo algorithm

for coalescent clustering. In: Koller D., Schuurmans D., BengioT., Bottou L. editor. Advances in neural information processingsystems. Curran Associates, Inc. p. 521–528.

Gunawan D., Kohn R., Quiroz M., Dang K.-D., Tran M.-N. 2018.Subsampling sequential Monte Carlo for static Bayesian models.Technical report, arXiv:1805.03317.

Hajiaghayi M., Kirkpatrick B., Wang L., Bouchard-Côté A. 2014.Efficient continuous-time Markov chain estimation. In: Eric P.Xing, Tony Jebara, editors. Proceedings of the 31st InternationalConference on Machine Learning (ICML), Vol. 32. Bejing, China:PMLR. p. 638–646.

Höhna S., Defoin-Platel M., Drummond A. 2008. Clock-constrainedtree proposal operators in Bayesian phylogenetic inference.In: 8th IEEE international conference on bioinformatics andbioengineering, Athens, Greece. p. 1–7.

Höhna S., Drummond A.J. 2012. Guided tree topology proposals forBayesian phylogenetic inference. Syst. Biol. 61:1–11.

Höhna S., Landis M.J., Heath T.A., Boussau B., Lartillot N., MooreB.R., Huelsenbeck J.P., Ronquist F. 2016. Revbayes: Bayesianphylogenetic inference using graphical models and an interactivemodel-specification language. Syst. Biol. 65:726–736.

Holder M., Lewis P.O. 2003. Phylogeny estimation: traditional andBayesian approaches. Nat. Rev. Gen. 4:275.

Huelsenbeck J.P., Larget B., Alfaro M.E. 2004. Bayesian phylogeneticmodel selection using reversible jump Markov chain Monte Carlo.Mol. Biol. Evol. 21:1123–1133.

Huelsenbeck J.P., Ronquist F. 2001. MRBAYES: Bayesian inference ofphylogenetic trees. Bioinformatics 17:754–755.

Jeffreys H. 1935. Some tests of significance, treated by the theoryof probability. In: Mathematical proceedings of the CambridgePhilosophical Society, Vol. 31. Cambridge University Press. p. 203–222.

Jow H., Hudelot C., Rattray M., Higgs P.G. 2002. Bayesian phylogeneticsusing an RNA substitution model applied to early mammalianevolution. Mol. Biol. Evol. 19:1591–1601.

Jun S.-H., Bouchard-Côté A. 2014. Memory (and time) efficientsequential Monte Carlo. In: Proceedings of the 31st InternationalConference on Machine Learning (ICML), Vol. 31. Bejing, China:PMLR. p.514–522.

Kimura M. 1980. A simple method for estimating evolutionary ratesof base substitutions through comparative studies of nucleotidesequences. J. Mol. Evol. 16:111–120.

Kong A. 1992. A note on importance sampling using standardizedweights. The University of Chicago. Department of Statistics,Technical report 348.

Kuhner M.K., Felsenstein J. 1994. A simulation comparison ofphylogeny algorithms under equal and unequal evolutionary rates.Mol. Biol. Evol. 11:459–468.

Lakner C., van der Mark P., Huelsenbeck J.P., Larget B., Ronquist F.2008. Efficiency of Markov chain Monte Carlo tree proposals inBayesian phylogenetics. Syst. Biol. 57:86–103.

Larget B., Simon D. 1999. Markov chain Monte Carlo algorithms forthe Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16:750–759.

Lartillot N. 2006. Conjugate Gibbs sampling for Bayesian phylogeneticmodels. J. Comput. Biol. 13:1701–1722.

Lartillot N., Lepage T., Blanquart S. 2009. PhyloBayes 3: a Bayesiansoftware package for phylogenetic reconstruction and moleculardating. Bioinformatics 25:2286–2288.

Lartillot N., Philippe H., Lewis P. 2006. Computing Bayes factors usingthermodynamic integration. Syst. Biol. 55:195–207.

Lemey P., Rambaut A., Welch J.J., Suchard M.A. 2010. Phylogeographytakes a relaxed random walk in continuous space and time. Mol.Biol. Evol. 27:1877–1885.

Li S., Pearl D.K., Doss H. 2000. Phylogenetic tree construction usingmarkov chain monte carlo. J. Am. Stat. Assoc. 95:493–508.

Mau B., Newton M., Larget B. 1999. Bayesian phylogenetic inferencevia Markov chain Monte Carlo. Biometrics 55:1–12.

Miller J.C., Maloney C.J. 1963. Systematic mistake analysis of digitalcomputer programs. Commun. ACM 6:58–63.

Neal R.M. 2001. Annealed importance sampling. Stat. Comput. 11:125–139.

Neal R.M. 2005. Estimating ratios of normalizing constants usinglinked importance sampling. Technical report, arXiv:math/0511216.

Dow




arch 2021




Newton M.A., Raftery A.E. 1994. Approximate Bayesian inference withthe weighted likelihood bootstrap. J. R. Stat. Soc. Series B Methodol.56:3–48.

Oaks J.R., Cobb K.A., Minin V.N., Leaché A.D. 2018. Marginallikelihoods in phylogenetics: a review of methods and applications.Technical report, arXiv:1805.04072.

Olsson J., Douc R. 2017. Numerically stable online estimation ofvariance in particle filters. Technical report, arXiv:1701.01001.

Owen M., Provan J.S. 2011. A fast algorithm for computing geodesicdistances in tree space. IEEE/ACM Trans. Comput. Biol. Bioinform.8:2–13.

Quiroz M., Kohn R., Villani M., Tran M.-N. 2018a. Speeding up MCMCby efficient data subsampling. J. Am. Stat. Assoc. 1–13.

Quiroz M., Tran M.-N., Villani M., Kohn R. 2018b. Speeding up MCMCby delayed acceptance and data subsampling. J. Comput. Graph.Stat. 27:12–22.

Rannala B., Yang Z. 1996. Probability distribution of molecularevolutionary trees: a new method of phylogenetic inference. J. Mol.Evol. 43:304–311.

Rannala B., Yang Z. 2003. Bayes estimation of species divergence timesand ancestral population sizes using DNA sequences from multipleloci. Genetics 164:1645–1656.

Roberts G.O., Gelman A., Gilks W.R. 1997. Weak convergence andoptimal scaling of random walk Metropolis algorithms. Ann. Appl.Probab. 7:110–120.

Robinson D., Foulds L. 1981. Comparison of phylogenetic trees. Math.Biosci. 53:131–147.

Robinson D.F., Foulds L.R. 1979. Comparison of weighted labelledtrees. In: Horadam A.F., Wallis W. D., editors. Combinatorialmathematics VI. Berlin: Springer. p. 119–126.

Smith R.A., Ionides E.L., King A.A. 2017. Infectious disease dynamicsinferred from genetic data via sequential Monte Carlo. Mol. Biol.Evol. 34:2065–2084.

Tavaré, S. 1986. Some probabilistic and statistical problemsin the analysis of DNA sequences. Lect. Math. Life Sci.17:57–86.

Teh Y.W., Daume III H., Roy D.M. 2008. Bayesian agglomerativeclustering with coalescents. In: Platt J. C., Koller D., Singer Y.,Roweis S. T., editors. Advances in neural information processingsystems. 20. Vancouver, B.C., Canada: Curran Associates, Inc.p. 1473–1480.

Tierney L. 1994. Markov chains for exploring posterior distributions.Ann. Stat. 22:1701–1762.

Wang L., Bouchard-Côté A., Doucet A. 2015. Bayesian phylogeneticinference using a combinatorial sequential Monte Carlo method. J.Am. Stat. Assoc. 110:1362–1374.

Wingate D., Goodman N., Stuhlmueller A., Siskind J.M. 2011.Nonstandard interpretations of probabilistic programs for efficientinference. In: Shawe-Taylor J., Zemel R.S., Bartlett P.L., PereiraF., Weinberger K.Q., editors. Advances in neural informationprocessing systems 24. Granada Spain: Curran Associates, Inc.p. 1152–1160.

Xie W., Lewis P.O., Fan Y., Kuo L., Chen M.-H. 2010. Improving marginallikelihood estimation for Bayesian phylogenetic model selection.Syst. Biol. 60:150–160.

Yang Z., Rannala B. 1997. Bayesian phylogenetic inference using DNAsequences: a Markov Chain Monte Carlo method. Mol. Biol. Evol.14:717–724.

Zhao T., Wang Z., Cumberworth A., Gsponer J., Freitas N.D., Bouchard-Côté, A. 2016. Bayesian analysis of continuous time Markov chainswith application to phylogenetic modelling. Bayesian Anal. 11:1203–1237.

Zhou Y., Johansen A.M., Aston J.A. 2016. Toward automatic modelcomparison: an adaptive sequential Monte Carlo approach. J.Comput. Graph. Stat. 25:701–726.

Dow




arch 2021

an annealed sequential monte carlo method for bayesian

Documents