*beast in beast 2.3.x estimating species trees from...

*BEAST in BEAST 2.3.x

Estimating Species Trees from Multilocus Data

Joseph Heled, Remco Bouckaert, Walter Xie and Alexei J Drummond

November 10, 2015

1 Introduction

In this tutorial we describe a full Bayesian framework for species tree estimation.The statistical methodology described in this tutorial is known by the acronym*BEAST (pronounced ”star beast”) [3].

You will need the following software at your disposal:

• BEAST [2]- this package contains the BEAST program, BEAUti, TreeAn-notator and other utility programs. This tutorial is written for BEASTv2.3.x, which has support for multiple partitions. It is available for down-load from http://beast2.org.

• Tracer - this program is used to explore the output of BEAST (and otherBayesian MCMC programs). It graphically and quantitively summarizesthe distributions of continuous parameters and provides diagnostic infor-mation. At the time of writing, the current version is v1.6. It is availablefor download from http://tree.bio.ed.ac.uk/software/figtree/.

• FigTree - this is an application for displaying and printing molecularphylogenies, in particular those obtained using BEAST. At the time ofwriting, the current version is v1.4.2. It is available for download fromhttp://tree.bio.ed.ac.uk/software/tracer/.

2 *BEAST

This tutorial will guide you through the analysis of three loci sampled from26 individuals representing nine species of pocket gophers. This is a subsetof previous published data [1]. The objective of this tutorial is to estimatethe species tree that is most probable given the multi-individual multi-locussequence data. The species tree has nine taxa, whereas each gene tree has 26taxa. *BEAST 2 will co-estimate three gene trees embedded in a shared speciestree [3, for details].

1

Figure 1: Select a new template in BEAUti.

The first step will be to convert a NEXUS file with a DATA or CHARAC-TERS block into a BEAST XML input file. This is done using the programBEAUti (Bayesian Evolutionary Analysis Utility). This is a user-friendly pro-gram for setting the evolutionary model and options for the MCMC analysis.The second step is to actually run BEAST using the input file that contains thedata, model and settings. The final step is to explore the output of BEAST inorder to diagnose problems and to summarize the results.

BEAUti

Run BEAUti by double clicking on its icon.

Set up BEAUti for *BEAST

*BEAST uses a different template from the standard. This means that to useBEAUti for *BEAST, the first thing to do is change the template. Choose theFile/Template/StarBeast item. When changing a template, BEAUti deletes allpreviously imported data and start with a new empty template. So, if youalready loaded some data, a warning message pops up indicating that this datawill be lost if you switch templates.

Loading the NEXUS file

*BEAST is a multi-individual, multi-locus method method. The data for eachlocus is stored as one alignment in its own NEXUS file. Taxa names in eachalignment have to be unique, but duplicates across alignments are fine.

2

To load a NEXUS format alignment, simply select the Import Alignment

option from the File menu:Select three files called 26.nex, 29.nex, 47.nex by holding shift key. You can

find the files in the examples/nexus directory in the directory where BEASTwas installed. Or click the links to download the data. After the data is openedin your web browser each time, right click mouse and separately save them as26.nex, 29.nex, 47.nex.

Each file contains an alignment of sequences of from an independent locus.The 26.nex looks like this (content has been truncated):

#NEXUS

[TBO26oLong]

BEGIN DATA;

DIMENSIONS NTAX =26 NCHAR=614;

FORMAT DATATYPE = DNA GAP = - MISSING = ?;

MATRIX

Orthogeomys_heterodus ATTCTAGGCAAAAAGAGCAATGC ...

Thomomys_bottae_awahnee_a ????????????????????ATGCTG ...

Thomomys_bottae_awahnee_b ????????????????????ATGCTG ...

Thomomys_bottae_xerophilus ????????????????????ATGCTG ...

Thomomys_bottae_cactophilus ????????????????AGCAATGCT ...

... ...

;

END;

Once loaded, the three partitions are displayed in the main panel. You candouble click any alignment (partition) to show its detail.

For multi-locus analyses, BEAST can link or unlink substitutions modelsacross the loci by clicking buttons on the top of Partitions panel. The defaultof *BEAST is unlinking all models: substitution model, clock model, and treemodels. Note that you should only unlink the tree model across data partitionsthat are actually genetically unlinked. For example, in most organisms all themitochondrial genes are effectively linked due to a lack of recombination andthey should be set up to use the same tree model in a *BEAST analysis.

Import trait(s) from a mapping file to fire *BEAST

Each taxon in a *BEAST analysis is associated with a species. Typicallythe species name is already embedded inside the taxon. The species nameshould be easy to extract; place it either at the beginning or the end, sepa-rated by a “special” character which does not appear in names. For exam-ple, aria 334259, coast 343436 (using an underscore) or 10x017b.wrussia,

2x305b.eastis (using a dot).

3

https://github.com/CompEvol/beast2/blob/master/examples/nexus/26.nex?raw=true



Figure 2: Data partition panel after loading alignments.

We need to tell BEAUti somehow which lineages in the alignments go withtaxa in the species tree. Select the Taxon Set panel, and a list of taxa from thealignments is shown together with a default guess by BEAUti. In this case, theguess is not very good, so we want to change this. You can manually changeeach of the entries in the table, or press the guess button and a dialog is shownwhere you can choose from several ways to try to detect the taxon from thename of the lineages, or have a mapping stored in a file. In this case, splittingthe name on the underscore character (’ ’) and selecting the second group willgive us the mapping that we need.

Alternatively, the mapping can be read from a trait file. A proper trait file istab delimited. The first row is always traits followed by the keyword species

in the second column and separated by tab. The rest of the rows map eachindividual taxon name to a species name: the taxon name in the first columnand species name in the second column separated by tab. For example:

traits species

taxon1 speciesA

taxon2 speciesA

taxon3 speciesB

... ...

Setting the substitution model

The next thing to do is to click on the Site Model tab at the top of themain window. This will reveal the evolutionary model settings for BEAST.Exactly which options appear depend on whether the data are nucleotides, or

4

Figure 3: Selecting taxon sets in BEAUti using the guess dialog from the taxonset panel.

5

Figure 4: Setting up substitution and site models for the gopher alignments.

amino acids, or binary data, or general data. The settings that will appear afterloading the data set will be the default values so we need to make some changes.

Most of the models should be familiar to you. For this analysis, we will se-lect each substitution model listed on the left side in turn to make the followingchange: select HKY for substitution model and Empirical for the Frequen-cies.

Setting the clock model

Second, click on the Clock Models tab at the top of the main window. Inthis analysis, we use the Strict Clock molecular clock model as default. Yourmodel options should now look like this:

The Estimate check box is unchecked for the first clock model and checkedfor the rest clock models, because we wish to estimate the substitution rate ofeach subsequent locus relative to the first locus whose rate is fixed to 1.0.

6

Figure 5: Setting up clock models for the gopher alignments.

7

Figure 6: Setting up multi species coalescent parameters.

Multi Species Coalescent

The Multi Species Coalescent panel allows settings to the multi speciescoalescent model to be specified for each tree. *BEAST has a different tree priorpanel where users can only configure the species tree prior not gene tree priors(which are automatically specified by the multispecies coalescent). Currently,three population size models are available: Piecewise linear and constantroot, Piecewise linear, and Piecewise constant. In this analysis, we usepiecewise linear and constant root.

The Ploidy item determines the type of sequence (mitochondrial, nuclear,X, Y). This matters since different modes of inheritance gives rise to differenteffective population sizes. In this analysis, we simply use a random startingtree.

Priors and Operators

The Priors panel allows priors to be specified for each parameter in the model.Several models are available as tree priors, however, only the Yule or the Birth-Death Model are suitable for *BEAST analyses. Leave the tree prior setting atthe default Yule model. The Operators panel (hidden) is used to configuretechnical settings that affect the efficiency of the MCMC program. We leavethese two panels unchanged in this analysis.

8

Setting the MCMC options

The next tab, MCMC, provides more general settings to control the length ofthe MCMC and the file names.

Firstly we have the Length of chain. This is the number of steps theMCMC will make in the chain before finishing. The appropriate length of thechain depends on the size of the data set, the complexity of the model and theaccuracy of the answer required. The default value of 10,000,000 is entirelyarbitrary and should be adjusted according to the size of your data set. Forthis data set let’s set the chain length to 5,000,000 as this will run reasonablyquickly on most modern computers (a few minutes).

The next options specify how often the parameter values in the Markovchain should be displayed on the screen and recorded in the log file. The screenoutput is simply for monitoring the programs progress so can be set to any value(although if set too small, the sheer quantity of information being displayed onthe screen will actually slow the program down). For the log file, the valueshould be set relative to the total length of the chain. Sampling too often willresult in very large files with little extra benefit in terms of the precision ofthe analysis. Sample too infrequently and the log file will not contain muchinformation about the distributions of the parameters. You probably want toaim to store no more than 10,000 samples so this should be set to no less thanchain length / 10,000.

For this exercise we will set the screen log to 10,000 and the trace log

to 5,000. Set the file name of the trace log to gopher.log and that of thespeciesTreeLogger to gopher.species.trees.

If you are using windows then we suggest you add the suffix .txt to both ofthese (so, gopher.log.txt and gopher.species.trees.txt) so that Windowsrecognizes these as text files.

Generating the BEAST XML file

We are now ready to create the BEAST XML file. To do this, either select theFile/Save or File/Save As option from the File menu. Check the defaultpriors setting and click Continue. Save the file with an appropriate name (weusually end the filename with .xml, i.e., gopher.xml). We are now ready torun the file through BEAST.

Running BEAST

Now run BEAST and when it asks for an input file, provide your newly createdXML file as input by click Choose File ..., and then click Run.

BEAST will then run until it has finished reporting information to the screen.The actual results files are saved to the disk in the same location as your inputfile. The output to the screen will look something like this:

BEAST v2.3.1, 2002-2015

Bayesian Evolutionary Analysis Sampling Trees

Designed and developed by

Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard

9

Figure 7: Setting up the MCMC paremeters.

10

Figure 8: Launching BEAST.

11

Department of Computer Science

University of Auckland

[email protected]

[email protected]

Institute of Evolutionary Biology

University of Edinburgh

[email protected]

David Geffen School of Medicine

University of California, Los Angeles

[email protected]

Downloads, Help & Resources:

http://beast2.org/

Source code distributed under the GNU Lesser General Public License:

http://github.com/CompEvol/beast2

BEAST developers:

Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled,

Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li,

Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel,

Oliver Pybus, Chieh-Hsi Wu, Walter Xie

Thanks to:

Roald Forsberg\includeimage, Beth Shapiro and Korbinian Strimmer

... ...

4980000 -3827.2093 279.2 -4293.2486 22.1146 24s/Msamples

4990000 -3842.9689 276.3 -4295.6213 17.6877 24s/Msamples

5000000 -3798.1962 278.2 -4280.7865 22.0904 24s/Msamples

Operator Tuning #accept #reject Pr(m) Pr(acc|m)

NodeReheight(Reheight.t:Species) - 205317 486572 0.1386 0.2967

ScaleOperator(popSizeBottomScaler.t:Species) 0.1902 10224 26338 0.0074 0.2796

ScaleOperator(popMeanScale.t:Species) 0.4890 5795 16242 0.0044 0.2630

UpDownOperator(updown.all.Species) 0.5029 38885 108717 0.0295 0.2634

ScaleOperator(YuleBirthRateScaler.t:Species) 0.2416 7155 15125 0.0044 0.3211

ScaleOperator(treeScaler.t:26) 0.7580 8510 42962 0.0102 0.1653

ScaleOperator(treeRootScaler.t:26) 0.4424 11487 39392 0.0102 0.2258

Uniform(UniformOperator.t:26) - 287517 225590 0.1024 0.5603

SubtreeSlide(SubtreeSlide.t:26) 0.5560 418 256419 0.0512 0.0016 Try decreasing size to about 0.278

Exchange(narrow.t:26) - 115438 140923 0.0512 0.4503

Exchange(wide.t:26) - 1363 49988 0.0102 0.0265

WilsonBalding(WilsonBalding.t:26) - 1882 49890 0.0102 0.0364

ScaleOperator(StrictClockRateScaler.c:29) 0.4744 13355 37595 0.0102 0.2621






Exchange(wide.t:29) - 2229 49303 0.0102 0.0433


UpDownOperator(updown.29) 0.7793 10314 40694 0.0102 0.2022

ScaleOperator(StrictClockRateScaler.c:47) 0.5490 12974 38097 0.0102 0.2540






Exchange(wide.t:47) - 731 50281 0.0102 0.0143


UpDownOperator(updown.47) 0.7397 10853 40537 0.0102 0.2112

UpDownOperator(strictClockUpDownOperator.c:47) 0.7278 9987 41373 0.0102 0.1945

UpDownOperator(strictClockUpDownOperator.c:29) 0.7761 10040 40724 0.0102 0.1978

ScaleOperator(KappaScaler.s:26) 0.3401 469 1197 0.0003 0.2815



ScaleOperator(popSizeTopScaler.t:Species) 0.1910 23665 54936 0.0157 0.3011

Tuning: The value of the operator’s tuning parameter, or ’-’ if the operator can’t be optimized.

#accept: The total number of times a proposal by this operator has been accepted.

#reject: The total number of times a proposal by this operator has been rejected.

Pr(m): The probability this operator is chosen in a step of the MCMC (i.e. the normalized weight).

Pr(acc|m): The acceptance probability (#accept as a fraction of the total proposals for this operator).

Total calculation time: 135.584 seconds

End likelihood: -3798.196207797654

12

Figure 9: Tracer with the gopher data.

Analyzing the results

Run the program called Tracer to analyze the output of BEAST. When themain window has opened, choose Import Trace File... from the File menuand select the file that BEAST has created called gopher.log. You should nowsee a window like in Figure 9.

Remember that MCMC is a stochastic algorithm so the actual numbers willnot be exactly the same.

On the left hand side is a list of the different quantities that BEAST haslogged. There are traces for the posterior (this is the log of the product ofthe tree likelihood and the prior probabilities), and the continuous parameters.Selecting a trace on the left brings up analyses for this trace on the right handside depending on tab that is selected. When first opened, the ‘posterior’ traceis selected and various statistics of this trace are shown under the Estimatestab. In the top right of the window is a table of calculated statistics for theselected trace.

Tracer will plot a (marginal posterior) distribution for the selected parameterand also give you statistics such as the mean and median. The 95% HPD lower

or upper stands for highest posterior density interval and represents the mostcompact interval on the selected parameter that contains 95% of the posteriorprobability. It can be thought of as a Bayesian analog to a confidence interval.

Select the treeModel.rootHeight parameter and the next three (hold shift

13

Figure 10: Tracer showing the root heights of the lineage trees.

whilst selecting). This will show a display of the age of the root and the threegene trees. If you switch the tab at the top of the window to Marginal Densitythen you will get a plot of the marginal posterior densities of each of these dateestimates overlayed, as shown in Figure 10.

Obtaining an estimate of the phylogenetic tree

BEAST also produces a sample of plausible trees. These can be summarizedusing the program TreeAnnotator. This will take the set of trees and identifya single tree that best represents the posterior distribution. It will then anno-tate this selected tree topology with the mean ages of all the nodes as well asthe 95% HPD interval of divergence times for each clade in the selected tree.It will also calculate the posterior clade probability for each node. Run theTreeAnnotator program and set it up to look like in Figure 11.

The burnin is the number of trees to remove from the start of the sample.Unlike Tracer which specifies the number of steps as a burnin, in TreeAnno-tator you need to specify the actual number of trees. For this run, we use thedefault setting.

The Posterior probability limit option specifies a limit such that if a nodeis found at less than this frequency in the sample of trees (i.e., has a posteriorprobability less than this limit), it will not be annotated. The default of 0.5

14

Figure 11: Using TreeAnnotator to summarise the tree set.

means that only nodes seen in the majority of trees will be annotated. Set thisto zero to annotate all nodes.

For Target tree type you can either choose a specific tree from a file or askTreeAnnotator to find a tree in your sample. The default option, Maximumclade credibility tree, finds the tree with the highest product of the posteriorprobability of all its nodes.

Choose Mean heights for node heights. This sets the heights (ages) of eachnode in the tree to the mean height across the entire sample of trees for thatclade.

For the input file, select the trees file that BEAST created (by default thiswill be called gopher.species.trees) and select a file for the output (here wecalled it gopher.species.tree).

Now press Run and wait for the program to finish.

Viewing the Species Tree

Finally, we can look at the tree in another program called FigTree. Run thisprogram, and open the gopher.species.tree file by using the Open commandin the File menu. The tree should appear. You can now try selecting some ofthe options in the control panel on the left. Try selecting Node Bars to getnode age error bars. Also turn on Branch Labels and select posterior to getit to display the posterior probability for each node. Under Appearance youcan also tell FigTree to colour the branches by the rate. You should end upwith something like Figure 12.

15

Figure 12: Figtree representation of the species tree.

16

Alternatively, you can load the species tree set into DensiTree and set it upas follows.

• Set burn-in to 500. The tree should not be collapsed any more.

• Show a root-canal tree to guide the eye. Perhaps, the intensity of thetrees is not large enough, so you might want to increase the intensity byclicking the icon in the button bar.

• Show clades, their mean and 95% HPD graphically, and posterior supportusing text. Now, too many clades are shown, and most are not of interest.Select ’Selected only’, then open the clade toolbar (menu Window/Viewclade toolbar), and select only highly supported clades. Also, select theclade consisting of heterodus, bottea, umbinus and townsendii, to showthat heterodus is an outgroup, but there is some support (over 16%) thatit is not.

• Drag the clade monticola and idahoensis up so that the 95% HPD bar doesnot overlap with the one for mazama, monticola, idahoensis and talpoidis.Increase font size of the label for better readability.

The image should look something like Figure 13.Exercise: There is about 75% support for heterodus to be an outgroup, and

about 17% for heterodus to be in a clade with bottea, umbinus and townsendii.Can you explain where the other 8% went?

DensiTree can be used to show the branch widths of summary tree fromtree annotator as population sizes. Under ‘Line Width’ in DensiTree, choose‘BY METADATA NUMBER’ for the bottom and for the top, and choose num-bers 2 and 3 in the ’top’ and ’bottom’ spinner. Left, the bottom representsdmv1, the top dmv2 in the summary tree, which do not quite match in areaswith little posterior support for the clades (see Figure 13 to see which cladeshave little support).

Right, the top is matched up with the bottom of the branch above, usingthe ‘Make fit to bottom’ option for the top. This looks a bit prettier, but maynot be quite accurate.

Alternatively, a consensus tree can be generated by biopy (https://github.com/jheled/biopy) with using 1-norm left, and 2-norm right.

Showing all consensus trees with population widths (use By Metadata Pat-tern, for top and bottom and use

.*dmv=.([^,]*).*

for the bottom pattern and

.*dmv=.[^,]*,([^\}]*).*

for the top pattern. This gives us this visualisation:

17

https://github.com/jheled/biopy

https://github.com/jheled/biopy

Figure 13: DensiTree representation of the species tree.

m a z a m a

monticola

idahoensis

talpoides

b o t t a e

umbrinus

townsendii

heterodus

m a z a m a

monticola

idahoensis

talpoides

b o t t a e

umbrinus

townsendii

heterodus

18

idahoensis

monticola

talpoides

m a z a m a

b o t t a e

umbrinus

townsendii

heterodus

idahoensis

monticola

talpoides

m a z a m a

b o t t a e

umbrinus

townsendii

heterodus

m a z a m a

monticola

idahoensis

talpoides

b o t t a e

umbrinus

townsendii

heterodus

19

Comparing your results to the prior

Using BEAUti, set up the same analysis but under the MCMC options, selectthe Sample from prior only option. This will allow you to visualize thefull prior distribution in the absence of your sequence data. Summarize thetrees from the full prior distribution and compare the summary to the posteriorsummary tree.

Bonus: Species delimitation using STACEY

Arguably, some of the gopher species in the dataset analysed so far belongs toa single species, judging from the posterior tree distribution in Figure 13. Totest this, we can run the same analysis as above with STACEY, which standsfor Species Tree And Classification Estimation, Yarely. You need the following:

• STACEY package, easiest installed using the package manager from theBEAUti File/Manage packages menu.

• SpeciesDelimitationAnalyser, available by downloadinghttp://www.indriid.com/2014/speciesDA.jar.

You can find more documentation on how to set up an analysis and opti-mise some settings in the STACEY package documentation, which is part of theSTACEY package (in the directory <BEAST package directory>/2.3/STACEY/doc/where <BEAST package directory> depends on the OS you are using. To findit, click the button with the question mark in the package manager dialog inBEAUti).

To set up the analysis in BEAUti do the same as for the *BEAST analysiswith the following changes:

• use the STACEY template instead of the StarBeast template.

• do not change the tree prior in the priors tab.

• for the tutorial, set the chain length to 2 million for a coffee break ofabout 2 minutes, or higher to get more acceptable ESSs (and a longercoffee break).

Save the XML file and run it with BEAST.Analyse the log file using

java -jar /path/to/speciesDA.jar -burnin 10 species_123456.trees out.txt

where the number 123455 should be equal to the seed used to run BEAST. Thiscreates a file out.txt which looks something like this:

20

count fraction similarity nclusters talpoides idahoensis townsendii umbrinus monticola bottae heterodus mazama

16 0.08376963350785341 165.85714285714275 5 1 4 5 5 4 5 2 3

19 0.09947643979057591 151.21428571428558 5 4 1 5 5 2 5 3 4

104 0.5445026178010471 139.92857142857142 6 1 2 6 6 3 6 4 5

2 0.010471204188481676 21.53571428571428 5 1 2 5 5 4 5 3 4

11 0.05759162303664921 17.5 5 4 1 5 5 4 5 2 3

1 0.005235602094240838 15.85714285714286 6 5 1 2 6 5 6 3 4

2 0.010471204188481676 15.000000000000002 4 3 3 4 4 3 4 1 2

6 0.031413612565445025 14.57142857142857 7 1 2 3 7 4 7 5 6

10 0.05235602094240838 10.0 4 2 3 4 4 3 4 1 2

2 0.010471204188481676 8.5 5 1 4 5 5 2 5 3 4

1 0.005235602094240838 7.607142857142858 8 1 2 3 4 5 6 7 8

2 0.010471204188481676 4.7857142857142865 4 3 3 4 4 1 4 2 3

3 0.015706806282722512 3.9642857142857144 5 4 4 5 5 1 5 2 3

2 0.010471204188481676 3.857142857142857 6 1 5 2 6 5 6 3 4

2 0.010471204188481676 2.928571428571429 7 1 2 7 3 4 7 5 6

2 0.010471204188481676 2.0 6 5 1 2 6 3 6 4 5

2 0.010471204188481676 2.0 4 3 1 4 4 3 4 2 3

2 0.010471204188481676 2.0 4 1 3 4 4 3 4 2 3

1 0.005235602094240838 1.0 7 1 2 7 7 3 4 5 6

1 0.005235602094240838 1.0 4 2 2 4 4 3 4 1 3

Note that the third line (with count=104) suggests more than half the timethe taxa townsendi, umbrinus and bottea form a single species.

References

[1] N.M. Belfiore, L. Liu, and C. Moritz. Multilocus phylogenetics of a rapidradiation in the genus Thomomys (Rodentia: Geomyidae). Systematic Bi-ology, 57(2):294, 2008.

[2] Remco R. Bouckaert, Joseph Heled, Denise Kuhnert, Tim Vaughan, Chieh-Hsi Wu, Dong Xie, Marc A Suchard, Andrew Rambaut, and Alexei J Drum-mond. Beast 2: a software platform for bayesian evolutionary analysis. PLoSComput Biol, 10(4):e1003537, Apr 2014.

[3] Joseph Heled and Alexei J Drummond. Bayesian inference of species treesfrom multilocus data. Mol Biol Evol, 27(3):570–80, Mar 2010.

21

*beast in beast 2.3.x estimating species trees from...

Documents