phylobayes - what, why, and how - evolution and...
TRANSCRIPT
-
PhyloBayes- what, why, and how
Xiaofan ZhouIntegrative Microbiology Research Centre
South China Agricultural University2019-01-25
-
PhyloBayesPhyloBayes
a Bayesian phylogenetic software based on mixture models
-
Bayesian phylogenetics
๐๐ ๐๐ ๐ท๐ท =๐๐ ๐ท๐ท ๐๐ โ ๐๐(๐๐)
๐๐(๐ท๐ท)
๐๐ ๐๐,๐ท๐ท = ๐๐ ๐๐ ๐ท๐ท โ ๐๐ ๐ท๐ท = ๐๐ ๐ท๐ท ๐๐ โ ๐๐(๐๐)
๐๐ ๐๐ ๐ท๐ท =๐๐ ๐ท๐ท ๐๐ โ ๐๐(๐๐)โ๐๐ ๐ท๐ท ๐๐ โ ๐๐(๐๐)
Posterior probability Likelihood Prior
-
Among site variation in sequence evolutionโข Evolutionary rates
โข Substitution modelโข Exchange rate matrixโข Equilibrium frequency profile (PhyloBayes-CAT!!!)
-
Components of substitution model
To
From A T C G
A aฯT bฯC cฯGT aฯA dฯC eฯGC bฯA dฯT fฯGG cฯA eฯT fฯC
Rate matrix of the GTR model
โexchangeabilityโ
To
From A T C G
A a b c
T a d e
C b d f
G c e f
Equilibrium frequencies
[ฯA, ฯT, ฯC, ฯG]
F81 (Poisson) model:a = b = c = d = e = f =1
-
Dirichlet process for site assignment
-
Long-Branch Attraction
True tree Inferred tree
-
PhyloBayes-CAT can suppress LBA
-
Models available in PhyloBayes
Exchangeability matrix
Equilibrium frequency Evolutionary Rate Data partition
โข Poisson
โข Empirical model- JTT- LGโฆ
โข GTR
โข Mixture model
โข Site homogeneous- similar to โ+Fโ in ML
โข Site heterogeneous- CAT- empirical mixture models (e.g. C10-C60)
โข Gamma distribution- discrete- continuous
โข CAT model of rate
โข Partially linked branch-length
โข Partition-specific substitution matrix not supported (?)
-
Partitioned PhyloBayes-MPI
-
Limitations of PhyloBayes
โข CAT+GTR model is very slowโฆโฆโข Even with PhyloBayes-MPI
โข Hard to convergeโข Particularly CAT+GTR
โข CAT-Poisson is often used instead of CAT-GTR
โข Results of unconverged analyses were reported
-
โโฆโฆ CAT-F81 consistently performed worse than other models (partitioning and CAT-GTR) in inferring the correct branching patterns โฆโฆโ
-
Alternative solution in ML
โข Empirical mixture models in MLโข C10 to C60
โข Firstly introduced in PhyML-CAT
โข Fast implementation in IQ-TREE, further accelerated using PMSF
-
Alternative solution in ML
โข Empirical mixture models in MLโข C10 to C60
โข Fast implementation in IQ-TREE, further accelerated using PMSF
Tree,Alignment,Model (e.g., C10)
Wang et al. (2017) Syst. Biol.
-
Model comparison in PhyloBayes
โข Bayes factor
โข Cross-Validationโข How well can the models predict future data?
โข Posterior predictive analysisโข How well does simulated data approximate observed data?
-
Cross Validation
โข ๐๐ ๐ท๐ท๐ก๐ก๐ก๐ก๐ก๐ก๐ก๐ก ๐ท๐ท๐๐๐ก๐ก๐๐๐๐๐๐,๐๐๐๐ ๐ฃ๐ฃ๐ก๐ก. ๐๐(๐ท๐ท๐ก๐ก๐ก๐ก๐ก๐ก๐ก๐ก|๐ท๐ท๐๐๐ก๐ก๐๐๐๐๐๐,๐๐๐๐)
Original alignment
L0
L1...
Ln
T0
T1...
Tn
Posterior distribution of model parameters
๐๐ ๐ท๐ท๐ก๐ก๐ก๐ก๐ก๐ก๐ก๐ก ๐ท๐ท๐๐๐ก๐ก๐๐๐๐๐๐,๐๐
-
Posterior predictive analysis
โข Compare observed data vs. simulate data (using trained model)
โข Resemble parametric bootstrap in MLโข Parameters sampled from posterior distributions instead of being fixed
โข Available tests:โข Biochemical specificityโข Compositional homogeneityโข Saturation
-
Tutorial
1. Run PhyloBayes analyses on a toy dataset with four taxa:โข Check for convergence (e.g., TRACER)โข Compare inferred tree with true tree (e.g., phylo.io)โข Try different models (e.g., F81, GTR, CAT+F81, CAT+GTRโฆ)
2. Perform model comparisonโข Cross-Validationโข Posterior predictive analysis
3. Compare with empirical mixture models in ML (IQ-TREE)
-
The dataset was simulated under LBA
โข 4 taxa, ~21,000 amino acid sites
โข Simulated under LG+C60+F
โข ๐๐ ranges from 1 to 10 to model different levels of LBA artefact
True tree
0.7
0.01 โ ๐๐
Wang et al. (2018) Syst. Biol.
-
Markov chain Monte Carlo
-
When does it finish?
โข Check for convergence between runs
โข Visually:โข TRACER (http://beast.community/tracer)โข AWTY (http://king2.scs.fsu.edu/CEBProjects/awty/awty_start.php)โข Graphylo (https://github.com/wrf/graphphylo)
โข bpcomp (in PhyloBayes)
-
When does it finish?
-
PhyloBayes๏ฟฝ- what, why, and howPhyloBayes๏ฟฝBayesian phylogeneticsAmong site variation in sequence evolutionComponents of substitution modelDirichlet process for site assignmentLong-Branch AttractionPhyloBayes-CAT can suppress LBAModels available in PhyloBayesPartitioned PhyloBayes-MPILimitations of PhyloBayesๅนป็ฏ็็ผๅท 12Alternative solution in MLAlternative solution in MLModel comparison in PhyloBayesCross ValidationPosterior predictive analysisๅนป็ฏ็็ผๅท 18TutorialThe dataset was simulated under LBAMarkov chain Monte CarloWhen does it finish?When does it finish?ๅนป็ฏ็็ผๅท 24