Christian ArnoldBioinformatics Group, University of Leipzig
Bioinformatics HerbstseminarOctober 23th, 2009
Three Weeks of Experience at the formatics Institute
Content
1. The 10kTrees Project
2. Phylogenetic Targeting
3. Acknowledgements
1. The 10kTrees Project
Goals
• Updated primate phylogeny that includes phylogenetic uncertainty– Use newest available sequence data, include as
much primate species as possible, and update regularly
– Produce a set of >=10,000 primate-wide trees (with branch lengths) that are appropriate for taxonomically broad comparative research on primate behavior, ecology and morphology using Bayesian methods
• Make it accessible to other researchers
Methodology
1. Download sequences from
GenBank
2. Select the longest available sequence
for each gene in each species
3. Create individual fasta file with all
available sequences for each gene
7. Improve alignment quality using GBLOCKS
6. Create MSA using Muscle
4. Create availability matrix
9. Concatenate sequences and create partitioned dataset in
MrBayes format
5. Identify species with non-overlapping
genes
8. Identify best substitution model for
each gene
10. Run MrBayes 12. Update website11. Evaluate MrBayes analysis and calculate
consensus tree
Version 1 vs. Version 2Version 1 Version 2
Species 187 231
Genes
4 mitochondrial (COI, COII, CYTB and ND1) and 1 autosomal gene (SRY)
6 mitochondrial (12S rRNA, 16S rRNA, COI, COII, CYTB, cluster of other mitochondrial genes) and 3 autosomal genes (SRY, CCR5, MC1R)
Genetic loci 2 4
Total No. of Sites 5134 ~9000
Collected sequences413 out of 935 total(55.8% missing data)
1007 out of 2079 total(51.6% missing data)
No. of constraints 29 1
Generations 8 millions 60 millions
Computing time~ 48 days (16 processors in parallel, ~ 3 days each)
~ 2 years (32 processors in parallel, ~ 3 weeks each)
Preliminary consensus treeGreen: Cercopithecines
Blue: Hominoids
Red: Platyrrhines
Yellow: Tarsiers
Brown: Strepsirrhines
Rooted with Galeopterus variegatus
The 10kTrees Websitehttp://10ktrees.fas.harvard.edu/
Current Progress
• Submitted to Evolutionary Anthropology, in press.
• Will be presented at the AAPA conference (April 2010) in Albuquerque, New Mexico
• Version 2 is almost finished
• Available at http://10kTrees.fas.harvard.edu
Summary
• Bayesian approach is time-consuming, but works well, even though data matrix is very sparse
• Increased number of sequences in Version 2 dramatically reduces need for constraints and improves quality of tree and branch lengths estimates
• Ongoing project
• Total number of downloaded trees since June 2009: 95800
2. Phylogenetic Targeting
Which species should we study?
For which species should we collect data in order to increase the size of comparative data sets ?
Goals
?
Example 1/2
• Hypothesis: Two characters (x and y) show correlated evolution
• Goal: Test this hypothesis comparatively (e.g. by using phylogenetically independent contrasts and correlation tests)
• Problem 1: Data has been only collected for x, but not for y
• Solution 1: Collect data for y and test hypothesis
• Problem 2: From which species should we collect data for y?
• Solution 2: Phylogenetic targeting!?
Example 2/2
Brain size
Cognitive data
4 ?
9 7
10 ?
3 ?
2 ?
s3
s1
s4
s5
s2
Collecting new data is time-consuming and expensive…
Methods• Systematically generate all possible pairwise comparisons
• For every pairwise comparison, calculate character differences for the two species that form the pair and assign a score
• Determine set of phylogenetically independent pairs that maximizes the sum of all selected pair scores (maximal pairing)
s3
s4
s2
s6
s1
s7
s5
Maximal pairing: Example
Time complexity: , for balanced trees:)( 3nO
Decomposition of the maximal pairing
)log( 22 nnO
)(maxmax
)(
)(
RsubtreesRR
Tdesc
TSS
SS
Simulation results 1/2
• Random (Rnd) selection of species– Type 1 errors close to nominal level– Power: ~40%, independent of number of taxa– Uses 67% of available variation
• Phylogenetic targeting (PT) induced selection of species– Type 1 errors close to nominal level– Power: 67-81%, increases with number of taxa– Uses 89% of available variation
Detecting correlated character evolution, based on selection of 12 species
Simulation results 2/2
PT Rnd 12 18 24
Number of selected species
Fra
cti
on
of
availab
le v
ari
ati
on
aft
er
sam
plin
g
PT Rnd PT Rnd PT Rnd
Current Progress
• A revised version will be resubmitted to American Naturalist in the not too distant future
• TODO: Extend simulations and clarify some issues
• Available at http://phylotargeting.fas.harvard.edu
Summary
• A focused selection of species can save valuable time and money
• Phylogenetic targeting provides a very flexible approach and can address different questions in the context of limited resources
• Dynamic programming algorithms are everywhere
3. Acknowledgements
• Harvard University
• Max-Planck Institute for Evolutionary Anthropology
• University of Leipzig
• Charlie Nunn
• Luke Matthews
• Peter F. Stadler
Thanks!
Thank you for your attention!
Questions?
If not: Cheers (it’s early, but not too early…)
Any Questions?