Download - GCB04Tut
-
An Introduction to Molecular Phylogeny
Dr. Kerstin Hoef- EmdenUniversitt zu KlnBotanisches Institut
Gyrhofstr. 1550931 Kln
-
What is molecular phylogeny?
phylon = Greek for stemgenesis = Greek for origin
molecular phylogeny = studying relationships among organisms using molecular markers (e.g. DNA or protein sequences)
dissimilarities among sequences = genetic divergence caused by mutations during the course of time
-
Molecular Phylogenetic Methods
- their accuracy can be tested in in silico simulations
- are based on assumptions about the processes of molecular evolution
- may be computationally intense (this refers more to CPU time than to memory)- may be sensitive to artefacts
- usually results are displayed as trees
-
Accuracy of Molecular Phylogenetic Methods
consistency = Does a method reconstruct the correct tree given an infinite amount of data? (All methods do, if assumptions are not violated.)
efficiency = How quickly converges a method to the correct tree with a finite amount of data? (The less data is needed to infer the correct tree, the more
efficient the method.)
robustness = How well is the performance of a method, if the assumptionsabout the evolutionary process are violated?
-
How to test phylogenetic methods?
e.g. by simulation in silico (= in the computer)
a) Simulate the evolution of a randomly chosen DNA or protein sequence under a given evolutionary model and tree topology into several lineages.b) Use the phylogenetic method under test to infer a phylogenetic tree.c) Does the resulting phylogenetic tree correspond to the true tree?d) Modify the tree topology to different extremes in branch lengths and repeat the test.
-
Phylogenetic Methods and Real Life Sequences
- The true tree is unkown; each inferred tree represents a hypothesis.- No infinite amounts of data are available (no nuclei with infinite space, which contain infinite amounts of DNA).- By using robust and efficient methods and appropriate evolutionary models, the inferred trees hopefully converge to the real phylogeny as close as possible.- The simulation studies give some hints about potential vulnerabilities of the phylogenetic methods.
-
Trees: Nomenclature
terminal branchinternal branch
terminal node = operational taxonomicunit (OTU) = contemporary taxon
internal node = unknownancestor = extinct taxon
mathematics: branch = edge; node = vertex (plural: vertices)
-
Tree Types: Unscaled Trees
slanted cladogram rectangular cladogram
disadvantage:no information
about evolutionaryrates in a tree
-
Tree Types: Scaled Trees
rooted phenogramunrooted phenogram
disadvantage: direction of evolution is unknownadvantage: higher resolution
-
Treefile Formats#NEXUS Begin trees; [Treefile saved Thu Sep 16 03:12:19 2004]
Translate1 MPorph,2 S11679,3 CCMP736,4 MCont316,5 S1382,[...]21 UTEX637;tree PAUP_1 = [&U] (1:0.096043,(((((2:0.041968,12:0.011298):0.014339,(13:0,20:0):0.012408):0.012250,3:0.188535):0.027987,(4:0.107423,5:0.115320):0.013260,((((14:0.001531,15:0):0.038667,19:0.005060):0.000989,18:0.018300):0.015597,(17:0.006567,21:0.029673):0.017735):0.009011):0.011165,(((6:0.021826,9:0.009905):0.002602,(7:0.014488,11:0.066979):0.005662):0.014995,(8:0.046098,10:0.079671):0.010953):0.009027):0.020750,16:0.048813);End;
Phylip (Newick)(MPorph:0.094736,(((((S11679:0.041475,U1424:0.011216):0.014077,(M1712:0,S10379:0):0.012338):0.011839,CCMP736:0.183396):0.027333,(MCont316:0.106050,S1382:0.117050):0.013282,((((S3794:0.001532,S3694:0):0.038496,S899:0.004933):0.001127,S3194:0.018102):0.015038,(S4094:0.006579,UTEX637:0.029760):0.018250):0.009256):0.011383,(((Chondrus:0.021641,S13531a:0.009722):0.002405,(S4194a:0.014451,S1896:0.065892):0.005690):0.014595,(S4194b:0.046357,S13531b:0.078882):0.010916):0.008914):0.020599,S5981:0.048038);
-
Displaying Trees
Paup for MacOS 9: graphical output to screen, file or printer; nexus or Newick format (unscaled, scaled, rooted and unrooted trees)Phylip: treefile to graphics converter; Newick format (unscaled, scaled, rooted and unrooted trees)Paup for Windows or portable format (Unixoids): auxiliary program necessary
- Phylip converter programs- TreeView for MacOS 9 (and Windows?): all tree types and tree edition; nexus and Newick for Unixoids: no unrooted trees, no tree editing; nexus and Newick- TreeEdit (MacOS), Treetool etc.
-
General purposePaup 4b: Windows, MacOS 9, Unixoids (Linux, Solaris, MacOS X etc.)Phylip 3.62: Windows, MacOS 8, 9, X, Linux, C- Sources
Bayesian AnalysesMrBayes 3: Windows, MacOS, C- Sources (Unixoids)
Links to phylogeny- related software collected by Joe Felsenstein: http://evolution.genetics.washington.edu/phylip/software.html
Phylogeny Programs (1)
-
Paup* 4b10= Phylogenetic Analysis Using Parsimony (* and other methods)
Written by David Swofford. First versions up to Paup 3 were available for MacOS < 9 only and were focused on the parsimony method. Paup 4 is available for different OS and one of the most powerful toolsfor phylogenetic analyses concerning nucleotide sequences. Sold as a beta version, but more stable than some sold final versions of other software.
MacOS 9PPC: graphical user interface (i.e. mouse driven) and graphical output of treesAll others (Windows, Unixoids): command line (can be submitted to batch queues, unfortunately no checkpointing)Distributor in Europe: Palgrave- MacMillan, UK (Windows and MacOS 9 PPC; GBP 62/72)Distributor in USA (and for portable versions): Sinauer Associates (USD 85- 150)
Phylogeny Programs (2)
-
Phylip 3.62= Phylogenetic Inference Package
Written by Joe Felsenstein. Multiple purpose package for nucleotide as well asprotein sequences. Approx. 30 different programs to fullfil different tasks.Freely available over the internet (http://evolution.gs.washington.edu/phylip.html).
Precompiled for: MacOS 8/9 PPC, MacOS X, Windows, Red Hat Linux (i368)C- Sources for all other UnixoidsUser- interface: text- based menu system
Phylogeny Programs (3)
-
MrBayes 3
Written by John Huelsenbeck and Frederic Ronquist. Specialised on Bayesian Analyses. Handles nucleotide as well as proteinsequences. Partitioned computation of concatenated data sets.Freely available over internet (http://morphbank.ebc.uu.se/mrbayes/)
Runs under MacOS X, Windows, UnixoidsC- SourcesUser- interface: command line- driven; syntax similar to Paup
Parallelised version available; no checkpointing.
Phylogeny Programs (4)
-
Input Formats
Paup: nexus format (interleaved or sequential)Phylip: phylip formatMrBayes: nexus format (interleaved or sequential)
Phylogeny Programs (5)
-
Phylogeny Programs (6)#NEXUSBEGIN TAXA; DIMENSIONS NTAX=6; TAXLABELS 'S3694' 'S5981' 'S4094' 'S3194' 'S899' 'S10379' ;END;BEGIN CHARACTERS; DIMENSIONS NCHAR=14; FORMAT DATATYPE=NUCLEOTIDE GAP=- ;MATRIX[1] 'S3694'cCCAAGCGTTTCCG[2] 'S5981'CCCAATCGTTTCCC[3] 'S4094'CCCAATCGTTTCCG[4] 'S3194'GCCAATCGTTTCCG[5] 'S899'CCCAAGCGTTTCCG[6] 'S10379'CCCAATCGTTTCCG;END;
Phylip Format
6 14S3694 cCCAAGCGTTTCCGS5981 CCCAATCGTTTCCCS4094 CCCAATCGTTTCCGS3194 GCCAATCGTTTCCGS899 CCCAAGCGTTTCCGS10379 CCCAATCGTTTCCG
Nexus Format
-
user- defined treesand topology testing
aim:group of organisms
or gene family
choice of molecular marker(s)and
taxon sampling
amplification/sequencing
alignment
choice of evolutionary model
phylogenetic analyses
tree(s)results
impr
ov e
men
t of
Work- Flow
-
Taxon Sampling
Strategy for an initial Taxon Sampling
- the diversity of the group should be represented (guessing by looking at phenotype or using systematics of group). e.g. combinations of morphological characters or representatives of all species/genera of a group or serotypes or ....- at least two representatives of each presumed clade (guessing)- not to few taxa (> 15)- outgroup taxa (closest related sistergroup only!)
-
Choice of Molecular Marker(s):Phylogenies of Gene Families (1)
All orthologues and paralogues (or alleles) of a gene in an organism have to be sequenced!
Why?
-
Homo
Drosophila
Arabidopsis
Homo
Drosophila
Arabidopsis
Chlamydomonas
Chlamydomonas
Drosophila
Homo
Arabidopsis
Chlamydomonas
Choice of Molecular Marker(s):Phylogenies of Gene Families (2)
A very fictitious example for a weird tree caused by an incomplete sampling of a gene family (taxon sampling also not recommended).correct tree
very bad tree!
-
Choice of Molecular Marker(s):Phylogenies of Organisms (1)
- choose single copy genes (protein- coding) or highly synchronised genes (ribosomal DNA)- choose higher variable genes for closely related organisms and conserved genes for farther related organisms- in sexually reproducing organisms, two alleles may occur
-
Choice of Molecular Marker(s):Phylogenies of Organisms (2)
e.g. the eukaryotic ribosomal operon
SSU rDNA LSU rDNA
5.8S rDNA
ITS1 ITS2
conserved = potentially suited for phylogenies of genera or higher level taxa
highly variable = potentially suited for phylogenies of species or lower level taxa
-
Choice of Molecular Marker(s):Phylogenies of Organisms (3)
some examples for more conserved genes:actin
elongation factor 1 (EF- 1)rbcL
tubulinsand lots more ...
-
user- defined treesand topology testing
aim:group of organisms
or gene family
choice of molecular marker(s)and
taxon sampling
amplification/sequencing
alignment
choice of evolutionary model
phylogenetic analyses
tree(s)results
impr
ov e
men
t of
Work- Flow
-
DNA Amplification (1)genomic DNA
PCR
template for sequencing
cloning
mRNA
RT- PCR
cDNA
-
DNA Amplification (2)
genomic DNA cDNA
advantagedisadvantage
introns (add information)introns (splice sites?)
no introns (just ORF)no introns
-
Taq polymeraseno proofreading = introduces reading errors (pred. transitions))direct sequencing (large template pool) - > usually no problem
cloning of PCR products - > problem! solutions: a) proofreading polymerase instead of Taq b) sequence more than 2 clones (better more than three; only an option, if no allelic variation can be expected!)
DNA Amplification (3)
-
Sequencing
Reduce/avoid sequencing errors
- sequencing of forward and reverse strands- thoroughful proofreading ribosomal RNA sequences: secondary structure protein sequences: translation (Stop codons?)- BLAST search (PCR contamination, chimaeric sequences?)
-
user- defined treesand topology testing
aim:group of organisms
or gene family
choice of molecular marker(s)and
taxon sampling
amplification/sequencing
alignment
choice of evolutionary model
phylogenetic analyses
tree(s)results
impr
ov e
men
t of
Work- Flow
-
automatic alignmentgood as a starting point for an alignmentnot good, if sequences contain a lot of indels and highly variable regions (i.e. non- coding regions such as ITS or intron sequences or variable regions in ribosomal RNA sequences)
proteins: check alignment afterwards by eyeribosomal RNA, intron and ITS sequences: always manual editing needed
Alignment (1)
-
Alignment (2)
Second round of proofreading
Unusual amino acids in the translated sequence?Deviations in a highly conserved region of ribosomal RNA?One G whereas all others have two in a highly conserved region?
- > back to the assembly data and cross checking (it may be true, though!)
-
The alignment is the very basis of the phylogeneticanalyses. A software can not differentiate between a realmutation and a sequencing or alignment error.
Alignment (3)
Effects of an Erroneous Alignment
Decreasing of the resolution.Worst case: artefactual tree topology
-
Alignment (4)
Preparation of the Alignment for the Phylogenetic Analyses
Exclusion of nonalignable regions and saving of the data set in nexus (PAUP) or phylip (Phylip, PAML, Molphy) format depending on the software used for phylogenetic analyses.In protein- coding sequences: perhaps excluding third codon position
-
CCMP152 TAGGAAATCTAGAGCTAATACATGCACCATCGCTCTAATTTGATATTTT--------M1303 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT-------M2180 TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTGTGTTTAGTT-------S9772e TAGGAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTTACAATATCTAA-----S9772b TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT-----C9772a TAGgAAA-CTACAGCTAATACATGCTCCATCGCTTTTTTTATATATATTTGT-----M1703 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACAM1481 TAGGAATTCTAGAGCTAATACATGCACCATCGCTTTTTTTTCTTTTTTCTTTTTTCTTSB9801 TAGGAATTCTAGAGCTAATACATGCACCATCGTTTTTCTTGACAGGAAGGAAGAAAAAM1318 TAGGAATTCTAGAGCTAATACATGCACCAGTGCCCTTAGTTTATTCTTTTTTAAGACAM1312 TAGgAATTCTAGAGCTAATACATGCACCATAGCCTTTTGTAATTTTTTTTAAAGTTTTHruf TAGGAATTCTAGAGCTAATACATGCCCCATCGCTTTCGAAGTTTTTTAATTTTTTTTC
mask XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXxxx
Alignment (5)
CTC = discussion zone between alignable and nonalignable regions TTT = highly variable region -> exclude from analyses
e.g. ribosomal DNA
-
Properties of an alignment editor suited for phylogenetic purposes
multiple sequence alignmentlimits of sequence number and length sufficiently high?manual editingprotected mode (no deletion of nucleotides)several import/export formats (e.g. clustal for automated pre- alignment)for phylogenetic analyses: nexus/phylip format exportdefinition of a mask to exclude non- alignable regions
Alignment (6)
-
Automatic Alignment Tools
Clustal W and relativesT- Coffee
MalignTreeAlign
PileUpand others ...
Alignment (7)
Manual Alignment
AlignBioEdit
SeaViewSeAl
ARB (RNA- coding DNA)DCSE (RNA- coding DNA)
SeqLab (GCG)MacVector
and others ...
-
Questions?
-
user- defined treesand topology testing
aim:group of organisms
or gene family
choice of molecular marker(s)and
taxon sampling
amplification/sequencing
alignment
choice of evolutionary model
phylogenetic analyses
tree(s)results
impr
ov e
men
t of
Work- Flow
-
consist of a combination of the following parameters:
base frequenciessubstitution rate matrix
proportion of invariable sitesgamma- distributed among- site rate variation
(covarion/covariotide)
Evolutionary Models (1)
-
Evolutionary Models (2)
base frequenciespercentages of A, C, G or T in the alignment;can be set to: - equal (0.25 for each nt)- empirical (= computed from the alignment)- estimate (= optimised as a likelihood parameter)- manually set
homogeneity of base frequencies among taxa:Program Paup performs a chi square test for biased base frequencies.(command line: basefreqs)
-
Evolutionary Models (3)
substitution rate matrixassumptions about substitution rates of point mutations
Jukes- Cantor model: All point mutations occur at the same rate. (number of substitution types [nst] =1)Hasegawa- Kishino- Yano and Kimura- 2- parameter: Differing rates for transitions and transversions (nst=2)General time reversible model (GTR): Each type of substition has a different substitution rate, reversals are considered equally likely (nst=6)
-
Evolutionary Models (4)
substitution rate matrixnst=2
A C G T
A - v i v
C v - v i
G i v - v
T v i v -
i = transition
v = transversion
nst=6
A C G T
A - a b c
C a - d e
G b d - f
T c e f -
equal rates for
both directions
nst=6 (but 3 rate classes)
A C G T
A - a b a
C a - a e
G b a - a
T a e a -
the Tamura-Nei
model
-
Evolutionary Models (5)
proportion of invariable sites andgamma- distributed among- site rate variation
DNA sequences do not evolve at the same rate in all positions.
protein- coding sequences: faster rates at the third position ribosomal DNA, internal transcribed spacers: alternating pattern of more conserved and highly variable regions correlating with secondary structure (helices and unpaired regions).
-
Evolutionary Models (6)
123 123 123DNA TCA CGA GTA TCC CGC GTC TCG CGG GTG TCT CGT GTT
protein Ser Arg Val
protein- coding genes: degenerate code
ribosomal DNA
AG
A
AA
A
U-GA-UC-GC-G
CUA A
RNA secondary structure
TACCATGAAAAAGTGGAC
DNA
red = highly variable positions
-
Evolutionary Models (7)
proportion of invariable sites = proportion of positions, which do not evolve
gamma- distributed among- site rate variation = nucleotides evolve at differing rates in differing positions; modelled by the shape parameter
both parameters may be used separately or can be combined
-
gamma- distributed among- site rate variation
Evolutionary Models (8)
pro
porti
on
of s
it es
continuous gamma distribution
=
= 1 ~ 10
substitution rate
= 0.25
discrete gamma distribution
-
Evolutionary Models (9)
Covarion/Covariotide
Individual sequences or lineages evolve faster than others. = not evolving according to a molecular clock.
Not implemented in most phylogeny programs (MrBayes is an exception).
-
Evolutionary Models (10)
Example for a command line in Paup 4b to calculate the parameters of aTamura- Nei model with unequal base frequencies, proportion of invariable sites and gamma distribution (a tree has to be available in the memory)
lscores 1/ nst=6 basefre=est rmat=est rclass=(a b a a e a) pinv=est rate=gam shape=est;
-
Evolutionary Models (11)
different combinations of
base frequencies+
substitution rate matrix+
proportion of invariable sites/gamma distribution
= 56 evolutionary models in the program Paup 4b
How to decide, which model fits best a data set?
-
Choice of Evolutionary Model (1)
The program Modeltest 3.5 (by Posada and Crandall) performs hierarchical likelihood ratio tests (hLRT) and also computes the Akaike information criterion (AIC).
Modeltest consists of a command file for Paup (modelblock)and an executable (Posada Lab at http://darwin.uvigo.es/)
-
Choice of Evolutionary Model (2)
Running Modeltest
1.) Start Paup and load data set.2.) Load modelblock of Modeltest with command execute into Paup3.) Paup will follow the commands given in the modelblock: First a tree is constructed using the simplest and fastest method. Then Paup computes the likelihood values for all 56 evolutionary models for the data set given the tree. The likelihood scores of the 56 models are saved in a file called model.scores.4.) The Modeltest executable is started and fed with the model.scores file. It performs the hLRT and AIC tests and saves the results to a file.
-
Choice of Evolutionary Model (3)
Testing models of evolution - Modeltest Version 3.06(c) Copyright, 1998-2000 David Posada ([email protected])Department of Zoology, Brigham Young UniversityWIDB 574, Provo, UT 84602, USA_______________________________________________________________
Wed Sep 15 21:33:13 2004
Input format: Paup matrix file
** Log Likelihood scores ** +I +G +I+GJC = 3853.2573 3853.2573 3814.9705 3806.2795F81 = 3843.7336 3843.7336 3806.0015 3797.3303K80 = 3852.4849 3852.4849 3814.0562 3805.3757HKY = 3842.8232 3842.8232 3804.8003 3796.1357TrNef = 3852.4060 3852.4060 3813.3804 3804.7378TrN = 3842.6401 3842.6401 3804.7976 3796.1355K81 = 3851.4536 3851.4536 3813.0771 3804.3674K81uf = 3842.2886 3842.2886 3804.3494 3795.6624TIMef = 3851.3740 3851.3740 3812.3914 3803.7239TIM = 3842.1130 3842.1130 3804.3457 3795.6621TVMef = 3846.7188 3846.7188 3807.1191 3798.2488TVM = 3840.2319 3840.2319 3802.2058 3793.3215SYM = 3846.6523 3846.6523 3806.4158 3797.5940GTR = 3839.9839 3839.9839 3802.2041 3793.3086
-
** Hierarchical Likelihood Ratio Tests (hLRTs) **
Equal base frequencies Null model = JC -lnL0 = 4124.0898 Alternative model = F81 -lnL1 = 4117.4712 2(lnL1-lnL0) = 13.2373 df = 3 P-value = 0.004151 Ti=Tv Null model = F81 -lnL0 = 4117.4712 Alternative model = HKY -lnL1 = 4117.0146 2(lnL1-lnL0) = 0.9131 df = 1 P-value = 0.339297 Equal rates among sites Null model = F81 -lnL0 = 4117.4712 Alternative model = F81+G -lnL1 = 3806.0015 2(lnL1-lnL0) = 622.9395 df = 1 Using mixed chi-square distribution P-value =
-
Model selected: F81+I+G -lnL = 3797.3303 Base frequencies: freqA = 0.3045 freqC = 0.2348 freqG = 0.2328 freqT = 0.2279 Substitution model: All rates equal Among-site rate variation Proportion of invariable sites (I) = 0.4495 Variable sites (G) Gamma distribution shape parameter = 0.6163
[...]
BEGIN PAUP;Lset Base=(0.3045 0.2348 0.2328) Nst=1 Rates=gamma Shape=0.6163 Pinvar=0.4495;END;
Choice of Evolutionary Model (5)
-
Questions?
-
user- defined treesand topology testing
aim:group of organisms
or gene family
choice of molecular marker(s)and
taxon sampling
amplification/sequencing
alignment
choice of evolutionary model
phylogenetic analyses
tree(s)results
impr
ov e
men
t of
Work- Flow
-
Combination of an Optimality Criterion with a Tree Search Algorithm
1) Optimality CriterionScoring method to decide which tree is the beste.g. maximum parsimony, distance analysis, maximum likelihood
2) Tree Search AlgorithmMethod to construct a treee.g. exhaustive search, branch- and bound, heuristic search, quartet puzzling, neighbor- joining
Phylogenetic Analysis Methods
-
Phylogenetic Analysis: Maximum Parsimony
The tree which requires the fewest mutation steps to explain the nucleotide pattern of an alignment is the best.Each point mutation equals one point in the scoring system, thus in unweighted parsimony only integer values are possible as scores.
best tree = maximum parsimony tree (MPT)if several trees are equally scored = equally parsimonious trees (EPT)
Problem: Evolutionary model is implicit and cannot be adapted to the data set. All mutations at all positions are considered equal, even in more variable regions.
-
Phylogenetic Analysis: Distance Matrix Methods (1)
All sequences of an alignment are compared pairwise with each other. Each pairis assigned an evolutionary distance value expressing the degree of divergence.The results are listed in a distance matrix, which is used to construct a tree.
To calculate the distances, one out of the 56 different evolutionary modelscan be chosen or the estimators of the maximum likelihood method can be used.
The tree with the shortest sum of distances is the best. Since the distancesare no integers, there is usually only one best tree.
-
Phylogenetic Analysis: Distance Matrix Methods (2)
paup> showdist
HKY85 distance matrix
1 2 3 4 5 6 7 1 MPorph - 2 S11679 0.15819 - 3 CCMP736 0.16875 0.15158 - 4 MCont316 0.15953 0.13655 0.18535 - 5 S1382 0.17529 0.14077 0.18336 0.15227 - 6 Chondrus 0.10471 0.10205 0.15863 0.12481 0.13192 - 7 S4194a 0.10967 0.10370 0.15845 0.12654 0.12807 0.03539 -
-
Phylogenetic Analysis: Maximum Likelihood (1)
Probablistic method: Tries to find the tree that optimises the probability of observing the data in the alignment. Likelihood is expressed as negative natural logarithm (- lnL; lowest - lnL is the best).
Computation Steps: - A tree is given. - For each position in the alignment, the site- wise log likelihood is calculated. This includes all possible combinations of ancestral character states in a tree. - The log likelihoods of all positions of the alignment are multiplied and result in the total log likelihood value.
-
Phylogenetic Analysis: Maximum Likelihood (2)
Example
A
A
C
G? ?
A-A C-A G-A T-AA-C C-C G-C T-CA-G C-G G-G T-GA-T C-T G-T T-T
position 1 of 1500 positions
tour- taxon- tree =16 possible combinations
of ancestral states
AlignmentSeq 1 ATTA...Seq 2 ACTA...Seq 3 CCTA...Seq 4 GGTG... 1234...
probabilities dependin evolutionary model settings
site- wise log likelihood:all probabilites of the
16 character combinations
total log likelihood:product of 1500 site- wise
log likelihoods
-
Phylogenetic Analysis: Exhaustive Tree Search
All possible trees are calculated according to the chosen optimality criterion.
Theoretically good: Safest method to find the best tree!Problem: Computationally intense! With a lot of taxa impossible to do.
e.g. rooted bifurcating trees:
6 taxa = 945 trees10 taxa = 34,459,425 trees
15 taxa = 213,458,046,676,875 trees
-
Phylogenetic Analysis: Branch- and- Bound Tree Search
Branch and bound is a speed- up procedure for exhaustive search. It also considersall possible trees.
The score/distance/likelihood of a randomly generated starting tree is calculated and used as a threshold. All trees that are already worse than this threshold during construction procedure are not finished, but skipped. If a tree turns out to be better, it is used as a new threshold.
Disadvantage: Still too time consuming for larger data sets.
-
Phylogenetic Analysis: Heuristic Tree Search (1)
The trees are considered to form a landscape called the treespace.The best trees are on top of the hills, the worst trees arein the valleys.
Heuristic searches start with a random tree, which may be located in a valleyand try to find the best tree located in the global maximum of the tree spaceby rearranging the branches of the starting tree.
-
Phylogenetic Analysis: Heuristic Tree Search (2)
1 Starts with a randomly generated tree, which may be in a valley.2 Local rearrangements by exchanging neighbouring branches optimise the tree. The tree may end up in a local optimum only.3 Global rearrangements help to cross the valley and find the another hill.4 The new tree is again rearranged by small exchanges to climb up the hill to the top. If this is the global optimum, further global rearrangements will not improve the tree.
1
2
3 4
-
Phylogenetic Analysis: Heuristic Tree Search (3)
Tree Rearrangement Methods
Nearest- Neighbour Interchange (NNI) = Adjacent branches are rearranged.
Subtree Pruning and Regrafting (SPR) = A branch with a subtree is removed from a tree and added between two nodes somewhere else in the tree (= one new tree).
Tree Bisection and Reconnection (TBR) = A tree is split into two subtrees and both parts are connected between all possible nodes of the other (= several new trees are considered).
-
Phylogenetic Analysis: Neighbor- Joining (1)
Preferred method to infer trees from distance matrices. Belongs to the clustering methods.
Computation steps:
1) Calculate net divergence of each taxon from the others, and compute a corrected distance for further use.2) Start with a star- like tree (belongs to star- decomposition methods).3) Join the two taxa with the lowest divergence.4) Recalculate the distance matrix by treating the joined taxa as one.5) Repeat steps 3 to 4 until all taxa are joined and the tree is resolved.
-
Phylogenetic Analysis: Neighbor- Joining (2)
first distance matrix
corrected distance matrix
recalculation of distance matrix
-
Phylogenetic Analysis: A Comparison of Methods
Maximum Parsimony
discrete charactersimplicit evolutionary model
heuristic tree search
Distance Matrix
continuous charactersexplicit evolutionary model
neighbor- joining trees
fastest methodlarge data sets
Maximum Likelihood
discrete charactersexplicit evolutionary model
heuristic tree search
robust methodcomputational intense
-
Phylogenetic Analysis: Paup Commands
Begin paup; set autoclose increase=auto outroot=monophy; outgroup 1-4; set crit=p; hsear addseq=rand nreps=10; savetrees file=pars.tre brlens;Lset Base=(0.3315 0.2201 0.2334) Nst=1 Rates=gamma Shape=0.9144 Pinvar=0.3911; set crit=d; dset dist=ml; nj; savetrees file=nj.tre brlens; set crit=l; hsear addseq=rand nreps=1; savetrees file=ml.tre brlens; quit;End;
-
Phylogenetic Analysis: Bayesian Analysis (1)
Uses also likelihoods in calculations, but is based on a formula introduced by Reverend Bayes and uses posterior probabilities.
Bayesian analysis starts with a set of a priori expectations about evolutionary model, tree topology and branch lengths. By examining the data (the alignment), the posterior probabilities of the hypotheses given the data are calculated using the Bayes formula.Since it is impossible to compute the complete joint posterior probability distribution of trees and evolutionary model parameters (a landscape with hills and valleys), samples are drawn using a Metropolis- coupled Markov chain Monte Carlo method.
-
Phylogenetic Analysis: Bayesian Analysis (2)
1) initialization of Markov chain with random tree and random evolutionary parameters - > calculation of probability2) proposal of new state of chain with one changed parameter (topology, branch length or evolutionary model) - > calculation of probability 3) If P(Tnew)/P(Told) 1 - > accepting new state, if the ratio is < 1, a random number decidesThis corresponds to one generation of the Markov chain. A chain is run over several thousands to millions of generations. Every 100th generation a tree and its parameters are sampled and saved to files. After a while, the chain starts circling around a probability optimum comprising the best trees.
-
CH1
H2
H3
Phylogenetic Analysis: Bayesian Analysis (3)
Since the Markov chain may end up stuck in a local maximum, in addition to this cold chain also three so- called heated chains (H1 to H3) are initialized. These chains have lower thresholdsto be able to jump over valleys more easily. From time to time a heated chain exchanges parameters (= Metropolis- coupled) with the cold chain, helping it to find the global optimum.
-
Phylogenetic Analysis: Bayesian Analysis (4)
Maximum Likelihood usually results in one optimal tree (sometimes also two), whereas Bayesian analysis results in a set of optimal trees.
Results of a Bayesian analysis after summarizing of the sampled trees and evolutionary model parameters:
A tree file listing the trees according to their posterior and accumulative posterior probabilities.A list of credibility intervals and mean values for all parameters of the evolutionary model.A consensus tree with branch lengths and posterior probabilities indicating support for branches.
-
Phylogenetic Analysis: Bayesian Analysis (5)
Example command block for MrBayes to be attached to the nexus file.
Begin mrbayes; set autoclose=yes; lset nst=6 rates=invgamma ngammacat=4 covarion=yes; mcmcp ngen=3500000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes filename=SSU; mcmc; quit;End;
-
Phylogenetic Analysis: Bayesian Analysis (6)
Summary of analysis data
sump filename=SSU.p burnin=8000;
sumt filename=SSU.t burnin=8000;
All trees and parameters, which were sampled prior to reaching the likelihood plateau (i.e. the burn- in phase before arriving at the global optimum) a excluded from the summaries. The sump command results in a plot showing the likelihood values. If the burn- in was not properly removed, the command has to be repeated with a higher burnin value.
-
Questions?
-
user- defined treesand topology testing
aim:group of organisms
or gene family
choice of molecular marker(s)and
taxon sampling
amplification/sequencing
alignment
choice of evolutionary model
phylogenetic analyses
tree(s)results
imp r
ov e
men
t of
Work- Flow
-
Trees - Support for Branches: Bootstrap Analysis
In bootstrap analysis, single positions are randomly drawn from the alignment (imagine a lottery) and assembled to a new dataset of the same size as the original alignment. As a result, some positions may occur several times, whereas others are excluded.This lottery is repeated at least 100 times resulting in at least 100 subsamples of the original alignment.Of each subsample a phylogenetic analysis is done (MP, distance, ML). The results are summarised in a consensus tree.
In a 50 percent majority rule consensus tree, all branches that occur in at least 50% of all bootstrap subsamples are displayed on the branches.
bootstrap values > 95% = significantly supported branches
-
Trees: Support for Branches Posterior Probabilities
The consensus tree resulting from a Bayesian analyses is a 50% majority rule consensus tree, inferred from sampled trees of the Markov chain.Similar to the support values of a bootstrap consensus, the posterior probabilities express how many of the sampled trees are found with this topology.
Posterior probabilites are usually higher than bootstrap support values and have been subject of debates.
-
ArtefactsTrees: The Long Branch Attraction Artefact (LBA) (1)
correct tree result of phylogenetic analysis
A A
B BC C
D C
-
ArtefactsTrees: The Long Branch Attraction Artefact (LBA) (2)
ExplanationLong branches indicate a higher rate of mutations.
a) A high rate of mutations results in multiple reversals to the original character state (= homoplasies).b) In addition, a high mutation rate causes signal noise blurring the information in the sequences.
Consequence: Homoplasies are erroneously interpreted as indicators for relatedness.Choice of inappropriate evolutionary model increases vulnerability against LBA.
-
Trees: Potential Indicators for LBA (1)
Tree is ladderised at the root, i.e. most or all long branches emerge successively close
to the root of the tree.
-
Trees: Potential Indicators for LBA (2)
Differing tree topologies depending in whether simple or complex evolutionary models are used.
maximum parsimony maximum likelihood (F81+I+)
-
user- defined treesand topology testing
aim:group of organisms
or gene family
choice of molecular marker(s)and
taxon sampling
amplification/sequencing
alignment
choice of evolutionary model
phylogenetic analyses
tree(s)results
imp r
ove
men
t of
Work- Flow
-
1.) Improved Taxon Sampling
Breaking up the long branches by adding related taxa.
more taxa = higher resolution by adding information to variable positions
2.) Improved Choice of Markers
Use several genes with differing evolutionary rates and concatenate. The influence of the long branches may be broken (also good: use of genes from different genomes, e.g. nuclear, mitochondrial, plastid).
more positions = higher resolution by extending the data matrix
Preventing/Reducing LBA
-
How to Handle Concatenated Data? (1)
Choice of Evolutionary Model
Different genes most likely will need different evolutionary models.Neither Paup nor Phylip allow for a partitioning of data.The more genes are included and the more divergent the data are in terms of evolutionary rates, the more likely Modeltest will propose the
most complex evolutionary model, GTR+I+.
-
How to Handle Concatenated Data? (2)
Choice of Evolutionary Model: MrBayes allows for a partitioning of data
Begin mrbayes; set autoclose=yes; log start filename=sum.log; charset NM=1-1564; charset ITS2=1565-1850; charset LSU=1851-2733; partition concP2=3:NM,ITS2,LSU; set partition=concP2; lset applyto=(1) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; lset applyto=(2) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; lset applyto=(3) nst=6 nucmodel=4by4 rates=invgamma ngammacat=4 covarion=yes; unlink statefreq=(all); unlink shape=(all); unlink revmat=(all); unlink switchrates=(all); prset ratepr=variable; mcmcp ngen=3500000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes filename=conc; mcmc; quit;End;
-
How to Handle Concatenated Data? (3)
The Likelihood Summation Method
- Run a phylogenetic analysis with each singe- gene dataset and the concatenated data set.- Let Paup save the 1000 best trees resulting from each analysis.- Concatenate the tree files.- Calculate the likelihood scores for each of the trees with the lscores command in Paup, but use as a data matrix the single- gene alignments only.- Load the scorefiles for each data set in a spreadsheet program and calculate the sum of log likelihoods for each tree.- Sort the trees according to their log likelihood.
-
user- defined treesand topology testing
aim:group of organisms
or gene family
choice of molecular marker(s)and
taxon sampling
amplification/sequencing
alignment
choice of evolutionary model
phylogenetic analyses
tree(s)results
imp r
ov e
men
t of
Work- Flow
-
Kishino- Hasegawa and Shimodaira- Hasegawa tests are used totest hypothetical user- defined trees by comparison with the optimal tree.
Problem:Tests were designed to compare random trees, but used by biologists to compare trees with the optimal tree.
Testing Tree Topology (1)
-
Consel- written by H. Shimodaira- needs input files with site- wise log likelihoods- accepts Paup, Molphy and PAML scorefiles- consists of a suite of programs for different tasks- C source code; Unix command line or DOS console- performs: Approximately unbiased test Kishino- Hasegawa test (unweighted/weighted) Shimodaira- Hasegawa test (unweighted/weighted) bootstrap probabilities posterior probabilities
Testing Tree Topology (2): Consel
-
Using Consel in Combination with Paup
1) Construct constraints to test a hypothesis2) Infer the optimal constraint trees with Paup (ML)3) Concatenate all treefiles that are supposed to be subjected to the test.4) Let Paup calculate the total log likelihood and the site- wise log likelihoods and save the data to a scorefile.5) Use a text editor to delete superfluous data (all parameters of the evolutionary model).6) Feed Consel with the scorefile.
Testing Tree Topology (3): Consel and Paup
-
Testing Tree Topology (4): Consel Commands
1) Generate bootstrap subsamples
makermt paup scorefile.txt
2) Perform the test
consel scorefile
3) Generate an output file with the test results
catpv scorefile
-
Testing Tree Topology (5): How Does Consel Work?
makermt Generates multiscale bootstrap subsamples from the site- wise log likelihoods in the scorefile. Bootstrap samples from site- wise log likelihoods = the RELL method Multiscale bootstrap = sizes of subsamples differ from original data set By default, makermt generates 10 sets of replicates, each with 10,000 subsamples (0.5x, 0.6, 0.7, 0.8, 0.9, 1.1, 1.2, 1.3, 1.4, 1.5 fold the size of the original data set) and an additional set with 10,000 subsamples of 1.0- fold size.consel Calculates the probabilities for KHT, SHT and normal bootstrap using the 10,000 subsamples of 1.0- fold size and the probabilities for the AUT and the multiscale bootstrap using the multiscale bootstrap samples.
-
Testing Tree Topology (6): Consel Outputcatpv summarises the results of the test (the probability values)
# reading nm.pv# rank item obs au np | bp pp kh sh wkh wsh |# 1 1 -15.8 0.957 0.942 | 0.944 1.000 0.936 0.994 0.936 0.999 |# 2 6 15.8 0.064 0.055 | 0.054 1e-07 0.064 0.415 0.064 0.218 |# 3 5 35.2 0.005 0.002 | 0.002 5e-16 0.006 0.079 0.006 0.027 |# 4 4 36.3 0.002 0.001 | 4e-04 2e-16 0.005 0.063 0.005 0.019 |# 5 7 56.8 7e-05 1e-04 | 2e-04 2e-25 3e-04 0.012 3e-04 0.001 |# 6 2 129.4 2e-64 5e-21 | 0 6e-57 0 0 0 0 |# 7 3 142.2 4e-06 5e-06 | 0 2e-62 0 0 0 0 |rank = tree rankingitem = no. of tree in treefileobs = log likelihood differenceau = p- values of the approximately unbiased testnp = p- values of the multiscale bootstrapbp = p- values of the normal bootstrappp = posterior probabilitieskh = Kishino- Hasegawa testsh = Shimodaira- Hasegawa testwkh, wsh = weighted Kishino- Hasegawa and Shimodaira- Hasegawa tests
-
Phylogeny With Protein Sequences (1)
Due to degeneration of the genetic code, codons may be biased!This bias may apply not only to the third position, but also to first and second.
Presumably it would be better to use protein sequences instead, but:
Protein alignments have 20 character states instead of 4 = analyses, especially ML analyses take much longer!
Using nucleotide data, but considering nonsynonymous/synonymous substitutions and/ortranslating during analysis - > this is also quite time- consumptive!
-
Phylogeny With Protein Sequences (2)
Maximum likelihood analyses of protein sequences are usually based on substitution matrices derived from empirical data instead of estimating the substitution rate matrix from the data set.
e.g. Dayhoff (Dayhoff et al. 1978) JTT (Jones, Thornton, Taylor 1992) WAG (Wheelan, Goldman 2001)
-
Paup: only limited possiblities, no substitution matrices included; no maximum likelihood
Programs for phylogenetic analyses of protein sequences
Phylip (text- based menu): phylip format; maximum likelihood; PAM, JTT, PMBTree- Puzzle (text- based menu): phylip format; maximum likelihood with quartet puzzling; Dayhoff, JTT, WAG, VT etc.PAML (Unixoids, DOS console): phylip format; maximum likelihood; Dayhoff, JTT, WAG etc.(Molphy [Unixoids]): phylip format; maximum likelihood; Dayhoff, JTT)MrBayes: Bayesian analysis
Phylogeny With Protein Sequences (3)
-
Phylogeny With Protein Sequences (4)
One possibility:
Calculate a tree and gamma categories using Tree- Puzzle 5.2 (Schmidt, Strimmer and von Haeseler 2004).
Use the gamma category estimates as settings to perform a maximum likelihood analyses with proml from the Phylip 3.62 package (Joe Felsenstein 2004).
-
Phylogeny With Protein Sequences (5)
Text based menu of Tree- Puzzle 5.2
GENERAL OPTIONS b Type of analysis? Tree reconstruction k Tree search procedure? Quartet puzzling v Approximate quartet likelihood? Yes u List unresolved quartets? No n Number of puzzling steps? 1000 j List puzzling step trees? No o Display as outgroup? Chondrus (1) z Compute clocklike branch lengths? No e Parameter estimates? Approximate (faster) x Parameter estimation uses? Neighbor-joining treeSUBSTITUTION PROCESS d Type of sequence input data? Auto: Amino acids m Model of substitution? Auto: JTT (Jones et al. 1992) f Amino acid frequencies? Estimate from data setRATE HETEROGENEITY w Model of rate heterogeneity? Uniform rate
Quit [q], confirm [y], or change [menu] settings:
-
Phylogeny With Protein Sequences (6)
Phylip 3.62suite of 30 programs
e.g. ML bootstrapping:a) start seqboot = generate bootstrap samples from data setb) run dnaml or proml (depending in data set) c) run contree to create a consensus tree d) consensus treefile may be loaded into a tree displaying program
-
Molecular Clock
Assumes that sequences evolve at equal evolutionary rates. This is usually not the case and puts the analysis under a constraint. Molecular clock hypothesis may be tested with a likelihood ratio test similar to the evolutionary models in Modeltest.If fossils are available one may try dating the divergences of lineages.
Secondary Structure Analyses
may add information to the results (predominantly RNA- coding, ITS or intron regions)
Synapomorphy Analyses
Searching for synapomorphic characters or strings of characters may be useful for systematic purposes
Diverse
-
Questions?
-
That's it!