Download - Phylogenetics 101
![Page 1: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/1.jpg)
Phylogenetics 101Eddie Holmes
Center for Infectious Disease Dynamics,Department of Biology,
The Pennsylvania State UniversityFogarty International Center, National Institutes of Health
![Page 2: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/2.jpg)
Modern Phylogenetics
![Page 3: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/3.jpg)
Useful Textbooks & SoftwareBooks:• Page RDM & Holmes EC. (1998). Molecular Evolution: A Phylogenetic Approach. Blackwell Science Ltd, Oxford. • Lemey P, Salemi M & Vandamme A-M. (2009). The Phylogenetic Handbook, 2nd Edition. Cambridge University Press.
Computer Software:• BEAST (Bayesian Evolutionary Analysis Sampling Trees)
- http://beast.bio.ed.ac.uk/• MEGA (Molecular Evolutionary Genetics Analysis)
- http://megasoftware.net/• MrBayes (Bayesian inference of phylogeny)
- http://mrbayes.csit.fsu.edu/• PhyML (Maximum likelihood phylogenetics)
- http://www.atgc-montpellier.fr/phyml/• HyPhy/DATAMONKEY (Selection, recombination & hypothesis testing)
- http://datamonkey.org/• RDP3 (Recombination detection program)
- darwin.uvigo.es/rdp/rdp.html• PAUP* (Phylogenetic Analysis Using Parsimony *and other methods)
- http://paup.csit.fsu.edu/
![Page 4: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/4.jpg)
• Estimating genetic distances between sequences
• Inferring phylogenetic trees
• Detecting recombination events
• The inference of selection pressures (particularly detecting positive selection)
• Estimating rates of evolutionary change
• Inferring demographic history (population dynamics)
• Phylogeography
Topics in Evolutionary Inference
![Page 5: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/5.jpg)
The Quasispecies
Useful summary reference:• Bull JJ, Meyers LA & Lachmann M. (2005). Quasispecies made simple. PLoS Comp.Biol. 1:e61.
![Page 6: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/6.jpg)
The Quasispecies
• Idea introduced by Manfred Eigen as a mathematical model of early life forms (RNA replicators) based upon chemical kinetics and first used in virology by Esteban Domingo in the 1970s. Now the dominant model in RNA virus evolution.
• A distribution of variant genomes ordered around the fittest sequence (often called the ‘master sequence’) and produced by a combination of mutation and selection (“mutation-selection” balance). Only functions at high mutation rates.
• Only considers intra-host evolution.
![Page 7: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/7.jpg)
• The frequency of any variant in the quasispecies is a function of its own replication rate and the probability that it is produced by the erroneous replication of other variants in the population.
• Viral genomes are not independent entities due to mutational coupling (i.e. variants are linked in mutational space). The entire mutant distribution forms an organised structure which acts like (quasi) a single unit (species).
• Natural selection acts on the mutant distribution as a whole, not on individual variants, and the quasispecies evolves to maximise its average replication fitness. So, it is a form of group selection.
• Important implication: low fitness variants can out-compete high fitness variants if they are surrounded by beneficial mutational neighbours (“survival of the flattest”).
The Quasispecies
![Page 8: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/8.jpg)
The Quasispecies
‘survival of the fittest’
‘survival of the flattest’
![Page 9: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/9.jpg)
“Survival of the Flattest”
![Page 10: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/10.jpg)
Population A = red (high replication rate)Population B = blue (low replication rate)
Experimental Verification of Quasispecies Dynamics
• Sanjuán R, Cuevas JM, Furió V, Holmes EC & Moya A. (2007). Selection for robustness in mutagenized RNA viruses. PLoS Genet. 3: e93.
• However, this only occurs at artificially elevated mutation rates
![Page 11: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/11.jpg)
• Most people simply use the quasispecies as a synonym for genetic diversity. However, genetic diversity is not the same as the quasispecies!
• The quasispecies works in theory, in “digital organisms” and perhaps in some laboratory populations where mutation rates are increased artificially (i.e. when RNA viruses are about to breech the “error threshold”). However, quasispecies do not occur in laboratory populations with ‘normal’ error rates.
• No good evidence as yet that RNA viruses in nature form quasispecies:- no evidence that selection acts on the whole population- mutation rates are too low
• The mutation rate required for the survival of the flattest (> 2 per genome replication) is higher than that seen in nature (< 1 per genome replication)…this only occurs during the treatment of viral infections with mutagens (“lethal mutagenesis”).
Do RNA Viruses Form Quasispecies?
![Page 12: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/12.jpg)
Shameless Self-Publicity
• Amazon.com Sales Rank: # 666,415 in Books
![Page 13: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/13.jpg)
Estimating Genetic Distances Between Sequences
![Page 14: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/14.jpg)
Estimating Genetic Distance SIVcpz ATGGGTGCGA GAGCGTCAGT TCTAACAGGG GGAAAATTAG ATCGCTGGGAHIV-1 ATGGGTGCGA GAGCGTCAGT ATTAAGCGGG GGAGAATTAG ATCGATGGGA
SIVcpz AAAAGTTCGG CTTAGGCCCG GGGGAAGAAA AAGATATATG ATGAAACATTHIV-1 AAAAATTCGG TTAAGGCCAG GGGGAAAGAA AAAATATAAA TTAAAACATA
SIVcpz TAGTATGGGC AAGCAGGGAG CTGGAAAGAT TCGCATGTGA CCCCGGGCTAHIV-1 TAGTATGGGC AAGCAGGGAG CTAGAACGAT TCGCAGTTAA TCCTGGCCTG
SIVcpz ATGGAAAGTA AGGAAGGATG TACTAAATTG TTACAACAAT TAGAGCCAGCHIV-1 TTAGAAACAT CAGAAGGCTG TAGACAAATA CTGGGACAGC TACAACCATC
SIVcpz TCTCAAAACA GGCTCAGAAG GACTGCGGTC CTTGTTTAAC ACTCTGGCAGHIV-1 CCTTCAGACA GGATCAGAAG AACTTAGATC ATTATATAAT ACAGTAGCAA
SIVcpz TACTGTGGTG CATACATAGT GACATCACTG TAGAAGACAC ACAGAAAGCTHIV-1 CCCTCTATTG TGTGCATCAA AGGATAGAGA TAAAAGACAC CAAGGAAGCT
SIVcpz CTAGAACAGC TAAAGCGGCA TCATGGAGAA CAACAGAGCA AAACTGAAAGHIV-1 TTAGACAAGA TAGAG--GAA -----GAGCA AAACAAAAGT AA---GAAAA
SIVcpz TAACTCAGGA AGCCGTGAAG GGGGAGCCAG TCAAGGCGCT AGTGCCTCTGHIV-1 AAGCACAGCA AGC-----AG CAGCTGACA- -CAGGACAC- AG--CAGC--
SIVcpz CTGGCATTAG TGGAAATTACHIV-1 CAGG--TCAG CCAAAATTAC
![Page 15: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/15.jpg)
Multiple Substitutions at a Single Site - Hidden Information
A
A
C
TExample 1
T
A
C
AExample 2Only count 1 mutationwhen 2 have occurred
Count 0 mutationswhen 3 have occurred
![Page 16: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/16.jpg)
The Problem of Multiple Substitution
• When % divergence is low, observed distance (p) is a good estimator of genetic distance (d)• When % divergence is high, p underestimates d and a “correction statistic” is required i.e. a model of DNA substitution
Time
% D
iver
genc
e
ActualObserved
50
25
75
Hidden information
![Page 17: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/17.jpg)
Models of DNA Substitution
i. The probability of substitution between bases(e.g. A to C, C to T…)
ii. The probability of substitution along a sequence (different sites/regions evolve at different rates)
• Models of DNA sequence evolution are required to recover the missing information through correcting for multiple substitutions.
![Page 18: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/18.jpg)
Models of DNA Substitution 1 (Jukes-Cantor, 1969)
• Assumptions:i. All bases evolve independentlyii. All bases are at equal frequencyiii. Each base can change with equal probability ()iv. Mutations arise according to a Poisson
distribution (rare and independent events)
• From this the number of substitutions per site (d) can be estimated by;
d = -3/4 In (1-4/3P)where P is the proportion of observed nucleotide differencesbetween 2 sequences.
![Page 19: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/19.jpg)
A
T
C
G
a a
a
a
a
a
All substitutions occur at the same rate (a)
Is this model too simple for real data?
![Page 20: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/20.jpg)
A
T
C
G
b b
b
b
a
a
Transitions (a) and transversions (b) occur at a different rate
![Page 21: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/21.jpg)
Models of DNA Substitution 2(Kimura 2-parameter, 1980)
• Assumptions:i. All bases evolve independentlyii. All bases are at equal frequencyiii. Transitions and transversions occur with different probabilities ( and )iv. The Jukes-Cantor model is applied to transitions and transversions independently
• From this the expected number of substitutions per site (d) can be estimated by;
d = -1/2 In (1-2P-Q)√1-2Qwhere P is the proportion of observed transitions and Q the proportion of observed transversions between 2 sequences
![Page 22: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/22.jpg)
Models of DNA Substitution1. Base frequencies are equal and all substitutions are equally likely
(Jukes-Cantor)
2. Base frequencies are equal but transitions and transversions occur at different rates
(Kimura 2-parameter)
3. Unequal base frequencies and transitions andtransversions occur at different rates
(Hasegawa-Kishino-Yano)
4. Unequal base frequencies and all substitution types occur at different rates
(General Reversible Model)
Simplest(few parameters)
Most complex(many parameters)
All these models can be tested using the program jMODELTEST (darwin.uvigo.es/software/jmodeltest.html)
![Page 23: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/23.jpg)
Models of DNA Substitution
i. The probability of substitution between bases(e.g. A to C, C to T…)
ii. The probability of substitution along a sequence (different sites/regions evolve at different rates)
![Page 24: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/24.jpg)
A Gamma Distribution Can be Used to Model Among-Site Rate Heterogeneity
Little among-siterate variation
Frequent among-siterate variation
![Page 25: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/25.jpg)
Estimates of a Shape Parameter of Among Site Rate Variation
Gene aProlactin 1.37Albumin 1.05C-myc 0.47Ctyochrome b (mtDNA) 0.44Insulin 0.40D-loop (mtDNA) 0.1712S rRNA (mtDNA) 0.16
• Viruses are usually characterized by extensive among-site rate variation(a < 1).
![Page 26: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/26.jpg)
• Uncorrected (p-distance) = 0.406• Jukes-Cantor = 0.586• Kimura 2-parameter = 0.602• Hasegawa-Kishino-Yano = 0.611• General reversible = 0.620• General reversible + gamma = 1.017
Estimating Genetic Distance:SIVcpz vs HIVlai
![Page 27: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/27.jpg)
Other Models
• Allowing a different rate of nucleotide substitution for each codon position in a coding sequence (SRD06; tends to work better than gamma distributions in RNA viruses)
• Allowing different sets of nucleotides to change along different lineages (“covarion” model)
e.g. sites that are variable in bacteria might be conserved in eukaryotes
• Accounting for the non-independence of nucleotides (caused by protein and RNA secondary structures)
![Page 28: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/28.jpg)
Inferring Phylogenetic Trees
![Page 29: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/29.jpg)
Important Problems in MolecularPhylogenetic Analysis
• Is there a tree at all (e.g. recombination)?
• Many possible trees:- For 10 taxa there are 2 x 106 unrooted trees- For 50 taxa there are 3 x 1074 unrooted trees
- efficient and powerful search algorithms
• Choosing the right model of nucleotide substitution
• Rate variation among lineages (causes “long branch attraction”). Need a representative sample of taxa.
![Page 30: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/30.jpg)
small treelong branches drawn together
(convergent sites pull branches together)
large treelong branches far apart
(convergent sites distributed across tree)
= convergent site
Why Having a Representative Sample of Taxa is Important
= informative site
Long branch attraction
![Page 31: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/31.jpg)
Tree-Building Methods
No explicit model of sequence evolution
Explicit model of sequence evolution
parsimony
Application of the
parsimony principle
distance
pairwise comparison
of sequences
maximum likelihood and bayesian
Statistical approach
![Page 32: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/32.jpg)
Methods for Inferring Phylogenetic Trees
• Parsimony (PAUP*) Find tree with the minimum number of mutations between sequences (i.e. choose tree with the least convergent evolution)
• Neighbor-Joining (PAUP*, MEGA) Estimate genetic distances between sequences and cluster these distances into a tree that minimises genetic distance over the whole tree
• Maximum Likelihood (PAUP*, GARLi, PhyML, RaxML, MEGA) Determine the probability of a tree (and branch lengths) given a particular model of molecular evolution and the observed sequence data
• Bayesian (BEAST, Mr.Bayes)Similar to likelihood but where there is information about the prior distribution of parameters. Also returns a (posterior) distribution of trees
![Page 33: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/33.jpg)
๏ Advantages:- Allows the use of an explicit model of evolution- Very fast- Simple
๏ Disadvantages:- Only produces one tree with no indication of its quality- Reduces all sequence information into a single distance
value- Dependent on the evolutionary model used (preferentially
this model should be estimated from the data)
Distance Methods
![Page 34: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/34.jpg)
๏ Parsimony- Fast- Not statistically consistent with most models of evolution- “The” method for morphological data
๏ Maximum Likelihood- Requires explicit statement of evolutionary model- Slow- Statistically consistent- Most commonly used with molecular data
Optimality Methods
![Page 35: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/35.jpg)
Maximum Likelihood in Phylogenetics
• Best described by Joe Felsenstein‣ Felsenstein, J. (1981). Evolutionary
trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368-376
• Now considered the most statistically valid approach to molecular phylogenetics along with the closely related Bayesian methods
• Allows us to incorporate extremely detailed models of molecular evolution
![Page 36: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/36.jpg)
Likelihood• Likelihood is a quantity proportional to the probability of
observing an outcome/data/event X given a hypothesis H
P ( X | H ) or P ( X | p )• then we would talk about the likelihood
L ( p | X )that is, the likelihood of the parameters given the data.
• In this case the hypothesis is a tree + branch lengths and the data are the sequences
R.A. Fisher(looking suitably grumpy)
Assumes an explicit model of molecular evolution, such as those described previously
![Page 37: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/37.jpg)
Bayesian Phylogenetics๏ Using Bayesian statistics, you search for a set of plausible trees instead
of a single best tree๏ In this method, the “space” that you search in is limited by prior
information๏ The posterior distribution of trees can be translated to a probability of
any branching event- Allows estimate of uncertainty!- BUT incorporates prior beliefs
Andrew Rambaut will explain in more detail…
![Page 38: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/38.jpg)
Searching Through ‘Tree Space’
![Page 39: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/39.jpg)
Searching Through Tree Space๏ There are two ways in which we can search through tree
space to find the best tree for our data:– Branch-and-bound: finds the optimal tree by implicitly
checking all possible trees (cutting of paths in the search tree that cannot possibly lead to optimal trees)
– Heuristic: searches by randomly perturbing the tree, does not check all trees and cannot guarantee to find the optimal one(s). Most commonly used.
(exhaustive searching is only possible for very small data sets)
![Page 40: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/40.jpg)
Heuristic searching
Global Maximum Likelihood treeLi
kelih
ood
Trees
local optimum
Starting tree of theheuristic search
Starting tree of theheuristic search
![Page 41: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/41.jpg)
Non-Parametric Bootstrap• Statistical technique that uses random resampling of data to determine sampling error.• Characters are resampled with replacement to create many replicate data sets. A tree is then inferred from each replicate. • Agreement among the resulting trees is summarized with a consensus tree. The frequencies of occurrence of groups, bootstrap proportions, are a measure of support for those groups
Parametric Bootstrap (Monte Carlo simulation)• Compare the likelihoods of competing trees on the data. • Simulate replicate sequences using the parameters (including the tree) obtained for the worse tree (null hypothesis).• Compare the likelihoods trees for each replicate data set as before to create a null distribution.
Bootstrapping (How Robust is a Tree?)
![Page 42: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/42.jpg)
Non-Parametric BootstrappingAAAAAA
CCCTTT
1 2 3 4 56
CCCCCG
TTTTTG
GGAAAA
GGCTTA
123456
TTTTTG
GGAAAA
123456
AAAAAA
CCCCCG
GGAAAA
123456
GGAAAA
GGCTTA
AAAAAA
CCCTTT
TTTTTG
GGCTTA
CCCCCG
1
1000
... Resample with replacement multiple times
![Page 43: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/43.jpg)
Detecting Recombination
![Page 44: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/44.jpg)
Recombination & Reassortment• The Problems:
- Generates new genetic configurations- Complicates our attempts to infer phylogenetic history and other evolutionary processes (e.g.
positive selection)
• The Solutions:- Find recombinants and remove them from the data set (usual plan)- Incorporate recombinants into an explicit
evolutionary model (far harder)
• “Topological incongruence”, where different gene regions (or genes) produce different phylogenetic trees, is the strongest signal for recombination (although conservative)
![Page 45: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/45.jpg)
Methods for Recombination Detection • Measure level of linkage disequilibrium:- LDhat, D’
• Look for changes in patterns of sequence similarity (often pairwise):- GENECOV, RDP, Max Chi-Square, SimPlot, SiScan, TOPAL
• Look for incongruent phylogenetic trees:- BOOTSCAN, 3SEQ, LARD, PLATO, LIKEWIND
• Look for “networked” evolution- SplitsTree, NeighborNet
• Look for excessive convergent evolution:- Homoplasy test, PIST
• See http://www.bioinf.manchester.ac.uk/recombination/programs.shtml for a more complete list• Many of these methods are available in the Recombination Detection Program (RDP3) – http://darwin.uvigo.es/rdp/rdp.html
![Page 46: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/46.jpg)
Sliding Window Diversity Plots can Graphically Show Recombination (e.g. “SimPlot”)
• Magiorkinis et al. Gene 349, 165-171 (2005).
Hepatitis B virus
![Page 47: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/47.jpg)
Detecting Recombination: Looking for Incongruent Trees
• Different genes produce different trees
Gene region 1
Gene region 2
A
B
C
Maximum likelihood break-point
• Programme “LARD” (a maximum likelihood approach)• Compute likelihood of each possible breakpoint in the alignment• Identify breakpoint with the highest likelihood in the alignment• Compare recombination likelihood to that with no recombination• Assess significance with Monte Carlo simulation
Although reassortment is commonplace in influenza virus, the occurrence of homologous recombination is highly controversial
![Page 48: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/48.jpg)
Analyzing Natural Selection
![Page 49: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/49.jpg)
Ways of Measuring Selection Pressures(Especially Detecting Positive Selection)
• Phylogenetic methods: Identify cases of strong parallel or convergent evolution
• Population genetic methods: (i) Look for regional reductions in genetic diversity, usually
using SNPs (commonly used with genomic data)(ii) Compare estimates of effective population size obtained
using different measures of genetic diversity (e.g. the H statistic of Fay & Wu)
(iii) Estimate the speed of allele fixation compared to neutrality
• Combined phylogenetic and population genetic methods:Compare the relative numbers of nonsynonymous (dN) and synonymous (dN) substitutions per site
![Page 50: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/50.jpg)
Detecting Positive Selection by Examining Patterns/Rates of Fixation
• Bhatt S, Holmes EC & Pybus OG. (2011). The genomic rate of molecular adaptation of the human influenza A virus. Mol.Biol.Evol. 28, 2443-2451.
![Page 51: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/51.jpg)
• Compare the ratio of synonymous (dS) and nonsynonymous (dN) substitutions per site (dN/dS = ):
Ser Met Leu Gly GlySeq 1: TCA ATG TTA GGG GGA
† * † † ** Seq 2: TCG ATA CTA GGT ATA
Ser Ile Leu Gly Ile†Synonymous substitution *Nonsynonymous substitution
dN/dS < 1.0 = purifying selection dN/dS ~ 1.0 = neutral evolution dN/dS > 1.0 = positive selection
• Cases where dN > dS ( > 1) are evidence for positive selection because the rate of fixation of nonsynonymous changes (dN) is greater than the neutral mutation rate (dS) which is impossible under genetic drift
Measuring Selection Pressures
![Page 52: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/52.jpg)
Analysing Selection Pressures in Genes Using dN/dS
• Pairwise methods:(i) Compute dS and dN in each pair of sequences and then compute the
mean across all pairs(ii) Various methods, including:
- Nei & Gojobori 1986 (distance matrix method)- Li et al. 1985 (distance matrix method)- Yang et al. 2000 (maximum likelihood method)
(iii) Problems of pseudo-replication, sometimes use poor substitution models, and lack of power (many false-negatives)
• Site-by-site (and branch) methods:(iii) Incorporate phylogenetic relationships of sequences (i.e. estimate dN/dS
along a tree)(iv) Allow variable selection pressures among codons and realistic models
of nucleotide substitution(v) Can employ parsimony, likelihood or Bayesian methods(vi) Has now been extended to account for directional selection (DEPS)(vii) Tendency for false-positive results, especially in branch-site methods
![Page 53: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/53.jpg)
Datamonkeyhttp://www.datamonkey.org/
• Online version of the more powerful HyPhy package
• Contains multiple programs for the analysis of selection pressures (and recombination)
![Page 54: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/54.jpg)
Variable Selection Pressures in RNA Viruses
![Page 55: Phylogenetics 101](https://reader036.vdocuments.mx/reader036/viewer/2022081420/56816775550346895ddc6acb/html5/thumbnails/55.jpg)
SIHIGPGRAFYTTGESIPIGPGRAFYTTGQSIHIGPGGAFYTTGQSIHIGPGRAFYTTGDSIPIGPGRAFYTTGDGIHIGPGSAFYATGDSIHIGPGRAFYTTGGSIHIGPGRAVYTTGQGIHIGPGSAFYATGGGIHIGPGRAVYTTEQRIHIGPGRAVYTTEQGIHIGPGSAFYATGRRIYIGPGRAVYTTEQGIHIGPGSAVYATGGRIYIGPGSAVYTTEQGIHIGPGSAFYATGGRIGIGPGRSVYTAEQGIHIGPGSAVYATGDGIHIGPGRAFYATGDGIHIGPGRAVYTTGDRIYIGPGRAVYTTDQ
Intra-Host Evolution of HIV-1Tip of the V3 loop (part of theenvelope protein of HIV-1)- diversity in a single patient
• The HIV-1 envelope protein is undervery strong positive selection to helpthe virus escape from the humanimmune response (the V3 loop contains epitopes for neutralising antibodies and cytotoxic T-lymphocytes (CTLs).
• V3 loop dN/dS = 13.182 (Nielsen & Yang. Genetics 148, 929. 1998).