The coalescent process
Introduction
‡ Random drift can be seen in several ways
Forwards in time: variation in allele frequency
Backwards in time: a process of inbreeding//coalescence
Allele frequenciesRandom variation in reproduction causes random fluctuations in allele frequency:
var HpL =pq
ÅÅÅÅÅÅÅÅÅÅÅ2 Ne
After many generations, the distribution can be approximated by a diffusion.With random drift and mutation (PØQ at rate m, QØP at rate n) the equilibrium distribution is:
prob HpL~p4 Ne n-1 q4 Ne m-1
The left-hand plot shows the distribution of p for Ne = 2, 500, n = 2.5 µ 10-5 , m = 5 µ 10-5 ; the right-handplot is for Ne = 20,000
0.2 0.4 0.6 0.8 1
1
2
3
4
5
6
7
0.2 0.4 0.6 0.8 1
0.5
1
1.5
2
The diffusion approximation can also include other forces, such as selection and migration. For example, theequilibrium distribution under mutation, random drift, and selection is:
prob HpL~p4 Ne n-1 q4 Ne m-1 Wê2 Ne
With heterozygote advantage (fitnesses 1-s;1:1-s), Wêêê2 Ne = 1 - sHp2 + q2 L~ Exp@-2 Ne sHp2 + q2 LDWith Ne = 2, 500, n = 2.5 µ 10-5 , m = 5 µ 10-5 , and s=0.0001, 0.001, 0.004 (left to right):
0.2 0.4 0.6 0.8 1
12345
0.2 0.4 0.6 0.8 1
0.20.40.60.8
0.2 0.4 0.6 0.8 1
0.20.40.60.8
1
¤ The key parameters are Ne m, Ne n, Ne s , which give the strength of drift relative to mutation and selection.
ü Further reading: Kimura, The neutral theory of molecular evolution, Chap.3
Identity by descent
‡ Definition
Wright (1921, 1922), Haldane & Moshinsky (1939), Cotterman (1940) and Malécot (1948) developed the ideaof identity by descent.
Two genes are identical by descent if they descend from the same gene in some ancestral population.
ü Note:
- Identity by descent is distinct from identity in state- i.b.d. is defined relative to some ancestral reference population.- Identity measures can extend to many genes; usually, however, we just deal with identity between
pairs of genes. This is related to variance of allele frequency, correlation between genes, and homozygosity - Relationships among many genes are better thought of in terms of coalescence of lineages in a
genealogy.
‡ The probability of identity by descent is easily calculated for pedigrees
e.g. brother-sister mating
2 Coalescent process.nb
Genes are NOT ibdin this case
Probability of identity by descent is 1/4
In general, the probability that two distinct genes in a diploid individual are i.b.d. isf = ‚
loopsH 1ÅÅÅÅ2 Ln-1
H1 + fA L , where the sum is over all loops in the pedigree, n is the number of individuals in
the loop, and fA the identity between genes in the common ancestor.
Note that the random element here is in segregation, not reproduction
Coalescent process.nb 3
‡ The increase in i.b.d. with random mating
ü Wright-Fisher model
Suppose that there are 2 Nt individuals in a haploid population. In the next generation, there are 2 Nt+1 ,drawn randomly from all 2 Nt possible parents.
On this scheme, individuals produce a number of offspring which is close to a Poisson distribution.
The Wright-Fisher model also applies to a random-mating diploid population, provided that individuals are aslikely to mate with themselves as with anyone else.
Then, the probability that two genes are i.b.d. from the previous generation is 1 ê 2 Nt :
ft+1 =1
ÅÅÅÅÅÅÅÅÅÅÅ2 Nt+ J1 -
1ÅÅÅÅÅÅÅÅÅÅÅ2 Nt
N ft f0 = 0
ht+1 ª 1 - ft = J1 -1
ÅÅÅÅÅÅÅÅÅÅÅ2 NtN ht hence ht = ‰
i=0
t-1 J1 -1
ÅÅÅÅÅÅÅÅÅÅÅ2 NiN
With constant population size, ht declines by (1-1/2N) per generation - approximately, as ~exp(-t/2N).The typical timescale for inbreeding and random drift is 2N generations.
With fluctuating sizes, ht declines (approximately) as exp H-H⁄i=0t-1 1ÅÅÅÅÅÅÅÅ2 Ni LL= expH-t ê 2 NH L where NH is
the harmonic mean population size.
CoalescenceThe ancestry of a sample of neutral genes has a simple statistical distribution:
the chance that any two lineages coalesce is 1ÅÅÅÅÅÅÅÅÅÅÅ2 Neper generation
-
----
More precisely: - suppose that each gene leaves v descendants- As Nض, the probability that any pair of lineages coalesce, per generation, tends to varHnLÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N
i.e. Ne = N ê varHnLThe coalescent process refers to this limit
- equivalent to the diffusion approximation
An influential idea:- DNA sequences are best described by their genealogy- a variety of mutation models can be superimposed- tracing back samples of alleles
- speeds up simulations- gives statistical tests on sampled data
4 Coalescent process.nb
An influential idea:- DNA sequences are best described by their genealogy- a variety of mutation models can be superimposed- tracing back samples of alleles
- speeds up simulations- gives statistical tests on sampled data
ü References
Hudson, R. (1990). Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7, 1-44.Hudson, R. (1993). The how and why of generating gene genealogies. In Mechanisms of molecular evolution,ed. Takahata N & Clark AG, pp 23-36.Donnelly, P. and S. Tavaré. (1995). Coalescents and genealogical structure under neutrality. Ann. Rev.Genet. 29, 401-421.Rosenberg, N. A., and M. Nordborg, 2002 Genealogical trees, coalescent theory and the analysis of geneticpolymorphisms. Nature Reviews Genetics 3: 380-390.
‡ Properties of the coalescent process
The time during which there are k lineages is exponentially distributed with expectation 1ÅÅÅÅl = 2 NeÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅkHk-1Lê2 :
P HtkL = Exp@-l tkD l „tk where l =k Hk - 1LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ4 Ne
ü The genealogy is dominated by the deepest split.
The expected depth of the tree is:
2 Ne J 2ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅk Hk - 1L +
2ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅHk - 1L Hk - 2L … 1
ÅÅÅÅ6 +1ÅÅÅÅ3 + 1N =
2 Ne JJ 2ÅÅÅÅÅÅÅÅÅÅÅÅk - 1 -
2ÅÅÅÅk N + J 2
ÅÅÅÅÅÅÅÅÅÅÅÅk - 2 -2
ÅÅÅÅÅÅÅÅÅÅÅÅk - 1 N + … J 2ÅÅÅÅ2 -
2ÅÅÅÅ3 N + J 2
ÅÅÅÅ1 -2ÅÅÅÅ2 NN =
2 Ne JJ1 -2ÅÅÅÅk N + 1N ~4 Ne for large k
Thus, the tree collapses to 2 lineages in ~ 2 Ne generations; these take another 2 Ne generations to coalesceHence, pairwise measures are uninformative
ü The expected length of the genealogy is ~ 4 Ne [email protected] kDThe expected length of the tree is:
2 Ne Jk 2
ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅk Hk - 1L + Hk - 1L 2
ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅHk - 1L Hk - 2L … 4ÅÅÅÅ6 +
3ÅÅÅÅ3 + 2N
= 2 Ne J 2ÅÅÅÅÅÅÅÅÅÅÅÅk - 1 +
2ÅÅÅÅÅÅÅÅÅÅÅÅk - 2 + … 2
ÅÅÅÅ3 +2ÅÅÅÅ2 +
2ÅÅÅÅ1 N
= 4 Ne ‚j=1
k-1 1ÅÅÅÅj
~4 Ne [email protected] kD for large k
The distribution of length is highly variable:
The dots show the quantiles at 0.001, 0.01, 0.1, 0.9, 0.99, 0.999.
Coalescent process.nb 5
Figure 1
5 10 20 50n
1
2
5
10
20
L
‡ Fluctuating population size
Changes in Ne cause changes in timescale
The standard coalescent
Expanding populations Ø "star phylogeny"exponential growth: popl'n was 10% of the current size at TMRCA
6 Coalescent process.nb
Population bottlenecks Ø burst of coalescencea bottleneck equivalent to 2 Ne 'ordinary' generations of drift
ü Changing timescales
The "scaled time" is a measure of the total amount of genetic drift that has occurred:
T = ‡0
t „tÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N HtL
For a constant population size, T = t ê H2 NL . If the population is growing at a rate l, and the present size isN0 , then N = N0 ‰-lt , and so:
T = ‡0
t ‰ltÅÅÅÅÅÅÅÅÅÅÅ2 N0
„t =1
ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N0 l
H‰lt - 1LThe parameter l is a measure of the amount of population growth over the current timescale set by populationsize, 2 N0 . Here is the transformation for l = 1.5
Coalescent process.nb 7
0.5 1 1.5 2actual time
2
4
6
8
10
12
scaled time
‡ Branching processesThe coalescent process only applies to samples from a large population
If all genes are observed, we have a branching process
e.g. discrete time: # of offspring i follows a Poisson distribution with E@iD = l
1 2 3 4l
1P
More generally, for l~1, P ~ 2 Hl - 1L ê varHiL
8 Coalescent process.nb
1 2 34 5 67 8910 111213 141516 17
18
1920
t d i s o q f p c k h m j a e n r l b g
coalescent
12
345 6
789 1011 1213 14151617 181920
o b s d g i t j k r a m p n e q c f h l sample froma branchingprocessl = 1.1
Mutation
‡ Infinite alleles
Assuming that every mutation generates a new allele, the probability of identity in allelic state("homozygosity") is F = ⁄t ft H1 - mL2 t , where ft is the distribution of coalescence times.
F ~ E@‰-2 m t D = ‡0
¶
‰-2 m t ft „ t = ‡0
¶
‰-2 m t ‰-tê2 Ne „ t
ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 Ne=
1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + 4 Ne m
Identity coefficients, F, can easily be calculated by going back in time one generation:
Coalescent process.nb 9
F =H1 - mL2 JJ1 -1
ÅÅÅÅÅÅÅÅÅÅÅ2 NeN F +
1ÅÅÅÅÅÅÅÅÅÅÅ2 Ne
N fl F =H1 - mL2
ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 Ne H1 - H1 - 1ÅÅÅÅÅÅÅÅ2 Ne L H1 - mL2L ~
1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + 4 Ne m
Identity coefficients are generating functions for the distribution of coalescence times:
F ~ E@‰-2 m t D \ F = 1 when m = 0
dFÅÅÅÅÅÅÅÅÅÅÅd m
~ E@-2 t ‰-2 m t D \dFÅÅÅÅÅÅÅÅÅÅÅd m
= -2 E@tD when m = 0
d2 FÅÅÅÅÅÅÅÅÅÅÅÅÅÅd m2 ~ E@4 t2 ‰-2 m t D \
d2 FÅÅÅÅÅÅÅÅÅÅÅÅÅÅd m2 = 4 E@t2 D when m = 0
‡ More general models of mutation
Bases mutate at rate m, and change to A, T, G, C with equal probabilityProbability of identity in state of two genes is:
F = EA 1ÅÅÅÅ4 H1 - ‰-2 mtL + ‰-2 mtE
‡ Infinite sitesFor DNA sequences, the 'infinite sites' model is more appropriate: each mutation is at a new site in thesequence. Two alleles may differ by mutations at 1, 2… sites - giving a measure of the time for which they have beendiverging.
If there are mutations on every internal branch, the genealogy can be reconstructed:
-
----
a
b
c
de
10 Coalescent process.nb
GeneMut' n
1 2 3 4 5 6
a 1 1 1 1 0 0b 0 0 0 1 0 0c 0 0 0 0 1 1d 0 1 1 0 0 0e 1 0 0 0 0 0
To root the tree, we must know which mutations are derived - which requires an outgroup
Any pair of sites which carried all four combinations is incompatible with a tree- recombination- multiple mutations
The mean pairwise diversity, p, is just E[2m t] = 4 Ne m
The number of segregating sites, ns , in a sample is proportional to the total length of the tree: E@ns D = mL ,where L = ⁄ j=1
k jt j
E@nsD = E@m LD = 4 Ne m J 1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅHk - 1L +
1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅHk - 2L … 1
ÅÅÅÅ3 +1ÅÅÅÅ2 + 1N ~ 4 Ne [email protected] kD
Under neutrality, we expect a definite relation between the # of segregating sites and the pairwise diversity
Recombination
‡ Ancestral graphsWith sexual reproduction, genomes have multiple ancestors.Ancestry is described by an ancestral graph:
Coalescent process.nb 11
Coalescence amongst k lineages at a rate kHk-1LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 1ÅÅÅÅÅÅÅÅÅÅÅ2 Ne
Recombination at a rate kr
Pattern depends on R = 2 Ne r
Each recombination generates a pair of unique junctionsJunctions can disappear if they meet eachother in a coalescence
At any time, any one genome is distributed across several ancestral lineages
1 + R -R2ÅÅÅÅÅÅÅÅÅÅ3 +
13ÅÅÅÅÅÅÅÅÅ54 R3 + OHR4 L HDerrida & Jung - Muller 1999L
‡ Example: R = 50Number of ancestral lineages:
12 Coalescent process.nb
2 4 6 8
5
10
15
20
25
A typical sample, with 18 ancestors:
Six sampled genomes represented by colours I tÅÅÅÅÅÅÅÅÅÅÅ2 Ne= 0.6M :
Coalescent process.nb 13
‡ Looking along the genome....Different regions have different genealogies:
14 Coalescent process.nb
ü Patterns of diversity vary along the genome:
Numbers of segregating sites H20 sampled genomes; q = 4 Ne m = 30; sliding window width 0.5L
2 4 6 8 10
2.5
5
7.5
10
12.5
15
17.5
20
Coalescent process.nb 15
2 4 6 8 10
2.5
5
7.5
10
12.5
15
17.5
20
2 4 6 8 10
2.5
5
7.5
10
12.5
15
17.5
20
Mean number of pairwise differences:
16 Coalescent process.nb
2 4 6 8 10
1
2
3
4
5
2 4 6 8 10
1
2
3
4
5
Coalescent process.nb 17
2 4 6 8 10
1
2
3
4
5
‡ Pedigrees - or an infinitely long genome
Probability of ancestor repetitions in the genealogicaltree of the king Edward III. The continuous and dashedlines show simulations of F@rD in a closed population with211 and 212 individuals for our model.
Distribution H@r, tD of r repetitions after t generations. t = 9,13, 15, 17, 19, 21, and 23 for a population with N = 215.
18 Coalescent process.nb
Derrida, B., S. C. Manrubia, and D. H. Zanette. 1999. Statistical properties of genealogical trees. PhysicalReview Letters 82:1987-1990.
‡ Forwards in timeWhat is the fate of a single ancestral genome?In an infinitely large population, this is a branching process.
The chance that the pedigree will survive is ~ 80%
Any finite piece of genome is certain to be lost - but very slowly
The probability of survival of a neutral genome (S = 0) as a function of map length, R. From top to bottom,thecurves show Pt [R] for t = 0, 1, 2... 10; 20, 30...100; and 200, 300...1000 generations.
Coalescent process.nb 19
The distribution of blocks of genome that remain after 50 generations; map lengthR = 1. The two panels show two random realisations of this process. Each line represents one genome.
The increase in mean block number over time (±1 standard error), compared with the expectation
1+Rt. (b) The mean amount of ancestral material over time , compared with the constant expecta-tionR. (c) The probability of survival, P, compared with the value calculated from Eq. 2. (d) The distribu-tion ofblock sizes at time t = 30 compared with the expectation. (R=1).
20 Coalescent process.nb
The increase in mean block number over time (±1 standard error), compared with the expectation
1+Rt. (b) The mean amount of ancestral material over time , compared with the constant expecta-tionR. (c) The probability of survival, P, compared with the value calculated from Eq. 2. (d) The distribu-tion ofblock sizes at time t = 30 compared with the expectation. (R=1).
‡ What do we see?
What is the relation between the ancestry of segments of genome, and the patterns we see?
Patil et al. 2001 Science 294:1719 21,676,868 bases, 36000 SNPs; ~4000 "blocks" identified; ~2700 SNPs capture ~80% of haplotype variationWhat is the actual structure of these 20 chromosomes?
Coalescent process.nb 21
22 Coalescent process.nb
Selection on linked sites
‡ Balancing selection
ü Complete linkage
Kreitman & Aguade (Genetics, 1986) observed excess polymorphism in theAdh region of D. melanogaster.
Hudson, Kreitman & Aguade (Genetics, 1987) introduced the "HKA test" todetect balancing selection.A polymorphism with two alleles P, Q divides linked markers into two separate gene pools.
Eventually, there will be a set of alleles with homozygosity 1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅH1+4 NpmL associated with P, and a distinct setassociated with Q, with homozygosity 1ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅH1+4 NqmL . The overall homozygosity is:
F =p2
ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + 4 N m p +q2
ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ1 + 4 N m q
e.g. 1-F vs p for 4Nm = 0.1 (bottom), q=1 (top):
1p
1F
ü Recombination
We must follow identities between genes both associated with P, FPP , both with Q, FQQ , or one with each,FPQ
FPP' = H1 - r qL2 FPP + 2 r q H1 - r qL FPQ + r2 q2 FQQ
Assuming r small:
dFPP = 2 r q HFPQ - FPPLdFPQ = r Hq FQQ + p FPP - FPQLdFQQ = 2 r p HFPQ - FQQL
Coalescent process.nb 23
The effects of mutation and drift can be found in a similar way.Overall:
dFPP = -2 m FPP + 2 r q HFPQ - FPPL +H1 - FPPLÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N p
dFPQ = -2 m FPQ + r Hq FQQ + p FPP - FPQLdFQQ = -2 m FQQ + 2 r p HFPQ - FQQL +
H1 - FPPLÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 N q
At equilibrium, dF=0. The average F is:
Fê = H2 + r - 4 p q H1 - N m H2 + 3 r + r2LLL êH2 + r + 4 N m H2 + H1 + 4 pqL r + p q r2L + 16 N2 m2 p q H2 + 3 r + r2LLwhere r=r/m.
Note that the effect is only over recombination rates of order m
ü Plot of heterozygosity H1 - FèèL against r/m for 4Nm = 0.1
2 4 6 8 10
0.2
0.4
0.6
0.8
1
ss = SolveA90 == -2 m FPP + 2 r q HFPQ - FPPL +H1 - FPPLÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 n p
,
0 == -2 m FPQ + r Hq FQQ + p FPP - FPQL,0 == -2 m FQQ + 2 r p HFPQ - FQQL +
H1 - FQQLÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 n q
=, 8FPP, FPQ, FQQ<E;
24 Coalescent process.nb
H8FPP, FPQ, FQQ, p2 FPP + 2 p q FPQ + q2 FQQ< ê. ss@@1DD ê.8m -> g m, r -> g r m, n -> 1ê Hg nnL, q -> 1 - p< êêCancelL ê. nn -> 1ên êê Simplify8HH2 + rL H-1 + 4 n H-1 + pL m H1 + p rLLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL,Hr H-1 + 4 n H-1 + pL p m H2 + rLLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +
4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL,HH2 + rL H-1 + 4 n p m H-1 + H-1 + pL rLLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +
4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL,-H2 + r + 4 p H-1 + n m H2 + 3 r + r2LL - 4 p2 H-1 + n m H2 + 3 r + r2LLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +
4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL<Plot@1 + H2 + r + 4 p H-1 + n m H2 + 3 r + r2LL - 4 p2 H-1 + n m H2 + 3 r + r2LLL êH-2 - r + 16 n2 H-1 + pL p m2 H2 + 3 r + r2L +
4 n m H-2 + H-1 - 4 p + 4 p2L r + H-1 + pL p r2LL ê.8n -> 0.025 ê m, p -> 1ê 2, r -> Abs@rD<, 8r, 0, 10<,PlotRange -> 880, 10<, 80, 1<<D;
‡ Selective sweeps
Fixation of a single favourable mutation carries with it a segment of linked genome
Coalescent process.nb 25
mutation
branchingprocess
ns >> 1
deterministic increase
p<<1
fixation
sample
An example: s = 0.1, N = 105 , sampled when p = 0.1. r = {-0.05, 0.15}s/Log[2N] = 0.008
26 Coalescent process.nb
Fixation takes ~ Log@2 NDÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅs generations, so a region of r~ sÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅLog@2 ND has reduced diversity
Coalescent process.nb 27
ü References
Maynard Smith, J., and J. Haigh. 1974. The hitch-hiking effect of a favour-able gene. Genet.Res. 23:23-35.Hudson, R. B., and N. L. Kaplan. 1988. The coalescent process in modelswith selection and recombination. Genetics 120:831-840.Kaplan, N. L., R. R. Hudson, and C. H. Langley. 1989. The hitch-hikingeffect revisited. Genetics 123:887-899.Barton, N. H. 2000. Genetic hitch-hiking. Philosophical Transactions of theRoyal Society (London) B 355:553-1562.Kim, Y., and W. Stephan. 2002. Detecting a local signature of genetichitchhiking along a recombining chromosome. Genetics 160:765-777.Gillespie, J. H. 2001. Is the population size of a species relevant to itsevolution? Evolution 55:2161-2169.
Monte Carlo methods
ü Generalities
How can we make inferences from genetic data?- statistics such as # of segregating sites, pairwise diversity…- likelihood: the probability of observing the data, given some hypothesis
Statistical inference:- significance tests- likelihood- Bayesian inference
ü Griffiths-Tavare
Griffiths, R. C., and S. Tavare. 1994. Simulating probability distributions inthe coalescent. Theoretical Population Biology 46:131-159.
We observe some configuration of mutations:ikjjjjjjjjjjjjjjjjjjjjjjj
1 2 3 4 5 6 7 8 9a 0 0 0 1 1 0 1 0 0b 0 0 0 1 1 0 0 1 1c 1 1 1 0 0 1 0 0 0d 1 1 1 0 0 0 0 0 0e 0 0 0 0 0 0 0 0 0
y{zzzzzzzzzzzzzzzzzzzzzzz
This configuration was produced by this genealogy:
28 Coalescent process.nb
e c d a bThis rooted genealogy cannot be fully reconstructed, because there were no mutations along the branchsleading down to e and to {a,b,c,d}
ü The algorithm (exact version):
- Work back along the genealogy, until the most recent mutation or coalescence
- Sites can only lose a mutation if that mutation is represented only in one leaf; let there be J such sites. (Inthe example above, sites 6,7,8,9 are singletons; J=4).
- A pair of lineages can only coalesce if they carry the same set of mutations; let there be K such pairs. In theexample, there are no such possibilities: K=0.
- With n lineages, the rate of events is ln = n qÅÅÅÅ2 + nHn-1LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2 ; a sum is taken over these events, with the appropri-ate probability, and expressed in terms of the probabilities of the simpler configurations generated by loss of amutation or coalescence.
- This sum over J+K possible previous configurations is wighted by the overall weight 1ÅÅÅÅl :
P@SD =1
ÅÅÅÅÅÅÅÅÅln
ikjjjjjj ‚
j=1
J qÅÅÅÅÅ2 P@Sj
* D + ‚k=1
K
P@Sk* D y{zzzzzz where ln = n
qÅÅÅÅÅ2 +
n Hn - 1LÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ2
Sj* represents deletion of the j ' th singleton site from S, and Sk
* the coalescence of the k ' th pair.
This algorithm becomes extremely slow for large numbers of mutations and lineages.
ü Monte Carlo version:
A Monte Carlo estimate can be made by sampling possible paths back through the genealogy, with relativeprobability fÅÅÅÅÅ2 for possible losses of mutations, and 1 for possible coalescences:
P@SD = J qÅÅÅÅÅÅf
Nm
EA‰ 1ÅÅÅÅÅÅÅÅli
J fÅÅÅÅÅÅ2 Ji
* + Ki* NE
where Ji* is the number of possible losses of mutations, Ki
* the number of possible coalescences,m the number of segregating sites, and i the current # of lineages
Coalescent process.nb 29
The parameter f can be chosen arbitrarily: it should take a value which minimises the variance of the estima-tor. Note that while f=q seems natural, it does not give an optimal estimator.
ü Other applications:
Joint estimation of recombination and mutation H4 Ne r, 4 Ne mL :
Kuhner, M. K., J. Yamato, and J. Felsenstein. 2000. Maximum likelihoodestimation of recombination rates from population data. Genetics156:1393-401.Fearnhead, P., and P. Donnelly. 2001. Estimating recombination ratesfrom population genetic data. Genetics 159:1299-1318.
Estimation of population structure:
Beerli, P., and J. Felsenstein. 2001. Maximum likelihood estimation of amigration matrix and effective population sizes in n subpopulations byusing a coalescent approach. Proceedings of the National Academy of Sci-ences (U.S.A.) 98:4563-4568
30 Coalescent process.nb