Transposable elements distributions and
genome evolution’s footprints.
Maria Rita Fumagalli
Tutor: Michele Caselle
Universita degli studi di Torino
Ph.D. Program in Complex Systems for Life Sciences
XXVIII Cycle - 2013/2015
Introduction
Evolution is a process common to all known lifeforms and plays a central role in shaping
biological diversity. This is a core problem of contemporary biology, where a substantial amount
of data is available, but, to be solved, it requires mixing different competences and backgrounds.
Evolution implies the concept of inheritability, the acquisition of new abilities that are
transmitted to the new generation, and, consequently, DNA modifications. In the last decades
the intuitive chain “one gene, one transcript, one protein” has been enriched by the discovery
of a complex relation between coding and non-coding information. The growing knowledge
of the regulatory roles fulfilled by that part of our DNA that was once referred to as “junk”,
contributed in dramatically changing the idea of “genetic information” encoded by DNA.
For a physicist, evolution is a stochastic process driven by rare events. Evaluating and
modeling costs and benefits of evolutionary moves and how they drive the process of evolution
represent an interesting challenge.
Recently, the idea of using microorganisms to study evolution in the laboratory made pos-
sible to directly investigate how their genomes and phenotypic properties evolve. Replicated
experiments under controlled conditions, combined with modern sequencing technologies allow
to deduce quantitative principles of evolution and test theoretical hypotheses.
Laboratory evolution experiments show a number of large-scale changes of the genome, which
appear in combination with point mutations. Large-scale mutations can involve individual
genes, genomic segments, or even entire chromosomes. Regarding this kind of mutations, the
2
open question is how these a priori detrimental moves may be turned into an advantage. Stan-
dard population genetics models do not explicitly address the problem of large-scale moves and
usually consider point mutations only.
The first part of this thesis focuses on the role of transposable elements in genome evolution,
as source of large-scale evolutionary moves. We introduce a model to describe their distribution,
in order to obtain information on large-scale genomic rearrangements over time.
The second part of this thesis aims to develop theoretical and quantitative models for long-
term evolution of microbial populations, comparing them with data from biological experiments.
Most of this part has been published in the article:
Speed of evolution in large asexual populations with diminishing returns
Fumagalli M. R., Osella M., Thomen P., Heslot F. and Cosentino Lagomarsino M.
Journal of theoretical biology, 2015.
3
Contents
I Distribution of Transposable Elements and Genome Evolution 6
1 Transposable elements in human genome 7
2 Models for REs distributions in human genome 15
2.1 Null model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Expansion-duplication model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Source estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Conclusions 32
Appendix - Part I 34
A1 Data analysis and selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
A2 GC-rich and GC-poor regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A3 Jukes-Cantor and Kimura divergence . . . . . . . . . . . . . . . . . . . . . . . . . 36
A4 Maximum likelihood estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Supplementary Figures - Part I 39
II Stochastic Models of Evolution in Large Asexual Populations 50
1 Mutations and fitness in asexual populations 51
2 Models for microbial evolution in long-term experiments 55
2.1 Model definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.2 The diminishing return model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3 Parameter-matching procedure for diminishing return model . . . . . . . . . . . . 67
Discussion and conclusions 75
4
Appendix - Part II 79
A1 Self-consistency scaling argument estimating the adaptation speed. . . . . . . . 79
A2 Simulation algorithm and effective parameters . . . . . . . . . . . . . . . . . . . . 82
A3 Phenotypic variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A4 Estimate of the effective beneficial mutation rate . . . . . . . . . . . . . . . . . . 84
A5 Estimate of mean fitness as a function of time . . . . . . . . . . . . . . . . . . . . 85
Supplementary Figures - Part II 87
5
Part I
Distribution of Transposable
Elements and Genome Evolution
6
Chapter 1
Transposable elements in human
genome
Transposable elements (TEs), also known as “jumping genes” or transposons, are sequences
of DNA able to insert and move within a host genome.
The existence of “jumping genes” was observed for the first time by Barbara McClintock
in maize in late forties [1, 2]. Since then, TEs were found in essentially all the genomes, with
very few exceptions. Transposable elements compose nearly half of the human genome [3, 4, 5]
(Fig. 1.1), but their contribution is likely to be even larger because the identification of ancestral
sequences is complicated by the mutations acquired over time.
TEs classification
Transposable elements can be divided into two classes: retrotransposons (Class I) and DNA
transposons (Class II) according to their mechanism of transposition. These classes are subdi-
vided in a wide variety of groups, families and subfamilies according to their sequences. More-
over, TEs can also be defined as autonomous (active) or nonautonomous, depending if they
encode for the proteins necessary for their own transposition.
DNA transposons (DTEs) account for a small fraction of the TEs in human and in other
mammals [3, 6], while give a relevant contribution in other species. Although they seem to have
lost the ability to transpose in the human genome [3], DTEs were active during early primate
7
(a)
(b)
Figure 1.1: (a) Relative contribution of different TEs to the amount of repetitive elements in selected species
as percentage over the total number of insertions. (b) Contribution of different TEs to the composition of
human genome. Alu and L1 families together represent one third of our genome. Data obtained as described in
Appendix.
evolution, until 37 million years ago (MYA) [6].
These elements are able to excise from the genome, move and paste themselves as DNA into
a new genomic position.
The transposition process does not necessarily increase the copy number of the DTEs, be-
cause most of them move through a non-replicative “cut-and-paste” mechanism. The increase
in the number of copies of DTEs should be caused by indirect mechanisms that rely on host
replicative processes [7].
Most of the active elements encode a transposase enzyme that is able to bind at the termini
of the DTE sequence performing a breakage reaction to remove the transposon from its site.
The same enzyme, binding at the new integration site of the target DNA, inserts the DTE
into its new position. The inserted element is flanked by small gaps which are filled in by host
enzymes, leading to small target site duplications [8].
There are other types of DNA transposons, named Helitrons and Mavericks, whose mecha-
nism of transposition is not yet well understood but, most likely, implies the displacement and
replication of a single-stranded DNA intermediate [7].
Retrotransposable elements (REs) proliferate through a “copy-and-paste” mechanism.
They are transcribed in RNA intermediates and reverse transcribed into the host genome in a
different location. Therefore the retrotransposition process increases the number of REs present
in the genome since it preserves the original copy duplicating the element.
8
Retrotransposons can be further subdivided into two groups: long terminal repeats (LTRs)
and non-LTRs.
LTRs, including human endogenous retroviruses (HERVs), have some similarity with virus se-
quences. These elements encode different viral proteins (reverse transcriptase, ribonuclease and
integrase) that provide enzymatic activities for retrotranscribe the RNA and insert the cDNA
into the genome. LTR sequences lack the genes encoding the envelope proteins that permit the
movement from one cell to another. Hence, unlike viruses, they can only reinsert themselves
into the genome from which they originated. Reverse transcription of LTR-retrotransposon
RNA is a multistep process occurring in the cytoplasm [8].
Non-LTR retrotransposon are classified according to their size into two categories: short
interspersed elements (SINE, shorter than 500 basepairs(bp)) and long interspersed elements
(LINE) [9]. The two major families∗ of LINEs and SINEs are L1 and Alu, globally consisting
of ≈ 2 · 106 elements and accounting for nearly 30% of human DNA (Fig. 1.1). Currently, few
elements from these two families are able to move in the human genome and can cause genomic
instability through both insertion and post-insertion mechanisms [10].
In contrast to LTR retrotransposons and retroviruses, RNA is transported to the nucleus and
reverse transcription process takes place on nuclear genomic DNA. Occasionally, nonautonomous
TEs and other cellular transcripts can compete for the retrotranscription machinery, leading to
the retrotransposition of these elements and processed pseudogenes [11].
L1 non-LTR retrotransposon are autonomous REs diffuse in mammals. This family represents
almost one fifth of human DNA, with around 106 insertions.
Full-length elements are 6kb long but the vast majority of human L1 insertions are 5’
truncated to less than 1kb (Fig. S2). The presence of truncated elements is observed also in
novel insertions in cancer cells [14]. The relevant number of incomplete sequences is likely
related to premature polyadenylation of the RNA rather than post-insertion events [5]. Few
hundreds of full-length (and potentially functional) L1s can be detected in human genome, and
germline retrotransposition events occur at an estimated rate of one event every 200 live births
per generation [15, 16].
∗Actually, the classification of these REs is much more complicated, especially for LINEs. For sake of simplicity,
in this work we refer to L1 and Alu as “families” of REs, and all the elements belonging to these families are
classified in “subfamilies”.
9
L1L1
Alu
(a)(b)
Figure 1.2: (a) L1 and Alu structure. Figure shows the schematic structure of L1 and Alu elements. L1s
contain an RNA polymerase II binding sites and two open reading frames encoding the proteins necessary for
RNA retrotranscription and insertion in the host. Alus are short non-coding sequences. They are composed by
two monomers, connected by an A-rich region. Left monomer contains a bipartite promoter for RNA Polymerase
III. Figure adapted from ref. [12]. (b) L1 retrotransposition cycle. L1 mRNA (blue line) is translated in
the cytoplasm, where it assembles with its own encoded proteins. Ribonucleoprotein complexes are re-imported
into the nucleus. The endonuclease makes a single-stranded nick in the host DNA and the reverse transcriptase
uses the nicked DNA to prime reverse transcription from the 3’ end of the L1 RNA. Retrotranscription can be
incomplete, and the element results truncated (e.g. in figure only 3’ UTR and ORF2 are successfully inserted
in the host genome). Alus, SVAs or cellular mRNAs can recruit L1 proteins, using the same machinery to
retrotranscribe themselves in the host genome. Figure adapted from ref. [13]
L1 elements contain in the 5’UTR an internal RNA Polymerase II (Pol II) promoter [5, 17].
Thus, lacking the Pol II promoter, 5′ truncated L1s constitute inactive elements of the family.
In the central part of their sequence, L1s contain two open reading frames (ORFs): the
first one encodes a nucleic acid binding protein and the second encodes an endonuclease and a
reverse transcriptase [8, 11]. These proteins, when the element is transcribed by RNA Pol II, are
probably responsible for the transport and retrotranscription of the RNA into a new genomic
position (Fig. 1.2b).
The L1-encoded endonuclease preferentially cleaves the sequence 5’-TTTT/AA-3’ [17, 18],
and L1s result particularly abundant in AT-rich and gene-poor regions.
Although there is some evidence of tendency of L1s to cluster [19], insertion sites can be
considered randomly distributed, at least at genomic scale [15, 20] (see also Fig. S10).
Alus are primate-specific endogenous SINE and with more than 106 copies are the most
successful transposons in humans. They are 300-bp non-coding nonautonomous sequences com-
posed of two different GC-rich monomers. The monomers derived by the 7SL RNA gene (the
nucleic acid component of the signal recognition particle) and are connected by an A-rich region.
10
Since Alus are nonautonomous REs, retrotransposition requires two basic components: ex-
pression of the RNA template and accessibility to the retrotransposition machinery. Transcrip-
tion is guaranteed by the presence of a bipartite RNA Polymerase III promoter in the left
monomer [17]. Specific Alu subfamilies share the 3’-terminal sequences with “partner” L1s [21],
and this can facilitate the recruitment of L1-encoded proteins. Due to the lack of coding se-
quences, Alu transcripts are not targeted by the translation machinery. Thus, the Alu RNA
can exist as a “free” RNA, eventually recruiting a Poly-A binding protein (PABP).
This may favor the localization of Alu ribonucleoproteins on ribosomes, increasing the prob-
ability of coming in close proximity to newly synthesized reverse transcriptase [17].
The relation between Alus an L1s is further supported from studies on another 7SL-derived
SINE, the rodent B1 element [17]. Despite the fact that Alus retrotranscription involves L1
endonuclease, they result enriched in GC-rich (and gene rich) regions. This bias seems to be
caused by post-insertion mechanisms [22].
Impact of TEs on the host genome: expansion, competition and
evolution
Transposable elements impact genome integrity in several ways, not only during their in-
sertion but also participating in post-insertion rearrangements and structural variations of the
genome [10, 23].
The predisposition of TEs towards involvement in genomic rearrangements is a consequence
of their ability to mobilize DNA, their abundance, and their high sequence identity [10]. Creating
homology regions between nonallelic sequences, TEs can potentially interact during both DNA
repair processes and the crossing over in meiotic recombination [8, 10].
Transposable elements are known to play a role in a number of genetic diseases such as
distrophia and emophilia (see for example [24, 25, 26, 17] ). They have also been observed to
be mobilized in cancer development. This is not unexpected, since a typical characteristic of
cancer cells is their genomic instability, even though the cause-effect relation is still unclear [14,
27, 28, 29, 30]. TEs mobilization, if occurring in somatic cells, can affect the phenotype at single
cell level [13]. Somatic insertions have been detected in neuronal cells while the occurrence of
such events in other (healty) tissues is still under investigation [31].
11
50 MYA
100 MYA
Divergence %
Freq
uen
cy
Figure 1.3: Rate of amplification for different Alu subfamilies. Figure shows the frequency distribution
of Kimura divergence (corrected for the CpG content) of different Alu subfamilies from their relative consensus
sequence (see Appendix). Each subfamily experienced a peak of diffusion followed by inactivation. Indicative
time scale, obtained assuming ≈ 1.3 · 10−9 mutation per nucleotide per year, is reported on top of the figure
[MYA = Millions years ago].
It has recently been demonstrated that L1s can mediate genomic deletions [32]. The insertion
of an Alu element close to another Alu but on the opposite strand of the DNA can cause
the excision of the two elements [32]. On the other hand, Alus are believed to have a role in
duplications. This hypothesis is supported by the presence of Alus in the junctions of duplicated
regions [33, 34, 35].
Although these moves represent an important source of renewal in the genome, the unregu-
lated activity of TEs can cause significant genetic damage. Cells have developed mechanisms
to minimize TEs mobilization and limit their impact.
The first line of defense against the damaging effects of retrotransposons is to prevent their
expression at transcriptional and post-transcriptional level through methylation, PIWI proteins
and PIWI-interacting RNAs [5, 17].
In order to describe the expansion/inactivation dynamics of TEs and the competition be-
tween different subfamilies it is necessary to establish the age of single elements.
A retrotransposon, once inserted in a specific position in the host genome, accumulates mu-
tations over time. It is possible to measure the evolutive distance (or divergence) between the
mutated RE sequence and the original sequence that was retrotransposed (consensus sequence).
This allows to evaluate the age of the insertion. Jukes-Cantor and Kimura models allow to
calculate the value of this divergence quite easily [36, 37, 38](see Appendix).
The distribution of the divergence between a specific REs subfamily and its consensus se-
quence, shows that insertion events are usually localized in time (Fig. 1.3). In fact, the prolifera-
12
tion dynamics of Alu, L1 and other TEs families is characterized by rapid bursts of amplification
followed by inactivation [9, 39, 40, 41].
The three major groups of Alu subfamilies J, S and Y experienced discrete waves of retro-
transposition activity, appearing at different time during primates evolution. Figure 1.3 shows
the divergence distribution and the peaks of diffusion for three subfamilies representative of
these groups (AluJb, AluSx and AluY respectively). The relation between increase and de-
crease of the rate of insertion for different couples of Alus and L1s resemble the classical results
obtained for predator-prey models [17, 21, 39].
Studying the dynamics of birth and expansion of different subfamilies allows to approach
these processes with models typical of population genetics and ecology [42, 43, 44, 45, 46].
These models deal with fitness disadvantages caused by TEs, copy number control and can
include also the effect of silencing.
L1s regulatory regions (3’ and 5’UTR) have been frequently modified during evolution,
whereas the two open-reading frames (ORF1 and ORF2) remained relatively conserved [5, 40].
This rapid evolution of regulatory regions is possibly due to a strategy to escape host silenc-
ing mechanisms and bypass the competition for the retrotransciption machinery with non-
autonomous REs [39].
This is possibly due to a strategy to escape host silencing mechanisms and bypass the
competition for the retrotransciption machinery with non-autonomous REs [39].
The evolution of different L1 subfamilies and their coexistence suggests the presence of
competition also within L1s for the recruitment of the polymerase [40].
The relationship between autonomous and nonautonomous elements is not restricted to L1s
and Alus in human but it has been suggested also in other organisms and for other couple of
LINEs/SINEs including for example B1/L1 in mouse [21, 47]. L2s and MIRs, similarly to Alus
and L1s, share the 3’ terminal sequences and, notably, nonautonomous MIRs become extinct
when L2s stop to proliferate [48].
Most of the REs that can be identified in the current human genome are generally inactive
fossil sequences, with few subfamilies still currently expanding [4]. Therefore, the genomic dis-
13
tribution of REs reflects possible specific preferences or biases of the insertion mechanisms, but
carries also information about the most relevant evolutionary forces driving the rearrangements
of the host genome.
The next chapter introduces a model to describe the current distributions of genomic distances
between REs of different subfamilies, focusing on members of Alu and L1 families. These distri-
butions seem to be compatible with a process of insertion in approximately random positions,
followed by host genome expansion and segmental duplications.
14
Chapter 2
Models for REs distributions in
human genome
2.1 Null model
The distribution of the REs insertion sites can be considered almost random on chromosomal
and genomic scale [15, 20].
The “null” hypothesis of initial random insertion of REs in the genome suggests an analogy
with the stick-breaking model previously introduced to describe the fragmentation process of a
polymer chain [49]. In fact, Alus and L1s can be considered as point-like breaks placed on a
segment of length L equal to the genome size. Considering that Alus are ≈ 300bp long, and
most of L1s insertions are less than 1kbp long (see Fig. S2), the point-like approximation results
quite accurate.
Therefore, the distribution of inter-REs distances for elements randomly inserted in the
genome should be equivalent to the size distribution of fragments after the random scission of
a polymer.
The processes of fragmentation and aggregation have been extensively studied in order to
describe different phenomena such as polymer and sequences degradation, breakup of liquid
droplets, and the crushing of rocks [49, 50, 51, 52].
15
A general expression for this kind of process is given by [51]:
∂c(x, t)
∂t= −
∫ x
0c(x, t)F (y, x− y)dy + 2
∫ ∞x
c(y, t)F (x, y − x)dy (2.1)
+
∫ x
0c(x− y, t)c(y, t)K(y, x− y)dy − 2
∫ ∞0
c(x, t)c(y, t)K(y, x)dy,
where the functions K and F are the kernels that describe how the elements coagulate and
fragment and c(x, t) the number of “pieces” of size x at time t.
We are interested in a random fragmentation process (K = 0) in which all bonds break with
equal probability. The solution of this process can be evaluated analytically [49].
Applying the solution in ref. [49] to the insertion of REs in the genome, the parameter time
should be substituted by the average number of REs per unit chain. The expected number of
distances equal to x after the random placement of B REs on a genome of length L is: c0(x;B,L) =[2BL +
(BL
)2(L− x)
]e−
BLx for 0 < x < L
c0(x;B,L) = e−B for x = L(2.2)
The probability distribution p0(x;B,L) is obtained by normalizing c0(x;B,L) with the total
number of inter-REs distances B + 1.
The above formula can be justified using very intuitive arguments. Considering a polymer
of length L and B “breaks” randomly placed on it the probability of insertion of a break in a
precise position is 1/L for all of them∗.
A fragment of length x can be obtained either by placing a single break at distance x from
one border of the polymer or placing two breaks at relative distance x. Moreover the remaining
B − 1 (or B − 2) breaks should fall out of the fragment.
Hence, the probability of having a fragment of size x from one end of the polymer is
1/L (1− x/L)B−1. Including the correct multiplicity, the expected number of fragments of size
x after placing B breaks is:
πB(x, border) = B 1L
(1− x
L
)B−1
πB(x, center) = B(B − 1)(L− x) 1L2
(1− x
L
)B−2.
(2.3)
∗For simplicity of notation we consider that the insertion of a break does not reduce the number of the possible
insertion sites of the others.
16
Freq
uen
cy
Distance [bp]
(a) Alu Y
104
105
1
10-2
10-4
Alu Jr
104
105
1
10-2
10-4
GCpoorGCrich
104
105
10-1
10-3
Alu Sx Chr 1(b) (c)
Figure 2.1: The genomic distribution of retrotrasposable elements shows deviations from random
placement. Figure show examples of the empirical inter-REs distance distribution (symbols) of three different
Alu subfamilies in the human genome. Panels (a) and (b) refer to genomic distributions. Dashed black lines
represent the parameter-free analytical expectation given by the random-placement null model (Eq. (2.2)). Panel
(c) shows the inter-Alus distance distribution in GC-rich (> 41%) and GC-poor regions (symbols) in a single
chromosome, with the corresponding null model predictions (dashed and dotted-dashed line). More examples
supporting an similar deviation from the random placement expectation can be found in the Appendix.
In the limit of large B (i.e. B ≈ B − 1) and small x (x/L << 1):
πB(x, border) ≈ BL e−xB
L
πB(x, center) ≈(BL
)2(L− x)e−x
BL
(2.4)
Since there are two borders of the chain c(x) = 2πB(x, border) + πB(x, center), and we
recover the result in Eq. (2.2).
We will refer to Eq. (2.2) as stick-breaking or null model distribution for inter-REs distances.
Divergence from the null model
The stick-breaking distribution (2.2) depends only on the parameters B and L. In terms of
inter-REs distance distribution, these parameters are known, representing the number of REs
of a given subfamily and the genome length. Thus, Eq. (2.2) gives a parameter-free analytical
prediction of random-placement distribution.
Comparison between empirical data and the null model for several REs subfamilies clearly
shows evidence of non-random positioning. The disagreement between the distributions is
17
JbJbJoJo
Jr4Jr4
JrJr
Sc5Sc5
Sc8Sc8ScSc
Sg4Sg4
Sg7Sg7
SgSg
SpSp
Sq10Sq10
Sq2Sq2
Sq4Sq4
SqSq
Sx1Sx1
Sx3Sx3
Sx4Sx4
SxSx Sz6Sz6SzSz
Ya5Ya5Yb8Yb8 YcYc
YY
Ye5Ye5
Yf1Yf1
Yj4Yj4
Yk2Yk2Yk3Yk3
Yk4Yk4
Ym1Ym1
JbJbJoJo
Jr4Jr4
JrJr
Sc5Sc5
Sc8Sc8ScSc
Sg4Sg4
Sg7Sg7
SgSg
Sq10Sq10
Sq2Sq2
Sq4Sq4
SqSq
Sx1Sx1
Sx3Sx3
Sx4Sx4
SxSx
Sz6Sz6
SzSz
Ya5Ya5
Yb8Yb8
YcYc
YY
Ye5Ye5
Yf1Yf1
Yj4Yj4
Yk2Yk2Yk3Yk3
Yk4Yk4
Ym1Ym1
GCPoor
GCRich
0 0.05 0.10 0.15 0.200
0.05
0.10
0.15
0.20
0.25
0.30
Jukes-Cantor divergence
Stic
k-br
eaki
ng d
ista
nce
Stic
k-br
eaki
ng d
ista
nce
JbJb
JoJo
Jr4Jr4
JrJr
Sc5Sc5
Sc8Sc8ScSc
Sg4Sg4
Sg7Sg7
SgSg
SpSp
Sq10Sq10
Sq2Sq2
Sq4Sq4
SqSq
Sx1Sx1
Sx3Sx3
Sx4Sx4
SxSx
Sz6Sz6
SzSz
Ya5Ya5
Yb8Yb8
YcYc
YY
Ye5Ye5
Yf1Yf1
Yj4Yj4
Yk2Yk2Yk3Yk3
Yk4Yk4
Ym1Ym1
0.00 0.05 0.10 0.15 0.200.0
0.1
0.2
0.3
0.4
Jukes-Cantor divergence
Stic
k-br
eaki
ng d
ista
nce
(a) (b) (c)
JbJb
JoJo
Jr4Jr4
JrJr
Sc5Sc5
Sc8Sc8ScSc
Sg4Sg4
Sg7Sg7
SgSg
SpSp
Sq10Sq10
Sq2Sq2
Sq4Sq4
SqSq
Sx1Sx1
Sx3Sx3
Sx4Sx4
SxSx
Sz6Sz6
SzSz
Ya5Ya5Yb8Yb8
YcYc
YY
Ye5Ye5
Yf1Yf1
Yj4Yj4
Yk2Yk2Yk3Yk3
Yk4Yk4
Ym1Ym1
10-6 10-5 10-40
0.1
0.2
0.3
0.4
Density B/L
Ya5Ya5Yb8Yb8
Figure 2.2: The distance from the null model increases with subfamily age. Left and middle panels
show the distance from the null model for different Alu subfamilies calculated over the whole genome (a) and in
GC-rich/poor regions (b) as a function of Jukes-Cantor divergence from the consensus sequence. The correlation
coefficient is > 0.7 in all three cases. Panel (c) shows the distance from null model as in panel (a) as a function of
the density (B/L) of Alu subfamilies. Different colors in panel (a) and (c) identify the three mayor Alu groups:
Y (red), S (orange) and J (green).
particularly evident for very short and very long inter-REs distances (see Fig. 2.1). This behavior
is also reproduced by single chromosome distributions and in GC-rich/GC-poor regions (see
Fig. 2.1(c), Appendix).
Defining a distance between data and the null model can give a quantitative estimate of
the observed deviations. We define the distance of the distribution of a given subfamily i from
the stick-breaking prediction as:
DSB =∫ Li
0BiLi|[1−
(1− xBiLi
1Bi+1
)e−xBi/Li
]− fi(x)|dx . (2.5)
The first term in the integral is the cumulative distribution of the null model and fi(x) is the
cumulative frequency of the empirical distances.
The cumulative stick-breaking distribution can be expressed using the rescaled variable
y = xBi/Li. Since the quantity xL is expected to be very small compared to B†, we can neglect
the linear term yBi+1 . In this case, Eq.(2.5) can be approximated as:
DSB ≈∫ Bi
0 | (1− e−y)− fi(y)|dy. (2.6)
The above expression shows that the distances we calculated can be compared between different
subfamilies. In fact, we are measuring the distance of different empirical distributions from
†For our datasets it is usually xL< 10−3.
18
almost the same function‡ ≈ 1− e−y.
The calculated distances are well correlated with the average Jukes-Cantor divergence of
different subfamilies and thus, with a certain approximation, with their age (r ≈ 0.7 and
r ≈ 0.5 for Alus and L1s respectively, see Fig. 2.2a and Fig. S5). A similar result is also
obtained considering the distribution of distances between Alus in GC-rich and GC-poor regions
(Fig. 2.2). Moreover, the divergence from the stick-breaking expectation increases with the
density of the subfamilies (Fig. 2.2c). Together these observations suggest the existence of a
time-dependent mechanism active in both GC-rich and GC-poor regions capable to reshape the
inter-REs distance distribution in a similar way.
2.2 Expansion-duplication model
There are different processes capable of reshaping DNA architecture in eukaryotes.
Some of them, such as insertion of transposable elements or pseudogenes, have the net effect of
expanding the genome, and thus the inter-REs distances.
Duplication and deletion events can influence not only the reciprocal distance between REs
but also their number. Translocations and inversions, by contrast, “change the order” of the
REs, modifying only the two inter-REs distances adjacent to the borders of the translocated or
inverted region. Since the null model of stick-breaking does not involve the order of inter-REs
distances but only their size, we can essentially ignore the contributions of translocations and
inversions.
Under the hypothesis of randomly distributed events the probability of insertion or deletion
in a certain position is uniform over the whole genome. As a consequence the probability that
an expansion/deletion event modifies an inter-REs distance x is proportional to x. It is possible
‡Note that the upper limit of the integral, that depends on the specific subfamily i, is never reached. In fact
y = Bi would imply x = Li.
19
to write a very general equation for the evolution of the distances distribution:
c(x, t+ 1) = c(x, t) +γex−λeL(t) c(x− λe)− γe
xL(t)c(x)
+γdx+λdL(t) c(x+ λd)− γd x
L(t)c(x)
+µ+q+(x, t)− µ−q−(x, t).
(2.7)
The first terms on the right account for the increase and decrease in size of preexisting distances
due to insertions and deletions of length λe and λd respectively. The last two terms, q+ and q−,
are generic source terms for gain or loss of inter-REs distances. Note that the gain (loss) of a
distance implies the increase (decrease) of the number of REs. The total number of distances
between B elements always equals B+1. As a consequence, the normalization of c(x, t) changes
over time.
It is possible to generalize Eq. (2.7) including a distribution of lengths ρ(λ).
In the limit x >> λd , x >> λe Eq. (2.7) can be approximated introducing partial derivatives
∂c(x,t)∂t = −γe λe
L(t)∂(xc(c,t))
∂x
+γdλdL(t)
∂(xc(c,t))∂x
+µ+q+(x, t)− µ−q−(x, t)
(2.8)
and it assumes the simple form:
∂c(x,t)∂t = −γ λ
L(t)∂(xc(x,t))
∂x + µq(x, t). (2.9)
The factor γλ = γeλe− γdλd is the effective expansion/deletion parameter and q(x, t) is the
effective source term. In the next section we will consider both these quantities as positive,
assuming that expansions and insertions dominate over the deletions. However, the solution of
Eq. (2.9) would be exactly the same considering the opposite situation.
20
Pure expansion model
* Continuous case
We solve Eq. (2.9) in the case of “pure expansion”, ignoring the source term q, using the
method of characteristics.
Rewriting the equation as ∂c(x,t)∂t = −γ λ
L(t)x∂(c(x,t))
∂x −γ λL(t)c(x, t) and introducing the variable
k we obtain:
dc(k)
dk= −γ λ
L(k)c(k) where
dtdk = 1
dxdk = γ λ
L(k)x(2.10)
The variable k can be identified with the time and substituting k → t, the solution is given
by: dxdt = γ λ
L(t)x
dc(x(t),t)dt = −γ λ
L(t)c(x(t), t)⇒
x(t) = x0 e∫ t0 γ
λL(t′)dt
′
c(x(t), t) = c(x0, 0) e−∫ t0 γ
λL(t′)dt
′ (2.11)
Since the term e−∫ t0 γ
λL(t)
dtdoes not depend on x, the evolved distribution has the same
functional form of the initial condition: the expansion process does not modify the shape of the
distribution.
Imposing as initial condition
c(x0, 0) = c0(x0;B0, L0) =
(2B0
L0+
(B0
L0
)2
(L0 − x0)
)e−B0/L0 x + e−B0δ (x0 − L0) ,
the expanded distribution results:
c(x(t), t) =
[(2B0
L+(B0
L
)2 (L− x(t)
))e−
B0Lx(t) + e−B0δ
(x(t)− L
)](2.12)
where L = L0e∫ t0 γ
λL(t′)dt
′.
For this model, the total number of REs is a constant and the normalization condition∫ L(t)0 c(x(t), t)dx(t) = B0 + 1 must hold. As a consequence, we obtain L = L(t) and the simple
relation
c(x(t), t) = c0 (x(t); B0, L(t)) (2.13)
21
(a) (b)
1
10
103
Cou
nts
Distance [bp]104 106105
102
10
103
103 104 106105
Expansion
T=0
λ=500λ=50
λ=5000
Expansion removal
T=0
Lf=1.2GbpLf=3Gbp
Cou
nts
Figure 2.3: Random expansion does not affect a stick-breaking distribution. (a) The panel shows the
simulated distribution of distances (orange dots) between B0 = 104 points randomly placed on a segment of
length L0 = 109. Expansion process was simulated using different values of λ (different symbols, as in legend).
The final genome size is Lf = 3·109 for all the distributions. Black dashed lines indicate the null model prediction
as in Eq. 2.13. (b) The panel shows the simulated distribution of distances (orange dots) between B0 = 5 · 104
points randomly placed on a segment of length L0 = 109. We simulated expansion (λ = 5000) and removal of
points at different rates (different symbols, as in legend). The final number of points is Bf = 104 for both the
distributions. Black dashed lines correspond to the null model distribution with the correct number of REs and
genome size. Continuous red and green lines represent the solutions for x < λ (Eq. (2.21)) and λ ≤ x ≤ 2λ
(Eq. (2.22)) respectively.
between the null model and the expanded distribution.
Moreover, it is possible to derive an explicit solution for the genome size as a function of
time:
L(t) = L0e∫ t0 γ
λL(t′)dt
′→ L(t) = L0 + γλt (2.14)
The linear increase of L in time, obtained through analytical arguments, is not unexpected
since the insertion rate γ is kept constant.
When the deletion term in Eq. (2.8) dominates over expansion, we obtain exactly the above
solution where λγ < 0. In this case both x and L decrease in time.
Note that we should impose some realistic limit to the increase (decrease) of genome size.
However, we are interested in modeling a “realistic” expansion process spanning a finite time.
We consider L(t) and L0 to be of the same order of magnitude.
For short distances (i.e. λ > x) the discrete equation (2.7) should be modified and its
continuous approximation is no longer valid. In this region the inter-REs distance x is not
22
expanding and distribution evolves according to:
c(x, t+ 1) = c(x, t)− γ xL(t)c(x, t) ⇒
∂c(x,t)∂t = −γ x
L(t)c(x, t). (2.15)
Imposing L(t) as in Eq. (2.14), the evolved distribution at time t results:
c(x, t) = c(x, 0)e−γx
∫dtL(t)
= c0(x;B0, L0)(L0L(t)
) xλ.
(2.16)
The complete solution should be given by a superimposition of Eq. (2.13) and Eq. (2.16).
* Discrete case
In the case of a discrete system, it is possible to solve the expansion process for small
(x < λ ) and intermediate distances (x ≈ λ).
Considering an initial point xa < λ and the successive distances xb = xa + λ, xc = xb + λ ...
we can write:
c(xa, t+ 1) = c(xa, t)− γ xaL(t)c(xa, t)
c(xb, t+ 1) = c(xb, t)− γ xbL(t)c(xb, t) + γ xa
L(t)c(xa, t)
c(xc, t+ 1) = c(xc, t)− γ xcL(t)c(xc, t) + γ xb
L(t)c(xb, t)
...
(2.17)
Hence, we can rewrite the first equation as
c(xa, t+ 1) = c(xa, 0)
t∏i=0
(1− γxa
L(i)
), (2.18)
while, with some additional manipulations§, the second equation results:
c(xb, t+ 1) = c(xb, 0)
t∏i=0
(1− γxb
L(i)
)+
t∑k=0
c(xa, k)γxaL(k)
t∏j=k+1
(1− γxb
L(j)
). (2.19)
§For simplicity of notation, we are assuming here and in the following equations that∏ti=k+1
(1− γx
L(i)
)for
k = t is equal to 1.
23
Considering γλ << L0, it is possible to introduce the approximation:
m∏i=n
(1− γx
Li
)≈
(m∏i=n
(1− γλ
L(i)
)) γxγλ
=
(L(n− 1)
L(m)
) xλ
. (2.20)
Using this approximation, Eq. (2.18) resuts: we find:
c(xa, t+ 1) ≈ c(xa, 0)
(L0
L(t)
)xaλ(
1− γxaL0
)(2.21)
similarly Eq. (2.16).
Substituting this solution in Eq. (2.19) allows to calculate the distribution in the point xb: :
c(xb, t+ 1) ≈ c(xb, 0)(L0L(t)
)xbλ(
1− γxbL0
)
+c(xa, 0)(L0L(t)
)xaλ(
1− γxaL0
)γxaL(t)
∑tk=0
(1 + γxa
L(k−1)
)
≈ c(xb, 0)(L0L(t)
)xbλ(
1− γxbL0
)+ c(xa, 0)
(L0L(t)
)xaλ(
1− γxaL0
)γxaL(t) t .
(2.22)
In the last step we neglected a contribution of the order ≈ xb/λ ln(L(t)/L0) ≈ (xb/L0) t << t.
In the point xc = xb + λ a similar expression for the evolved distribution can be derived.
Increasing the distance from the initial point xa, the correction due to c(xa, t) becomes less
relevant. In the limit of distances x >> λ it is possible to derive a solution equivalent to the
one obtained from the continuous approximation.
We verified by simulations that the expansion process does not change the shape of the
distribution for initial random insertions and the estimates in Eq. (2.21) and Eq. (2.22) perform
quite well (see Figures 2.3 and Appendix).
Moreover, we simulated random removals of REs. The intuitive result is that this process
decreases the parameter B0, without affecting the shape of the distribution. Thus, an expansion-
removal process would lead to a distribution c(x(t), t) = c0(x(t);B(t), L(t)). An equivalent result
can be obtained with random insertion of new REs.
24
Expansion and insertion
The solution of a pure expansion model presented above, can be extended in order to account
the presence of an external source as in Eq. (2.9).
In the continuous case, using the method of characteristic and rewriting Eq. (2.9) as a
total derivative of the function xc(x, t), the solution is straightforward: dxdt = γ λ
L(t)x
1x(t)
d(xc(x,t))dt = µq(x)
⇒ c(x(t), t) =x0
x(t)c(x0, 0) +
µ
x(t)
∫ t
0q(x(t′))x(t′)dt′ . (2.23)
We assume a linear increase of the genome size L(t) = L0 + ϕt, where ϕ is a combined
function of expansion and insertion contributions. As a consequence, the relation between x
and L is given by:
dx
x= λ
γ
ϕ
dL
L→ x(t) = x0
(L(t)
L0
) γλϕ
. (2.24)
Substituting c0(x0;B0, L0) as initial condition the first term on the right in Eq. (2.23) is still
a stick-breaking function:
x0x(t)c0(x0;B0, L0) =
(2 B0LE(t) +
(B0LE(t)
)2(LE(t)− x(t))
)e−B0x(t)/LE(t). (2.25)
The expanded genome is now LE(t) = L(t)(L(t)L0
)γλ/ϕ−1, since the whole genome L(t)
contains also the contributions of the (expanded) source.
In the limit ϕ→ γλ we recover the result of the former section.
The discrete solution of a generic expansion-insertion model is quite complex.
For x < λ Eq. (2.18) becomes
c(xa, t+ 1) = c(xa, 0)
t∏i=0
(1− γxa
L(i)
)+ q(xa)
t∑k=0
t∏i=k+1
(1− γxa
L(i)
)(2.26)
and successive manipulations lead to:
c(xa, t+ 1) ≈ c(xa, 0)(L0L(t)
) γxaϕ(
1− γxaL0
)+ q(xa)(t+ 1)
(1− γxat
2Lt
). (2.27)
25
For a generic coordinate xi the additional contribution of the source depends on q at different
time points and at different positions xi − λ:
q(xi)∑t
k=0
∏tj=k+1
(1− γxi
Lj
)+
+q(xi − λ){∑t−1
k=0
∏t−1j=k+1
(1− γ(xi−λ)
Lj
)γ xi−λLt
+
+∑t−2
k=0
∏t−2j=k+1
(1− γ(xi−λ)
Lj
)γ xi−λLt−1
(1− γ xiLt
)+ ...
}+ ...
(2.28)
Unfortunately the exact solution is not easy to compute, but it is expected to be roughly
proportional to 1xi
∑tm=0 q(xi −mλ)
∏mn=0 γ(xi − nλ)t ¶.
In order to fully solve the expansion-insertion model we have to chose an explicit form for
the source function q(x).
Segmental duplication is a relevant mechanism for genome evolution. The duplication of a
DNA sequence can imply the passive duplication of the retrotransposons eventually contained in
the sequence. The source term q(x, t) can be seen as the effective source of inter-REs distances
resulting from these events.
Considering duplications of length κ << L and assuming a Poissonian distribution for the
number n of duplicated retrotransposable elements (i.e. p(n,D) = Dn
n! e−D) the source results:
q(x) =∞∑n=1
Dn
n!e−Dπn(x) , (2.29)
where πn(x) is the expected number of duplicated inter-REs distances of length x.
The case n = 0 corresponds to expansion events and can be included in the factor γλ.
At the beginning of the duplication process D = κB0/L0. Assuming that the genome size
increases as L(t) = L0 + ϕt = L0 + (γλ+ µκ)t and B(t) = B(t− 1) + µD(t− 1), the density
as a function of time is:
B(t)L(t) = B0
L0
(1− γλ
L0
)(L0L(t)
)γλ/ϕ. (2.30)
¶We implicitly consider the condition xi > nλ.
26
If the contribution of expansion is negligible (γλ ≈ 0) the duplication process implies a
conservation of the mean density.
D is a mean expected value while the actual number of duplicated insertion is random.
Therefore, the above density have to be considered as a mean value.
A local duplication model allows to calculate explicitly πn(x) and the source term. In
fact, during local duplication all the internal distances are copied and a new inter-REs distance
is created. The length of the new distance is equal to the sum of those at the borders of the
sequence κ ‖. This is not precise for non-local duplication where, in general, two new random
distances are created. However, we expect that the results obtained for the local duplication
process are valid for non-local duplication when D >> 1 (i.e. when the effect of the newly
created distances are negligible).
Following the same approach adopted in the previous section, we assume that the probability
of finding a retrotransposon in a precise position is 1/κ. Therefore, the number of duplicated
inter-REs distances equal to x is:
πn(x, internal) = n(n−1)κ2
(κ− x)(1− x
κ
)n−2. (2.31)
The probability that the sum of the lengths at the borders of the duplicated sequence equals to
x is:
πn(x, borders) = n(n−1)κ2
x(1− x
κ
)n−2. (2.32)
In the particular case n = 1, π1(x, internal) = 0 and π1(x, borders) = δ(κ− x).
The integral∫ κ
0 dx (πn(x, internal) + πn(x, borders)) is equal to the total number n of inter-
REs distances added by the process∗∗.
‖This is very intuitive observing Fig. S12.
∗∗This is the total number of fragments given n cuts on a circular polymer.
27
The effective source function results:
q(x) =∑∞
n=1Dn
n! e−D [πn(x, borders) + πn(x, internal))]
=∑∞
n=1Dn
n! e−D[n(n−1)κ2
κ(1− x
κ
)n−2+ δ(n− 1)δ(κ− x)
]
= D2
κ e−Dx/κ +De−Dδ(κ− x)
(2.33)
with x ≤ κ.
Given a fixed ratio B(t)/L(t) = b, the source of duplication has essentially the functional
form of a stick-breaking distribution:
q(x) =
(B(t)L(t)
)2κe−B(t)L(t)
xx < κ
B(t)L(t)κe
−B(t)L(t)
κx = κ
(2.34)
The substitution of Eq (2.34) in Eq. (2.23) would lead to a source term ∝ e−D and a second
term proportional to(1x(t)
)1+ϕ/γλ{
Γ
[1 + ϕ
γλ ,Dx(t)κ
(L0L(t)
)γλ/ϕ]− Γ
[1 + ϕ
γλ ,Dx(t)κ
]}. (2.35)
A series expansion of the incomplete gamma functions Γ[a, x] assuming(L0L(t)
)γλ/ϕ≈ 1,
gives an exponential dependence of the source term on x.
Including the effect of expansion on the density as in Eq. (2.30) the expanded source results:
µx(t)
∫dt′q(x′)x′ = µ
ϕ−γλ
(B(t)L(t)
)2x(t)κe
−B(t)L(t)
x(t)(L(t)x(t) −
L0x0
)+ΘµB(t)
γλ
(κx(t)
)ϕ/γλe−B(t)L(t)
x(t),
(2.36)
where the symbol Θ indicates an appropriate Heaviside function.
In conclusion, the effect of random local duplication can be evaluated as an additional expo-
nential contribution to the stick-breaking.
28
Supposing that duplications are facilitated by the presence of repetitive sequences, we can
expect that “dense” regions tend to be duplicated more.
It is possible to include the “actual” distribution as a source of duplicated distances consid-
ering the function q(x, t) = f(x)c(x, t). Assuming that short distances are more likely to be
duplicated, f(x) is a decreasing function. In this case, equation (2.23) can be integrated:
dc(x(t),t)dt =
[−γ λ
L(t) + µf(x)]c(x(t), t) → c(x(t), t) = c(x0, 0)e
∫ t0 dt′(−γ λ
L(t′)+µf(x(t′)))
= x0x(t)c(x0, 0)eµ
∫ t0 dt′f(x(t′)) .
In the limit of large inter-REs distances x we would expect the duplication probability f(x) to
disappear and eµ∫ t0 dt′f(x(t′)) ≈ 1, recovering the result obtained for the pure expansion model.
2.3 Source estimate
The solutions in the previous section, besides the the arbitrary choice of the functional form
of q(x), contain the unknown parameters κ and the ratio between expansion and duplication.
While a distribution for duplication lengths can be roughly inferred (e.g. using data from
Ref. [33]), to estimate the parameters ϕ and γλ is more complex.
In order to estimate the fraction of duplicated REs and a functional form for the phenomeno-
logical source, we introduce a simplified model. According to the results in Sec. 2.2, expansion
process does not change the shape of the initial stick-breaking distribution.
As a consequence, relying on the assumption that the duplication process generates short
inter-REs distances, the right-tail of the total distribution should be well fitted by the stick-
breaking function with the correct normalization.
Thus, we aim to fit the data with a function:
c(x,B,L) = θ c0(x;B0, LE) + (1− θ)q(x) (2.37)
where LE is the expanded genome size as in Eq. (2.25) and B0 indicates the initial number
of insertions (i.e. the original number of retrotransposition events).
29
xmin
Purestick-breaking
Stick-breaking+duplication
Frequency
Distance
Figure 2.4: Schematic representation of source estimate procedure.
The inter-REs distance distribution (orange dots - data do not refer to any
particular subfamily) for a value larger than a minimum value xmin (black
dashed line) is well approximated by a stick-breaking distribution (blue
line). The same distribution at shorter distances is the sum of the stick-
breaking and the source. The parameters of the stick-breaking distribution
and the threshold xmin can be optimized using a maximum likelihood ap-
proach.
Frequency
Distance [bp]
(a) Alu Y (b) Alu Jr sourcedouble exp.0.5
1
0 25000
exponential0.5
1
0 25000 50000
source
104
105
1
10-2
10-4
1
10-2
10-4
104
105
Figure 2.5: The inter-REs distance distribution is well described by a stick-breaking model and a
short-range source of distances. Figure shows two examples of inter-REs distance distribution (orange dots)
and the best fit (continuous red line) obtained as the sum of a stick-breaking distribution and an external source
(see main text). The insets show the normalized sources distribution (orange dots) fitted using an exponential
or double exponential function (continuous red line).
In other words, we posit that the tail of the observed distance distribution of B retrotrans-
poson can be well explained by a stick-breaking solution with a smaller initial number of breaks
B0. The short-distances region is a superposition of this stick-breaking process with an extra
contribution captured by the rapidly decaying term q(x).
This approximation allows us to estimate the functional form of q(x) as the difference be-
tween the empirical data and the stick-breaking distribution.
Using a maximum likelihood approach we identified, for each subfamily, the threshold
value xmin that gives the best possible fit of the tail of the distribution, obtaining the initial
number of breaks B0 (see Appendix).
According to our results, the expanded source term is well described by an exponentially
decreasing function for young subfamilies while it is better approximated by a double exponential
30
JbJb
JoJo
Jr4Jr4
JrJr
Sc5Sc5
Sc8Sc8
ScSc
Sg4Sg4
Sg7Sg7
SgSg
SpSp
Sq2Sq2
SqSq
Sx1Sx1
Sx3Sx3
Sx4Sx4
SxSx
Sz6Sz6
SzSz
Ya5Ya5
YcYc
YY
Yk2Yk2Yk3Yk3Ym1Ym1
5 10 4 10 5 5 10 5104
5 104
105
5 105
Avg. Distance
Sou
rce
Figure 2.6: The average source length correlates with inter-REs distances. Figure shows the esti-
mated average source length as a function of the average distance L/B between REs elements for different Alu
subfamilies. Different colors identify the three mayor Alu groups: Y (red), S (orange) and J (green).
for the older ones. Considering that a high-density subfamily has a small typical distance
between elements, this small distance could be preferentially added to the distribution through
the duplication process. Such a mechanism could explain the apparently more than exponential
trend of the data. Using an exponential function to fit the source we found that the mean value
of q(x) is correlated with the density B/L of the corresponding subfamily. These observations
are in agreement with the dependence we derived for the local duplication model.
31
Conclusions
Recently, the role of transposable elements in genome evolution has been increasingly recog-
nized. These elements, widely diffused in different species, can be assimilated to parasites that
evolve and compete with their host genome and with other transposable elements present in it.
In this thesis we presented a null model for the distribution of transposable elements based
on the hypothesis of random insertion in the human genome, focusing in particular on Alus and
L1s. The large number of possible insertion sites and data on recent retrotransposition events
suggest that, at a large scale, this is a reasonable assumption [20, 22, 35]. Obviously, the idea
of “large scale” is not easily defined, and we refer here in general to a genomic scale.
However, the stick-breaking null model does not reproduce correctly the inter-Alus and inter-
L1s distance distributions. Assuming the null hypothesis, all the observed deviations are due
to post-insertion events.
Different mechanisms are believed to be responsible for this discrepancy [22, 35, 53]. Our
intention was to investigate if selection must necessarily enter as a dominant force, or if a
neutral evolution model could be sufficient to explain the empirical data.
Selective elimination of Alus from GC-poor regions after their insertion is believed to be
responsible for the the bias of Alus distribution towards GC-rich regions [22]. This mechanism,
of course, is in contrast with the neutral evolution hypothesis.
However, it is easy to include this effect in a random placement model on two different
genomes that mimic GC-rich and GC-poor regions with two rates of insertion. Hence, the
32
selective elimination can be modeled as a non-retrotransposition event.
The insertion of new genetic material has been suggested to be at the basis of a progres-
sive transition toward a power-law distribution of inter-REs distances [53]. We verified, both
analytically and using simulations, that the expansion process due to insertion of new transpos-
able elements and pseudogenes does not affect significantly the shape of the random placement
distribution.
Including an external source of inter-REs distances in our model we were able to explain the
empirical data. The “effective” phenomenological source (that we suppose to be a source of
duplicated sequences) can be evaluated under the simple hypothesis that it generates short-range
distances.
The estimated amount of duplicated REs is quite large, especially for the older subfamilies
(≈ 50%). Applying the estimate procedure separately on GC-rich and GC-poor regions these
values improve, the fraction of duplicated Alus in GC-rich regions decreasing at ≈ 30%.
The reliability of these estimates is not easy to evaluate. A first source of error is the contri-
bution of non-expanding short inter-REs distances (see Eq. (2.16) or its discrete counterpart).
This term depends on the characteristic insertion length and on the initial genome size, that
are unknown. Moreover, large expansion events (e.g. large duplications of tens of kilobases) can
affect the tail of the distribution. In fact, such events increase suddenly the distance between
two REs and, as a consequence, the probability of insertion of new genetic material in the same
region. Hence, large expansion events could create an effect similar to the one described by
Sellis and coworkers [53]. According to simulated data, this could create a slight deviation from
the null model predictions (see figure S11).
Concluding, we suggest that the observed distribution of recent retrotransposable elements
can be explained with a neutral evolution model of random insertion in the genome followed by
an expansion-duplication process. In the future, we plan to apply the same model to functional
sequences. The goal of this new research is to use the model presented here as a null model to
investigate the role of selective pressure on genomic functional elements, such as transcription
factor binding sites and enhancers.
33
Appendix - Part I
A1 Data analysis and selection
Data on the number of insertions and position of transposable elements (TEs) in human
genome (hg38) were downloaded from RepeatMasker [54] official website http://www.repeatmasker.org
(RepeatMasker open-4.0.5 - Repeat Library 20140131).
The human genome sequence (assembly hg38) was downloaded from UCSC database [55]
(last accessed October 2014). We considered only data referred to reference chromosomes 1-22,
X, Y.
Data analysis and plots were performed using Python, R, Gnuplot and Mathematica.
Estimate of the distance from the null model and the maximum likelihood estimate of the
source were performed on Alu and L1 subfamilies with more than 1000 elements, in order to
have sufficiently high statistics. We included in the analysis a total of 32 Alu and 107 L1
subfamilies∗.
About 3% of the elements identified are labeled by RepeatMasker as ”broken” into pieces
during evolution but originated by a single retrotransposition event, due to the insertion of
other TEs inside the element or genomic rearrangements.
∗We use, here and throughout the thesis, the term ”family” referred to Alu and L1, while the term ”subfamily”
refers, in general, to any subgroup included in these families (e.g. L1M1, AluJb...)
34
These fragmentation events can not be taken into account by our model, that consider TEs as
”point-like” insertions. We collapsed the data according to identification number (ID) assigned
by RepeatMasker (see Fig. S1). This process lead to a very small number of ambiguities. In
few cases it was not possible to assign a precise subfamily to the collapsed element.
We verified that neglecting or including these REs in our model do not affect significantly
the inter-REs distance distribution and have no practical relevance for our model.
The inter-REs distance was calculated as the difference between the start coordinate in
the genome of an element and the stop position of the previous one.
The distance between the first (last) element and the start (end) of each chromosome is
neglected. Moreover, we neglect those distances that fall in centromeric and pericentromeric
regions, that are able to generate unusual long distances in our dataset (of the order of few
Mbp).
We include in centromeric and pericentromeric regions the cytobands labeled as acen and
gvar, according to UCSC repository [55].
As a whole, during this ”selection” process a total of about 50 inter-REs distances were
discarded for each subfamily. Genome results reduced to ≈ 2.8Gb, depending on the subfamily.
A2 GC-rich and GC-poor regions
We define “GC content” of a sequence the ratio between the total number of bases and the
number of Cs and Gs present in the sequence.
We calculated the GC content in non-overlapping windows of 106 bp, neglecting those with
more than 20% of masked bases. A window is considered GC-rich when its GC content is
> 41%. Then, contiguous windows with the same GC content are collapsed in single regions.
We obtained a total of 521 different regions, 263 defined as GC-rich and 258 defined GC-
poor. This procedure guarantees a sufficient “minimal” length of the regions in order to have a
significant number of REs inside them.
The procedure was repeated on the masked sequence of the human genome where the repeats
are masked by capital Ns [55]. In this case we neglected windows with more than 50% of masked
bases. The results obtained with the two methods are comparable.
35
A3 Jukes-Cantor and Kimura divergence
There exist different substitution models to measure the evolutionary divergence between
two DNA sequences.
Insertion time of a TE can be roughly determined as the ratio between the divergence d of
the TE from its consensus sequence, according to a given model, and the total substitution rate
r [56]. Substitution rate indicates the probability of mutation from a generic base (A, C, G, T)
to another one.
In general, substitution rate depends on the organism, the effective population size, the spe-
cific locus of the genome (e.g. coding regions of essential genes will tend to be more conserved),
it differs for synonymous and not-synonymous mutation and varies in time [57, 58, 59, 60, 61].
The overall substitution rate in human is estimated to be of the order of 2 · 10−8 mutation
per nucleotide per generation [58, 62]. Hence, assuming a generation time of ≈ 15 years the
mutation rate is ≈ 1.3 · 10−9 mutation per nucleotide per year.
Jukes and Cantor in 1969 [36, 38] proposed a simple substitution model for the divergence
between two sequences. In this model the mutation rate α is equal for all the four nucleotides.
Considering a reference sequence and a mutable one we can define qt as the fraction of
bases conserved between the two sequences, and pt = 1− qt the fraction of mutated ones. The
substitution rate is r = 3α, since all the three possible substitutions (e.g A → C, A → G or
A→ T ) have the same probability. For mismatching nucleotides, the probability of making the
correct backward mutation is r/3 while the probability of mutating in another nucleotide still
different from the original one is 23r. Thus, the fraction of conserved and mutated nucleotides
q and p evolve according to:
qt+1 = qt − rqt + ptr1
3→
qt+1 − qt = r 13(1− 4qt)
pt+1 − pt = r(1− pt 43)
(0.38)
Posing the initial conditions q(0) = 1 and p(0) = 0, it is immediate to obtain:
q(t) =1
4
(1 + 3e−4rt/3
)↔ p(t) =
3
4
(1− e−4rt/3
)(0.39)
Thus, the Jukes-Cantor distance between the two sequences is defined as:
d = rt = −3
4ln
(1− 4
3p(t)
). (0.40)
Given the fraction of conserved and mutated nucleotides divergence time can be estimated
as t = d/r [56].
36
Considering two sequences evolved from a common ancestor equation (0.39) becomes
q(t) = 1− 3
4
(1− e−8rt/3
)↔ p(t) =
3
4
(1− e−8rt/3
)(0.41)
and their divergence results d2 = −34 ln
(1− 4
3p(t)). The divergence time is obtained as
t = d2/(2r) because if both the sequences mutate, they need half of the time to reach the same
divergence level.
A more precise estimate of the divergence can be obtained using Kimura two parameters
model [37, 38]. This model accounts for two different mutations rate α and β for transitions
between two purine or pyrimidine (i.e. A ↔ G and C ↔ T ) and transversions (e.g. A ↔ C).
Hence, the total substitution rate is r = α+ 2β. Kimura divergence is given by
d = −1
2ln(1− 2p1 − p2)− 1
4ln(1− 2p2) (0.42)
where p1 and p2 are the fraction of transitions transversions respectively.
Moreover, it is possible to add a correction on these divergences accounting for the hyper-
mutability of CpG dinucleotides [59, 63, 64]. This correction is particularly relevant for Alus
since they have an high GC content. RepeatMasker implements this correction neglecting the
mutated CpG dinucleotides. The resulting values of Kimura divergence for repetitive elements
from their consensus sequences are reported in the RepeatMasker alignment file. We compared
the average values of Kimura corrected divergence with Jukes-Cantor divergence for different
subfamilies.
Kimura divergence is ≈ 1.5 larger than Jukes-Cantor divergence for Alus. In the case of
L1s, that are not enriched in CpG, the corrected Kimura divergence essentially coincides with
the simple Jukes-Cantor divergence (see Fig. S4). Hence, the estimate of Alus age is highly
influenced by the specific substitution model and by the CpG content correction. However, the
linear correlation between Kimura corrected divergence and Jukes-Cantor values suggest that
the relative distance between Alu subfamilies can be well described even using the Jukes-Cantor
divergence.
37
A4 Maximum likelihood estimate
In Sec. 2.3 we suggested that the tail of the inter-REs distance distribution could be described
by the stick-breaking function, with the correct parameters.
We consider the probability distributionp(x, xmin, B0, LE) = 1
Z
(2B0LE
+(B0LE
)2(LE − x)
)e− B0LE
x
Z =(
1 + B0LE
(LE − xmin))e− B0LE
xmin(0.43)
where the factor Z guarantees the normalization of the stick-breaking function on the range
[xmin, LE ]†.
The log-likelihood function of this probability distribution was maximized numerically for
different values of the threshold xmin and of the the parameters .
We selected the value of xmin (and as consequence, the corresponding optimal couple
{B0, LE}) that minimize the Kolmogorov-Smirnoff distance between the stick-breaking func-
tion and the data.
The maximum likelihood procedure described above was applied to all the 32 selected Alus
subfamilies. For seven small subfamilies it was impossible to estimate the parameters (usually,
we obtained xmin = 0 and B0 > Breal). For most of the other subfamilies xmin ≈ 105. The
estimated sources are reported in Fig.S9.
†We are using the same notation as in the main text, where LE indicates the expanded genome.
38
Supplementary Figures.
0
200
600
1000
1400
1800
100 300 500 700 900
Fre
quen
cy
Distance [bp]
AluJrAluJbAluY
Raw
0
100
200
300
400
500
0 200 400 600 800 1000
Fre
quen
cy
Distance [bp]
Collapsed
Supplementary Figure S1: Short inter-Alus distances are affected by insertion of TEs Figure show inter-
Alus distances for three subfamilies calculated on raw data from RepeatMasker (left) and after collapsing repeats
identified as single insertions (right). The peak at ≈ 300bp in AluJb and AluJr distributions is due to insertion
of recent Alus in elements of the older families. Black line on the right shows and indicative stick-breaking
prediction (B = 105, L = 3 · 109) for the three subfamilies, that have almost the same density.
39
1
2.5 105
5 105
7.5 105
1 106
103 5 103 104
Cou
nts
Length [bp]
AluMirL1L2
DNALTR
Others
Sine |Line |
Supplementary Figure S2: Length distribution of transposable elements in human genome. Most of the
TEs insertions in human genome are less than 1kbp long, including LINEs elements.
Data refer to collapsed transposable elements according to RepeatMasker ID (see Sec.A1). Satellite repeats are
neglected.
Bin size is 500bps.
40
a1- Chr 1
a2 - Chr 17
a3 - Chr 18
Chr %GC %Gcmasked
1 0,417 0,417
2 0,402 0,397
3 0,397 0,388
4 0,382 0,365
5 0,395 0,386
6 0,396 0,386
7 0,407 0,400
8 0,402 0,395
9 0,413 0,412
10 0,416 0,415
11 0,416 0,420
12 0,408 0,399
13 0,385 0,371
14 0,408 0,404
15 0,421 0,425
16 0,447 0,453
17 0,455 0,463
18 0,398 0,391
19 0,480 0,509
20 0,439 0,448
21 0,409 0,406
22 0,470 0,498
X 0,395 0,390
Y 0,392 0,377
b
Supplementary Figure S3: GC content changes in different chromosomes. (a1-a3) Figures show the GC
content in human chromosomes 1, 17, and 18. Blue refer to regions of low GC content, yellow/red indicate an
high GC content (scale on the right). The script used to produce the figures is based on the one available at
http://genomat.img.cas.cz/. (b) Table shows the average GC content (as percentage of G or C bases over all the
non-masked bases) in different chromosomes, including (left) and excluding (right column) repetitive regions (see
Sec.A1).
41
Juke
s-C
an
tor
div
.
Kimura div.
L1
y=1.42x+0.02
Juke
s-C
an
tor
div
.
Kimura div.
Alu
y=1.01x+0.00
Supplementary Figure S4: Kimura and Jukes-Cantor divergence correlation. Figure show the Jukes-
Cantor divergence as a function of Kimura divergence (corrected for CpG hypermutability) for different Alu and
L1 subfamilies. Best linear fit values are shown. While for L1 subfamilies the two divergences are essentially
equivalent, for Alus the correction due to CpG hypermutability increases all the Jukes-Cantor values by a factor
≈ 1.5. The correlation coefficient is > 0.9 in both cases.
L1 family
0.2
0.4
0.6
0.8
Jukes-Cantor Divergence0.05 0.15 0.25
Stic
k-br
eaki
ng d
ista
nce
Supplementary Figure S5: Diver-
gence of inter-L1s distance dis-
tributions from null model cor-
relates with the age. The figure
shows the calculated distance from
null model distribution for 107 L1
subfamilies as a function of their av-
erage Jukes-Cantor divergence. The
correlation coefficient is ≈ 0.5.
42
B L Gcpoor Gcrich Total
Nocentr Nocentr DivSB DivSB DivSB
AluJb 106131 2796266905 0,27 0,27 0,35
AluJo 42419 2812872175 0,28 0,28 0,39
AluJr4 11258 2795179391 0,20 0,18 0,25
AluJr 104562 2800128859 0,25 0,27 0,33
AluSc5 3927 2769968260 0,06 0,11 0,10
AluSc8 19429 2811262186 0,12 0,19 0,20
AluSc 36042 2811769081 0,11 0,18 0,18
AluSg4 11204 2801195939 0,13 0,23 0,26
AluSg7 5478 2787169000 0,10 0,15 0,17
AluSg 40788 2813571523 0,17 0,26 0,28
AluSp 61418 2808277749 0,21 0,33 0,35
AluSq10 1392 2643504441 0,09 0,17 0,15
AluSq2 49205 2811617633 0,17 0,24 0,27
AluSq4 1505 2648682617 0,08 0,12 0,12
AluSq 11613 2805403229 0,16 0,21 0,28
AluSx1 101443 2796168223 0,21 0,27 0,33
AluSx3 36617 2815790407 0,14 0,23 0,24
AluSx4 11185 2797095951 0,11 0,19 0,20
AluSx 115930 2794142909 0,21 0,29 0,34
AluSz6 38796 2811128834 0,22 0,23 0,30
AluSz 112224 2793895984 0,23 0,27 0,33
AluYa5 3615 2739162849 0,04 0,06 0,04
AluYb8 2521 2690164499 0,03 0,04 0,02
AluYc 6366 2791755286 0,03 0,10 0,08
AluY 93727 2802302678 0,10 0,24 0,20
AluYe5 1207 2617578240 0,06 0,08 0,07
AluYf1 1866 2694779102 0,11 0,12 0,12
AluYj4 2606 2747766668 0,03 0,09 0,08
AluYk2 6357 2798203017 0,06 0,14 0,14
AluYk3 5800 2773998664 0,07 0,13 0,12
AluYk4 1022 2601615948 0,04 0,11 0,05
AluYm1 4417 2769035437 0,07 0,14 0,12
Supplementary Figure S6: Parameters for the 32 analyzed Alu subfamilies. Table shows, for each subfam-
ily, the number of insertions (B), the total genome size (L) and the and distances from the null model (DivSB)
in GC-rich and GC-poor regions as well as over the whole genome. Data refer to REs selected as described in
Sec.A1, neglecting centromeric and telomeric regions.
43
103 104 105 106
AluJb
10-1
10-3
10-5
Distance [bp]
103 104 105 106
AluSx3
10-1
10-3
103 104 105 106
AluSq2
10-1
10-3
105 106
AluYf1
10-3
10-2
10-1
107 103 104 105 106
AluY - Chr 2
10-1
10-3
103 104 105 106
AluSc - Chr 1
10-1
10-3
Frequency
Frequency
Supplementary Figure S7: Examples of inter-Alus distances distribution. Figure shows few examples of
inter-Alus distance distribution (orange dots) for different subfamilies. If chromosome is not specified, data refer
to whole genome (hg38) distribution, excluding centromeric regions and high variability regions. Fragments of
Alus detected by Repeatmasker as single insertion events are collapsed (see Sec.A1). Null-model prediction for
each subfamily is shown (black dashed line).
44
Cou
nts
1
10
102
103
104
103 104 105 106
AluJb
GCpoorGCrich
1
10
102
103
104
103 104 105 106
AluSx
GCpoorGCrich
1
10
102
104 105 106
AluSx4
GCpoorGCrich
1
10
102
103
104
103 104 105 106
Distance [bp]
AluY
GCpoorGCrich
1
10
102
104 105 106
AluYk3
GCpoorGCrich
Cou
nts
1
10
102
103
104
103 104 105 106
AluSc
GCpoorGCrich
Supplementary Figure S8: Examples of inter-Alus distance distributions in GC-rich and GC-poor
regions. Figure shows few examples of inter-Alus distance distribution for different subfamilies in GC-rich and
GC-poor regions (different symbols as in legend, see Sec.A2). Stick-breaking prediction for each distribution is
shown (black dashed line).
45
103
104
105
10-1
10-2
10-3
10-4
10-5
AluJb
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluJo
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluJr4
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluJr
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSc
5
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSc
8
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSc
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSg
4
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSg
7
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSg
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSp
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSq
2
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSq
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSx
1
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSx
3
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSx
4
103
104
105
10-1
10-2
10-3
10-4
10-5
AluSx
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSz
6
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluSz
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluYa5
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluYc
103
104
105
10-1
10-2
10-3
10-4
10-5
AluY
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluYk2
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluYk3
103
104
105
106
10-1
10-2
10-3
10-4
10-5
AluYm1
Frequency
Dis
tan
ce [
bp
]Supple
men
tary
Fig
ure
S9:
Est
imate
dso
urc
es
for
Alu
subfa
milie
s.T
he
figure
show
sth
ere
sult
of
sourc
ees
tim
ate
pro
cedure
for
25
Alu
ssu
bfa
milie
s(o
range
dots
).
Most
of
the
sourc
esare
wel
ldes
crib
edby
an
exp
onen
tial
funct
ion,
while
ord
ersu
bfa
milie
sare
bet
ter
des
crib
edby
adouble
exp
onen
tial
(conti
nuous
red
lines
).A
ll
the
data
obta
ined
as
diff
eren
ceb
etw
een
empir
ical
dis
trib
uti
on
and
opti
miz
edst
ick-b
reakin
gare
show
n,
incl
udin
gdis
tance
sla
rger
thanxmin
(in
most
of
the
case
s
xmin≈
105).
The
exp
onen
tial
fit
was
per
form
edusi
ng
only
data
inth
era
ngex<xmin
.
46
1 10 100Distance (Mbp)
1
10
100
Counts
Supplementary Figure S10: Example of inter-REs dis-
tance distribution for recent insertions. Figure shows
inter-REs distance distribution for 367 L1H elements de-
tected in a sample of 25 individual human genomes (data
from ref.[15]). These insertions are not fixed in human pop-
ulation since they are not present in the reference genome.
Hence, we can suppose that this pool of L1s is due to recent
retrotransposition events. Comparison between genomic dis-
tribution of these elements (orange dots) and stick-breaking
prediction (black dashed line) shows a good agreement.
47
1000 10000 100000 1000000
100
101
102
103
104AluJb
1000 10000 100000 1000000
100
101
102
103
104AluY
Supplementary Figure S11: Effect of expansion on stick-breaking distribution. We wanted to evalu-
ate the effect of a ”realistic” expansion due to TEs insertions and duplications on inter-REs distances. We
calculated the size distribution of all the TEs with Jukes-Cantor divergence smaller than AluY and AluJb
subfamily respectively. These TEs are likely to be younger than the Alus belonging to the two subfami-
lies, and could have contributed to expansion of inter-AluYs and inter-AluJbs distances. Data on genomic
duplications in human larger than 1kbp and identity > 90% (genome build GRCh37) were downloaded from
http://humanparalogy.gs.washington.edu/build37/build37.htm. We selected those duplications that do not con-
tain any insertion of AluY and AluJb in human genome (GRCh37). We simulated the random placement of a
number of REs equal to the number of AluY and AluJb members present in human genome (not shown). Genome
size was reduced to 2.5Gbp and 2Gbp for AluY and AluJb respectively in order to obtain for both the subfamilies
a final LE ≈ 2.8Gbp .
Figure shows the actual distribution of distances for AluY and AluJb elements (orange dots) and simulated data
(red square) of expanded distribution after random insertion of the selected TEs and duplicated regions. In the
case of AluJb subfamily, duplicated regions are inserted twice, since data on duplications are likely to refer to
events newer than AluJb insertion.
Expanded distributions can not reproduce the empirical data and are well described by null model prediction
(black dashed line). However, for AluJb we observed a small deviation of the tail from the null model distribu-
tion, possibly due to an effect similar to the one described by Sellis and coworkers [53]. This deviation is purely
indicative, since the simulated expansion process does not reproduce a ”real” expansion of human genome.
Despite the many arbitrary choices we made, these results support the idea that a random expansion process is
not sufficient to explain the empirical distributions.
48
Retrotransposon
DNA
Duplicated retrotransposon
Duplicated DNA
α β γ δ
A B C D
α β γ δ
A B C D
β γ
B C E
β γ ε2
B C E2
ε1
E1
Supplementary Figure S12: Schematic representation of local duplication. (Top) During the duplication
of a DNA sequence (delimited by black dashed lines), REs internal to the sequence can be passively copied. This
process leads to the duplication of those inter-REs distances that are inside the sequence (distances B and C
in figure), while it can modify the two inter-REs distances that are at the border of the sequence (A and D in
figure). Local duplication process (bottom left) preserve the distances at the borders (A and D) adding a new
inter-REs distance E equals to the sum of the distances between the first (the last) copied RE and the ends of
the duplication (i.e. β and γ in figure). Non-local duplication (bottom right) implies the loss of an inter-REs
distance ε1 +ε2 and creation of two new distances E1 = β+ε1 and E2 = γ+ε2. Since the duplication is non-local,
ε1 and ε2 do not have any relationship with the duplicated region, and it is impossible to establish a priori the
size of E1 and E2.
49
Part II
Stochastic Models of Evolution in
Large Asexual Populations
50
Chapter 1
Mutations and fitness in asexual
populations
Thanks to contemporary technologies such as high-throughput sequencing and phenotypic
characterization, previously unachievable quantitative measurements of the results of controlled
laboratory evolution experiments are now possible. This is guiding theoretical investigations
and could make the validation and falsification of phenomenological theories feasible [65, 66, 67].
Moreover, the reproducibility of evolutionary outcomes can be studied in microbial populations
that are founded by the same ancestor and placed in identical environments [68].
Microorganisms have the advantage to be small and relatively easy to grow in laboratory.
Their short generation time, ranging from minutes to hours, allows to observe evolution on a
“laboratory” time scale. Moreover, studying prokaryotes gives the possibility to bypass most of
the complex regulatory features that characterize eukaryotic organisms.
Most experiments in microbial evolution are conceptually simple. Populations are estab-
lished (often from single clones), then propagated in a controlled and reproducible environment
for many generations. Samples of the ancestral population and from various time points in the
experiment are stored indefinitely (e.g. frozen) so that the ancestral and derived genotypes can
be compared with respect to any genetic or phenotypic properties of interest. This provides
information on the dynamics of the evolutionary process and the extent of evolutionary change.
In asexual (or rarely mating) populations, new genotypes arise from DNA mutations. If these
events are not deathly for the offspring, the result is a mutant having a genotype different from
that of the parent. A measure of the effect of a mutation is given by the increase or decrease of
the fitness of an individual.
51
Fitness is a complicated trait, and from a biological point of view it can be defined as the
success of an individual in reproducing. In the context of experiments on asexual populations
it can be identified with the growth rate µ of a clone [68, 69]. Studying evolution phenomena
using statistical mechanics, fitness can be defined as the mean expected number of surviving
offspring of an individual.
The effect of a particular mutation can be classified as beneficial if it increases the repro-
ductive success, the growth rate or the number of offspring, neutral, if it does not have effect,
or deleterious. The increase (decrease) of fitness due to the acquisition of a mutation is called
the advantage of that mutation.
Fitness depends not only on the genotype of an individual, but also on the phenotype and
on the environment in which it is measured. In fact, a mutation that is beneficial in a specific
context can be deleterious, for the same individual, if the environment changes.
In the specific case of large asexual populations of microorganisms, a high number of ben-
eficial mutations emerge in different clones, and cannot be mixed because of slow or absent
recombination. These beneficial mutations appearing in parallel coexist and compete to drive
adaptation. This phenomenon of concurrent beneficial mutations is related to the Fisher-Muller
hypothesis (or Hill-Robertson effect) for the advantage of recombination [70]. In general, bene-
ficial mutations also arise with a distribution of fitness advantages, which is generally believed
to be exponential [71].
Recent models have generally dealt with the competition between mutations of different
strengths and the competition between mutations that arise on different fitness backgrounds
separately. The first effect, the role of a distribution of fitness changes, is analyzed by mod-
els in which any individual is either the wild type or a mutant derived directly from the wild
type [69, 72, 73]. Thus, multiple mutations arising in the extant mutants are neglected. Con-
versely, models that explicitly deal with multiple mutations typically assume that all mutations
have the same effect [69, 74, 75, 76] (a recent work incorporating both effects [77] has shown
that for peaked distribution of advantage this is the correct effective theory). The latter kind
of model has the advantage of being simpler to treat and accessible analytically.
52
There appears to be one important discrepancy between the models described so far and the
behavior of bacteria evolved in the laboratory for long time (roughly, > 1000 generations). This
discrepancy is well represented by the experimental sub-linear increase on long time-scales of
the average population log-fitness (or fitness advantage) [78]. This is in contrast with the linear
increase predicted theoretically by many models even if they have been compared successfully
with the diversity and adaptation speed of short-time laboratory evolution experiments [79].
Therefore, the core issue is to understand the evolutionary mechanisms at the basis of the
experimental slowing down of the adaptation process.
Furthermore, two recent experimental studies [80, 81] have shown a common trend in the
advantage of combined beneficial mutations occurring in different genes. In most of the cases
analyzed, the combined advantage is lower than the sum of that of individual mutations. In
other words, when mutations of loci in different genes accumulate, the effective advantage of
each of them is lower. This was shown by combinatorial genetics techniques, by constructing all
the possible configurations of a small set of mutations, and evaluating their advantage through
competition experiments. This decrease of the advantage carried by a mutation as the back-
ground fitness increases provides a possible mechanism at the basis of the observed sub-linear
increase of the average fitness in long-term evolutionary experiments [80].
This trend, referred to as diminishing returns epistasis, had been previously suggested the-
oretically on the basis of the general pattern of adaptation observed in long-term microbial
experiments [78], using a modeling framework that neglected concurrent or multiple mutations.
Another study predicts the same principle on the basis of a simple fitness landscape model
combined with the distribution of single mutation effects measured experimentally [82]. The
actual pattern in the fitness advantage associated to the same mutation in different backgrounds
observed by the two studies in ref. [80, 81] is complex, as, on top of the diminishing return ef-
fect, the advantage appears to depend on the mutation identity. Even more recent systematic
experiments [67] are unveiling a complex scenario where different mechanisms coexist for the in-
teractions of mutations between and within functional “blocks”, which can span multiple genes
along the genome.
However, the full experimental complexity is difficult to incorporate in a treatable model, and
experimental data on linked mutations and interference between them are difficult to obtain.
Thus, simplified descriptions, as the multiple-mutations model, are useful to model evolving
populations using a minimal quantity of information on mutations and fitness advantage.
53
Here, we take a simplified approach to study the diversity and speed of adaptation in presence
of diminishing returns. We define a framework that can account for multiple mutations, and
incorporates the effect of diminishing return epistasis.
Namely, the fitness of a mutation depends only on its order of appearance in a clone, and
decreases with it. This generalizes the non-epistatic multiple-mutations model [75, 76](recovered
in case the advantage decrease with the number of acquired mutations is zero). We preserve the
model assumption that evolution is driven by beneficial mutations which appear with constant
rate.
54
Chapter 2
Models for microbial evolution in
long-term experiments
We build a minimal population-genetics model [73, 83, 84] including diminishing return epis-
tasis in presence of competition between beneficial mutations.
2.1 Model definition
The model describes a population of N haploid individuals, or sequences. The evolutionary
dynamics of the population is based on the well-known Wright-Fisher model [83, 84, 85]. Each
individual at generation t + 1 is chosen as the offspring of an individual in class i present at
generation t with probability χi/N , where χi = wi/〈w〉 is the relative fitness of class i. More
intuitively, we can say that each individual of type i produces a random number of offspring
with average equal to its relative fitness wi.
Inheritance is introduced by assigning the fitness of the parent to the offspring. Mutations
change the mean fitness of the offspring w′i relative to the parental one wi according to the
relation
w′i = wi(1 + s) ' wies,
and we refer to s as the advantage or selection coefficient of the mutation.
These prescriptions for the evolutionary dynamics assume non-overlapping generations since
at each generation there is a complete replacement of parents with progeny. The population size
55
N is kept constant, and there is no recombination move. Fitness differences in the population
leads to natural selection since classes of individuals with higher fitness generate increasingly
larger fractions of the population, while classes with low fitness progressively disappear.
Since the rules defining the dynamics contain the relative fitness χk, the model is unaffected
by multiplication of all the wk by a common factor. Therefore, the fitness value w0 = 1 can be
arbitrarily assigned to the ancestral genotype with no mutation (or wild-type, WT).
While the fitness advantage associated to new mutations is a complex issue [86], in presence of
abundant beneficial mutations, deleterious mutations (negative effect on fitness, s < 0) do not
typically contribute to the adaptation of large populations and are customarily neglected [73, 75,
76]. For this reason, we consider only beneficial mutations with a positive selection coefficient
s, acquired by the population at a given rate.
Following this approach, in the model each offspring has a constant probability per genera-
tion Ub (the beneficial mutation rate) of acquiring a beneficial mutation.
Clonal interference regime The conditions for the emergence of the interference phe-
nomenon between sub-populations with different genotypes in a large population can be under-
stood with simple scaling arguments as a competition between processes occurring at different
time scales [73].
Since the dynamics is stochastic, it is possible to define for each mutation a “surviving”
probability that is equal to π(s) ' cs, where c is a constant factor that depends on the specific
model used [87]. For the algorithm used here, π(s) ∼ 2s (see Fig. S4 and ref. [73] for a self-
contained motivation). This is the probability that, when a new mutant with advantage s
arises, its lineage grows sufficiently in size to overcome genetic drift (stochastic fluctuations in
the reproductive process).
When a certain mutation is carried by a fraction of the population larger than 1/π(s),
the mutation starts to expand deterministically in the population (it is “established”) and, in
absence of additional beneficial mutations, it will take over the population (go to “fixation”)
logistically.
The establishment size can be explained using an intuitive argument. Considering a sub-
population with nmut mutators in a population of N individuals. The probability that at least
56
one mutator goes to fixation (i.e. its lineage take over the population) is
Πfix = 1− (1− π(s))nmut . (2.1)
The size for which the fixation probability Πfix is around 1 is given by (1−π(s))nmut ≈ 0, and,
for small π(s) we obtain nmut ≈ 1π(s) .
For a population of size N , the scale of the fixation time can be estimated by imposing that
12s exp(sτfix) ' N , giving a characteristic time τfix ' ln(2Ns)
s [75].
On the other hand, the number of new mutations arising in the population at each generation
is NUb. Hence, the time scale for appearance and establishment of a new beneficial mutation
is τest ' 1NUbπ(s) .
Therefore, when NUb � 12ln(2Ns) (i.e. τfix � τest) a beneficial mutation can fix before any
new mutation can establish, making the evolutionary dynamics driven by successive sweeps of
new lineages arising in an essentially monoclonal population. This regime is called “selective
sweeps” or “periodic selection”. Instead, a sufficiently large population with high beneficial
mutation rate (i.e. NUb >> 1) evolves in the opposite regime, in which multiple mutations can
establish before fixation of any of them and interfere with each other (clonal interference).
This is the regime considered here, which is believed to be relevant for laboratory evolution
experiments with microorganisms.
In fact, the typical values of the selection coefficient s in laboratory evolution experiments
with bacteria are in the range s ' 0.001 − 0.005 [88, 89]. Exploring for the parameter Ub the
(experimentally relevant) range 10−10−10−3 [88, 89], and considering population sizes between
106 and 1010 [65], we ensure to be in a clonal interference regime.
57
2.2 The diminishing return model
In the simpler scenario, we assume that each new beneficial mutation hits a new site
on the genome, so that each offspring has a constant probability per generation Ub of acquiring
a new beneficial mutation. The model assumes that successive mutations do not lead to the
same fitness gain, but the fitness gain is dependent on the mutations already occurred in an
individual [78]. This feature extends the non-epistatic model for multiple mutations [75, 76] and
considers selection coefficients dependent on the number of mutations, i.e. s = s0g′(k), where
g′(k) is a decreasing function of the number of acquired mutations k.
The total advantage is given by the sum of the effects of contributing mutations, i.e:
w(k) = es0g(k) and g(k) =∑k
k′=1 g′(k′).
Definition of the advantage functions
In order to fully specify the model, one has to choose a specific form for the advantage
function g′(k), describing the strength of the negative epistasis between mutations.
A simple example is given by the choice of a fitness gain that depends on the number of the
extant mutations k as a power law.
In this case, the fitness is
wk = e∑kk′=0 s0αk
′α−1, (2.2)
with α < 1 for diminishing return, and α = 1 for no epistasis.
In the general case α 6= 0, one can write
wk = es0αHk,1−α ≈ es0(kα−1[ kα+ 12
+O(1/k)]+ζ(1−α)) , (2.3)
where ζ indicates the Riemann zeta function.
Thus, the relative fitness can be expressed as χk ≈ es0(kα−〈kα〉).
For the particular case α = 0 the fitness becomes
wk = e∑kk′=1 s0k
′−1 ≈ es0(ln(k)+γ) , (2.4)
which is obtained truncating the harmonic number expansion (neglecting terms O(1/k)), and
where γ is the Euler-Mascheroni constant (≈ 0.6).
Under this assumption, the relative fitness results χk ≈ es0ln
k〈k〉 .
58
These expressions show that we can essentially directly assume fitness functions of the form
wk = es0kα
if α 6= 0
wk = es0ln(k+1) if α = 0 , (2.5)
which can be expressed, neglecting terms of order 1/k, as sum of powers of the number of
accumulated mutations∗. The fitness function in Eq. (2.5), using ln(k+1) instead of ln(k), allows
to automatically include the case k = 0 with the correct normalization condition w(0) = 1.
We will refer to the case α 6= 0 as the power law model, while the case α = 0 corresponds
to the logarithmic model.
The third model we can consider is characterized by a “geometric dependence” of the fitness
advantage on the number of acquired mutations. In this case the advantage function is
g′(k) = qk−1 with q < 1 and the fitness is given by
wk = e∑kk′=1 s0q
k′−1= e
s01−qk1−q . (2.6)
In other words, the advantage accumulates following a geometric sum. As for the former models,
considering only the final form of the advantage function g(k) = 1−qk1−q , allows to directly satisfy
the condition g(0) = 0. The constant factor 1− q could be adsorbed in s0, as for the power law
model.
Note that, while in the power law and logarithmic models the fitness can increase indefinitely,
the fitness for the geometric model is upprebounded: ln (w(k →∞)) = s01
1− q.
The diminishing return model: phenomenology.
The population is composed of classes of individuals with the same number of mutations
that are in one-to-one correspondence to fitness advantage classes (Fig. 1a and Fig. S2). Direct
simulation of the model shows that both the mean advantage 〈s0g(k)〉 and average number of
mutations 〈k〉 grow sub-linearly for intermediate to long times (Fig. 1b, 1c).
This trend is independent from α (or from the specific model of epistasis g′(k)) and is due
to the fact that decreasing advantage and the consequent rise of the establishment threshold
for clones together slow down adaptation.
∗This notation requires an appropriate rescaling of s0α→ α in the case of α 6= 0
59
The time derivatives of 〈s0g(k)〉 and 〈k〉 estimate the adaptation speed vs and the mutation-
accumulation speed vk of a typical realization. Figure 1e shows an average of vk over 100
realizations, plotted as a function of time. The simulations indicate that vk relaxes to a plateau
which is close to the beneficial mutation rate Ub. Equivalently, for long times, the mean number
of fixed mutations shows a linear behavior in time with a rate close to Ub (red line in Fig. 1c).
In the same long-time limit, the advantage of a mutation s = s0g′(k) drops asymptotically to
zero.
At long times there is a transition to an effectively neutral regime: the selection coefficient
s0g′(k) becomes too small to be relevant, and the fixation dynamics is driven solely by genetic
drift. The probability of fixation of an essentially neutral mutation is ∼ 1/N while the rate
of appearance of new mutations is UbN . Therefore, the pace at which new mutations are
accumulated is approximately vk ∼ Ub. This neutral long-time regime is outside of the limit of
applicability of the model, and has to be regarded as unphysical, since when vk = Ub, deleterious
mutations cannot be neglected [90]. Thus, for any finite N the asymptotic trend of 〈k〉 has to be
interpreted as an effective signature of a change of regime for both vk and vs, where beneficial
mutations should be in equilibrium with deleterious ones, which could possibly be captured by
a variant of the model including deleterious mutations [90].
Note also that realistically the beneficial mutation rate itself could decrease in the later
stages of evolution [73, 91], eventually giving a contribution to the adaptation dynamics (see
Discussion).
The diminishing return model: infinite population limit
In the limit of infinite population, N → ∞, the dynamics of the model can be described
using the following equation [73],
f(k, t) = (1− Ub)f(k, t− 1)w(k)
〈w〉(t−1)+ Ub f(k − 1, t− 1)
w(k − 1)
〈w〉(t−1), (2.7)
where f(k, t) is the frequency of individuals with k beneficial mutations at generation t,
w(k) = es0g(k), and 〈w〉t =∑
k wkf(k, t) is the mean fitness.
Multiplying Eq. (2.7) by k and summing over k gives the following expression for the dy-
60
f(k,
t)
Lk
Num. of mutations k
Vk
Ls
f(s o
g(k)
,t)
Advantage s0g(k)
sV
<s 0
g(k
)>
0.50
0.52
0.54
0.56
0 6 104 1.2 105
<k>
Generations
50
100
200
300400
1 104 1 105
Generations Generations
1
1.5
0 4 107
Vk
x U
b
8 107 1.6 1081.2 108
2
3
2.5
(a) (b) (c) (d)
Figure 1: Basic features of the model. (a) Because of competition between beneficial mutations, the popu-
lation is divided into sub-populations with different frequencies, defined by the number of mutations k (see also
Fig.S2). Lk is the difference, in number of mutations, between the maximum number of mutations found in a
clone kmax and the mean. This induces a distribution for the log-fitness s0g(k). (b - c) Both distributions travel
in time, driven by established beneficial mutations, with instantaneous velocities vk and vs. The panels show the
increase in time of 〈k〉 and of 〈s0g(k)〉 obtained by direct simulation of the diminishing return model. The plot in
panel (c) is in log-log scale, and the data are compared with a reference straight line (dashed blue line, with slope
5 · 10−3) to highlight the sublinear growth of 〈k〉. The continuous red line shows the the asymptotic long-time
linear behavior with slope corresponding to Ub. (d) Long-time behavior of the mean speed of fixed mutations vk
(green symbols), averaged over different realizations. For long times, this quantity decreases (as a power law)
towards the limit value vk = Ub, where the assumptions of the model break down and deleterious mutations need
to be accounted for [90]. This limit also corresponds to the limit value of vk obtained by a infinite-N estimate
(see text). Simulations are carried out using the parameters N = 5 · 107, s0 = 0.5, α = 0.02, Ub = 1 · 10−3.
Averages are computed over 100 realizations (these averages are implied in the notations for the y-axis labels).
0
100
200
300
0 100 200 300
<V
ark>
<VarMF>R
0.025
0.15
7 9 11 13
σR(V
ark)
/<V
ark>
Log10(N)
Generation 107Generation 106Generation 105
0.05
0.10
R
R
(a) (b)
Figure 2: The infinite-N approximation captures a relation between vk and the width of the fitness
class distribution, which is valid at intermediate times for moderate N and until longer times
for large population sizes. (a) Simulated variance of the mutation classes distribution, shown as a function
of the expected variance from the infinite-N estimate (Varinf = (vk − Ub)〈k〉1−α(s0α)−1, see Eq. (2.9)). To
avoid ambiguities, averages over realizations are indicated by a suffix R. The continuous red line represents the
theoretical prediction Varinf = Vark. The error bars (standard deviations over realizations, σR(Vark)) become
larger with 〈Vark〉R. (b) While for increasing times σR(Vark) diverges, the relative variability σR(Vark)/〈Vark〉R
over realizations decreases with increasing population size N , for any fixed time. This suggests that the infinite-N
estimate is well-defined. Simulations are carried out using the parameters s0 = 0.5, α = 0.02, Ub = 1 · 10−3.
Population size in panel (a) is N = 107.
61
namics of the mean number of mutations 〈k〉(t) =∑
k kf(k, t),
〈k〉(t+ 1) =〈k w〉(t)〈w〉(t)
+ Ub . (2.8)
This expression can be further simplified assuming that the frequency distribution is narrowly
peaked around the mean (which travels in time) and that it can be expressed as f(k, t) ≈
δ(k; 〈k〉(t)). In this case it can be easily verified that vk = Ub.
A different, more instructive, relation, which keeps into account the width of f(k, t) can be
obtained starting from Eq. (2.8), and expanding w(k) under the assumption that
Dk ≡ (k − 〈k〉) � 〈k〉, for every index k of non-empty classes. The fitness for the power law
model, for example, results w(k) = es0kα ≈ es0〈k〉α
(1 + s0α〈k〉α−1Dk
).
The assumption Dk � 〈k〉 is verified by simulations (see Supplementary Fig. S5) and by
further considerations on the finite-N width of the distribution given in the following sections.
Computing the averages 〈w〉 and 〈kw〉 to first order in Dk, and noticing that 〈Dk〉 = 0 and
〈kDk〉 = Vark, we obtain an expression for the speed of accumulated mutations as a function
of the variance of the fitness class distribution.
Estimating vk as d〈k〉dt ≈ 〈k〉(t+ 1)− 〈k〉(t) gives, for the case g(k) = kα
d〈k〉dt≈ s0α〈k〉α−1(t)Vark(t) + Ub . (2.9)
According to this equation, vk is driven by two terms, the increase of mutations due to the
beneficial mutation rate Ub and the selection of individuals with larger fitness. This result is
connected to Fisher’s fundamental theorem and the result obtained by Guess [92], which relate
the speed of adaptation to the variance of the fitness. In this case, the speed of accumulation
of successive mutations is related to the width of the mutation class histogram, but rescaled by
the factor 〈kα−1〉(t), which decreases with time.
Note that in the limit α = 1 we recover the usual linear proportionality, since the advantage
is linear in the mutation class index [73].
The infinite-N limit of the model with α = 1 has been previously addressed by Park et al [73]
with a moment generating function approach. In particular, they estimated the distribution
variance as Vark ' 1−Ubs0
. This estimate, substituted in Eq. (2.9) gives their expression for the
speed vk ' 1 (in the limit of small Ub) suggesting that our result is a consistent generalization
to the epistatic case.
62
For the diminishing returns model, the increase in the width of the distribution of k does
not compensate for the term 〈kα−1〉 (which tends to 0), and, for long times, Eq. (2.9) predicts
the limit velocity vk = Ub, as observed in simulations (Fig. 1e).
Simulated data are in good accordance for intermediate time with the mean field estimate for
the variance Vark of the mutation class distribution (Fig. 2). Additionally, Vark/〈k〉 decreases
quickly with time (see Fig. S5), justifying the assumption of small Dk/〈k〉 †.
However, the variability of Vark over different realizations increases quickly. In order to verify
whether these fluctuations are well-behaved, we have evaluated γ = σR (Vark) /〈Vark〉R. Vark
indicates the variance of the mutation class distribution in a single realization (roughly analogous
to L2k), and the suffix R indicates averages over realizations. Specifically, 〈x〉R indicates the
average of the quantity x over different realizations, while σR(x) is its standard deviation.
Therefore, γ, plotted in Fig. 2b as a function of N , represents the relative variability over the
realizations of the variance Vark of the distribution.
For any fixed time, this quantity decreases with N , suggesting that the meanfield limit is
well-defined for infinite populations. Conversely, fixing N and increasing t, γ appears to reach
finite values, hence Vark (and hence vk) seems to be non-self averaging in time.
This effect and its extent are due to both genetic drift and to the shape of the fitness
landscape.
In summary, a mean-field description of the population dynamics does not work properly
for longer time-scales at finite N , as previously shown in the case of no epistasis (α = 1) [73].
However, the infinite-population limit is instructive for understanding the main mechanisms
driving the model, and provides reasonably good estimates at intermediate times even for finite
populations. Finally, the extreme diversity between realizations at finite population sizes might
have some empirical relevance since most of the experimental results concern a single or a few
evolutionary trajectories.
†It is expected that |Dk| / (Vark)1/2 for every k.
63
The diminishing return model: behavior at finite population size.
Experimental conditions should be modeled accounting for finite-size populations at inter-
mediate times (i.e. on the relevant experimental time scale ≈ 102 − 104 generations, with
〈k〉 ≈ 10 − 102). Direct simulations of the diminishing return model show that the frequency
distributions of the mutation and advantage classes are approximatively Gaussian (as in a con-
stant advantage model), even when the advantage function has fairly high curvature (Fig. 3a,
Fig. S7).
The meanfield estimate leads to expect that the widths of these histograms are related to the
speed of adaptation and consequently of mutation accumulation. Given the discrete nature of
these distributions, their widths are well represented by the distances Lk and Ls of the foremost
bin from the average (shown in Fig. 1a). For a diminishing return model Lk is an increasing
function of the mean number of mutation classes 〈k〉, while Ls decreases with 〈k〉 (Fig. 3b).
The two speeds are connected by the decay of the advantage between k and k + 1, as well as
by the change in the width of the distributions with increasing 〈k〉. Indeed, in the long-time
limit, even if vk > 0, vs vanishes. The existence of a large number of sub-populations with
different k but essentially the same fitness, leads to the effectively neutral behavior discussed in
the previous section.
The “stochastic edge” estimates for the finite population adaptation speed available for the
multiple-mutations model are based on the hypothesis that the only class subjected to substan-
tial stochastic effects is the fittest one, i.e. that ∆s & Ub [75, 90]. For the diminishing return
model, when ∆s ≈ Ub this approximation in general fails. However, for experimentally relevant
parameters we can always suppose that the stochastic edge approximation is valid. Supposing
that s0 = 5 · 10−1, α = 0.02 (this is a much stronger epistatic effect than the one we estimate
from experimental data, see Sec. 2.3) and that Lk ≈ 50 (of the order of the values obtained
from simulations with this parameter set and population size N = 107−1013), then ∆s becomes
close to a beneficial mutation rate Ub ≈ 10−3 (which can be considered very large [88, 89]), for
〈k〉 ≈ 5 · 102. This exceeds the interesting experimental range of 〈k〉 (101 − 102).
Thus, since the simulated advantage and mutation class histograms are both nearly Gaussian
(but not stationary in width), it is possible to generalize the estimates applied for the standard
multiple-mutations model [75]. Supposing a slow increase of the width of the distribution with
k and assuming that the width of the histogram is stable during during the establishment time
64
Frequency
10-8
10-4
1
Number of Mutations0 20 40 60 80
1.2 10-1 2 10-110-8
10-4
1
Advantage
Frequency
Generation103
5 103
104
2.5 104
5 104
7.5 104
L k
3
4
5
0 20 40 60 80<k>
10-3
10-3
10-2
0 20 40 60 80<k>
L s
Vk
010-2
3 10-2
0 20 40 60 80<k>
0
5 10-5
0 20 40 60 80<k>
Vs
(a) (b) (c)
5
Figure 3: (Color online) The histograms of fitness advantage and mutation classes have nearly Gaus-
sian forms. While adaptation slows down, the latter histogram expands while the former becomes
increasingly peaked. (a) Histograms of the mutation classes (top) and advantage classes (bottom) obtained
from simulations averaging over 200 realizations at different generations (different symbols, see legend). The
parabolic form in the semi-log plot indicates that they are approximately Gaussian (solid lines connecting the
symbols). The establishment size 1/π(s) is represented as a dashed line in the top panel. (b) Simulated data for
the widths of the mutation class histogram Lk (green squares, top) and of the fitness advantage histogram, Ls
(blue circles, bottom), plotted as a function of the mean number of mutations 〈k〉. The continuous line represents
the theoretical estimates of the width (see Eq. (2.10) in Appendix). (c) Plots of the speed of mutation accu-
mulation vk (top, green squares) and of adaptation vs (bottom, blue circles), as a function of the mean number
of mutations 〈k〉. Continuous lines are the corresponding theoretical estimates (see Eq. (S16) of Appendix).
Note that since τk depends logarithmically on Lk, the estimates for both speeds are in satisfactory agreement
with the simulated data even if Lk is approximated more roughly. The parameters used in the simulations are
N = 109, Ub = 6 · 10−6, s0 = 0.1, α = 0.2, compatible with those estimated (see Sec. 2.3). Averages are
performed over 200 realizations.
τk necessary to the new class to reach the establishment threshold, the mean of the distribution
moves from 〈k〉t to 〈k〉t+ 1 during the this time and Lk ≈ Lk+1. For 〈k〉 � Lk > 1 (and 〈k〉 not
too high due to the condition of sufficiently large ∆s) the advantage of the edge with respect
to the mean class is ∆sk = s0(kα − 〈k〉α) ≈ s0αLkkα−1.
Different estimates can be obtained depending on the approximations taken, giving implicit
or explicit formulas. For example, assuming that k >> Lk and neglecting the logarithmic
corrections in Lk we obtain a closed expression for Lk,
Lk =2 log(Ns0αk
α−1)
log(s0αkα−1
Ub
) . (2.10)
Comparison with simulated data shows that the expression for Lk in Eq. (2.10), including
small-L corrections, is a reasonably good estimate of the width of the distribution for em-
65
pirically plausible parameters (Fig. 3e). In particular, the speed of adaptation and mutation
accumulation (Fig. 3c) are well captured by this analytical description.
A mathematical argument for the validity of this extension of the multiple-mutations model
is presented in the Appendix.
The approximations discussed above are valid as long as 〈k〉 is large enough to exceed Lk,
but small enough not to make ∆sk too small (one can e.g. impose that ∆sk � 1/Ub to be far
than the neutral limit).
The sources of error in this estimate are to neglecting logarithmic corrections. It is possible
to account for these corrections using the simulated values of Lk. We verified that this leads
to a small underestimate of the width of the distribution (not shown). Additional sources of
deviation are due to neglecting non-leading mutation classes in estimating the establishment
time, the use of an average growth rate and the assumption of exponential growth of each class
starting from the establishment size (see Appendix). For the empirically plausible values of N ,
Ub, and s0 we estimate that the model should be valid for values of 〈k〉 up to 102. Depending on
the parameters and the specific model chosen this range of validity can be far larger. Moreover,
for even larger k we verified that vk = 1/τk tends to a constant, Ub with the approximations
taken, restoring the correct result for infinite population size.
66
2.3 Parameter-matching procedure for diminishing return model
We used a simple procedure for choosing a set of parameters and a functional form of the
advantage g(k) compatible with existing data.
We considered fitness/mutations data from two laboratory evolution experiments using the
three different variants of the diminishing return model defined in the former section. The
power law model, where the advantage is described by s0g(k) = s0kα, is the main case we
presented, while the advantage functions of other two models considered have a logarithmic and
an exponential dependence on k.
Experiments analyzed
We considered data from two experiments. The first data set concerns Acinetobacter baylyi
and was obtained from a chemostat experiment [93], while the second data set comes from the
initial 2 · 104 generations of the well-known “Escherichia coli long-term evolution experiment”
(LTEE) [66, 68, 94].
The A. baylyi experiment studied the population dynamics in a chemostat using a minimal
medium supply for about four months (≈ 3000 generations), at a dilution rate D ≈ 0.7h−1.
The use of chemostat allowed to grow a large population (N ≈ 3 · 1010) under controlled
conditions for a fairly long time. Since the number of individuals is large, it is expected that
different sub-populations will grow in parallel in clonal interference regime. This assumption
has been confirmed by population sequencing data [93]. Maximum growth-rate measurements
were performed in batch on 21 isolated clones and on the the original strain introduced in the
chemostat (wild type) fitting the growth curve during the exponential phase.
Even if generations are not synchronous, the mean growth rate is held fixed by the dilution
rate in the chemostat. In fact, when a monoclonal population grows in the chemostat, its growth
rate equals the dilution rate D, and the generation time is defined as tgen = ln(2)/D.
67
In presence of different sub-populations the dynamics becomes more complex [95]. The mea-
sured values of µmax differ from the effective growth rate that each population can reach in the
chemostat because of competition between different sub-populations for a limited amount of
nutrients.
For sake of simplicity, and since the detailed dynamics is not experimentally accessible, we
assume that the growth rate of all clones is fixed by dilution rate over the whole experiment,
while the measured values of µmax are supposed to be indicative of the effective fitness of the
different sub-populations inside the chemostat [93].
Thus, the different maximum growth rates of clones measured in batch are interpreted as dif-
ferent survival probabilities of their offspring inside the chemostat.
Fitness is defined as:
w(i) = emi = eµmax,i tgen (2.11)
where the index i indicates the i − th sub-population, mi is the growth rate expressed in
1/generation, experimental data of µmax,i have the units of h−1, and tgen = ln(2)/D ≈ 1h−1.
In simple words the fitness of an individual is defined as the amount of offspring assuming it
can grow by the maximum growth rate over an average generation defined by the chemostat
dilution rate (hence, fitness is a dimensionless quantity).
Since the relevant quantity of the dynamics is the relative fitness, during the parameter-
matching procedure we used the normalized fitness of clone i (wexp(i) = eµmax(i)−µWT ) to infer
the functional form of the advantage. Note that, since the reference fitness value is given by
the ancestral growth rate, wexp(WT) = 1.
A whole-genome sequencing was performed on two single clones isolated at the end of the
experiment (AB2800b and AB2800a) and on the ancestral strain that served as a reference for
identifying mutations in evolved clones. A total of 11 mutations were detected into the evolved
clones, eight of them in common between the two.
Population sequencing was performed at three time points confirming the presence of dif-
ferent subpopulations.
Additional sequencing has been performed in the remaining selected clones on PCR frag-
ments encompassing the mutated loci identified in the two end-point clones. This permitted to
reconstruct a sketch of the history (or the genealogy) of the mutation appearances. Thus, the
number of accumulated mutations have to be considered as a lower bound for all the clones,
except for the ancestor and the two fully sequenced clones where they are measured directly.
68
However, the indications of the number of mutations from the population sequencing data are
compatible with the inferred values of k [93].
A comprehensive description of the experimental methods used for the experiments with
A. baylyi can be found in ref. [93].
The E. coli long-term evolution experiment concerns twelve E. coli populations evolved
in parallel in batch for about twenty five years, corresponding to more than 6 · 104 generations.
Serial dilution 1 : 100 was performed daily, allowing ≈ 6.6 generations each day and an effective
population size of ≈ 2 · 107 individuals [68]. The maximum population size (≈ 5 · 108) is
fixed by the total amount of nutrient in the medium. Interestingly, six populations become
hypermutator after few thousands of generations, accumulating a large number of mutations.
We used the mutations and fitness data of the population designated Ara-1 referred to the first
20000 generations (corresponding to ≈ 10 years), as given in ref. [66].
Fitness was measured experimentally through competition experiments as the logarithm
of the increase in frequency. Competition experiments were performed between samples at
different times (every 1000 generations) and a spontaneous mutant of the ancestor, which is
easier to track visually and has been verified to have almost the same fitness [96]. The relative
log-fitness of population i with respect to the wild type is defined as
ϕ(i) =
log
(f(1)if(0)i
)log
(f(1)WT
f(0)WT
) , (2.12)
where f(0) and f(1) = f(0)eµt indicate the frequencies at the beginning and the end of the
competition experiment [66, 96]. The ratio ϕ(i) can be seen as the ratio between the growth
rate of the two populations and corresponds to log-fitness in the model. Note again that the
experimental data are dimensionless.
The difference between the growth rates can be approximatively deduced as
(ϕ(i)− 1)µ ≈ (µi − µWT )
where µ is the mean growth rate [68]. Since we are interested in expressing the fitness as the
mean number of offspring per generation, we use as mean growth rate µ = log(2)generations−1.
Thus we define the normalized fitness as w(i) = e(ϕ(i)−1) log(2). The fitness of the reference strain
is again wexp = 1.
69
Genome sequencing was performed on samples from generation 2000, 5000, 10000 15000
and 20000 as well as on the ancestor. A total of 45 mutations were found in the most evolved
strain, most of which were stable in later clones [66]. Detailed information about the long-term
evolution experiment with E. coli, and in particular about fitness measurements can be found
in refs. [66, 68, 94, 96].
Comparison of the experiments reveals some features in common and some remarkable
differences between the two. Both the experiments concern bacteria but were performed using
distinct propagation techniques. Their duration is quite different both in terms of generations
(2 · 104 and 3 · 103) and experimental time (≈ 10 years and ≈ 4 months). However, their
duration is long enough to observe a deceleration of fitness increase [78]. Another feature
shared by the two experiments is the large (effective) population size, which suggests that the
clonal interference regime might be relevant. The simultaneous presence of different genotypes
within the population has been verified in both the experiments [93, 97]. Moreover, the decrease
of the beneficial effect of the first five fixed mutations in the E. coli serial dilution experiment
has been demonstrated [80]. The effect is more complex than the description of diminishing
returns given here. However, as suggested by the authors, one can surmise that a simplified
model including epistatic interactions might be useful to roughly describe this phenomenon.
For all the analyzed clones in both experiments, it is possible to associate fitness values with
numbers of mutations, and thus bridge fitness with the parameter k in our model.
Note that in our modeling framework, as in most standard evolutionary models, the popula-
tion size is kept constant, all individuals are substituted by newly generated offspring at every
generation and we assume a constant time interval between generations, although none of these
assumptions are completely verified in the two experiments we considered. Since the speed of
evolution depends logarithmically on the population size, the error on the estimate of Ub given
by the constant population approximation should be very low.
70
Matching procedure and estimate of Ub
During the first step of the parameter-matching procedure, we found the best-fitting param-
eters for each functional form of the fitness advantage function g(k). This uses data on fitness
values and number of acquired mutations, using the definitions given in the previous section.
Since the number of experimental points is low (5 for the E. coli and 21 for the A. baylyi exper-
iment - corresponding to 9 different values of k), it is possible to obtain good fits with different
functional forms of the advantage s0g(k). The results of the fit using the three different func-
tional forms considered here (power law, logarithm and geometric) is shown in the top panel of
Fig. 5.
In a second step of the procedure, simulations are repeated for a wide range of values of Ub.
A qualitative estimate of this parameter can be obtained using the mean number of mutations
at the end of the two experiments (Fig. 4). In our case, 〈k〉 = 11 and 〈k〉 = 45 for the A. baylyi
experiment and E. coli experiment respectively. In the model, for a fixed interval of time steps,
the number of accumulated mutations decreases significantly and monotonically with Ub, and
the number of time steps necessary to reach 〈k〉 = 11 and 〈k〉 = 45 depends on Ub. Thus,
for each model there is a single value of the parameter that verifies 〈k〉(texp) = 〈k〉exp (Fig. 4,
Fig. 5).
Note that referring to the experimental values of k as mean values we are assuming that
the sequenced clones are representative of the populations. For the A. baylyi experiment, the
uncertainty on the number of mutations present in the clones is a relevant source of error. For
the E. coli experiment, the number of mutations is referred to single clones, but the associated
fitness is a population mean.
The procedure described so far yields some estimated values of the beneficial mutation rate.
However, these values depend on the advantage model, and vary up to three orders of magnitude.
Nevertheless, the estimates are roughly in the expected biological range (10−8 − 10−5, see
refs. [89, 88]).
The third step of the estimate procedure allows to select between advantage model, using the
comparison of simulated and experimentally measured dynamics for the fitness and the number
of mutations as a function of time (similarly to ref. [78]). Note that the latter function is not
trivially equivalent to the fitness as a function of k, but contains the effects of the population
dynamics in presence of clonal interference, for which the model provides a description.
71
Estimate of Fitness advantage
s0g(k) from data
Estimate Ubusing
experimental time
Simulations withfixed g(k) and
varying Ub
Number of mutations k
Advanta
ge
Time steps
<k>
Experimentalnumber of
generations
Beneficial Mutation Rate Ub
Tim
e s
teps
Estimatedexperimental <k>
Figure 4: Sketch of the parameter matching procedure leading to an estimate of the beneficial
mutation rate from experimental fitness data. Top: the experimental data for fitness as a function
of the number of mutation (blue symbols) allow to estimate the advantage fitness function s0g(k) from a fit
(continuous green line). Middle: simulations are run using the estimate advantage function for different values of
Ub, which remains undetermined. This leads to different predicted dynamics for the number of acquired mutations
(dotted,dashed, and dashed-dotted lines) and the fitness increase in time. These predictions can be matched with
the experiment to estimate Ub. In particular, the predicted number of time steps necessary to reach the final
experimental number of mutations varies with the beneficial mutation rate (filled circles). Bottom: The estimate
of Ub is obtained matching the predicted final time with the experimental one. The best value of Ub (blue star) is
obtained when the predicted number of time steps necessary to reach the final number of mutations (determined
in the experiment) corresponds to the number of experimental generations (horizontal continuous red line). We
applied this procedure to the three model variants for the diminishing return described in the main text.
72
Number of mutations
Tim
e st
eps
Beneficial mutation rate10-8 10-6 10-4 0
4
10-6 10-4 10-2
2
0
0.4
0 2 4 6 8 10
x 104x 103
0 10 20 30 40
Power Law
GeometricLogarithm Data
Beneficial mutation rate
0.6
0.2
0
0.4
0.2
0.6
Adv
anta
ge
Number of mutations12 0 104 2 1040 1 103 2 103 3 103
Generations
Adv
anta
ge
0
0.2
0.4
0.6
0
0.2
0.4
0.6
0
5
10
0
20
40
Generations
-6
0 104 2 104
Generations
Mut
atio
ns
0 1 103 2 103 3 103
Generations
0
2
4
A. baylyi A. baylyiE. coli E. coli(a)
(b)
(c)
(d)A. baylyi A. baylyiE. coli E. coli
Figure 5: Estimate of the beneficial mutation rate from different experiments. Estimate of the beneficial
mutation rate using data from Jezequel et al. [93] and Barrick et al. [66] . (a) Estimate of the fitness advantage
function s0g(k) from data of advantage as a function of mutation number (red symbols), as described in the top
panel of Fig. 4. The different lines correspond to the three model variants for the diminishing-return advantage
(power-law dot-dashed purple lines, logarithmic long-dashed green lines, geometric blue dashed lines, as in legend).
For sake of simplicity for experiment A. baylyi (left panel) for each k only the mean value of the advantage is
shown. Complete data are reported in ref. [93]. (b) Estimates of Ub, from different model variants, obtained
matching the time for which 〈k〉(texp) = 〈k〉exp predicted by simulations with a given s0g(k) to the experimental
number of generations, as described in the bottom panel of Fig. 4 (the different line styles refer to the model
variants as above). For A. baylyi experiment we considered a total of 2850 generations. (c-d) Comparison of
the performance of the three model variants with the estimated values of Ub. For the Jezequel et al. data (left
panels), the power law and geometric variants are quite close, but power law model for the diminishing return
gives the best agreement, especially considering the data relative to the number of acquired mutations as a
function of time. This choice leads to the estimate Ub ≈ 10−6 − 10−7. For the Barrick et al. data, the logarithm
and power-law models perform better, and give equivalent estimates of Ub ≈ 10−5.
73
A qualitative comparison between data and simulations indicates that for the A. baylyi
experiment the power law and geometric advantage model (with parameters s0 = 0.42, α = 0.19
and s0 = 0.24, q = 0.63 respectively) best describe the increase of the fitness with time (bottom
panel of Fig. 5). Comparing the increase of the number of mutations as a function of time of
the two models suggests that the power law model better resembles the data. This leads us
to prefer the power law model to compare with data. With this choice the estimated value of
beneficial mutation rate is around the value Ub ≈ 3 · 10−7. In the E. coli experiment, the power
law and logarithmic advantage models describe the data best, and are roughly equivalent (with
parameters s0 = 0.22, α = 0.26 and s0 = 0.16 respectively). Note that the beneficial effect of
the first fixed mutation is comparable to the value s0 ≈ 0.1 estimated from experiments [68, 98].
The range of estimated values obtained for the beneficial mutation rate are also essentially
equivalent (Ub ≈ 1·10−5 for the power law advantage model and Ub ≈ 6·10−6 for the logarithmic
advantage model). Note that the data points used in this step are much more abundant, since
they include a value of fitness every 1000 generations. A logarithmic increase of the fitness
advantage for the E. coli long term evolution experiment is also suggested by a parallel work [61].
Finally, this simple procedure for comparing model with data should not be affected by the
loss of self-averaging property found in the model, which becomes relevant in a later regime. For
example, assuming the estimated parameters for the E. coli long-term evolution experiment,
and using the model to explore the relative variance of the speed over realizations σ(vk)/vk,
at t = 2 · 104 generations one gets an error of the order of 2%. This value is still relatively
small. Considering the much larger number of accumulated mutations k ≈ 103, corresponding
to t ≈ 2 · 106, the speed of evolution is ≈ 3 · 10−4 (>> Ub), and its relative variance over the
realizations results σ(vk)/vk ≈ 8%, which would be still under control.
74
Discussion and conclusions
Different laboratory evolution experiments show a decrease of the fitness advantage due
to newly acquired mutations and a decrease of the speed of evolution [65, 78, 99]. We considered
a simplified model, using a minimal number of parameters, which is a direct generalization of
the multiple-mutations model with constant advantage, but describes this feature in terms of
diminishing returns [78]. Specifically, it is assumed that the selective advantage of all individuals
having k beneficial mutations is identical, but decreases with k. We verified analytically and
with simulations that, in the infinite population limit, considering phenotypic fluctuations does
not affect our model (see Sect. A3).
We have shown that the basic phenomenology of the model entails a sublinear decrease of the
mean number of fixed mutations and a steeper sublinear decrease of the mean advantage. This
is in qualitative agreement with previous results using a similar model applicable in a regime
where concurrent mutations do not occur [78].
The evolutionary speeds of mutation accumulation and advantage are related to the width
of the distribution of coexisting advantage classes. We showed how a theoretical infinite-N
argument produces a relation between the speed of fixed mutations vk and the second moment of
the histogram of mutation classes, confirmed by simulations. Interestingly, simulations indicate
that for any finite N , different model realizations behave increasingly differently with time in
terms of both vk and width of the mutation class histogram. This non-self-averaging property
implies that even at intermediate times, the behavior of a realization can be quite different from
the average. However, as we have seen, this effect appears to become relevant in the model on
time scales longer than the experimental times considered here.
Finally, for finite population size, we were able to define through analytical arguments the
regime where the stochastic edge estimate of the adaptation and mutation speeds can be ex-
75
tended to the case of diminishing returns provided that the advantage s is substituted with the
appropriate function s0g′(〈k〉).
While more complex and realistic descriptions exist (see ref. [100], that incorporates genet-
ically linked multiple mutations explicitly), the advantage of this approach is that the model
depends on few parameters. We performed a numerical experiment giving a qualitative idea
of the comparison of model with data, and allowing to perform gross estimates. We consid-
ered two different experimental data sets from long-term evolution experiments, and defined a
simple procedure to compare model with data. Assuming the model, this procedure yields an
order-of-magnitude estimate of the beneficial mutation rate Ub. The values obtained for the
beneficial mutation rate fall within the range of the available measurements [88, 89], on the
order of 10−6/10−5 mutations per genome per generation for the E. coli long-term evolution
experiment, and between 10−7 and 10−6 in the case of A. baylyi.
As previously mentioned, the constant-advantage model can be seen as an effective description
of a multiple-mutation framework with a distribution of advantages [77], provided an effective
advantage is used. This description also produces an effective beneficial mutation rate that, in
a diminishing return framework, would vary with time, and thus would need an extension of
the present model to be fully implemented.
It is possible to give a rough but quantitative estimate of the underlying beneficial mutation
rate in the simple case of an exponential distribution of the advantages using the rescaling
procedure proposed by Good and coworkers. This estimate (Appendix, Sec. A4) indicates that
the rescaling should not affect the current order-of-magnitude estimates.
A very recent work [98] examines fitness trajectories of the long term evolution experiment up
to 50k generations and match them with a theoretical argument based on epistasis, but which
neglects multiple mutations [69]. Their results are completely in line with ours: power law
behavior of the fitness and estimated beneficial mutation rate of ≈ 10−6 or higher. Comparing
the model to the first 20k generation of the new dataset (where both mutation data and fitness
are available) gives α = 0.27, which is in line with our previous estimate, and provides a measure
of the intensity of the epistatic interactions. In their work, Wiser and coworkers [98] introduce
epistasis using a parameter g to express the decrease of the expected advantage of new mutations.
They find that fitness, as a function of time, is proportional to t1/2g. In order to obtain an
equivalent expression for w(t) in our model we can correlate k and t through the establishment
76
time (see Appendix Sec. A5), obtaining the approximate relation w(k) = s0kα ∝ s0t
α/(2−α).
Comparing the two expressions allows to map the parameters of the two models g = (2−α)/2α.
Substituting the estimated value for α, gives g ≈ 3.2, which is close to the range of values
g ≈ 4− 9 derived by Wiser and coworkers, and in particular similar to the value g ≈ 4 obtained
for the Ara-1 population analyzed here.
Moreover our estimate of the epistatic interaction can be compared to the measurements
obtained in ref. [80] on the first five mutations acquired during the E. coli long-term evolution
experiment. A qualitative comparison shows accordance with our results. However, each single
mutation seems to have a specific value of s0, which would require a more complex modelling
approach.
The constant-advantage multiple-mutations model has also previously been applied to short-
term laboratory evolution experiments [79]. In those early stages, adaptation does not slow
down, and the assumption of constant advantage is justified. We can compare our results
to those obtained with the same procedure, using a constant advantage function g(k) = s0k,
applied to the increase in fitness during the late stages of both the experiments. We have
performed this test considering different “starting points”, i.e. initial generation in the empirical
data. The order-of-magnitude values of Ub obtained with the procedure are similar to the ones
quoted above. However, one is forced to discard the information about the initial mutations,
and, as can be expected, the outcome for Ub depends on the chosen starting point. We found
that it could vary by almost an order of magnitude for time intervals that appeared equally
reasonable to fit with a constant advantage model. On the contrary, the diminishing return
model allows to use data from all mutations, and does not leave this freedom. Additionally, it
includes the early mutations, where s varies much more, and presumably the relative accuracy
in its experimental measurement is higher.
Other scenarios have been proposed and possibly co-occur with diminishing returns epista-
sis [78, 80, 81], and, accordingly, different models have been formulated in this context. For
example, the speed of evolution could decrease because beneficial mutations with larger advan-
tage fix sooner in the population [100] and because the mutation rate or the number of possible
beneficial mutations decreases with time [90, 91].
We tested that, for a fixed advantage s(k) = s0k, a decrease of the beneficial mutation rate
leads to a slow down in adaptation. Applying a procedure similar to the one described in
77
Sec. 2.3, we were able to estimate the beneficial mutation rate for E.coli experiment using a
power law and an exponential model for the decrease of Ub. According to our results, during
the first 5 · 104 generations of E.coli experiment, Ub varies in the reasonable range 10−5/10−6
for both the models. Interestingly, we observed a decrease in the variability of the population
in terms of classes of mutation. The intuitive reasoning beyond this phenomenon is that, as Ub
decreases, the edge of the distribution decreases its probability to gain new mutations, even if it
reaches the establishment size, and the population tend to collapse in a single genotype. Thus,
it is possible to distinguish between the diminishing return and diminishing beneficial mutation
rate model using this parameter. Since experiments suggest an increase of the number of the
subpopulations in time, the diminishing return model seems to describe better the experimental
dynamics of evolution.
The two models can be easily “collapsed” and the detailed dynamics depends on the chosen
parameters.
The different explanations proposed for the slow down of adaptation are not necessary mu-
tually exclusive, and could be stratified in actual laboratory evolution experiments [67]. It is
currently unclear whether the available experimental observables allow to establish the relative
weights of these distinct phenomena, or precisely which different experimental measurements
would; we believe that simple and possibly falsifiable models could help exploring these ques-
tions.
78
Appendix - Part II
A1 Self-consistency scaling argument estimating the adapta-
tion speed.
This section discusses the generalization of the self-consistency considerations used to esti-
mate the adaptation speed from the standard multiple-mutations model [75, 73, 76], applied to
the case of diminishing-return with power-law increase of fitness.
For a power-law model, the advantage of the edge respect to the mean can be expressed as
sedge ≈ s0αLkkα−1.
Imposing the condition that one new mutation class is established at the edge of the his-
togram gives
1 =
∫ τk
0dt
(Ub
2s0αLkkα−1es0α(Lk−1)kα−1t
)(2s0αLk+1(k + 1)α−1
). (S13)
The last term in this integral is the establishment probability of a new fittest class, while
the first is the rate of beneficial mutations from the previous fitness class, which is born with
size 12s0αLkkα−1 and grows exponentially.
An estimate of τk can be obtained integrating the above expression and using the approxima-
tion Lk ≈ Lk+1. Under the assumption that k is not too large (i.e. the advantage of the fittest
class is sufficiently high) the contribution of the integration boundary t = 0 can be neglected
τk =1
s0α(Lk − 1)kα−1log
(s0α(Lk − 1)k2(α−1)
Ub(k + 1)α−1
). (S14)
79
The above expression can be further simplified assuming that both k and Lk are large enough
so that k + 1 ' k and Lk − 1 ' Lk and neglecting the logarithmic term in Lk leading to
τk =1
s0αLkkα−1log
(s0αk
(α−1)
Ub
). (S15)
This indicates that for intermediate k, the estimate of the standard multiple-mutations model
is valid provided the diminishing return advantage function s0kα is substituted to the constant
advantage. For sufficiently large k, expansion of the exponential in Eq. (S13) gives τk ∼ 1/Ub,
compatibly with the infinite-N result.
The time τk is related to the instantaneous speed of the mutation class histogram vk. The
speed vs, can be obtained knowing that during the time the mutation class histogram travels by
one class, the advantage histogram has to move by the relative fitness between the newly added
fittest class and the previous one, s0αkα−1. Using the simplified Eq. (S15) (which neglects
logarithmic terms in Lk), one obtains the speed
vk =s0αLkk
(α−1)
log( s0αkα−1
Ub)
; vs =s0αk
α−1
τk. (S16)
The second part of the estimate involves the normalization condition. Assuming that the
largest term of the histogram dominates, we need to evaluate the time τ ′k necessary for the fittest
class to become the class with mean advantage, whose size is order N/2. If the fittest class has
k mutations, its establishment size is 12s0αLkkα−1 . Its growth will be roughly exponential, with
a rate that decreases while it gets closer to the mean. We estimate its growth by its mean
growth rate during the time τ ′k. Immediately after establishment, its relative growth rate will
be s0αLkkα−1, while its rate will tend to zero when it gets close to the mean. Thus, on average,
we can assume that it grows exponentially with rate s0αLkkα−1
2 .
This argument leads to the equation
N/2 ≈ 1
2s0αLkkα−1es0αLkk
α−1
2τ ′k (S17)
which implies
τ ′k =2
s0αLkkα−1log(Ns0αLkk
α−1). (S18)
In order to estimate vs, we need to determine how much the histogram of fitness advantage
has progressed during the time τ ′k from the establishment of the k-th mutation class. We
assume that during this time Lk is roughly constant, so that after time τ ′k, k + Lk mutations
are established, and the advantage of the edge has reached s0αLk(k + Lk)α−1.
80
This allows to estimate vs as the advantage gained divided by the time τ ′k, i.e.
vs =(s0αLk)
2
2
kα−1(k + Lk)α−1
log(Ns0αLkkα−1)(S19)
Eq. (S16) and (S19) together allow to determine Lk, which can subsequently be used to
obtain the speed of adaptation vs, or of fixed mutations vk. Assuming that k >> Lk and
neglecting the logarithmic corrections in Lk, as in Eq. (S19), we can obtain the following closed
expression for Lk,
Lk =2 log(Ns0αk
α−1)
log(s0αkα−1
Ub
) , (S20)
which is Eq. (2.10) of the main text.
Finally, substituting Lk in the expression of vs, allows to obtain an explicit expression for
the velocity,
vs = 2s0αk(α−1) log(Ns0αk
α−1)(log( s0αk
α−1
Ub))2 (S21)
which, compared to simulations, in general works rather well, despite of the approximations
taken.
More in general, from the estimated establishment time Eq. (S14) and (S18), keeping into
account the increase in fitness as described in the main text, it is possible to derive two expres-
sions for the speed of adaptation vs as function of Lk. Equating these two expressions generates
an implicit estimate of the width Lk
Lk =2 log(Ns0αLkk
α−1)
log(s0α(Lk−1)kα−1
Ub
) Lk − 1
Lk. (S22)
This expression gives a more precise estimate of Lk and vs, even in the case Lk is very small.
Note that for α = 1, the power law return model reduces to the particular case of absence
of epistasis (i.e. g′(k) = 1, hence wk = eks0) [101, 75].
81
A2 Simulation algorithm and effective parameters
Simulations used the algorithm of Park and Krug, as described in refs. [85, 73]. The sim-
ulation scheme is sketched in Fig. S3. In a typical initial configuration, all clones have k = 0
mutations. At each subsequent time step, the progeny of the individuals of all fitness classes
are sampled from a multinomial distribution of parameters {p(k, t+ 1)}{k∈[kmin,kmax]}. The pa-
rameters p(k, t+ 1) take into account the frequency f(k, t) of the class at the former step, the
relative fitness χk computed at time t and the contribution of beneficial mutations arising from
the preceding class. Specifically
p(k, t+ 1) = (1− Ub)f(k, t)χk(t) + Ubf(k − 1, t)χk−1(t) (S23)
(see also Eq. 2 of the main text). Together, these definitions are equivalent to a Wright-Fisher
model with separate selection and mutation steps.
The multinomial random numbers are generated by iteratively drawing binomial random
numbers with parameters q(k, t+ 1) = p(k, t+ 1)/∑
k p(k, t+ 1) starting from kmax.
The model defined above is invariant by suitable rescaling of time, provided the other model
parameters are also rescaled correctly. This feature is useful in comparison with experiments
(see following) in order to understand the correspondence between time steps in the model and
a generation in the experiment. It can also be useful in increasing the efficiency of simulations.
Suppose one wants to map a reference empirical time into the model time. This requires
the rescaling tmodel = rtemp.
A simple choice is r = 1, implying a one-to-one correspondence between empirical genera-
tions and time steps in the model. Since an established clone with advantage s grows as est, the
advantage is proportional to the time scale, smodel = r semp (this means that the value of the
fitness in the model depends exponentially on r). Similarly, since the beneficial mutation rate
is defined as the number of expected mutations per genome per generation, the map between
time steps and empirical generations implies Ub,emp = Ub,model/r. Finally, the correct rescaled
population size is Nmodel = Nemp/r. In practice N can be rather large in a typical experiment
(e.g. ≈ 109), so that the rescaling of population size does not affect much the dynamics provided
82
r is not too large. The rescaling of the parameters described above can easily be rationalized
keeping in mind that the basic time scales of the model are set by the products sN and NUb. In
order to verify that the invariance discussed above is valid for our model, where the advantage
s varies with the number of mutations as s0g(k), we ran some simulations choosing the model
parameters (N , s0, Ub) using different maps between the time units (see Fig. S6). The results
indicate that for any practical purpose, the invariance is effective.
A3 Phenotypic variability
The population genetics model presented in the main text works under the hypothesis that
there is a one-to one correspondence between the genotype (essentially k) and the phenotype
(represented by the fitness and the number of offspring of an individual). However, as already
mentioned, the “reproductive success” of an individual is a function of both genotype and
environment. Moreover, even when we consider asexual populations evolving under controlled
conditions, some variables will vary of some extent from one cell to another (e.g. the number
or concentration of given proteins) [102, 103].
We can introduce phenotypic fluctuations in the frame of our model expressing the fitness as:
wtot = xgen ∗ yphen, (S24)
where x and y indicate the two components (genotype and phenotype). Note that the phenotypic
component can be a function of the genotype, (e.g. a certain genotype could have lower/higher
variability in the outcome compared to another one).
We define the advantage of the two fitness components as s(k) and pk(l), where l is the
index of the “phenotypic class”, and pk(l) is the contribution to the fitness due to class l given
the genotype k. In general, a total value of fitness wtot can be obtained using more than one
combination of k and lk.
Considering the phenotypic term as stochastic fluctuation, we can assume that lk is not
inheritable ‡. If each individual of the class k chooses a class lk independently from its ancestor,
‡This seem to be not completely true. [103]
83
the model dynamics results essentially unaffected by phenotype. This is easily verified in the
infinite population limit.
We define the fraction of individuals in the mutation class k as fk and flk the fraction of
them in the phtnotypic class lk. The fraction of individuals with a given genotype and a given
phenotype over the whole population is fk,l′k = fkfl′k .
As a consequence, the mean fitness of the population is:
wt =∑k
∑l′k
(fk,l′k(t)w(k,l′k)
) =∑k
∑l′k
(fk(t)fl′k(t)xkyl′k
) =∑k
(fk(t)wk(t)
), (S25)
where wk(t) = xk(t)yk(t) is the mean fitness of all the individuals with genotype k. Note
that the quantities yk(t) and fl′k(t) are stochastic variables, they change in time but are not
properly a function of time.
In the infinite population limit we can consider yk(t) = yk and the equation for the evolution
of the frequency distribution can be rewritten as
fk,lk(t+ 1) = flk
∑l′k
(fk,l′k(t)
w(k,l′k)
wt
)(1− Ub) +
∑l′k−1
(fk−1,l′k−1
(t)w(k−1,l′k−1)
wt
)Ub
= flk
[fk(t)wk
1
wt(1− Ub) + fk−1(t)wk−1
1
wtUb
].
Essentially, we recover the standard result, where the fitness of a genotype is the average
fitness wk. Simulations confirm that this intuitive result holds also at finite population size
(data not shown).
A4 Estimate of the effective beneficial mutation rate
A recent study has shown how the constant-advantage model can be seen as an effective
description of a multiple-mutation framework with a distribution of advantages [77]. In princi-
ple, this procedure gives an underlying beneficial mutation rate Ub from the beneficial mutation
rate of the effective theory. The effective selection coefficient coincides with the advantage of
the most probable fixed mutation, and the effective mutation rate has to be rescaled by the
probability of observing that mutation under the original distribution. Therefore, the rescaling
factor (and the mutation advantage in the effective theory) is dependent on parameters related
84
to the distribution of fitness effects. In a diminishing return scenario, this rescaling factor be-
comes time dependent. However, the assumptions made on the fitness advantage distribution
and on the effects that negative epistasis has on this distribution are relevant for this rescaling
procedure. Therefore, a precise estimate of the underlying beneficial mutation rate requires
several additional assumptions about the process, which need to be carefully considered and
verified with data.
Assuming for simplicity an exponential distribution ρ(s) = 1/σe−s/σ of fitness effects, it is
possible to give a quantitative estimate of the underlying beneficial mutation rate. The relation
between the effective beneficial mutation rate Ueff and the underlying rate Ureal is
Ureal =Ueff√
2πvsρ(s∗), (S26)
as given in Eq. 23 of Good et al. [77], where s∗ is the constant advantage used in the effective
theory. In the diminishing return framework, the typical mutation advantage decreases with the
number of accumulated mutations, s∗ = δs(k) and the speed of adaptation vs is slowing down as
vs ∝ (δs(k))2V ar(k) (using the mean-field estimate described in the main text). Assuming that
the effect of epistasis on the advantage distribution is such that the mean advantage σ scales as
the effective advantage with the number of mutations (i.e., σ ∝ δs(k)), we have ρ(s∗) ∝ 1/s∗ =
1/δs(k). In this case, the underlying beneficial mutation rate is simply Ureal ≈Ueff√V ar(k)
. In
other words, an estimated Ueff can be mapped to a time-dependent Ureal with a scaling factor
inversely proportional to the width of the distribution of mutation classes (√V ar(k) ∼ Lk).
On the experimental time scales, this procedure leads to small corrections to the mutation rate,
so that the order of magnitude estimates of Ub can still be considered acceptable.
A5 Estimate of mean fitness as a function of time
In their work, Wiser and coworkers [98] show that mean fitness depends on time as a power
law, with exponent 1/2g, where g can be obtained for all the available data or from single
populations. To relate these findings to the present work one has to express fitness as a function
of time. The k-th mutation typically appears at time tk, which is the sum of the establishment
times w(k) = w(tk). Thus, time can be related to k through
tk =k∑
k′=1
τk′ ≈log2
(s0αk
′(α−1)U−1b
)2s0αk′(α−1) log
(Ns0αk′(α−1)
) , (S27)
85
where we used (S15) and (S20) to express the establishment time. This expression can be
estimated as an integral, assuming that the dependency from k inside the logarithms is weak and
that they can be considered almost constant on the experimental time scale. This assumption is
justified for the data in analysis, since varying the parameters in the estimated range for E. coli
evolution experiment (i.e. N = 107/108, Ub = 10−5/10−6, s0 ≈ 0.2 and α ≈ 0.25) the ratio
between the logarithmic terms varies between 5 and 12 for k in the range 1 − 100. Using this
approximation, the integral can be solved obtaining t ∝ k2−α. Hence, fitness should scale with
time as w ∝ tα/(2−α). This is in agreement with the results of Wiser and coworkers, and the
relation between the parameters giving the strength of epistasis in the two models is g = 2−α2α .
86
Supplementary Figures.
Frequency
(a)
(b)
Frequency
Supplementary Figure S1: Clonal interference and selective sweep. The figure shows a schematic repre-
sentation of the emergence of new genotypes from a monoclonal population in regime of selective sweep (a) and
clonal interference (b). In the first case the new arising mutant can invade the population before the next one
establishes. In clonal interference regime, many different genotypes arise at the same time, coexist and compete.
The establishment and fixation time are schematically represented in the case of selective sweep regime.
87
Fre
quen
cy f
(k,t)
k (t)max
k (t+Δt)max
Number of mutations k
<k>
Supplementary Figure S2: The dynamics is driven by beneficial mutations and the rate of establish-
ment of new classes. The dashed rectangles represent the class frequency histogram at a subsequent time.
The green arrows symbolize the growth and decrease of size of the classes depending on their relative fitness.
Frequency f(k,t)bU
Beneficial mutation rate
s0g(k)e wk=Fitness Surviving offspring probabilities p(k,t+1)
f(k,t+1) from multinomial distribution
Supplementary Figure S3: Sketch of the algorithm used in the simulations (from ref. [73], see text).
The instantaneous frequencies of mutation classes define, through Eq. S23 (which incorporates selection and
mutation) the probability p(k, t+ 1) that an individual with k mutations is found at the subsequent generation.
The frequencies at the subsequent generations are then sampled from a multinomial distribution with parameters
p(k, t+ 1). This procedure allows to access large population sizes.
-5
-4
-3
-2
-1
π(s)
Advantage s10-5 -4 -3 -2
10
10
10
10
10
10 10 10
Supplementary Figure S4: The fixation
probability is proportional to the fit-
ness advantage. The figure shows the fix-
ation probability π(s) of a single clone that
grows in a uniform background having advan-
tage s (symbols), measured from our simula-
tions. The fixation probability is obtained as
the percentage of realizations where the ben-
eficial mutator fixes. We obtain π(s) = 2s
(continuous red line), in accordance with [85].
Simulations are performed using N = 107
over 108 realizations.
88
0.02
0.06
0.1
0.14
<V
ark>
R/ <
k>R
0 5 106 107
Generations
A
0 5 106 107
Generations
0.01
0.02
0.03
0 5 106 107
σR(V
k)/<
Vk>
R
Generations
0.04
0.06
0.08
0.1
σR (
L k )
/<L
k >R
CB
Supplementary Figure S5: The relative variance of the distribution decreases in time, while the
relative variance of Lk and vk follow an increasing trend. Panel A shows the ratio between the variance
Vark and the mean number of mutations 〈k〉 as a function of time. Simulations results confirm that, despite Vark
increases in time, the mean number of mutations increases more rapidly. Since Dk ≤ Lk, simulations results
confirm the hypothesis Dk/〈k〉 � 1 used in the main text (see also Fig. 3B). However, as illustrated in panels
B and C, vk and Lk can be quite different across realizations at the same time. The increase of the relative
error (calculated over different realizations) of the width Lk (panel B) and of the velocity (panel C) reflects both
the increase of the variance of the distribution and the decrease of vk. However, the error is sufficiently small
not to affect the results on experimentally relevant time scales. Simulations are performed using the parameters
N = 107, s0 = 0.5, α = 0.02, Ub = 1 · 10−3. Data are averaged over 100 realizations of the process.
89
r =1
<k>
A B
0
60
120
180
0.22
0.18
0.26
0.3
5 1052.5 1055 1052.5 105
<s 0
g(k)
>/r
Rescaled time steps Rescaled time steps
r =100r =10
Supplementary Figure S6: The model is effectively invariant for rescaling of the parameters. The
figure shows the mean number of mutations (〈k〉, panel A) and the fitness advantage (panel B) obtained from
simulations run using three different rescaling factors (r=1,10,100 different symbols as in legend). The simulated
dynamics is almost unaffected by the rescaling procedure. Simulations are made using the parameters N = 109,
s0 = 0.1, α = 0.2, Ub = 2 · 10−6. The data are averaged over 100 iterations, and error bars (standard error) are
smaller than symbols.
0 5 10 15 20 25 30 35 40Number of mutations
0.10.20.30.40.5
0 5 10 15 20 25 30 35 40Number of mutations
10-8
1
10-4
Generation 5 103 5 1042.5 103500250
Supplementary Figure S7: For empirically relevant parameters, the class population histograms at in-
termediate times are nearly Gaussian. The figure refers to “geometric” advantage, g(k) = (1−qk)/(1−q),
plotted in the top panel, and shows (bottom panel) the mean mutation class histogram at different genera-
tions (different symbols, see legend). The continuous lines in the bottom panel represent normalized Gaussian
distributions with equal mean and standard deviation of simulated histograms. The dashed line indicates the
establishment size 1/π(s(k)). Simulations are averaged over 250 realizations using the parameters N = 109,
s0 = 0.1, Ub = 6 · 10−6 and q = 0.8.
90
Bibliography
[1] B. McClintock, “The origin and behavior of mutable loci in maize,” Proc. Natl. Acad.
Sci., vol. 36, no. 6, 1950.
[2] B. McClintock, “Mutable loci in maize,” Carnegie Institution of Washington Yearbook,
vol. 50, 1951.
[3] E. S. Lander et al., “Initial sequencing and analysis of the human genome,” Nature,
vol. 409, Feb 2001.
[4] C. R. L. Huang, K. H. Burns, and J. D. Boeke, “Active transposition in genomes,” Annual
Review of Genetics, vol. 46, Dec 2012.
[5] R. Cordaux and M. A. Batzer, “The impact of retrotransposons on human genome evo-
lution,” Nature Reviews Genetics, vol. 10, Oct 2009.
[6] J. K. Pace and C. Feschotte, “The evolutionary history of human DNA transposons:
evidence for intense activity in the primate lineage,” Genome Research, vol. 17, Apr 2007.
[7] C. Feschotte and E. J. Pritham, “DNA transposons and the evolution of eukaryotic
genomes,” Annual Review of Genetics, vol. 41, 2007.
[8] H. H. Kazazian, “Mobile elements: drivers of genome evolution,” Science, vol. 303, Mar
2004.
[9] P. L. Deininger and M. A. Batzer, “Mammalian retroelements,” Genome Research, vol. 12,
Oct 2002.
[10] M. K. Konkel and M. A. Batzer, “A mobile threat to genome stability: the impact of
non-LTR retrotransposons upon the human genome,” Seminars in cancer biology, vol. 20,
no. 4, 2010.
91
[11] E. A. Bennett, L. E. Coleman, C. Tsui, W. S. Pittard, and S. E. Devine, “Natural genetic
variation caused by transposable elements in humans,” Genetics, vol. 168, Oct 2004.
[12] K. Ohshima, “RNA-mediated gene duplication and retroposons: retrogenes, LINEs,
SINEs, and sequence specificity,” International Journal of Evolutionary Biology, vol. 2013,
2013.
[13] T. Singer, M. J. McConnell, M. C. Marchetto, N. G. Coufal, and F. H. Gage, “LINE-
1 retrotransposons: mediators of somatic variation in neuronal genomes?,” Trends in
Neurosciences, vol. 33, Aug 2010.
[14] J. M. C. Tubio et al., “Extensive transduction of nonrepetitive DNA mediated by L1
retrotransposition in cancer genomes,” Science, vol. 345, Aug 2014.
[15] A. D. Ewing and H. H. J. Kazazian, “High-throughput sequencing reveals extensive vari-
ation in human-specific L1 content in individual human genomes,” Genome Research,
vol. 20, no. 9, 2010.
[16] J. Xing, Y. Zhang, K. Han, A. H. Salem, S. K. Sen, C. D. Huff, Q. Zhou, E. F. Kirkness,
S. Levy, M. A. Batzer, and L. B. Jorde, “Mobile elements create structural variation:
analysis of a complete human genome,” Genome Research, vol. 19, Sept 2009.
[17] A. M. Roy-Engel, “LINEs, SINEs and other retroelements: do birds of a feather flock
together?,” Front Biosci, vol. 17, Jan 2012.
[18] J. Jurka, “Sequence patterns indicate an enzymatic involvement in integration of mam-
malian retroposons,” Proc. Natl. Acad. Sci., vol. 94, Mar 1997.
[19] S. Boissinot, A. Entezam, P. J. Munson, L. Young, and A. V. Furano, “The insertional
history of an active family of L1 retrotransposons in humans,” Genome Research, vol. 14,
no. 7, 2004.
[20] I. Ovchinnikov, A. B. Troxel, and G. D. Swergold, “Genomic characterization of recent
human LINE-1 insertions: evidence supporting random insertion,” Genome Research,
vol. 11, Dec 2001.
[21] A. M. Weiner, “SINEs and LINEs: the art of biting the hand that feeds you,” Current
Opinion in Cell Biology, vol. 14, Jun 2002.
92
[22] M. Costantini, F. Auletta, and G. Bernardi, “The distributions of “new” and “old” Alu
sequences in the human genome: the solution of a “mystery”,” Molecular Biology and
Evolution, vol. 29, Nov 2012.
[23] K. R. Oliver and W. K. Greene, “Mobile DNA and the TE-thrust hypothesis: supporting
evidence from the primates,” Mobile DNA, vol. 2, no. 8, 2011.
[24] K. Kobayashi et al., “An ancient retrotransposal insertion causes fukuyama-type congen-
ital muscular dystrophy,” Nature, vol. 394, 1998.
[25] H. Hassoun, T. L. Coetzer, J. N. Vassiliadis, K. E. Sahr, G. J. Maalouf, S. T. Saad,
and J. Palek, “A novel mobile element inserted in the alpha spectrin gene: spectrin
dayton. A truncated alpha spectrin associated with hereditary elliptocytosis,” J. Clinical
Investigation, vol. 94, Aug 1994.
[26] H. J. Kazazian, C. Wong, H. Youssoufian, A. Scott, D. Phillips, and S. Antonarakis,
“Haemophilia a resulting from de novo insertion of L1 sequences represents a novel mech-
anism for mutation in man,” Nature, vol. 332, Mar 1988.
[27] C. S. Lin, D. A. Goldthwait, and D. Samols, “Identification of Alu transposition in human
lung carcinoma cells,” Cell, vol. 54, Jun 1988.
[28] S. Solyom et al., “Extensive somatic L1 retrotransposition in colorectal tumors,” Genome
Research, vol. 22, Dec 2012.
[29] B. Chenais, “Transposable elements and human cancer: a causal relationship?,” Biochim-
ica et Biophysica Acta, vol. 1835, Jan 2013.
[30] F. E. Lock et al., “Distinct isoform of FABP7 revealed by screening for retroelement-
activated genes in diffuse large B-cell lymphoma,” Proc. Natl. Acad. Sci., vol. 111, Aug
2014.
[31] K. R. Upton et al., “Ubiquitous L1 mosaicism in hippocampal neurons,” Cell, vol. 161,
Apr 2015.
[32] T. Graham and S. Boissinot, “The genomic distribution of L1 elements: the role of inser-
tion bias and natural selection,” J. Biomed Biotechnol., vol. 1, 2006.
93
[33] J. A. Bailey, G. Liu, and E. E. Eichler, “An Alu transposition model for the origin and
expansion of human segmental duplications,” The American Journal of Human Genetics,
vol. 73, no. 4, 2003.
[34] M. Babcock, A. Pavlicek, E. Spiteri, C. D. Kashork, I. Ioshikhes, L. G. Shaffer, J. Jurka,
and B. E. Morrow, “Shuffling of genes within low-copy repeats on 22q11 (LCR22) by
Alu-mediated recombination events during evolution.,” Genome Research, vol. 13, Dec
2003.
[35] J. Jurka, O. Kohany, A. Pavlicek, V. V. Kapitonov, and M. V. Jurka, “Duplication, co-
clustering, and selection of human Alu retrotransposons,” Proc. Natl. Acad. Sci., vol. 101,
Feb 2004.
[36] T. H. Jukes and C. R. Cantor, Evolution of protein molecules. Academy Press, 1969.
[37] M. Kimura, “A simple method for estimating evolutionary rates of base substitutions
through comparative studies of nucleotide sequences,” Journal of Molecular Evolution,
vol. 16, no. 2, 1980.
[38] M. Nei and S. Kumar, Molecular evolution and phylogenetics. Oxford University Press,
New York, 2000.
[39] B. J. Wagstaff, E. N. Kroutter, R. S. Derbes, V. P. Belancio, and A. M. Roy-Engel,
“Molecular reconstruction of extinct LINE-1 elements and their interaction with nonau-
tonomous elements,” Mol. Biol. Evol., vol. 30, Aug 2013.
[40] H. Khan, A. Smit, and S. Boissinot, “Molecular evolution and tempo of amplification of
human LINE-1 retrotransposons since the origin of primates,” Genome Research, vol. 16,
Jan 2006.
[41] A. Sookdeo, C. M. Hepp, M. A. McClure, and S. Boissinot, “Revisiting the evolution of
mouse LINE-1 in the genomic era,” Mobile DNA, vol. 4, Jan 2013.
[42] A. Le Rouzic and G. Deceliere, “Models of the population genetics of transposable ele-
ments,” Genetics Research, vol. 85, Jun 2005.
[43] A. Le Rouzic and P. Capy, “Population genetics models of competition between transpos-
able element subfamilies,” Genetics, vol. 174, Oct 2006.
94
[44] A. Le Rouzic, S. B. Thibaud, and P. Capy, “Long-term evolution of transposable ele-
ments,” Proc. Natl. Acad. Sci., vol. 104, 2007.
[45] A. Le Rouzic, T. Payen, and A. Hua-Van, “Reconstructing the evolutionary history of
transposable elements,” Genome Biol. Evol., vol. 5, no. 1, 2012.
[46] G. Abrusan and H.-J. Krambeck, “Competition may determine the diversity of transpos-
able elements,” Theoretical Population Biology, vol. 70, Nov 2006.
[47] O. Piskurek, H. Nishihara, and N. Okada, “The evolution of two partner LINE/SINE
families and a full-length chromodomain-containing Ty3/Gypsy LTR element in the first
reptilian genome of anolis carolinensis,” Gene, vol. 441, Jul 2009.
[48] Y. Terai, K. Takahashi, and N. Okada, “SINE cousins: the 3’-end tails of the two oldest
and distantly related families of SINEs are descended from the 3’-ends of LINEs with the
same genealogical origin,” Mol. Biol. Evol., vol. 15, Nov 1998.
[49] R. M. Ziff and E. D. McGrady, “The kinetics of cluster fragmentation and depolymerisa-
tion,” J. Phys. A: Math. Gen., vol. 18, 1985.
[50] Z. Cheng and S. Redner, “Kinetics of fragmentation,” J. Phys. A: Math. Gen., vol. 23,
1990.
[51] J. D. Barrow, “Coagulation with fragmentation,” J. Phys. A: Math. Gen., vol. 14, no. 3,
1981.
[52] F. Massip and P. F. Arndt, “Neutral evolution of duplicated DNA: an evolutionary stick-
breaking process causes scale-invariant behavior,” Phys. Rev. Lett., vol. 110, no. 14, 2013.
[53] D. Sellis, A. Provata, and Y. Almirantis, “Alu and LINE1 distributions in the human
chromosomes: evidence of global genomic organization expressed in the form of power
laws,” Molecular biology and evolution, vol. 24, no. 11, 2007.
[54] A. Smit, R. R. Hubley, and P. Green, “RepeatMasker open-4.0,” 2013-2015.
[55] “UCSC Genome Browser.” https://genome.ucsc.edu/.
[56] D. A. Ray, C. Feschotte, H. J. Pagan, J. D. Smith, E. J. Pritham, P. Arensburger, P. W.
Atkinson, and N. L. Craig, “Multiple waves of recent DNA transposon activity in the bat,
myotis lucifugus,” Genome Research, vol. 18, May 2008.
95
[57] M. Kimura, “Evolutionary rate at the molecular level,” Nature, vol. 217, 1968.
[58] P. D. Keightley, “Rates and fitness consequences of new mutations in humans,” Genetics,
vol. 190, no. 2, 2012.
[59] A. Hodgkinson and A. Eyre-Walker, “Variation in the mutation rate across mammalian
genomes,” Nature Review Genetics, vol. 12, Nov 2011.
[60] S. Subramanian and S. Kumar, “Neutral substitutions occur at a faster rate in exons than
in noncoding DNA in primate genomes,” Genome Research, vol. 13, May 2003.
[61] S. Wielgoss, J. Barrick, O. Tenaillon, M. Wiser, W. Dittmar, S. Cruveiller, B. Chane-
Woon-Ming, C. Medigue, and R. E. Lenski, “Mutation rate dynamics in a bacterial
population reflect tension between adaptation and genetic load,” Proc. Natl. Acad. Sci.,
vol. 110, Jan 2013.
[62] M. W. Nachman and S. L. Crowell, “Estimate of the mutation rate per nucleotide in
humans,” Genetics, vol. 156, no. 1, 2000.
[63] G. E. Liu, C. Alkan, L. Jiang, S. Zhao, and E. E. Eichler, “Comparative analysis of Alu
repeats in primate genomes,” Genome Research, vol. 19, May 2009.
[64] L. Duret and P. F. Arndt, “The impact of recombination on nucleotide substitutions in
the human genome,” PLOS Genetics, vol. 4, no. 5, 2008.
[65] T. Hindre, C. Knibbe, G. Beslon, and D. Schneider, “New insights into bacterial adap-
tation through in vivo and in silico experimental evolution,” Nat Rev Microbiol, vol. 10,
May 2012.
[66] J. E. Barrick, D. S. Yu, S. H. Yoon, H. Jeong, T. K. Oh, D. Schneider, R. E. Lenski, and
J. F. Kim, “Genome evolution and adaptation in a long-term experiment with Escherichia
coli,” Nature, vol. 461, Oct 2009.
[67] O. Tenaillon, A. Rodrıguez-Verdugo, R. L. Gaut, P. McDonald, A. F. Bennett, A. D. Long,
and B. S. Gaut, “The molecular diversity of adaptive convergence,” Science, vol. 335, Jan
2012.
[68] R. E. Lenski, M. R. Rose, S. C. Simpson, and S. C. Tadler, “Long-term experimental
evolution in Escherichia coli. I - adaptation and divergence during 2,000 generations,”
Am Nat, vol. 138, Dec 1991.
96
[69] P. J. Gerrish and R. E. Lenski, “The fate of competing beneficial mutations in an asexual
population,” Genetica, vol. 102-103, 1998.
[70] J. Felsenstein, “The evolutionary advantage of recombination,” Genetics, vol. 78, Oct
1974.
[71] H. A. Orr, “The distribution of fitness effects among beneficial mutations,” Genetics,
vol. 163, Apr 2003.
[72] C. O. Wilke, “The speed of adaptation in large asexual populations,” Genetics, vol. 167,
Aug 2004.
[73] S.-C. Park, D. Simon, and J. Krug, “The speed of evolution in large asexual populations,”
Journal of Statistical Physics, vol. 138, Feb 2010.
[74] L. S. Tsimring, H. Levine, and D. A. Kessler, “RNA virus evolution via a fitness-space
model,” Phys. Rev. Lett., vol. 76, Jun 1996.
[75] M. M. Desai and D. S. Fisher, “Beneficial mutation selection balance and the effect of
linkage on positive selection,” Genetics, vol. 176, Jul 2007.
[76] E. Brunet, I. M. Rouzine, and C. O. Wilke, “The stochastic edge in adaptive evolution,”
Genetics, vol. 179, May 2008.
[77] B. H. Good, I. M. Rouzine, D. J. Balick, O. Hallatschek, and M. M. Desai, “Distribution
of fixed beneficial mutations and the rate of adaptation in asexual populations,” Proc.
Natl. Acad. Sci., vol. 109, Mar 2012.
[78] S. Kryazhimskiy, G. Tkacik, and J. B. Plotkin, “The dynamics of adaptation on correlated
fitness landscapes,” Proc. Natl. Acad. Sci., vol. 106, Nov 2009.
[79] M. M. Desai, D. S. Fisher, and A. W. Murray, “The speed of evolution and maintenance
of variation in asexual populations,” Curr. Biol., vol. 17, Mar 2007.
[80] A. I. Khan, D. M. Dinh, D. Schneider, R. E. Lenski, and T. F. Cooper, “Negative epistasis
between beneficial mutations in an evolving bacterial population,” Science, vol. 332, Jun
2011.
[81] H.-H. Chou, H.-C. Chiu, N. F. Delaney, D. Segre, and C. J. Marx, “Diminishing returns
epistasis among beneficial mutations decelerates adaptation,” Science, vol. 332, Jun 2011.
97
[82] G. Martin, S. F. Elena, and T. Lenormand, “Distributions of epistasis in microbes fit
predictions from a fitness landscape model,” Nat Genet, vol. 39, Apr 2007.
[83] S. Wright, “Evolution in mendelian populations,” Genetics, vol. 16, Mar 1931.
[84] R. Fisher, The genetical theory of natural selection. Clarendon Press, Oxford, 1930.
[85] S.-C. Park and J. Krug, “Clonal interference in large populations,” Proc. Natl. Acad. Sci.,
vol. 104, Nov 2007.
[86] A. Eyre-Walker and P. D. Keightley, “The distribution of fitness effects of new mutations,”
Nat Rev Genet, vol. 8, Aug 2007.
[87] J. B. S. Haldane, “A mathematical theory of natural and artificial selection, part V:
selection and mutation,” Proc. Camb. Philos. Soc., vol. 23, July 1927.
[88] M. Hegreness, N. Shoresh, D. Hartl, and R. Kishony, “An equivalence principle for the
incorporation of favorable mutations in asexual populations,” Science, vol. 311, Mar 2006.
[89] L. Perfeito, L. Fernandes, C. Mota, and I. Gordo, “Adaptive mutations in bacteria: high
rate and small effects,” Science, vol. 317, Aug 2007.
[90] I. Rouzine, E. Brunet, and C. O. Wilke, “The traveling-wave approach to asexual evolu-
tion: Muller’s ratchet and speed of adaptation,” Theor. Popul. Biol., vol. 73, Feb 2008.
[91] S.-C. Park and J. Krug, “Evolution in random fitness landscapes: the infinite sites model,”
Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 4, 2008.
[92] H. A. Guess, “Limit theorems for some stochastic evolution models,” The Annals of
Probability, vol. 2, Feb 1974.
[93] N. Jezequel, M. C. Lagomarsino, F. Heslot, and P. Thomen, “Long-term diversity and
genome adaptation of Acinetobacter Baylyi in a minimal-medium chemostat,” Genome
Biology and Evolution, vol. 5, no. 1, 2013.
[94] “E. coli long-term experimental evolution project.” http://myxo.css.msu.edu/index.html.
[95] D. E. Dykhuizen and D. L. Hartl, “Selection in chemostat,” Microbiol. Mol. Biol. Rev.,
vol. 47, no. 2, 1983.
98
[96] J. A. G. M. de Visser and R. E. Lenski, “Long-term experimental evolution in Escherichia
coli. XI - rejection of non-transitive interactions as cause of declining rate of adaptation,”
BMC Evolutionary Biology, vol. 2, Oct 2002.
[97] S. F. Elena and R. E. Lenski, “Long-term experimental evolution in Escherichia coli. VII
- mechanisms maintaining genetic variability within populations,” Evolution, vol. 51, Aug
1997.
[98] M. J. Wiser, N. Ribeck, and R. E. Lenski, “Long-term dynamics of adaptation in asexual
populations,” Science Express, vol. 342, Dec 2013.
[99] S. F. Elena and R. E. Lenski, “Evolution experiments with microorganisms: the dynamics
and genetic bases of adaptation,” Nat Rev Genet, vol. 4, Jun 2003.
[100] S. Schiffels, G. Szollosi, V. Mustonen, and M. Lassig, “Emergent neutrality in adaptive
asexual evolution,” Genetics, vol. 189, Dec 2011.
[101] I. M. Rouzine, J. Wakeley, and J. M. Coffin, “The solitary wave of asexual evolution,”
Proc. Natl. Acad. Sci., vol. 100, Jan 2003.
[102] K. Sato, Y. Ito, T. Yomo, and K. Kaneko, “On the relation between fluctuation and
response in biological systems,” Proc. Natl. Acad. Sci., vol. 100, no. 24, 2003.
[103] Y. Ito, H. Toyota, K. Kaneko, and T. Yomo, “How selection affects phenotypic fluctua-
tion,” Molecular Systems Biology, vol. 5, no. 64, 2009.
99