Download - Transposable elements distributions and genome …...Transposable elements (TEs), also known as \jumping genes" or transposons, are sequences of DNA able to insert and move within

$: Transposable elements distributions and genome …...Transposable elements (TEs), also known as \jumping genes" or transposons, are sequences of DNA able to insert and move within$
Transposable elements distributions and

genome evolution’s footprints.

Maria Rita Fumagalli

Tutor: Michele Caselle

Universita degli studi di Torino

Ph.D. Program in Complex Systems for Life Sciences

XXVIII Cycle - 2013/2015

Introduction

Evolution is a process common to all known lifeforms and plays a central role in shaping

biological diversity. This is a core problem of contemporary biology, where a substantial amount

of data is available, but, to be solved, it requires mixing different competences and backgrounds.

Evolution implies the concept of inheritability, the acquisition of new abilities that are

transmitted to the new generation, and, consequently, DNA modifications. In the last decades

the intuitive chain “one gene, one transcript, one protein” has been enriched by the discovery

of a complex relation between coding and non-coding information. The growing knowledge

of the regulatory roles fulfilled by that part of our DNA that was once referred to as “junk”,

contributed in dramatically changing the idea of “genetic information” encoded by DNA.

For a physicist, evolution is a stochastic process driven by rare events. Evaluating and

modeling costs and benefits of evolutionary moves and how they drive the process of evolution

represent an interesting challenge.

Recently, the idea of using microorganisms to study evolution in the laboratory made pos-

sible to directly investigate how their genomes and phenotypic properties evolve. Replicated

experiments under controlled conditions, combined with modern sequencing technologies allow

to deduce quantitative principles of evolution and test theoretical hypotheses.

Laboratory evolution experiments show a number of large-scale changes of the genome, which

appear in combination with point mutations. Large-scale mutations can involve individual

genes, genomic segments, or even entire chromosomes. Regarding this kind of mutations, the

2

open question is how these a priori detrimental moves may be turned into an advantage. Stan-

dard population genetics models do not explicitly address the problem of large-scale moves and

usually consider point mutations only.

The first part of this thesis focuses on the role of transposable elements in genome evolution,

as source of large-scale evolutionary moves. We introduce a model to describe their distribution,

in order to obtain information on large-scale genomic rearrangements over time.

The second part of this thesis aims to develop theoretical and quantitative models for long-

term evolution of microbial populations, comparing them with data from biological experiments.

Most of this part has been published in the article:

Speed of evolution in large asexual populations with diminishing returns

Fumagalli M. R., Osella M., Thomen P., Heslot F. and Cosentino Lagomarsino M.

Journal of theoretical biology, 2015.

3

Contents

I Distribution of Transposable Elements and Genome Evolution 6

1 Transposable elements in human genome 7

2 Models for REs distributions in human genome 15

2.1 Null model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Expansion-duplication model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Source estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Conclusions 32

Appendix - Part I 34

A1 Data analysis and selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A2 GC-rich and GC-poor regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A3 Jukes-Cantor and Kimura divergence . . . . . . . . . . . . . . . . . . . . . . . . . 36

A4 Maximum likelihood estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Supplementary Figures - Part I 39

II Stochastic Models of Evolution in Large Asexual Populations 50

1 Mutations and fitness in asexual populations 51

2 Models for microbial evolution in long-term experiments 55

2.1 Model definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.2 The diminishing return model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.3 Parameter-matching procedure for diminishing return model . . . . . . . . . . . . 67

Discussion and conclusions 75

4

Appendix - Part II 79

A1 Self-consistency scaling argument estimating the adaptation speed. . . . . . . . 79

A2 Simulation algorithm and effective parameters . . . . . . . . . . . . . . . . . . . . 82

A3 Phenotypic variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

A4 Estimate of the effective beneficial mutation rate . . . . . . . . . . . . . . . . . . 84

A5 Estimate of mean fitness as a function of time . . . . . . . . . . . . . . . . . . . . 85

Supplementary Figures - Part II 87

5

Part I

Distribution of Transposable

Elements and Genome Evolution

6

Chapter 1

Transposable elements in human

genome

Transposable elements (TEs), also known as “jumping genes” or transposons, are sequences

of DNA able to insert and move within a host genome.

The existence of “jumping genes” was observed for the first time by Barbara McClintock

in maize in late forties [1, 2]. Since then, TEs were found in essentially all the genomes, with

very few exceptions. Transposable elements compose nearly half of the human genome [3, 4, 5]

(Fig. 1.1), but their contribution is likely to be even larger because the identification of ancestral

sequences is complicated by the mutations acquired over time.

TEs classification

Transposable elements can be divided into two classes: retrotransposons (Class I) and DNA

transposons (Class II) according to their mechanism of transposition. These classes are subdi-

vided in a wide variety of groups, families and subfamilies according to their sequences. More-

over, TEs can also be defined as autonomous (active) or nonautonomous, depending if they

encode for the proteins necessary for their own transposition.

DNA transposons (DTEs) account for a small fraction of the TEs in human and in other

mammals [3, 6], while give a relevant contribution in other species. Although they seem to have

lost the ability to transpose in the human genome [3], DTEs were active during early primate

7

(a)

(b)

Figure 1.1: (a) Relative contribution of different TEs to the amount of repetitive elements in selected species

as percentage over the total number of insertions. (b) Contribution of different TEs to the composition of

human genome. Alu and L1 families together represent one third of our genome. Data obtained as described in

Appendix.

evolution, until 37 million years ago (MYA) [6].

These elements are able to excise from the genome, move and paste themselves as DNA into

a new genomic position.

The transposition process does not necessarily increase the copy number of the DTEs, be-

cause most of them move through a non-replicative “cut-and-paste” mechanism. The increase

in the number of copies of DTEs should be caused by indirect mechanisms that rely on host

replicative processes [7].

Most of the active elements encode a transposase enzyme that is able to bind at the termini

of the DTE sequence performing a breakage reaction to remove the transposon from its site.

The same enzyme, binding at the new integration site of the target DNA, inserts the DTE

into its new position. The inserted element is flanked by small gaps which are filled in by host

enzymes, leading to small target site duplications [8].

There are other types of DNA transposons, named Helitrons and Mavericks, whose mecha-

nism of transposition is not yet well understood but, most likely, implies the displacement and

replication of a single-stranded DNA intermediate [7].

Retrotransposable elements (REs) proliferate through a “copy-and-paste” mechanism.

They are transcribed in RNA intermediates and reverse transcribed into the host genome in a

different location. Therefore the retrotransposition process increases the number of REs present

in the genome since it preserves the original copy duplicating the element.

8

Retrotransposons can be further subdivided into two groups: long terminal repeats (LTRs)

and non-LTRs.

LTRs, including human endogenous retroviruses (HERVs), have some similarity with virus se-

quences. These elements encode different viral proteins (reverse transcriptase, ribonuclease and

integrase) that provide enzymatic activities for retrotranscribe the RNA and insert the cDNA

into the genome. LTR sequences lack the genes encoding the envelope proteins that permit the

movement from one cell to another. Hence, unlike viruses, they can only reinsert themselves

into the genome from which they originated. Reverse transcription of LTR-retrotransposon

RNA is a multistep process occurring in the cytoplasm [8].

Non-LTR retrotransposon are classified according to their size into two categories: short

interspersed elements (SINE, shorter than 500 basepairs(bp)) and long interspersed elements

(LINE) [9]. The two major families∗ of LINEs and SINEs are L1 and Alu, globally consisting

of ≈ 2 · 106 elements and accounting for nearly 30% of human DNA (Fig. 1.1). Currently, few

elements from these two families are able to move in the human genome and can cause genomic

instability through both insertion and post-insertion mechanisms [10].

In contrast to LTR retrotransposons and retroviruses, RNA is transported to the nucleus and

reverse transcription process takes place on nuclear genomic DNA. Occasionally, nonautonomous

TEs and other cellular transcripts can compete for the retrotranscription machinery, leading to

the retrotransposition of these elements and processed pseudogenes [11].

L1 non-LTR retrotransposon are autonomous REs diffuse in mammals. This family represents

almost one fifth of human DNA, with around 106 insertions.

Full-length elements are 6kb long but the vast majority of human L1 insertions are 5’

truncated to less than 1kb (Fig. S2). The presence of truncated elements is observed also in

novel insertions in cancer cells [14]. The relevant number of incomplete sequences is likely

related to premature polyadenylation of the RNA rather than post-insertion events [5]. Few

hundreds of full-length (and potentially functional) L1s can be detected in human genome, and

germline retrotransposition events occur at an estimated rate of one event every 200 live births

per generation [15, 16].

∗Actually, the classification of these REs is much more complicated, especially for LINEs. For sake of simplicity,

in this work we refer to L1 and Alu as “families” of REs, and all the elements belonging to these families are

classified in “subfamilies”.

9

L1L1

Alu

(a)(b)

Figure 1.2: (a) L1 and Alu structure. Figure shows the schematic structure of L1 and Alu elements. L1s

contain an RNA polymerase II binding sites and two open reading frames encoding the proteins necessary for

RNA retrotranscription and insertion in the host. Alus are short non-coding sequences. They are composed by

two monomers, connected by an A-rich region. Left monomer contains a bipartite promoter for RNA Polymerase

III. Figure adapted from ref. [12]. (b) L1 retrotransposition cycle. L1 mRNA (blue line) is translated in

the cytoplasm, where it assembles with its own encoded proteins. Ribonucleoprotein complexes are re-imported

into the nucleus. The endonuclease makes a single-stranded nick in the host DNA and the reverse transcriptase

uses the nicked DNA to prime reverse transcription from the 3’ end of the L1 RNA. Retrotranscription can be

incomplete, and the element results truncated (e.g. in figure only 3’ UTR and ORF2 are successfully inserted

in the host genome). Alus, SVAs or cellular mRNAs can recruit L1 proteins, using the same machinery to

retrotranscribe themselves in the host genome. Figure adapted from ref. [13]

L1 elements contain in the 5’UTR an internal RNA Polymerase II (Pol II) promoter [5, 17].

Thus, lacking the Pol II promoter, 5′ truncated L1s constitute inactive elements of the family.

In the central part of their sequence, L1s contain two open reading frames (ORFs): the

first one encodes a nucleic acid binding protein and the second encodes an endonuclease and a

reverse transcriptase [8, 11]. These proteins, when the element is transcribed by RNA Pol II, are

probably responsible for the transport and retrotranscription of the RNA into a new genomic

position (Fig. 1.2b).

The L1-encoded endonuclease preferentially cleaves the sequence 5’-TTTT/AA-3’ [17, 18],

and L1s result particularly abundant in AT-rich and gene-poor regions.

Although there is some evidence of tendency of L1s to cluster [19], insertion sites can be

considered randomly distributed, at least at genomic scale [15, 20] (see also Fig. S10).

Alus are primate-specific endogenous SINE and with more than 106 copies are the most

successful transposons in humans. They are 300-bp non-coding nonautonomous sequences com-

posed of two different GC-rich monomers. The monomers derived by the 7SL RNA gene (the

nucleic acid component of the signal recognition particle) and are connected by an A-rich region.

10

Since Alus are nonautonomous REs, retrotransposition requires two basic components: ex-

pression of the RNA template and accessibility to the retrotransposition machinery. Transcrip-

tion is guaranteed by the presence of a bipartite RNA Polymerase III promoter in the left

monomer [17]. Specific Alu subfamilies share the 3’-terminal sequences with “partner” L1s [21],

and this can facilitate the recruitment of L1-encoded proteins. Due to the lack of coding se-

quences, Alu transcripts are not targeted by the translation machinery. Thus, the Alu RNA

can exist as a “free” RNA, eventually recruiting a Poly-A binding protein (PABP).

This may favor the localization of Alu ribonucleoproteins on ribosomes, increasing the prob-

ability of coming in close proximity to newly synthesized reverse transcriptase [17].

The relation between Alus an L1s is further supported from studies on another 7SL-derived

SINE, the rodent B1 element [17]. Despite the fact that Alus retrotranscription involves L1

endonuclease, they result enriched in GC-rich (and gene rich) regions. This bias seems to be

caused by post-insertion mechanisms [22].

Impact of TEs on the host genome: expansion, competition and

evolution

Transposable elements impact genome integrity in several ways, not only during their in-

sertion but also participating in post-insertion rearrangements and structural variations of the

genome [10, 23].

The predisposition of TEs towards involvement in genomic rearrangements is a consequence

of their ability to mobilize DNA, their abundance, and their high sequence identity [10]. Creating

homology regions between nonallelic sequences, TEs can potentially interact during both DNA

repair processes and the crossing over in meiotic recombination [8, 10].

Transposable elements are known to play a role in a number of genetic diseases such as

distrophia and emophilia (see for example [24, 25, 26, 17] ). They have also been observed to

be mobilized in cancer development. This is not unexpected, since a typical characteristic of

cancer cells is their genomic instability, even though the cause-effect relation is still unclear [14,

27, 28, 29, 30]. TEs mobilization, if occurring in somatic cells, can affect the phenotype at single

cell level [13]. Somatic insertions have been detected in neuronal cells while the occurrence of

such events in other (healty) tissues is still under investigation [31].

11

50 MYA

100 MYA

Divergence %

Freq

uen

cy

Figure 1.3: Rate of amplification for different Alu subfamilies. Figure shows the frequency distribution

of Kimura divergence (corrected for the CpG content) of different Alu subfamilies from their relative consensus

sequence (see Appendix). Each subfamily experienced a peak of diffusion followed by inactivation. Indicative

time scale, obtained assuming ≈ 1.3 · 10−9 mutation per nucleotide per year, is reported on top of the figure

[MYA = Millions years ago].

It has recently been demonstrated that L1s can mediate genomic deletions [32]. The insertion

of an Alu element close to another Alu but on the opposite strand of the DNA can cause

the excision of the two elements [32]. On the other hand, Alus are believed to have a role in

duplications. This hypothesis is supported by the presence of Alus in the junctions of duplicated

regions [33, 34, 35].

Although these moves represent an important source of renewal in the genome, the unregu-

lated activity of TEs can cause significant genetic damage. Cells have developed mechanisms

to minimize TEs mobilization and limit their impact.

The first line of defense against the damaging effects of retrotransposons is to prevent their

expression at transcriptional and post-transcriptional level through methylation, PIWI proteins

and PIWI-interacting RNAs [5, 17].

In order to describe the expansion/inactivation dynamics of TEs and the competition be-

tween different subfamilies it is necessary to establish the age of single elements.

A retrotransposon, once inserted in a specific position in the host genome, accumulates mu-

tations over time. It is possible to measure the evolutive distance (or divergence) between the

mutated RE sequence and the original sequence that was retrotransposed (consensus sequence).

This allows to evaluate the age of the insertion. Jukes-Cantor and Kimura models allow to

calculate the value of this divergence quite easily [36, 37, 38](see Appendix).

The distribution of the divergence between a specific REs subfamily and its consensus se-

quence, shows that insertion events are usually localized in time (Fig. 1.3). In fact, the prolifera-

12

tion dynamics of Alu, L1 and other TEs families is characterized by rapid bursts of amplification

followed by inactivation [9, 39, 40, 41].

The three major groups of Alu subfamilies J, S and Y experienced discrete waves of retro-

transposition activity, appearing at different time during primates evolution. Figure 1.3 shows

the divergence distribution and the peaks of diffusion for three subfamilies representative of

these groups (AluJb, AluSx and AluY respectively). The relation between increase and de-

crease of the rate of insertion for different couples of Alus and L1s resemble the classical results

obtained for predator-prey models [17, 21, 39].

Studying the dynamics of birth and expansion of different subfamilies allows to approach

these processes with models typical of population genetics and ecology [42, 43, 44, 45, 46].

These models deal with fitness disadvantages caused by TEs, copy number control and can

include also the effect of silencing.

L1s regulatory regions (3’ and 5’UTR) have been frequently modified during evolution,

whereas the two open-reading frames (ORF1 and ORF2) remained relatively conserved [5, 40].

This rapid evolution of regulatory regions is possibly due to a strategy to escape host silenc-

ing mechanisms and bypass the competition for the retrotransciption machinery with non-

autonomous REs [39].

This is possibly due to a strategy to escape host silencing mechanisms and bypass the

competition for the retrotransciption machinery with non-autonomous REs [39].

The evolution of different L1 subfamilies and their coexistence suggests the presence of

competition also within L1s for the recruitment of the polymerase [40].

The relationship between autonomous and nonautonomous elements is not restricted to L1s

and Alus in human but it has been suggested also in other organisms and for other couple of

LINEs/SINEs including for example B1/L1 in mouse [21, 47]. L2s and MIRs, similarly to Alus

and L1s, share the 3’ terminal sequences and, notably, nonautonomous MIRs become extinct

when L2s stop to proliferate [48].

Most of the REs that can be identified in the current human genome are generally inactive

fossil sequences, with few subfamilies still currently expanding [4]. Therefore, the genomic dis-

13

tribution of REs reflects possible specific preferences or biases of the insertion mechanisms, but

carries also information about the most relevant evolutionary forces driving the rearrangements

of the host genome.

The next chapter introduces a model to describe the current distributions of genomic distances

between REs of different subfamilies, focusing on members of Alu and L1 families. These distri-

butions seem to be compatible with a process of insertion in approximately random positions,

followed by host genome expansion and segmental duplications.

14

Chapter 2

Models for REs distributions in

human genome

2.1 Null model

The distribution of the REs insertion sites can be considered almost random on chromosomal

and genomic scale [15, 20].

The “null” hypothesis of initial random insertion of REs in the genome suggests an analogy

with the stick-breaking model previously introduced to describe the fragmentation process of a

polymer chain [49]. In fact, Alus and L1s can be considered as point-like breaks placed on a

segment of length L equal to the genome size. Considering that Alus are ≈ 300bp long, and

most of L1s insertions are less than 1kbp long (see Fig. S2), the point-like approximation results

quite accurate.

Therefore, the distribution of inter-REs distances for elements randomly inserted in the

genome should be equivalent to the size distribution of fragments after the random scission of

a polymer.

The processes of fragmentation and aggregation have been extensively studied in order to

describe different phenomena such as polymer and sequences degradation, breakup of liquid

droplets, and the crushing of rocks [49, 50, 51, 52].

15

A general expression for this kind of process is given by [51]:

∂c(x, t)

∂t= −

∫ x

0c(x, t)F (y, x− y)dy + 2

∫ ∞x

c(y, t)F (x, y − x)dy (2.1)

+

∫ x

0c(x− y, t)c(y, t)K(y, x− y)dy − 2

∫ ∞0

c(x, t)c(y, t)K(y, x)dy,

where the functions K and F are the kernels that describe how the elements coagulate and

fragment and c(x, t) the number of “pieces” of size x at time t.

We are interested in a random fragmentation process (K = 0) in which all bonds break with

equal probability. The solution of this process can be evaluated analytically [49].

Applying the solution in ref. [49] to the insertion of REs in the genome, the parameter time

should be substituted by the average number of REs per unit chain. The expected number of

distances equal to x after the random placement of B REs on a genome of length L is: c0(x;B,L) =[2BL +

(BL

)2(L− x)

]e−

BLx for 0 < x < L

c0(x;B,L) = e−B for x = L(2.2)

The probability distribution p0(x;B,L) is obtained by normalizing c0(x;B,L) with the total

number of inter-REs distances B + 1.

The above formula can be justified using very intuitive arguments. Considering a polymer

of length L and B “breaks” randomly placed on it the probability of insertion of a break in a

precise position is 1/L for all of them∗.

A fragment of length x can be obtained either by placing a single break at distance x from

one border of the polymer or placing two breaks at relative distance x. Moreover the remaining

B − 1 (or B − 2) breaks should fall out of the fragment.

Hence, the probability of having a fragment of size x from one end of the polymer is

1/L (1− x/L)B−1. Including the correct multiplicity, the expected number of fragments of size

x after placing B breaks is:

πB(x, border) = B 1L

(1− x

L

)B−1

πB(x, center) = B(B − 1)(L− x) 1L2

(1− x

L

)B−2.

(2.3)

∗For simplicity of notation we consider that the insertion of a break does not reduce the number of the possible

insertion sites of the others.

16

Freq

uen

cy

Distance [bp]

(a) Alu Y

104

105

1

10-2

10-4

Alu Jr

104

105

1

10-2

10-4

GCpoorGCrich

104

105

10-1

10-3

Alu Sx Chr 1(b) (c)

Figure 2.1: The genomic distribution of retrotrasposable elements shows deviations from random

placement. Figure show examples of the empirical inter-REs distance distribution (symbols) of three different

Alu subfamilies in the human genome. Panels (a) and (b) refer to genomic distributions. Dashed black lines

represent the parameter-free analytical expectation given by the random-placement null model (Eq. (2.2)). Panel

(c) shows the inter-Alus distance distribution in GC-rich (> 41%) and GC-poor regions (symbols) in a single

chromosome, with the corresponding null model predictions (dashed and dotted-dashed line). More examples

supporting an similar deviation from the random placement expectation can be found in the Appendix.

In the limit of large B (i.e. B ≈ B − 1) and small x (x/L << 1):

πB(x, border) ≈ BL e−xB

L

πB(x, center) ≈(BL

)2(L− x)e−x

BL

(2.4)

Since there are two borders of the chain c(x) = 2πB(x, border) + πB(x, center), and we

recover the result in Eq. (2.2).

We will refer to Eq. (2.2) as stick-breaking or null model distribution for inter-REs distances.

Divergence from the null model

The stick-breaking distribution (2.2) depends only on the parameters B and L. In terms of

inter-REs distance distribution, these parameters are known, representing the number of REs

of a given subfamily and the genome length. Thus, Eq. (2.2) gives a parameter-free analytical

prediction of random-placement distribution.

Comparison between empirical data and the null model for several REs subfamilies clearly

shows evidence of non-random positioning. The disagreement between the distributions is

17

JbJbJoJo

Jr4Jr4

JrJr

Sc5Sc5

Sc8Sc8ScSc

Sg4Sg4

Sg7Sg7

SgSg

SpSp

Sq10Sq10

Sq2Sq2

Sq4Sq4

SqSq

Sx1Sx1

Sx3Sx3

Sx4Sx4

SxSx Sz6Sz6SzSz

Ya5Ya5Yb8Yb8 YcYc

YY

Ye5Ye5

Yf1Yf1

Yj4Yj4

Yk2Yk2Yk3Yk3

Yk4Yk4

Ym1Ym1

JbJbJoJo

Jr4Jr4

JrJr

Sc5Sc5

Sc8Sc8ScSc

Sg4Sg4

Sg7Sg7

SgSg

Sq10Sq10

Sq2Sq2

Sq4Sq4

SqSq

Sx1Sx1

Sx3Sx3

Sx4Sx4

SxSx

Sz6Sz6

SzSz

Ya5Ya5

Yb8Yb8

YcYc

YY

Ye5Ye5

Yf1Yf1

Yj4Yj4

Yk2Yk2Yk3Yk3

Yk4Yk4

Ym1Ym1

GCPoor

GCRich

0 0.05 0.10 0.15 0.200

0.05

0.10

0.15

0.20

0.25

0.30

Jukes-Cantor divergence

Stic

k-br

eaki

ng d

ista

nce

Stic

k-br

eaki

ng d

ista

nce

JbJb

JoJo

Jr4Jr4

JrJr

Sc5Sc5

Sc8Sc8ScSc

Sg4Sg4

Sg7Sg7

SgSg

SpSp

Sq10Sq10

Sq2Sq2

Sq4Sq4

SqSq

Sx1Sx1

Sx3Sx3

Sx4Sx4

SxSx

Sz6Sz6

SzSz

Ya5Ya5

Yb8Yb8

YcYc

YY

Ye5Ye5

Yf1Yf1

Yj4Yj4

Yk2Yk2Yk3Yk3

Yk4Yk4

Ym1Ym1

0.00 0.05 0.10 0.15 0.200.0

0.1

0.2

0.3

0.4

Jukes-Cantor divergence

Stic

k-br

eaki

ng d

ista

nce

(a) (b) (c)

JbJb

JoJo

Jr4Jr4

JrJr

Sc5Sc5

Sc8Sc8ScSc

Sg4Sg4

Sg7Sg7

SgSg

SpSp

Sq10Sq10

Sq2Sq2

Sq4Sq4

SqSq

Sx1Sx1

Sx3Sx3

Sx4Sx4

SxSx

Sz6Sz6

SzSz

Ya5Ya5Yb8Yb8

YcYc

YY

Ye5Ye5

Yf1Yf1

Yj4Yj4

Yk2Yk2Yk3Yk3

Yk4Yk4

Ym1Ym1

10-6 10-5 10-40

0.1

0.2

0.3

0.4

Density B/L

Ya5Ya5Yb8Yb8

Figure 2.2: The distance from the null model increases with subfamily age. Left and middle panels

show the distance from the null model for different Alu subfamilies calculated over the whole genome (a) and in

GC-rich/poor regions (b) as a function of Jukes-Cantor divergence from the consensus sequence. The correlation

coefficient is > 0.7 in all three cases. Panel (c) shows the distance from null model as in panel (a) as a function of

the density (B/L) of Alu subfamilies. Different colors in panel (a) and (c) identify the three mayor Alu groups:

Y (red), S (orange) and J (green).

particularly evident for very short and very long inter-REs distances (see Fig. 2.1). This behavior

is also reproduced by single chromosome distributions and in GC-rich/GC-poor regions (see

Fig. 2.1(c), Appendix).

Defining a distance between data and the null model can give a quantitative estimate of

the observed deviations. We define the distance of the distribution of a given subfamily i from

the stick-breaking prediction as:

DSB =∫ Li

0BiLi|[1−

(1− xBiLi

1Bi+1

)e−xBi/Li

]− fi(x)|dx . (2.5)

The first term in the integral is the cumulative distribution of the null model and fi(x) is the

cumulative frequency of the empirical distances.

The cumulative stick-breaking distribution can be expressed using the rescaled variable

y = xBi/Li. Since the quantity xL is expected to be very small compared to B†, we can neglect

the linear term yBi+1 . In this case, Eq.(2.5) can be approximated as:

DSB ≈∫ Bi

0 | (1− e−y)− fi(y)|dy. (2.6)

The above expression shows that the distances we calculated can be compared between different

subfamilies. In fact, we are measuring the distance of different empirical distributions from

†For our datasets it is usually xL< 10−3.

18

almost the same function‡ ≈ 1− e−y.

The calculated distances are well correlated with the average Jukes-Cantor divergence of

different subfamilies and thus, with a certain approximation, with their age (r ≈ 0.7 and

r ≈ 0.5 for Alus and L1s respectively, see Fig. 2.2a and Fig. S5). A similar result is also

obtained considering the distribution of distances between Alus in GC-rich and GC-poor regions

(Fig. 2.2). Moreover, the divergence from the stick-breaking expectation increases with the

density of the subfamilies (Fig. 2.2c). Together these observations suggest the existence of a

time-dependent mechanism active in both GC-rich and GC-poor regions capable to reshape the

inter-REs distance distribution in a similar way.

2.2 Expansion-duplication model

There are different processes capable of reshaping DNA architecture in eukaryotes.

Some of them, such as insertion of transposable elements or pseudogenes, have the net effect of

expanding the genome, and thus the inter-REs distances.

Duplication and deletion events can influence not only the reciprocal distance between REs

but also their number. Translocations and inversions, by contrast, “change the order” of the

REs, modifying only the two inter-REs distances adjacent to the borders of the translocated or

inverted region. Since the null model of stick-breaking does not involve the order of inter-REs

distances but only their size, we can essentially ignore the contributions of translocations and

inversions.

Under the hypothesis of randomly distributed events the probability of insertion or deletion

in a certain position is uniform over the whole genome. As a consequence the probability that

an expansion/deletion event modifies an inter-REs distance x is proportional to x. It is possible

‡Note that the upper limit of the integral, that depends on the specific subfamily i, is never reached. In fact

y = Bi would imply x = Li.

19

to write a very general equation for the evolution of the distances distribution:

c(x, t+ 1) = c(x, t) +γex−λeL(t) c(x− λe)− γe

xL(t)c(x)

+γdx+λdL(t) c(x+ λd)− γd x

L(t)c(x)

+µ+q+(x, t)− µ−q−(x, t).

(2.7)

The first terms on the right account for the increase and decrease in size of preexisting distances

due to insertions and deletions of length λe and λd respectively. The last two terms, q+ and q−,

are generic source terms for gain or loss of inter-REs distances. Note that the gain (loss) of a

distance implies the increase (decrease) of the number of REs. The total number of distances

between B elements always equals B+1. As a consequence, the normalization of c(x, t) changes

over time.

It is possible to generalize Eq. (2.7) including a distribution of lengths ρ(λ).

In the limit x >> λd , x >> λe Eq. (2.7) can be approximated introducing partial derivatives

∂c(x,t)∂t = −γe λe

L(t)∂(xc(c,t))

∂x

+γdλdL(t)

∂(xc(c,t))∂x

+µ+q+(x, t)− µ−q−(x, t)

(2.8)

and it assumes the simple form:

∂c(x,t)∂t = −γ λ

L(t)∂(xc(x,t))

∂x + µq(x, t). (2.9)

The factor γλ = γeλe− γdλd is the effective expansion/deletion parameter and q(x, t) is the

effective source term. In the next section we will consider both these quantities as positive,

assuming that expansions and insertions dominate over the deletions. However, the solution of

Eq. (2.9) would be exactly the same considering the opposite situation.

20

Pure expansion model

* Continuous case

We solve Eq. (2.9) in the case of “pure expansion”, ignoring the source term q, using the

method of characteristics.

Rewriting the equation as ∂c(x,t)∂t = −γ λ

L(t)x∂(c(x,t))

∂x −γ λL(t)c(x, t) and introducing the variable

k we obtain:

dc(k)

dk= −γ λ

L(k)c(k) where

dtdk = 1

dxdk = γ λ

L(k)x(2.10)

The variable k can be identified with the time and substituting k → t, the solution is given

by: dxdt = γ λ

L(t)x

dc(x(t),t)dt = −γ λ

L(t)c(x(t), t)⇒

x(t) = x0 e∫ t0 γ

λL(t′)dt

′

c(x(t), t) = c(x0, 0) e−∫ t0 γ

λL(t′)dt

′ (2.11)

Since the term e−∫ t0 γ

λL(t)

dtdoes not depend on x, the evolved distribution has the same

functional form of the initial condition: the expansion process does not modify the shape of the

distribution.

Imposing as initial condition

c(x0, 0) = c0(x0;B0, L0) =

(2B0

L0+

(B0

L0

)2

(L0 − x0)

)e−B0/L0 x + e−B0δ (x0 − L0) ,

the expanded distribution results:

c(x(t), t) =

[(2B0

L+(B0

L

)2 (L− x(t)

))e−

B0Lx(t) + e−B0δ

(x(t)− L

)](2.12)

where L = L0e∫ t0 γ

λL(t′)dt

′.

For this model, the total number of REs is a constant and the normalization condition∫ L(t)0 c(x(t), t)dx(t) = B0 + 1 must hold. As a consequence, we obtain L = L(t) and the simple

relation

c(x(t), t) = c0 (x(t); B0, L(t)) (2.13)

21

(a) (b)

1

10

103

Cou

nts

Distance [bp]104 106105

102

10

103

103 104 106105

Expansion

T=0

λ=500λ=50

λ=5000

Expansion removal

T=0

Lf=1.2GbpLf=3Gbp

Cou

nts

Figure 2.3: Random expansion does not affect a stick-breaking distribution. (a) The panel shows the

simulated distribution of distances (orange dots) between B0 = 104 points randomly placed on a segment of

length L0 = 109. Expansion process was simulated using different values of λ (different symbols, as in legend).

The final genome size is Lf = 3·109 for all the distributions. Black dashed lines indicate the null model prediction

as in Eq. 2.13. (b) The panel shows the simulated distribution of distances (orange dots) between B0 = 5 · 104

points randomly placed on a segment of length L0 = 109. We simulated expansion (λ = 5000) and removal of

points at different rates (different symbols, as in legend). The final number of points is Bf = 104 for both the

distributions. Black dashed lines correspond to the null model distribution with the correct number of REs and

genome size. Continuous red and green lines represent the solutions for x < λ (Eq. (2.21)) and λ ≤ x ≤ 2λ

(Eq. (2.22)) respectively.

between the null model and the expanded distribution.

Moreover, it is possible to derive an explicit solution for the genome size as a function of

time:

L(t) = L0e∫ t0 γ

λL(t′)dt

′→ L(t) = L0 + γλt (2.14)

The linear increase of L in time, obtained through analytical arguments, is not unexpected

since the insertion rate γ is kept constant.

When the deletion term in Eq. (2.8) dominates over expansion, we obtain exactly the above

solution where λγ < 0. In this case both x and L decrease in time.

Note that we should impose some realistic limit to the increase (decrease) of genome size.

However, we are interested in modeling a “realistic” expansion process spanning a finite time.

We consider L(t) and L0 to be of the same order of magnitude.

For short distances (i.e. λ > x) the discrete equation (2.7) should be modified and its

continuous approximation is no longer valid. In this region the inter-REs distance x is not

22

expanding and distribution evolves according to:

c(x, t+ 1) = c(x, t)− γ xL(t)c(x, t) ⇒

∂c(x,t)∂t = −γ x

L(t)c(x, t). (2.15)

Imposing L(t) as in Eq. (2.14), the evolved distribution at time t results:

c(x, t) = c(x, 0)e−γx

∫dtL(t)

= c0(x;B0, L0)(L0L(t)

) xλ.

(2.16)

The complete solution should be given by a superimposition of Eq. (2.13) and Eq. (2.16).

* Discrete case

In the case of a discrete system, it is possible to solve the expansion process for small

(x < λ ) and intermediate distances (x ≈ λ).

Considering an initial point xa < λ and the successive distances xb = xa + λ, xc = xb + λ ...

we can write:

c(xa, t+ 1) = c(xa, t)− γ xaL(t)c(xa, t)

c(xb, t+ 1) = c(xb, t)− γ xbL(t)c(xb, t) + γ xa

L(t)c(xa, t)

c(xc, t+ 1) = c(xc, t)− γ xcL(t)c(xc, t) + γ xb

L(t)c(xb, t)

...

(2.17)

Hence, we can rewrite the first equation as

c(xa, t+ 1) = c(xa, 0)

t∏i=0

(1− γxa

L(i)

), (2.18)

while, with some additional manipulations§, the second equation results:

c(xb, t+ 1) = c(xb, 0)

t∏i=0

(1− γxb

L(i)

)+

t∑k=0

c(xa, k)γxaL(k)

t∏j=k+1

(1− γxb

L(j)

). (2.19)

§For simplicity of notation, we are assuming here and in the following equations that∏ti=k+1

(1− γx

L(i)

)for

k = t is equal to 1.

23

Considering γλ << L0, it is possible to introduce the approximation:

m∏i=n

(1− γx

Li

)≈

(m∏i=n

(1− γλ

L(i)

)) γxγλ

=

(L(n− 1)

L(m)

) xλ

. (2.20)

Using this approximation, Eq. (2.18) resuts: we find:

c(xa, t+ 1) ≈ c(xa, 0)

(L0

L(t)

)xaλ(

1− γxaL0

)(2.21)

similarly Eq. (2.16).

Substituting this solution in Eq. (2.19) allows to calculate the distribution in the point xb: :

c(xb, t+ 1) ≈ c(xb, 0)(L0L(t)

)xbλ(

1− γxbL0

)

+c(xa, 0)(L0L(t)

)xaλ(

1− γxaL0

)γxaL(t)

∑tk=0

(1 + γxa

L(k−1)

)

≈ c(xb, 0)(L0L(t)

)xbλ(

1− γxbL0

)+ c(xa, 0)

(L0L(t)

)xaλ(

1− γxaL0

)γxaL(t) t .

(2.22)

In the last step we neglected a contribution of the order ≈ xb/λ ln(L(t)/L0) ≈ (xb/L0) t << t.

In the point xc = xb + λ a similar expression for the evolved distribution can be derived.

Increasing the distance from the initial point xa, the correction due to c(xa, t) becomes less

relevant. In the limit of distances x >> λ it is possible to derive a solution equivalent to the

one obtained from the continuous approximation.

We verified by simulations that the expansion process does not change the shape of the

distribution for initial random insertions and the estimates in Eq. (2.21) and Eq. (2.22) perform

quite well (see Figures 2.3 and Appendix).

Moreover, we simulated random removals of REs. The intuitive result is that this process

decreases the parameter B0, without affecting the shape of the distribution. Thus, an expansion-

removal process would lead to a distribution c(x(t), t) = c0(x(t);B(t), L(t)). An equivalent result

can be obtained with random insertion of new REs.

24

Expansion and insertion

The solution of a pure expansion model presented above, can be extended in order to account

the presence of an external source as in Eq. (2.9).

In the continuous case, using the method of characteristic and rewriting Eq. (2.9) as a

total derivative of the function xc(x, t), the solution is straightforward: dxdt = γ λ

L(t)x

1x(t)

d(xc(x,t))dt = µq(x)

⇒ c(x(t), t) =x0

x(t)c(x0, 0) +

µ

x(t)

∫ t

0q(x(t′))x(t′)dt′ . (2.23)

We assume a linear increase of the genome size L(t) = L0 + ϕt, where ϕ is a combined

function of expansion and insertion contributions. As a consequence, the relation between x

and L is given by:

dx

x= λ

γ

ϕ

dL

L→ x(t) = x0

(L(t)

L0

) γλϕ

. (2.24)

Substituting c0(x0;B0, L0) as initial condition the first term on the right in Eq. (2.23) is still

a stick-breaking function:

x0x(t)c0(x0;B0, L0) =

(2 B0LE(t) +

(B0LE(t)

)2(LE(t)− x(t))

)e−B0x(t)/LE(t). (2.25)

The expanded genome is now LE(t) = L(t)(L(t)L0

)γλ/ϕ−1, since the whole genome L(t)

contains also the contributions of the (expanded) source.

In the limit ϕ→ γλ we recover the result of the former section.

The discrete solution of a generic expansion-insertion model is quite complex.

For x < λ Eq. (2.18) becomes

c(xa, t+ 1) = c(xa, 0)

t∏i=0

(1− γxa

L(i)

)+ q(xa)

t∑k=0

t∏i=k+1

(1− γxa

L(i)

)(2.26)

and successive manipulations lead to:

c(xa, t+ 1) ≈ c(xa, 0)(L0L(t)

) γxaϕ(

1− γxaL0

)+ q(xa)(t+ 1)

(1− γxat

2Lt

). (2.27)

25

For a generic coordinate xi the additional contribution of the source depends on q at different

time points and at different positions xi − λ:

q(xi)∑t

k=0

∏tj=k+1

(1− γxi

Lj

)+

+q(xi − λ){∑t−1

k=0

∏t−1j=k+1

(1− γ(xi−λ)

Lj

)γ xi−λLt

+

+∑t−2

k=0

∏t−2j=k+1

(1− γ(xi−λ)

Lj

)γ xi−λLt−1

(1− γ xiLt

)+ ...

}+ ...

(2.28)

Unfortunately the exact solution is not easy to compute, but it is expected to be roughly

proportional to 1xi

∑tm=0 q(xi −mλ)

∏mn=0 γ(xi − nλ)t ¶.

In order to fully solve the expansion-insertion model we have to chose an explicit form for

the source function q(x).

Segmental duplication is a relevant mechanism for genome evolution. The duplication of a

DNA sequence can imply the passive duplication of the retrotransposons eventually contained in

the sequence. The source term q(x, t) can be seen as the effective source of inter-REs distances

resulting from these events.

Considering duplications of length κ << L and assuming a Poissonian distribution for the

number n of duplicated retrotransposable elements (i.e. p(n,D) = Dn

n! e−D) the source results:

q(x) =∞∑n=1

Dn

n!e−Dπn(x) , (2.29)

where πn(x) is the expected number of duplicated inter-REs distances of length x.

The case n = 0 corresponds to expansion events and can be included in the factor γλ.

At the beginning of the duplication process D = κB0/L0. Assuming that the genome size

increases as L(t) = L0 + ϕt = L0 + (γλ+ µκ)t and B(t) = B(t− 1) + µD(t− 1), the density

as a function of time is:

B(t)L(t) = B0

L0

(1− γλ

L0

)(L0L(t)

)γλ/ϕ. (2.30)

¶We implicitly consider the condition xi > nλ.

26

If the contribution of expansion is negligible (γλ ≈ 0) the duplication process implies a

conservation of the mean density.

D is a mean expected value while the actual number of duplicated insertion is random.

Therefore, the above density have to be considered as a mean value.

A local duplication model allows to calculate explicitly πn(x) and the source term. In

fact, during local duplication all the internal distances are copied and a new inter-REs distance

is created. The length of the new distance is equal to the sum of those at the borders of the

sequence κ ‖. This is not precise for non-local duplication where, in general, two new random

distances are created. However, we expect that the results obtained for the local duplication

process are valid for non-local duplication when D >> 1 (i.e. when the effect of the newly

created distances are negligible).

Following the same approach adopted in the previous section, we assume that the probability

of finding a retrotransposon in a precise position is 1/κ. Therefore, the number of duplicated

inter-REs distances equal to x is:

πn(x, internal) = n(n−1)κ2

(κ− x)(1− x

κ

)n−2. (2.31)

The probability that the sum of the lengths at the borders of the duplicated sequence equals to

x is:

πn(x, borders) = n(n−1)κ2

x(1− x

κ

)n−2. (2.32)

In the particular case n = 1, π1(x, internal) = 0 and π1(x, borders) = δ(κ− x).

The integral∫ κ

0 dx (πn(x, internal) + πn(x, borders)) is equal to the total number n of inter-

REs distances added by the process∗∗.

‖This is very intuitive observing Fig. S12.

∗∗This is the total number of fragments given n cuts on a circular polymer.

27

The effective source function results:

q(x) =∑∞

n=1Dn

n! e−D [πn(x, borders) + πn(x, internal))]

=∑∞

n=1Dn

n! e−D[n(n−1)κ2

κ(1− x

κ

)n−2+ δ(n− 1)δ(κ− x)

]

= D2

κ e−Dx/κ +De−Dδ(κ− x)

(2.33)

with x ≤ κ.

Given a fixed ratio B(t)/L(t) = b, the source of duplication has essentially the functional

form of a stick-breaking distribution:

q(x) =

(B(t)L(t)

)2κe−B(t)L(t)

xx < κ

B(t)L(t)κe

−B(t)L(t)

κx = κ

(2.34)

The substitution of Eq (2.34) in Eq. (2.23) would lead to a source term ∝ e−D and a second

term proportional to(1x(t)

)1+ϕ/γλ{

Γ

[1 + ϕ

γλ ,Dx(t)κ

(L0L(t)

)γλ/ϕ]− Γ

[1 + ϕ

γλ ,Dx(t)κ

]}. (2.35)

A series expansion of the incomplete gamma functions Γ[a, x] assuming(L0L(t)

)γλ/ϕ≈ 1,

gives an exponential dependence of the source term on x.

Including the effect of expansion on the density as in Eq. (2.30) the expanded source results:

µx(t)

∫dt′q(x′)x′ = µ

ϕ−γλ

(B(t)L(t)

)2x(t)κe

−B(t)L(t)

x(t)(L(t)x(t) −

L0x0

)+ΘµB(t)

γλ

(κx(t)

)ϕ/γλe−B(t)L(t)

x(t),

(2.36)

where the symbol Θ indicates an appropriate Heaviside function.

In conclusion, the effect of random local duplication can be evaluated as an additional expo-

nential contribution to the stick-breaking.

28

Supposing that duplications are facilitated by the presence of repetitive sequences, we can

expect that “dense” regions tend to be duplicated more.

It is possible to include the “actual” distribution as a source of duplicated distances consid-

ering the function q(x, t) = f(x)c(x, t). Assuming that short distances are more likely to be

duplicated, f(x) is a decreasing function. In this case, equation (2.23) can be integrated:

dc(x(t),t)dt =

[−γ λ

L(t) + µf(x)]c(x(t), t) → c(x(t), t) = c(x0, 0)e

∫ t0 dt′(−γ λ

L(t′)+µf(x(t′)))

= x0x(t)c(x0, 0)eµ

∫ t0 dt′f(x(t′)) .

In the limit of large inter-REs distances x we would expect the duplication probability f(x) to

disappear and eµ∫ t0 dt′f(x(t′)) ≈ 1, recovering the result obtained for the pure expansion model.

2.3 Source estimate

The solutions in the previous section, besides the the arbitrary choice of the functional form

of q(x), contain the unknown parameters κ and the ratio between expansion and duplication.

While a distribution for duplication lengths can be roughly inferred (e.g. using data from

Ref. [33]), to estimate the parameters ϕ and γλ is more complex.

In order to estimate the fraction of duplicated REs and a functional form for the phenomeno-

logical source, we introduce a simplified model. According to the results in Sec. 2.2, expansion

process does not change the shape of the initial stick-breaking distribution.

As a consequence, relying on the assumption that the duplication process generates short

inter-REs distances, the right-tail of the total distribution should be well fitted by the stick-

breaking function with the correct normalization.

Thus, we aim to fit the data with a function:

c(x,B,L) = θ c0(x;B0, LE) + (1− θ)q(x) (2.37)

where LE is the expanded genome size as in Eq. (2.25) and B0 indicates the initial number

of insertions (i.e. the original number of retrotransposition events).

29

xmin

Purestick-breaking

Stick-breaking+duplication

Frequency

Distance

Figure 2.4: Schematic representation of source estimate procedure.

The inter-REs distance distribution (orange dots - data do not refer to any

particular subfamily) for a value larger than a minimum value xmin (black

dashed line) is well approximated by a stick-breaking distribution (blue

line). The same distribution at shorter distances is the sum of the stick-

breaking and the source. The parameters of the stick-breaking distribution

and the threshold xmin can be optimized using a maximum likelihood ap-

proach.

Frequency

Distance [bp]

(a) Alu Y (b) Alu Jr sourcedouble exp.0.5

1

0 25000

exponential0.5

1

0 25000 50000

source

104

105

1

10-2

10-4

1

10-2

10-4

104

105

Figure 2.5: The inter-REs distance distribution is well described by a stick-breaking model and a

short-range source of distances. Figure shows two examples of inter-REs distance distribution (orange dots)

and the best fit (continuous red line) obtained as the sum of a stick-breaking distribution and an external source

(see main text). The insets show the normalized sources distribution (orange dots) fitted using an exponential

or double exponential function (continuous red line).

In other words, we posit that the tail of the observed distance distribution of B retrotrans-

poson can be well explained by a stick-breaking solution with a smaller initial number of breaks

B0. The short-distances region is a superposition of this stick-breaking process with an extra

contribution captured by the rapidly decaying term q(x).

This approximation allows us to estimate the functional form of q(x) as the difference be-

tween the empirical data and the stick-breaking distribution.

Using a maximum likelihood approach we identified, for each subfamily, the threshold

value xmin that gives the best possible fit of the tail of the distribution, obtaining the initial

number of breaks B0 (see Appendix).

According to our results, the expanded source term is well described by an exponentially

decreasing function for young subfamilies while it is better approximated by a double exponential

30

JbJb

JoJo

Jr4Jr4

JrJr

Sc5Sc5

Sc8Sc8

ScSc

Sg4Sg4

Sg7Sg7

SgSg

SpSp

Sq2Sq2

SqSq

Sx1Sx1

Sx3Sx3

Sx4Sx4

SxSx

Sz6Sz6

SzSz

Ya5Ya5

YcYc

YY

Yk2Yk2Yk3Yk3Ym1Ym1

5 10 4 10 5 5 10 5104

5 104

105

5 105

Avg. Distance

Sou

rce

Figure 2.6: The average source length correlates with inter-REs distances. Figure shows the esti-

mated average source length as a function of the average distance L/B between REs elements for different Alu

subfamilies. Different colors identify the three mayor Alu groups: Y (red), S (orange) and J (green).

for the older ones. Considering that a high-density subfamily has a small typical distance

between elements, this small distance could be preferentially added to the distribution through

the duplication process. Such a mechanism could explain the apparently more than exponential

trend of the data. Using an exponential function to fit the source we found that the mean value

of q(x) is correlated with the density B/L of the corresponding subfamily. These observations

are in agreement with the dependence we derived for the local duplication model.

31

Conclusions

Recently, the role of transposable elements in genome evolution has been increasingly recog-

nized. These elements, widely diffused in different species, can be assimilated to parasites that

evolve and compete with their host genome and with other transposable elements present in it.

In this thesis we presented a null model for the distribution of transposable elements based

on the hypothesis of random insertion in the human genome, focusing in particular on Alus and

L1s. The large number of possible insertion sites and data on recent retrotransposition events

suggest that, at a large scale, this is a reasonable assumption [20, 22, 35]. Obviously, the idea

of “large scale” is not easily defined, and we refer here in general to a genomic scale.

However, the stick-breaking null model does not reproduce correctly the inter-Alus and inter-

L1s distance distributions. Assuming the null hypothesis, all the observed deviations are due

to post-insertion events.

Different mechanisms are believed to be responsible for this discrepancy [22, 35, 53]. Our

intention was to investigate if selection must necessarily enter as a dominant force, or if a

neutral evolution model could be sufficient to explain the empirical data.

Selective elimination of Alus from GC-poor regions after their insertion is believed to be

responsible for the the bias of Alus distribution towards GC-rich regions [22]. This mechanism,

of course, is in contrast with the neutral evolution hypothesis.

However, it is easy to include this effect in a random placement model on two different

genomes that mimic GC-rich and GC-poor regions with two rates of insertion. Hence, the

32

selective elimination can be modeled as a non-retrotransposition event.

The insertion of new genetic material has been suggested to be at the basis of a progres-

sive transition toward a power-law distribution of inter-REs distances [53]. We verified, both

analytically and using simulations, that the expansion process due to insertion of new transpos-

able elements and pseudogenes does not affect significantly the shape of the random placement

distribution.

Including an external source of inter-REs distances in our model we were able to explain the

empirical data. The “effective” phenomenological source (that we suppose to be a source of

duplicated sequences) can be evaluated under the simple hypothesis that it generates short-range

distances.

The estimated amount of duplicated REs is quite large, especially for the older subfamilies

(≈ 50%). Applying the estimate procedure separately on GC-rich and GC-poor regions these

values improve, the fraction of duplicated Alus in GC-rich regions decreasing at ≈ 30%.

The reliability of these estimates is not easy to evaluate. A first source of error is the contri-

bution of non-expanding short inter-REs distances (see Eq. (2.16) or its discrete counterpart).

This term depends on the characteristic insertion length and on the initial genome size, that

are unknown. Moreover, large expansion events (e.g. large duplications of tens of kilobases) can

affect the tail of the distribution. In fact, such events increase suddenly the distance between

two REs and, as a consequence, the probability of insertion of new genetic material in the same

region. Hence, large expansion events could create an effect similar to the one described by

Sellis and coworkers [53]. According to simulated data, this could create a slight deviation from

the null model predictions (see figure S11).

Concluding, we suggest that the observed distribution of recent retrotransposable elements

can be explained with a neutral evolution model of random insertion in the genome followed by

an expansion-duplication process. In the future, we plan to apply the same model to functional

sequences. The goal of this new research is to use the model presented here as a null model to

investigate the role of selective pressure on genomic functional elements, such as transcription

factor binding sites and enhancers.

33

Appendix - Part I

A1 Data analysis and selection

Data on the number of insertions and position of transposable elements (TEs) in human

genome (hg38) were downloaded from RepeatMasker [54] official website http://www.repeatmasker.org

(RepeatMasker open-4.0.5 - Repeat Library 20140131).

The human genome sequence (assembly hg38) was downloaded from UCSC database [55]

(last accessed October 2014). We considered only data referred to reference chromosomes 1-22,

X, Y.

Data analysis and plots were performed using Python, R, Gnuplot and Mathematica.

Estimate of the distance from the null model and the maximum likelihood estimate of the

source were performed on Alu and L1 subfamilies with more than 1000 elements, in order to

have sufficiently high statistics. We included in the analysis a total of 32 Alu and 107 L1

subfamilies∗.

About 3% of the elements identified are labeled by RepeatMasker as ”broken” into pieces

during evolution but originated by a single retrotransposition event, due to the insertion of

other TEs inside the element or genomic rearrangements.

∗We use, here and throughout the thesis, the term ”family” referred to Alu and L1, while the term ”subfamily”

refers, in general, to any subgroup included in these families (e.g. L1M1, AluJb...)

34

These fragmentation events can not be taken into account by our model, that consider TEs as

”point-like” insertions. We collapsed the data according to identification number (ID) assigned

by RepeatMasker (see Fig. S1). This process lead to a very small number of ambiguities. In

few cases it was not possible to assign a precise subfamily to the collapsed element.

We verified that neglecting or including these REs in our model do not affect significantly

the inter-REs distance distribution and have no practical relevance for our model.

The inter-REs distance was calculated as the difference between the start coordinate in

the genome of an element and the stop position of the previous one.

The distance between the first (last) element and the start (end) of each chromosome is

neglected. Moreover, we neglect those distances that fall in centromeric and pericentromeric

regions, that are able to generate unusual long distances in our dataset (of the order of few

Mbp).

We include in centromeric and pericentromeric regions the cytobands labeled as acen and

gvar, according to UCSC repository [55].

As a whole, during this ”selection” process a total of about 50 inter-REs distances were

discarded for each subfamily. Genome results reduced to ≈ 2.8Gb, depending on the subfamily.

A2 GC-rich and GC-poor regions

We define “GC content” of a sequence the ratio between the total number of bases and the

number of Cs and Gs present in the sequence.

We calculated the GC content in non-overlapping windows of 106 bp, neglecting those with

more than 20% of masked bases. A window is considered GC-rich when its GC content is

> 41%. Then, contiguous windows with the same GC content are collapsed in single regions.

We obtained a total of 521 different regions, 263 defined as GC-rich and 258 defined GC-

poor. This procedure guarantees a sufficient “minimal” length of the regions in order to have a

significant number of REs inside them.

The procedure was repeated on the masked sequence of the human genome where the repeats

are masked by capital Ns [55]. In this case we neglected windows with more than 50% of masked

bases. The results obtained with the two methods are comparable.

35

A3 Jukes-Cantor and Kimura divergence

There exist different substitution models to measure the evolutionary divergence between

two DNA sequences.

Insertion time of a TE can be roughly determined as the ratio between the divergence d of

the TE from its consensus sequence, according to a given model, and the total substitution rate

r [56]. Substitution rate indicates the probability of mutation from a generic base (A, C, G, T)

to another one.

In general, substitution rate depends on the organism, the effective population size, the spe-

cific locus of the genome (e.g. coding regions of essential genes will tend to be more conserved),

it differs for synonymous and not-synonymous mutation and varies in time [57, 58, 59, 60, 61].

The overall substitution rate in human is estimated to be of the order of 2 · 10−8 mutation

per nucleotide per generation [58, 62]. Hence, assuming a generation time of ≈ 15 years the

mutation rate is ≈ 1.3 · 10−9 mutation per nucleotide per year.

Jukes and Cantor in 1969 [36, 38] proposed a simple substitution model for the divergence

between two sequences. In this model the mutation rate α is equal for all the four nucleotides.

Considering a reference sequence and a mutable one we can define qt as the fraction of

bases conserved between the two sequences, and pt = 1− qt the fraction of mutated ones. The

substitution rate is r = 3α, since all the three possible substitutions (e.g A → C, A → G or

A→ T ) have the same probability. For mismatching nucleotides, the probability of making the

correct backward mutation is r/3 while the probability of mutating in another nucleotide still

different from the original one is 23r. Thus, the fraction of conserved and mutated nucleotides

q and p evolve according to:

qt+1 = qt − rqt + ptr1

3→

qt+1 − qt = r 13(1− 4qt)

pt+1 − pt = r(1− pt 43)

(0.38)

Posing the initial conditions q(0) = 1 and p(0) = 0, it is immediate to obtain:

q(t) =1

4

(1 + 3e−4rt/3

)↔ p(t) =

3

4

(1− e−4rt/3

)(0.39)

Thus, the Jukes-Cantor distance between the two sequences is defined as:

d = rt = −3

4ln

(1− 4

3p(t)

). (0.40)

Given the fraction of conserved and mutated nucleotides divergence time can be estimated

as t = d/r [56].

36

Considering two sequences evolved from a common ancestor equation (0.39) becomes

q(t) = 1− 3

4

(1− e−8rt/3

)↔ p(t) =

3

4

(1− e−8rt/3

)(0.41)

and their divergence results d2 = −34 ln

(1− 4

3p(t)). The divergence time is obtained as

t = d2/(2r) because if both the sequences mutate, they need half of the time to reach the same

divergence level.

A more precise estimate of the divergence can be obtained using Kimura two parameters

model [37, 38]. This model accounts for two different mutations rate α and β for transitions

between two purine or pyrimidine (i.e. A ↔ G and C ↔ T ) and transversions (e.g. A ↔ C).

Hence, the total substitution rate is r = α+ 2β. Kimura divergence is given by

d = −1

2ln(1− 2p1 − p2)− 1

4ln(1− 2p2) (0.42)

where p1 and p2 are the fraction of transitions transversions respectively.

Moreover, it is possible to add a correction on these divergences accounting for the hyper-

mutability of CpG dinucleotides [59, 63, 64]. This correction is particularly relevant for Alus

since they have an high GC content. RepeatMasker implements this correction neglecting the

mutated CpG dinucleotides. The resulting values of Kimura divergence for repetitive elements

from their consensus sequences are reported in the RepeatMasker alignment file. We compared

the average values of Kimura corrected divergence with Jukes-Cantor divergence for different

subfamilies.

Kimura divergence is ≈ 1.5 larger than Jukes-Cantor divergence for Alus. In the case of

L1s, that are not enriched in CpG, the corrected Kimura divergence essentially coincides with

the simple Jukes-Cantor divergence (see Fig. S4). Hence, the estimate of Alus age is highly

influenced by the specific substitution model and by the CpG content correction. However, the

linear correlation between Kimura corrected divergence and Jukes-Cantor values suggest that

the relative distance between Alu subfamilies can be well described even using the Jukes-Cantor

divergence.

37

A4 Maximum likelihood estimate

In Sec. 2.3 we suggested that the tail of the inter-REs distance distribution could be described

by the stick-breaking function, with the correct parameters.

We consider the probability distributionp(x, xmin, B0, LE) = 1

Z

(2B0LE

+(B0LE

)2(LE − x)

)e− B0LE

x

Z =(

1 + B0LE

(LE − xmin))e− B0LE

xmin(0.43)

where the factor Z guarantees the normalization of the stick-breaking function on the range

[xmin, LE ]†.

The log-likelihood function of this probability distribution was maximized numerically for

different values of the threshold xmin and of the the parameters .

We selected the value of xmin (and as consequence, the corresponding optimal couple

{B0, LE}) that minimize the Kolmogorov-Smirnoff distance between the stick-breaking func-

tion and the data.

The maximum likelihood procedure described above was applied to all the 32 selected Alus

subfamilies. For seven small subfamilies it was impossible to estimate the parameters (usually,

we obtained xmin = 0 and B0 > Breal). For most of the other subfamilies xmin ≈ 105. The

estimated sources are reported in Fig.S9.

†We are using the same notation as in the main text, where LE indicates the expanded genome.

38

Supplementary Figures.

0

200

600

1000

1400

1800

100 300 500 700 900

Fre

quen

cy

Distance [bp]

AluJrAluJbAluY

Raw

0

100

200

300

400

500

0 200 400 600 800 1000

Fre

quen

cy

Distance [bp]

Collapsed

Supplementary Figure S1: Short inter-Alus distances are affected by insertion of TEs Figure show inter-

Alus distances for three subfamilies calculated on raw data from RepeatMasker (left) and after collapsing repeats

identified as single insertions (right). The peak at ≈ 300bp in AluJb and AluJr distributions is due to insertion

of recent Alus in elements of the older families. Black line on the right shows and indicative stick-breaking

prediction (B = 105, L = 3 · 109) for the three subfamilies, that have almost the same density.

39

1

2.5 105

5 105

7.5 105

1 106

103 5 103 104

Cou

nts

Length [bp]

AluMirL1L2

DNALTR

Others

Sine |Line |

Supplementary Figure S2: Length distribution of transposable elements in human genome. Most of the

TEs insertions in human genome are less than 1kbp long, including LINEs elements.

Data refer to collapsed transposable elements according to RepeatMasker ID (see Sec.A1). Satellite repeats are

neglected.

Bin size is 500bps.

40

a1- Chr 1

a2 - Chr 17

a3 - Chr 18

Chr %GC %Gcmasked

1 0,417 0,417

2 0,402 0,397

3 0,397 0,388

4 0,382 0,365

5 0,395 0,386

6 0,396 0,386

7 0,407 0,400

8 0,402 0,395

9 0,413 0,412

10 0,416 0,415

11 0,416 0,420

12 0,408 0,399

13 0,385 0,371

14 0,408 0,404

15 0,421 0,425

16 0,447 0,453

17 0,455 0,463

18 0,398 0,391

19 0,480 0,509

20 0,439 0,448

21 0,409 0,406

22 0,470 0,498

X 0,395 0,390

Y 0,392 0,377

b

Supplementary Figure S3: GC content changes in different chromosomes. (a1-a3) Figures show the GC

content in human chromosomes 1, 17, and 18. Blue refer to regions of low GC content, yellow/red indicate an

high GC content (scale on the right). The script used to produce the figures is based on the one available at

http://genomat.img.cas.cz/. (b) Table shows the average GC content (as percentage of G or C bases over all the

non-masked bases) in different chromosomes, including (left) and excluding (right column) repetitive regions (see

Sec.A1).

41

Juke

s-C

an

tor

div

.

Kimura div.

L1

y=1.42x+0.02

Juke

s-C

an

tor

div

.

Kimura div.

Alu

y=1.01x+0.00

Supplementary Figure S4: Kimura and Jukes-Cantor divergence correlation. Figure show the Jukes-

Cantor divergence as a function of Kimura divergence (corrected for CpG hypermutability) for different Alu and

L1 subfamilies. Best linear fit values are shown. While for L1 subfamilies the two divergences are essentially

equivalent, for Alus the correction due to CpG hypermutability increases all the Jukes-Cantor values by a factor

≈ 1.5. The correlation coefficient is > 0.9 in both cases.

L1 family

0.2

0.4

0.6

0.8

Jukes-Cantor Divergence0.05 0.15 0.25

Stic

k-br

eaki

ng d

ista

nce

Supplementary Figure S5: Diver-

gence of inter-L1s distance dis-

tributions from null model cor-

relates with the age. The figure

shows the calculated distance from

null model distribution for 107 L1

subfamilies as a function of their av-

erage Jukes-Cantor divergence. The

correlation coefficient is ≈ 0.5.

42

B L Gcpoor Gcrich Total

Nocentr Nocentr DivSB DivSB DivSB

AluJb 106131 2796266905 0,27 0,27 0,35

AluJo 42419 2812872175 0,28 0,28 0,39

AluJr4 11258 2795179391 0,20 0,18 0,25

AluJr 104562 2800128859 0,25 0,27 0,33

AluSc5 3927 2769968260 0,06 0,11 0,10

AluSc8 19429 2811262186 0,12 0,19 0,20

AluSc 36042 2811769081 0,11 0,18 0,18

AluSg4 11204 2801195939 0,13 0,23 0,26

AluSg7 5478 2787169000 0,10 0,15 0,17

AluSg 40788 2813571523 0,17 0,26 0,28

AluSp 61418 2808277749 0,21 0,33 0,35

AluSq10 1392 2643504441 0,09 0,17 0,15

AluSq2 49205 2811617633 0,17 0,24 0,27

AluSq4 1505 2648682617 0,08 0,12 0,12

AluSq 11613 2805403229 0,16 0,21 0,28

AluSx1 101443 2796168223 0,21 0,27 0,33

AluSx3 36617 2815790407 0,14 0,23 0,24

AluSx4 11185 2797095951 0,11 0,19 0,20

AluSx 115930 2794142909 0,21 0,29 0,34

AluSz6 38796 2811128834 0,22 0,23 0,30

AluSz 112224 2793895984 0,23 0,27 0,33

AluYa5 3615 2739162849 0,04 0,06 0,04

AluYb8 2521 2690164499 0,03 0,04 0,02

AluYc 6366 2791755286 0,03 0,10 0,08

AluY 93727 2802302678 0,10 0,24 0,20

AluYe5 1207 2617578240 0,06 0,08 0,07

AluYf1 1866 2694779102 0,11 0,12 0,12

AluYj4 2606 2747766668 0,03 0,09 0,08

AluYk2 6357 2798203017 0,06 0,14 0,14

AluYk3 5800 2773998664 0,07 0,13 0,12

AluYk4 1022 2601615948 0,04 0,11 0,05

AluYm1 4417 2769035437 0,07 0,14 0,12

Supplementary Figure S6: Parameters for the 32 analyzed Alu subfamilies. Table shows, for each subfam-

ily, the number of insertions (B), the total genome size (L) and the and distances from the null model (DivSB)

in GC-rich and GC-poor regions as well as over the whole genome. Data refer to REs selected as described in

Sec.A1, neglecting centromeric and telomeric regions.

43

103 104 105 106

AluJb

10-1

10-3

10-5

Distance [bp]

103 104 105 106

AluSx3

10-1

10-3

103 104 105 106

AluSq2

10-1

10-3

105 106

AluYf1

10-3

10-2

10-1

107 103 104 105 106

AluY - Chr 2

10-1

10-3

103 104 105 106

AluSc - Chr 1

10-1

10-3

Frequency

Frequency

Supplementary Figure S7: Examples of inter-Alus distances distribution. Figure shows few examples of

inter-Alus distance distribution (orange dots) for different subfamilies. If chromosome is not specified, data refer

to whole genome (hg38) distribution, excluding centromeric regions and high variability regions. Fragments of

Alus detected by Repeatmasker as single insertion events are collapsed (see Sec.A1). Null-model prediction for

each subfamily is shown (black dashed line).

44

Cou

nts

1

10

102

103

104

103 104 105 106

AluJb

GCpoorGCrich

1

10

102

103

104

103 104 105 106

AluSx

GCpoorGCrich

1

10

102

104 105 106

AluSx4

GCpoorGCrich

1

10

102

103

104

103 104 105 106

Distance [bp]

AluY

GCpoorGCrich

1

10

102

104 105 106

AluYk3

GCpoorGCrich

Cou

nts

1

10

102

103

104

103 104 105 106

AluSc

GCpoorGCrich

Supplementary Figure S8: Examples of inter-Alus distance distributions in GC-rich and GC-poor

regions. Figure shows few examples of inter-Alus distance distribution for different subfamilies in GC-rich and

GC-poor regions (different symbols as in legend, see Sec.A2). Stick-breaking prediction for each distribution is

shown (black dashed line).

45

103

104

105

10-1

10-2

10-3

10-4

10-5

AluJb

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluJo

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluJr4

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluJr

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSc

5

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSc

8

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSc

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSg

4

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSg

7

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSg

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSp

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSq

2

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSq

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSx

1

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSx

3

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSx

4

103

104

105

10-1

10-2

10-3

10-4

10-5

AluSx

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSz

6

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluSz

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluYa5

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluYc

103

104

105

10-1

10-2

10-3

10-4

10-5

AluY

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluYk2

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluYk3

103

104

105

106

10-1

10-2

10-3

10-4

10-5

AluYm1

Frequency

Dis

tan

ce [

bp

]Supple

men

tary

Fig

ure

S9:

Est

imate

dso

urc

es

for

Alu

subfa

milie

s.T

he

figure

show

sth

ere

sult

of

sourc

ees

tim

ate

pro

cedure

for

25

Alu

ssu

bfa

milie

s(o

range

dots

).

Most

of

the

sourc

esare

wel

ldes

crib

edby

an

exp

onen

tial

funct

ion,

while

ord

ersu

bfa

milie

sare

bet

ter

des

crib

edby

adouble

exp

onen

tial

(conti

nuous

red

lines

).A

ll

the

data

obta

ined

as

diff

eren

ceb

etw

een

empir

ical

dis

trib

uti

on

and

opti

miz

edst

ick-b

reakin

gare

show

n,

incl

udin

gdis

tance

sla

rger

thanxmin

(in

most

of

the

case

s

xmin≈

105).

The

exp

onen

tial

fit

was

per

form

edusi

ng

only

data

inth

era

ngex<xmin

.

46

1 10 100Distance (Mbp)

1

10

100

Counts

Supplementary Figure S10: Example of inter-REs dis-

tance distribution for recent insertions. Figure shows

inter-REs distance distribution for 367 L1H elements de-

tected in a sample of 25 individual human genomes (data

from ref.[15]). These insertions are not fixed in human pop-

ulation since they are not present in the reference genome.

Hence, we can suppose that this pool of L1s is due to recent

retrotransposition events. Comparison between genomic dis-

tribution of these elements (orange dots) and stick-breaking

prediction (black dashed line) shows a good agreement.

47

1000 10000 100000 1000000

100

101

102

103

104AluJb

1000 10000 100000 1000000

100

101

102

103

104AluY

Supplementary Figure S11: Effect of expansion on stick-breaking distribution. We wanted to evalu-

ate the effect of a ”realistic” expansion due to TEs insertions and duplications on inter-REs distances. We

calculated the size distribution of all the TEs with Jukes-Cantor divergence smaller than AluY and AluJb

subfamily respectively. These TEs are likely to be younger than the Alus belonging to the two subfami-

lies, and could have contributed to expansion of inter-AluYs and inter-AluJbs distances. Data on genomic

duplications in human larger than 1kbp and identity > 90% (genome build GRCh37) were downloaded from

http://humanparalogy.gs.washington.edu/build37/build37.htm. We selected those duplications that do not con-

tain any insertion of AluY and AluJb in human genome (GRCh37). We simulated the random placement of a

number of REs equal to the number of AluY and AluJb members present in human genome (not shown). Genome

size was reduced to 2.5Gbp and 2Gbp for AluY and AluJb respectively in order to obtain for both the subfamilies

a final LE ≈ 2.8Gbp .

Figure shows the actual distribution of distances for AluY and AluJb elements (orange dots) and simulated data

(red square) of expanded distribution after random insertion of the selected TEs and duplicated regions. In the

case of AluJb subfamily, duplicated regions are inserted twice, since data on duplications are likely to refer to

events newer than AluJb insertion.

Expanded distributions can not reproduce the empirical data and are well described by null model prediction

(black dashed line). However, for AluJb we observed a small deviation of the tail from the null model distribu-

tion, possibly due to an effect similar to the one described by Sellis and coworkers [53]. This deviation is purely

indicative, since the simulated expansion process does not reproduce a ”real” expansion of human genome.

Despite the many arbitrary choices we made, these results support the idea that a random expansion process is

not sufficient to explain the empirical distributions.

48

Retrotransposon

DNA

Duplicated retrotransposon

Duplicated DNA

α β γ δ

A B C D

α β γ δ

A B C D

β γ

B C E

β γ ε2

B C E2

ε1

E1

Supplementary Figure S12: Schematic representation of local duplication. (Top) During the duplication

of a DNA sequence (delimited by black dashed lines), REs internal to the sequence can be passively copied. This

process leads to the duplication of those inter-REs distances that are inside the sequence (distances B and C

in figure), while it can modify the two inter-REs distances that are at the border of the sequence (A and D in

figure). Local duplication process (bottom left) preserve the distances at the borders (A and D) adding a new

inter-REs distance E equals to the sum of the distances between the first (the last) copied RE and the ends of

the duplication (i.e. β and γ in figure). Non-local duplication (bottom right) implies the loss of an inter-REs

distance ε1 +ε2 and creation of two new distances E1 = β+ε1 and E2 = γ+ε2. Since the duplication is non-local,

ε1 and ε2 do not have any relationship with the duplicated region, and it is impossible to establish a priori the

size of E1 and E2.

49

Part II

Stochastic Models of Evolution in

Large Asexual Populations

50

Chapter 1

Mutations and fitness in asexual

populations

Thanks to contemporary technologies such as high-throughput sequencing and phenotypic

characterization, previously unachievable quantitative measurements of the results of controlled

laboratory evolution experiments are now possible. This is guiding theoretical investigations

and could make the validation and falsification of phenomenological theories feasible [65, 66, 67].

Moreover, the reproducibility of evolutionary outcomes can be studied in microbial populations

that are founded by the same ancestor and placed in identical environments [68].

Microorganisms have the advantage to be small and relatively easy to grow in laboratory.

Their short generation time, ranging from minutes to hours, allows to observe evolution on a

“laboratory” time scale. Moreover, studying prokaryotes gives the possibility to bypass most of

the complex regulatory features that characterize eukaryotic organisms.

Most experiments in microbial evolution are conceptually simple. Populations are estab-

lished (often from single clones), then propagated in a controlled and reproducible environment

for many generations. Samples of the ancestral population and from various time points in the

experiment are stored indefinitely (e.g. frozen) so that the ancestral and derived genotypes can

be compared with respect to any genetic or phenotypic properties of interest. This provides

information on the dynamics of the evolutionary process and the extent of evolutionary change.

In asexual (or rarely mating) populations, new genotypes arise from DNA mutations. If these

events are not deathly for the offspring, the result is a mutant having a genotype different from

that of the parent. A measure of the effect of a mutation is given by the increase or decrease of

the fitness of an individual.

51

Fitness is a complicated trait, and from a biological point of view it can be defined as the

success of an individual in reproducing. In the context of experiments on asexual populations

it can be identified with the growth rate µ of a clone [68, 69]. Studying evolution phenomena

using statistical mechanics, fitness can be defined as the mean expected number of surviving

offspring of an individual.

The effect of a particular mutation can be classified as beneficial if it increases the repro-

ductive success, the growth rate or the number of offspring, neutral, if it does not have effect,

or deleterious. The increase (decrease) of fitness due to the acquisition of a mutation is called

the advantage of that mutation.

Fitness depends not only on the genotype of an individual, but also on the phenotype and

on the environment in which it is measured. In fact, a mutation that is beneficial in a specific

context can be deleterious, for the same individual, if the environment changes.

In the specific case of large asexual populations of microorganisms, a high number of ben-

eficial mutations emerge in different clones, and cannot be mixed because of slow or absent

recombination. These beneficial mutations appearing in parallel coexist and compete to drive

adaptation. This phenomenon of concurrent beneficial mutations is related to the Fisher-Muller

hypothesis (or Hill-Robertson effect) for the advantage of recombination [70]. In general, bene-

ficial mutations also arise with a distribution of fitness advantages, which is generally believed

to be exponential [71].

Recent models have generally dealt with the competition between mutations of different

strengths and the competition between mutations that arise on different fitness backgrounds

separately. The first effect, the role of a distribution of fitness changes, is analyzed by mod-

els in which any individual is either the wild type or a mutant derived directly from the wild

type [69, 72, 73]. Thus, multiple mutations arising in the extant mutants are neglected. Con-

versely, models that explicitly deal with multiple mutations typically assume that all mutations

have the same effect [69, 74, 75, 76] (a recent work incorporating both effects [77] has shown

that for peaked distribution of advantage this is the correct effective theory). The latter kind

of model has the advantage of being simpler to treat and accessible analytically.

52

There appears to be one important discrepancy between the models described so far and the

behavior of bacteria evolved in the laboratory for long time (roughly, > 1000 generations). This

discrepancy is well represented by the experimental sub-linear increase on long time-scales of

the average population log-fitness (or fitness advantage) [78]. This is in contrast with the linear

increase predicted theoretically by many models even if they have been compared successfully

with the diversity and adaptation speed of short-time laboratory evolution experiments [79].

Therefore, the core issue is to understand the evolutionary mechanisms at the basis of the

experimental slowing down of the adaptation process.

Furthermore, two recent experimental studies [80, 81] have shown a common trend in the

advantage of combined beneficial mutations occurring in different genes. In most of the cases

analyzed, the combined advantage is lower than the sum of that of individual mutations. In

other words, when mutations of loci in different genes accumulate, the effective advantage of

each of them is lower. This was shown by combinatorial genetics techniques, by constructing all

the possible configurations of a small set of mutations, and evaluating their advantage through

competition experiments. This decrease of the advantage carried by a mutation as the back-

ground fitness increases provides a possible mechanism at the basis of the observed sub-linear

increase of the average fitness in long-term evolutionary experiments [80].

This trend, referred to as diminishing returns epistasis, had been previously suggested the-

oretically on the basis of the general pattern of adaptation observed in long-term microbial

experiments [78], using a modeling framework that neglected concurrent or multiple mutations.

Another study predicts the same principle on the basis of a simple fitness landscape model

combined with the distribution of single mutation effects measured experimentally [82]. The

actual pattern in the fitness advantage associated to the same mutation in different backgrounds

observed by the two studies in ref. [80, 81] is complex, as, on top of the diminishing return ef-

fect, the advantage appears to depend on the mutation identity. Even more recent systematic

experiments [67] are unveiling a complex scenario where different mechanisms coexist for the in-

teractions of mutations between and within functional “blocks”, which can span multiple genes

along the genome.

However, the full experimental complexity is difficult to incorporate in a treatable model, and

experimental data on linked mutations and interference between them are difficult to obtain.

Thus, simplified descriptions, as the multiple-mutations model, are useful to model evolving

populations using a minimal quantity of information on mutations and fitness advantage.

53

Here, we take a simplified approach to study the diversity and speed of adaptation in presence

of diminishing returns. We define a framework that can account for multiple mutations, and

incorporates the effect of diminishing return epistasis.

Namely, the fitness of a mutation depends only on its order of appearance in a clone, and

decreases with it. This generalizes the non-epistatic multiple-mutations model [75, 76](recovered

in case the advantage decrease with the number of acquired mutations is zero). We preserve the

model assumption that evolution is driven by beneficial mutations which appear with constant

rate.

54

Chapter 2

Models for microbial evolution in

long-term experiments

We build a minimal population-genetics model [73, 83, 84] including diminishing return epis-

tasis in presence of competition between beneficial mutations.

2.1 Model definition

The model describes a population of N haploid individuals, or sequences. The evolutionary

dynamics of the population is based on the well-known Wright-Fisher model [83, 84, 85]. Each

individual at generation t + 1 is chosen as the offspring of an individual in class i present at

generation t with probability χi/N , where χi = wi/〈w〉 is the relative fitness of class i. More

intuitively, we can say that each individual of type i produces a random number of offspring

with average equal to its relative fitness wi.

Inheritance is introduced by assigning the fitness of the parent to the offspring. Mutations

change the mean fitness of the offspring w′i relative to the parental one wi according to the

relation

w′i = wi(1 + s) ' wies,

and we refer to s as the advantage or selection coefficient of the mutation.

These prescriptions for the evolutionary dynamics assume non-overlapping generations since

at each generation there is a complete replacement of parents with progeny. The population size

55

N is kept constant, and there is no recombination move. Fitness differences in the population

leads to natural selection since classes of individuals with higher fitness generate increasingly

larger fractions of the population, while classes with low fitness progressively disappear.

Since the rules defining the dynamics contain the relative fitness χk, the model is unaffected

by multiplication of all the wk by a common factor. Therefore, the fitness value w0 = 1 can be

arbitrarily assigned to the ancestral genotype with no mutation (or wild-type, WT).

While the fitness advantage associated to new mutations is a complex issue [86], in presence of

abundant beneficial mutations, deleterious mutations (negative effect on fitness, s < 0) do not

typically contribute to the adaptation of large populations and are customarily neglected [73, 75,

76]. For this reason, we consider only beneficial mutations with a positive selection coefficient

s, acquired by the population at a given rate.

Following this approach, in the model each offspring has a constant probability per genera-

tion Ub (the beneficial mutation rate) of acquiring a beneficial mutation.

Clonal interference regime The conditions for the emergence of the interference phe-

nomenon between sub-populations with different genotypes in a large population can be under-

stood with simple scaling arguments as a competition between processes occurring at different

time scales [73].

Since the dynamics is stochastic, it is possible to define for each mutation a “surviving”

probability that is equal to π(s) ' cs, where c is a constant factor that depends on the specific

model used [87]. For the algorithm used here, π(s) ∼ 2s (see Fig. S4 and ref. [73] for a self-

contained motivation). This is the probability that, when a new mutant with advantage s

arises, its lineage grows sufficiently in size to overcome genetic drift (stochastic fluctuations in

the reproductive process).

When a certain mutation is carried by a fraction of the population larger than 1/π(s),

the mutation starts to expand deterministically in the population (it is “established”) and, in

absence of additional beneficial mutations, it will take over the population (go to “fixation”)

logistically.

The establishment size can be explained using an intuitive argument. Considering a sub-

population with nmut mutators in a population of N individuals. The probability that at least

56

one mutator goes to fixation (i.e. its lineage take over the population) is

Πfix = 1− (1− π(s))nmut . (2.1)

The size for which the fixation probability Πfix is around 1 is given by (1−π(s))nmut ≈ 0, and,

for small π(s) we obtain nmut ≈ 1π(s) .

For a population of size N , the scale of the fixation time can be estimated by imposing that

12s exp(sτfix) ' N , giving a characteristic time τfix ' ln(2Ns)

s [75].

On the other hand, the number of new mutations arising in the population at each generation

is NUb. Hence, the time scale for appearance and establishment of a new beneficial mutation

is τest ' 1NUbπ(s) .

Therefore, when NUb � 12ln(2Ns) (i.e. τfix � τest) a beneficial mutation can fix before any

new mutation can establish, making the evolutionary dynamics driven by successive sweeps of

new lineages arising in an essentially monoclonal population. This regime is called “selective

sweeps” or “periodic selection”. Instead, a sufficiently large population with high beneficial

mutation rate (i.e. NUb >> 1) evolves in the opposite regime, in which multiple mutations can

establish before fixation of any of them and interfere with each other (clonal interference).

This is the regime considered here, which is believed to be relevant for laboratory evolution

experiments with microorganisms.

In fact, the typical values of the selection coefficient s in laboratory evolution experiments

with bacteria are in the range s ' 0.001 − 0.005 [88, 89]. Exploring for the parameter Ub the

(experimentally relevant) range 10−10−10−3 [88, 89], and considering population sizes between

106 and 1010 [65], we ensure to be in a clonal interference regime.

57

2.2 The diminishing return model

In the simpler scenario, we assume that each new beneficial mutation hits a new site

on the genome, so that each offspring has a constant probability per generation Ub of acquiring

a new beneficial mutation. The model assumes that successive mutations do not lead to the

same fitness gain, but the fitness gain is dependent on the mutations already occurred in an

individual [78]. This feature extends the non-epistatic model for multiple mutations [75, 76] and

considers selection coefficients dependent on the number of mutations, i.e. s = s0g′(k), where

g′(k) is a decreasing function of the number of acquired mutations k.

The total advantage is given by the sum of the effects of contributing mutations, i.e:

w(k) = es0g(k) and g(k) =∑k

k′=1 g′(k′).

Definition of the advantage functions

In order to fully specify the model, one has to choose a specific form for the advantage

function g′(k), describing the strength of the negative epistasis between mutations.

A simple example is given by the choice of a fitness gain that depends on the number of the

extant mutations k as a power law.

In this case, the fitness is

wk = e∑kk′=0 s0αk

′α−1, (2.2)

with α < 1 for diminishing return, and α = 1 for no epistasis.

In the general case α 6= 0, one can write

wk = es0αHk,1−α ≈ es0(kα−1[ kα+ 12

+O(1/k)]+ζ(1−α)) , (2.3)

where ζ indicates the Riemann zeta function.

Thus, the relative fitness can be expressed as χk ≈ es0(kα−〈kα〉).

For the particular case α = 0 the fitness becomes

wk = e∑kk′=1 s0k

′−1 ≈ es0(ln(k)+γ) , (2.4)

which is obtained truncating the harmonic number expansion (neglecting terms O(1/k)), and

where γ is the Euler-Mascheroni constant (≈ 0.6).

Under this assumption, the relative fitness results χk ≈ es0ln

k〈k〉 .

58

These expressions show that we can essentially directly assume fitness functions of the form

wk = es0kα

if α 6= 0

wk = es0ln(k+1) if α = 0 , (2.5)

which can be expressed, neglecting terms of order 1/k, as sum of powers of the number of

accumulated mutations∗. The fitness function in Eq. (2.5), using ln(k+1) instead of ln(k), allows

to automatically include the case k = 0 with the correct normalization condition w(0) = 1.

We will refer to the case α 6= 0 as the power law model, while the case α = 0 corresponds

to the logarithmic model.

The third model we can consider is characterized by a “geometric dependence” of the fitness

advantage on the number of acquired mutations. In this case the advantage function is

g′(k) = qk−1 with q < 1 and the fitness is given by

wk = e∑kk′=1 s0q

k′−1= e

s01−qk1−q . (2.6)

In other words, the advantage accumulates following a geometric sum. As for the former models,

considering only the final form of the advantage function g(k) = 1−qk1−q , allows to directly satisfy

the condition g(0) = 0. The constant factor 1− q could be adsorbed in s0, as for the power law

model.

Note that, while in the power law and logarithmic models the fitness can increase indefinitely,

the fitness for the geometric model is upprebounded: ln (w(k →∞)) = s01

1− q.

The diminishing return model: phenomenology.

The population is composed of classes of individuals with the same number of mutations

that are in one-to-one correspondence to fitness advantage classes (Fig. 1a and Fig. S2). Direct

simulation of the model shows that both the mean advantage 〈s0g(k)〉 and average number of

mutations 〈k〉 grow sub-linearly for intermediate to long times (Fig. 1b, 1c).

This trend is independent from α (or from the specific model of epistasis g′(k)) and is due

to the fact that decreasing advantage and the consequent rise of the establishment threshold

for clones together slow down adaptation.

∗This notation requires an appropriate rescaling of s0α→ α in the case of α 6= 0

59

The time derivatives of 〈s0g(k)〉 and 〈k〉 estimate the adaptation speed vs and the mutation-

accumulation speed vk of a typical realization. Figure 1e shows an average of vk over 100

realizations, plotted as a function of time. The simulations indicate that vk relaxes to a plateau

which is close to the beneficial mutation rate Ub. Equivalently, for long times, the mean number

of fixed mutations shows a linear behavior in time with a rate close to Ub (red line in Fig. 1c).

In the same long-time limit, the advantage of a mutation s = s0g′(k) drops asymptotically to

zero.

At long times there is a transition to an effectively neutral regime: the selection coefficient

s0g′(k) becomes too small to be relevant, and the fixation dynamics is driven solely by genetic

drift. The probability of fixation of an essentially neutral mutation is ∼ 1/N while the rate

of appearance of new mutations is UbN . Therefore, the pace at which new mutations are

accumulated is approximately vk ∼ Ub. This neutral long-time regime is outside of the limit of

applicability of the model, and has to be regarded as unphysical, since when vk = Ub, deleterious

mutations cannot be neglected [90]. Thus, for any finite N the asymptotic trend of 〈k〉 has to be

interpreted as an effective signature of a change of regime for both vk and vs, where beneficial

mutations should be in equilibrium with deleterious ones, which could possibly be captured by

a variant of the model including deleterious mutations [90].

Note also that realistically the beneficial mutation rate itself could decrease in the later

stages of evolution [73, 91], eventually giving a contribution to the adaptation dynamics (see

Discussion).

The diminishing return model: infinite population limit

In the limit of infinite population, N → ∞, the dynamics of the model can be described

using the following equation [73],

f(k, t) = (1− Ub)f(k, t− 1)w(k)

〈w〉(t−1)+ Ub f(k − 1, t− 1)

w(k − 1)

〈w〉(t−1), (2.7)

where f(k, t) is the frequency of individuals with k beneficial mutations at generation t,

w(k) = es0g(k), and 〈w〉t =∑

k wkf(k, t) is the mean fitness.

Multiplying Eq. (2.7) by k and summing over k gives the following expression for the dy-

60

f(k,

t)

Lk

Num. of mutations k

Vk

Ls

f(s o

g(k)

,t)

Advantage s0g(k)

sV

<s 0

g(k

)>

0.50

0.52

0.54

0.56

0 6 104 1.2 105

<k>

Generations

50

100

200

300400

1 104 1 105

Generations Generations

1

1.5

0 4 107

Vk

x U

b

8 107 1.6 1081.2 108

2

3

2.5

(a) (b) (c) (d)

Figure 1: Basic features of the model. (a) Because of competition between beneficial mutations, the popu-

lation is divided into sub-populations with different frequencies, defined by the number of mutations k (see also

Fig.S2). Lk is the difference, in number of mutations, between the maximum number of mutations found in a

clone kmax and the mean. This induces a distribution for the log-fitness s0g(k). (b - c) Both distributions travel

in time, driven by established beneficial mutations, with instantaneous velocities vk and vs. The panels show the

increase in time of 〈k〉 and of 〈s0g(k)〉 obtained by direct simulation of the diminishing return model. The plot in

panel (c) is in log-log scale, and the data are compared with a reference straight line (dashed blue line, with slope

5 · 10−3) to highlight the sublinear growth of 〈k〉. The continuous red line shows the the asymptotic long-time

linear behavior with slope corresponding to Ub. (d) Long-time behavior of the mean speed of fixed mutations vk

(green symbols), averaged over different realizations. For long times, this quantity decreases (as a power law)

towards the limit value vk = Ub, where the assumptions of the model break down and deleterious mutations need

to be accounted for [90]. This limit also corresponds to the limit value of vk obtained by a infinite-N estimate

(see text). Simulations are carried out using the parameters N = 5 · 107, s0 = 0.5, α = 0.02, Ub = 1 · 10−3.

Averages are computed over 100 realizations (these averages are implied in the notations for the y-axis labels).

0

100

200

300

0 100 200 300

<V

ark>

<VarMF>R

0.025

0.15

7 9 11 13

σR(V

ark)

/<V

ark>

Log10(N)

Generation 107Generation 106Generation 105

0.05

0.10

R

R

(a) (b)

Figure 2: The infinite-N approximation captures a relation between vk and the width of the fitness

class distribution, which is valid at intermediate times for moderate N and until longer times

for large population sizes. (a) Simulated variance of the mutation classes distribution, shown as a function

of the expected variance from the infinite-N estimate (Varinf = (vk − Ub)〈k〉1−α(s0α)−1, see Eq. (2.9)). To

avoid ambiguities, averages over realizations are indicated by a suffix R. The continuous red line represents the

theoretical prediction Varinf = Vark. The error bars (standard deviations over realizations, σR(Vark)) become

larger with 〈Vark〉R. (b) While for increasing times σR(Vark) diverges, the relative variability σR(Vark)/〈Vark〉R

over realizations decreases with increasing population size N , for any fixed time. This suggests that the infinite-N

estimate is well-defined. Simulations are carried out using the parameters s0 = 0.5, α = 0.02, Ub = 1 · 10−3.

Population size in panel (a) is N = 107.

61

namics of the mean number of mutations 〈k〉(t) =∑

k kf(k, t),

〈k〉(t+ 1) =〈k w〉(t)〈w〉(t)

+ Ub . (2.8)

This expression can be further simplified assuming that the frequency distribution is narrowly

peaked around the mean (which travels in time) and that it can be expressed as f(k, t) ≈

δ(k; 〈k〉(t)). In this case it can be easily verified that vk = Ub.

A different, more instructive, relation, which keeps into account the width of f(k, t) can be

obtained starting from Eq. (2.8), and expanding w(k) under the assumption that

Dk ≡ (k − 〈k〉) � 〈k〉, for every index k of non-empty classes. The fitness for the power law

model, for example, results w(k) = es0kα ≈ es0〈k〉α

(1 + s0α〈k〉α−1Dk

).

The assumption Dk � 〈k〉 is verified by simulations (see Supplementary Fig. S5) and by

further considerations on the finite-N width of the distribution given in the following sections.

Computing the averages 〈w〉 and 〈kw〉 to first order in Dk, and noticing that 〈Dk〉 = 0 and

〈kDk〉 = Vark, we obtain an expression for the speed of accumulated mutations as a function

of the variance of the fitness class distribution.

Estimating vk as d〈k〉dt ≈ 〈k〉(t+ 1)− 〈k〉(t) gives, for the case g(k) = kα

d〈k〉dt≈ s0α〈k〉α−1(t)Vark(t) + Ub . (2.9)

According to this equation, vk is driven by two terms, the increase of mutations due to the

beneficial mutation rate Ub and the selection of individuals with larger fitness. This result is

connected to Fisher’s fundamental theorem and the result obtained by Guess [92], which relate

the speed of adaptation to the variance of the fitness. In this case, the speed of accumulation

of successive mutations is related to the width of the mutation class histogram, but rescaled by

the factor 〈kα−1〉(t), which decreases with time.

Note that in the limit α = 1 we recover the usual linear proportionality, since the advantage

is linear in the mutation class index [73].

The infinite-N limit of the model with α = 1 has been previously addressed by Park et al [73]

with a moment generating function approach. In particular, they estimated the distribution

variance as Vark ' 1−Ubs0

. This estimate, substituted in Eq. (2.9) gives their expression for the

speed vk ' 1 (in the limit of small Ub) suggesting that our result is a consistent generalization

to the epistatic case.

62

For the diminishing returns model, the increase in the width of the distribution of k does

not compensate for the term 〈kα−1〉 (which tends to 0), and, for long times, Eq. (2.9) predicts

the limit velocity vk = Ub, as observed in simulations (Fig. 1e).

Simulated data are in good accordance for intermediate time with the mean field estimate for

the variance Vark of the mutation class distribution (Fig. 2). Additionally, Vark/〈k〉 decreases

quickly with time (see Fig. S5), justifying the assumption of small Dk/〈k〉 †.

However, the variability of Vark over different realizations increases quickly. In order to verify

whether these fluctuations are well-behaved, we have evaluated γ = σR (Vark) /〈Vark〉R. Vark

indicates the variance of the mutation class distribution in a single realization (roughly analogous

to L2k), and the suffix R indicates averages over realizations. Specifically, 〈x〉R indicates the

average of the quantity x over different realizations, while σR(x) is its standard deviation.

Therefore, γ, plotted in Fig. 2b as a function of N , represents the relative variability over the

realizations of the variance Vark of the distribution.

For any fixed time, this quantity decreases with N , suggesting that the meanfield limit is

well-defined for infinite populations. Conversely, fixing N and increasing t, γ appears to reach

finite values, hence Vark (and hence vk) seems to be non-self averaging in time.

This effect and its extent are due to both genetic drift and to the shape of the fitness

landscape.

In summary, a mean-field description of the population dynamics does not work properly

for longer time-scales at finite N , as previously shown in the case of no epistasis (α = 1) [73].

However, the infinite-population limit is instructive for understanding the main mechanisms

driving the model, and provides reasonably good estimates at intermediate times even for finite

populations. Finally, the extreme diversity between realizations at finite population sizes might

have some empirical relevance since most of the experimental results concern a single or a few

evolutionary trajectories.

†It is expected that |Dk| / (Vark)1/2 for every k.

63

The diminishing return model: behavior at finite population size.

Experimental conditions should be modeled accounting for finite-size populations at inter-

mediate times (i.e. on the relevant experimental time scale ≈ 102 − 104 generations, with

〈k〉 ≈ 10 − 102). Direct simulations of the diminishing return model show that the frequency

distributions of the mutation and advantage classes are approximatively Gaussian (as in a con-

stant advantage model), even when the advantage function has fairly high curvature (Fig. 3a,

Fig. S7).

The meanfield estimate leads to expect that the widths of these histograms are related to the

speed of adaptation and consequently of mutation accumulation. Given the discrete nature of

these distributions, their widths are well represented by the distances Lk and Ls of the foremost

bin from the average (shown in Fig. 1a). For a diminishing return model Lk is an increasing

function of the mean number of mutation classes 〈k〉, while Ls decreases with 〈k〉 (Fig. 3b).

The two speeds are connected by the decay of the advantage between k and k + 1, as well as

by the change in the width of the distributions with increasing 〈k〉. Indeed, in the long-time

limit, even if vk > 0, vs vanishes. The existence of a large number of sub-populations with

different k but essentially the same fitness, leads to the effectively neutral behavior discussed in

the previous section.

The “stochastic edge” estimates for the finite population adaptation speed available for the

multiple-mutations model are based on the hypothesis that the only class subjected to substan-

tial stochastic effects is the fittest one, i.e. that ∆s & Ub [75, 90]. For the diminishing return

model, when ∆s ≈ Ub this approximation in general fails. However, for experimentally relevant

parameters we can always suppose that the stochastic edge approximation is valid. Supposing

that s0 = 5 · 10−1, α = 0.02 (this is a much stronger epistatic effect than the one we estimate

from experimental data, see Sec. 2.3) and that Lk ≈ 50 (of the order of the values obtained

from simulations with this parameter set and population size N = 107−1013), then ∆s becomes

close to a beneficial mutation rate Ub ≈ 10−3 (which can be considered very large [88, 89]), for

〈k〉 ≈ 5 · 102. This exceeds the interesting experimental range of 〈k〉 (101 − 102).

Thus, since the simulated advantage and mutation class histograms are both nearly Gaussian

(but not stationary in width), it is possible to generalize the estimates applied for the standard

multiple-mutations model [75]. Supposing a slow increase of the width of the distribution with

k and assuming that the width of the histogram is stable during during the establishment time

64

Frequency

10-8

10-4

1

Number of Mutations0 20 40 60 80

1.2 10-1 2 10-110-8

10-4

1

Advantage

Frequency

Generation103

5 103

104

2.5 104

5 104

7.5 104

L k

3

4

5

0 20 40 60 80<k>

10-3

10-3

10-2

0 20 40 60 80<k>

L s

Vk

010-2

3 10-2

0 20 40 60 80<k>

0

5 10-5

0 20 40 60 80<k>

Vs

(a) (b) (c)

5

Figure 3: (Color online) The histograms of fitness advantage and mutation classes have nearly Gaus-

sian forms. While adaptation slows down, the latter histogram expands while the former becomes

increasingly peaked. (a) Histograms of the mutation classes (top) and advantage classes (bottom) obtained

from simulations averaging over 200 realizations at different generations (different symbols, see legend). The

parabolic form in the semi-log plot indicates that they are approximately Gaussian (solid lines connecting the

symbols). The establishment size 1/π(s) is represented as a dashed line in the top panel. (b) Simulated data for

the widths of the mutation class histogram Lk (green squares, top) and of the fitness advantage histogram, Ls

(blue circles, bottom), plotted as a function of the mean number of mutations 〈k〉. The continuous line represents

the theoretical estimates of the width (see Eq. (2.10) in Appendix). (c) Plots of the speed of mutation accu-

mulation vk (top, green squares) and of adaptation vs (bottom, blue circles), as a function of the mean number

of mutations 〈k〉. Continuous lines are the corresponding theoretical estimates (see Eq. (S16) of Appendix).

Note that since τk depends logarithmically on Lk, the estimates for both speeds are in satisfactory agreement

with the simulated data even if Lk is approximated more roughly. The parameters used in the simulations are

N = 109, Ub = 6 · 10−6, s0 = 0.1, α = 0.2, compatible with those estimated (see Sec. 2.3). Averages are

performed over 200 realizations.

τk necessary to the new class to reach the establishment threshold, the mean of the distribution

moves from 〈k〉t to 〈k〉t+ 1 during the this time and Lk ≈ Lk+1. For 〈k〉 � Lk > 1 (and 〈k〉 not

too high due to the condition of sufficiently large ∆s) the advantage of the edge with respect

to the mean class is ∆sk = s0(kα − 〈k〉α) ≈ s0αLkkα−1.

Different estimates can be obtained depending on the approximations taken, giving implicit

or explicit formulas. For example, assuming that k >> Lk and neglecting the logarithmic

corrections in Lk we obtain a closed expression for Lk,

Lk =2 log(Ns0αk

α−1)

log(s0αkα−1

Ub

) . (2.10)

Comparison with simulated data shows that the expression for Lk in Eq. (2.10), including

small-L corrections, is a reasonably good estimate of the width of the distribution for em-

65

pirically plausible parameters (Fig. 3e). In particular, the speed of adaptation and mutation

accumulation (Fig. 3c) are well captured by this analytical description.

A mathematical argument for the validity of this extension of the multiple-mutations model

is presented in the Appendix.

The approximations discussed above are valid as long as 〈k〉 is large enough to exceed Lk,

but small enough not to make ∆sk too small (one can e.g. impose that ∆sk � 1/Ub to be far

than the neutral limit).

The sources of error in this estimate are to neglecting logarithmic corrections. It is possible

to account for these corrections using the simulated values of Lk. We verified that this leads

to a small underestimate of the width of the distribution (not shown). Additional sources of

deviation are due to neglecting non-leading mutation classes in estimating the establishment

time, the use of an average growth rate and the assumption of exponential growth of each class

starting from the establishment size (see Appendix). For the empirically plausible values of N ,

Ub, and s0 we estimate that the model should be valid for values of 〈k〉 up to 102. Depending on

the parameters and the specific model chosen this range of validity can be far larger. Moreover,

for even larger k we verified that vk = 1/τk tends to a constant, Ub with the approximations

taken, restoring the correct result for infinite population size.

66

2.3 Parameter-matching procedure for diminishing return model

We used a simple procedure for choosing a set of parameters and a functional form of the

advantage g(k) compatible with existing data.

We considered fitness/mutations data from two laboratory evolution experiments using the

three different variants of the diminishing return model defined in the former section. The

power law model, where the advantage is described by s0g(k) = s0kα, is the main case we

presented, while the advantage functions of other two models considered have a logarithmic and

an exponential dependence on k.

Experiments analyzed

We considered data from two experiments. The first data set concerns Acinetobacter baylyi

and was obtained from a chemostat experiment [93], while the second data set comes from the

initial 2 · 104 generations of the well-known “Escherichia coli long-term evolution experiment”

(LTEE) [66, 68, 94].

The A. baylyi experiment studied the population dynamics in a chemostat using a minimal

medium supply for about four months (≈ 3000 generations), at a dilution rate D ≈ 0.7h−1.

The use of chemostat allowed to grow a large population (N ≈ 3 · 1010) under controlled

conditions for a fairly long time. Since the number of individuals is large, it is expected that

different sub-populations will grow in parallel in clonal interference regime. This assumption

has been confirmed by population sequencing data [93]. Maximum growth-rate measurements

were performed in batch on 21 isolated clones and on the the original strain introduced in the

chemostat (wild type) fitting the growth curve during the exponential phase.

Even if generations are not synchronous, the mean growth rate is held fixed by the dilution

rate in the chemostat. In fact, when a monoclonal population grows in the chemostat, its growth

rate equals the dilution rate D, and the generation time is defined as tgen = ln(2)/D.

67

In presence of different sub-populations the dynamics becomes more complex [95]. The mea-

sured values of µmax differ from the effective growth rate that each population can reach in the

chemostat because of competition between different sub-populations for a limited amount of

nutrients.

For sake of simplicity, and since the detailed dynamics is not experimentally accessible, we

assume that the growth rate of all clones is fixed by dilution rate over the whole experiment,

while the measured values of µmax are supposed to be indicative of the effective fitness of the

different sub-populations inside the chemostat [93].

Thus, the different maximum growth rates of clones measured in batch are interpreted as dif-

ferent survival probabilities of their offspring inside the chemostat.

Fitness is defined as:

w(i) = emi = eµmax,i tgen (2.11)

where the index i indicates the i − th sub-population, mi is the growth rate expressed in

1/generation, experimental data of µmax,i have the units of h−1, and tgen = ln(2)/D ≈ 1h−1.

In simple words the fitness of an individual is defined as the amount of offspring assuming it

can grow by the maximum growth rate over an average generation defined by the chemostat

dilution rate (hence, fitness is a dimensionless quantity).

Since the relevant quantity of the dynamics is the relative fitness, during the parameter-

matching procedure we used the normalized fitness of clone i (wexp(i) = eµmax(i)−µWT ) to infer

the functional form of the advantage. Note that, since the reference fitness value is given by

the ancestral growth rate, wexp(WT) = 1.

A whole-genome sequencing was performed on two single clones isolated at the end of the

experiment (AB2800b and AB2800a) and on the ancestral strain that served as a reference for

identifying mutations in evolved clones. A total of 11 mutations were detected into the evolved

clones, eight of them in common between the two.

Population sequencing was performed at three time points confirming the presence of dif-

ferent subpopulations.

Additional sequencing has been performed in the remaining selected clones on PCR frag-

ments encompassing the mutated loci identified in the two end-point clones. This permitted to

reconstruct a sketch of the history (or the genealogy) of the mutation appearances. Thus, the

number of accumulated mutations have to be considered as a lower bound for all the clones,

except for the ancestor and the two fully sequenced clones where they are measured directly.

68

However, the indications of the number of mutations from the population sequencing data are

compatible with the inferred values of k [93].

A comprehensive description of the experimental methods used for the experiments with

A. baylyi can be found in ref. [93].

The E. coli long-term evolution experiment concerns twelve E. coli populations evolved

in parallel in batch for about twenty five years, corresponding to more than 6 · 104 generations.

Serial dilution 1 : 100 was performed daily, allowing ≈ 6.6 generations each day and an effective

population size of ≈ 2 · 107 individuals [68]. The maximum population size (≈ 5 · 108) is

fixed by the total amount of nutrient in the medium. Interestingly, six populations become

hypermutator after few thousands of generations, accumulating a large number of mutations.

We used the mutations and fitness data of the population designated Ara-1 referred to the first

20000 generations (corresponding to ≈ 10 years), as given in ref. [66].

Fitness was measured experimentally through competition experiments as the logarithm

of the increase in frequency. Competition experiments were performed between samples at

different times (every 1000 generations) and a spontaneous mutant of the ancestor, which is

easier to track visually and has been verified to have almost the same fitness [96]. The relative

log-fitness of population i with respect to the wild type is defined as

ϕ(i) =

log

(f(1)if(0)i

)log

(f(1)WT

f(0)WT

) , (2.12)

where f(0) and f(1) = f(0)eµt indicate the frequencies at the beginning and the end of the

competition experiment [66, 96]. The ratio ϕ(i) can be seen as the ratio between the growth

rate of the two populations and corresponds to log-fitness in the model. Note again that the

experimental data are dimensionless.

The difference between the growth rates can be approximatively deduced as

(ϕ(i)− 1)µ ≈ (µi − µWT )

where µ is the mean growth rate [68]. Since we are interested in expressing the fitness as the

mean number of offspring per generation, we use as mean growth rate µ = log(2)generations−1.

Thus we define the normalized fitness as w(i) = e(ϕ(i)−1) log(2). The fitness of the reference strain

is again wexp = 1.

69

Genome sequencing was performed on samples from generation 2000, 5000, 10000 15000

and 20000 as well as on the ancestor. A total of 45 mutations were found in the most evolved

strain, most of which were stable in later clones [66]. Detailed information about the long-term

evolution experiment with E. coli, and in particular about fitness measurements can be found

in refs. [66, 68, 94, 96].

Comparison of the experiments reveals some features in common and some remarkable

differences between the two. Both the experiments concern bacteria but were performed using

distinct propagation techniques. Their duration is quite different both in terms of generations

(2 · 104 and 3 · 103) and experimental time (≈ 10 years and ≈ 4 months). However, their

duration is long enough to observe a deceleration of fitness increase [78]. Another feature

shared by the two experiments is the large (effective) population size, which suggests that the

clonal interference regime might be relevant. The simultaneous presence of different genotypes

within the population has been verified in both the experiments [93, 97]. Moreover, the decrease

of the beneficial effect of the first five fixed mutations in the E. coli serial dilution experiment

has been demonstrated [80]. The effect is more complex than the description of diminishing

returns given here. However, as suggested by the authors, one can surmise that a simplified

model including epistatic interactions might be useful to roughly describe this phenomenon.

For all the analyzed clones in both experiments, it is possible to associate fitness values with

numbers of mutations, and thus bridge fitness with the parameter k in our model.

Note that in our modeling framework, as in most standard evolutionary models, the popula-

tion size is kept constant, all individuals are substituted by newly generated offspring at every

generation and we assume a constant time interval between generations, although none of these

assumptions are completely verified in the two experiments we considered. Since the speed of

evolution depends logarithmically on the population size, the error on the estimate of Ub given

by the constant population approximation should be very low.

70

Matching procedure and estimate of Ub

During the first step of the parameter-matching procedure, we found the best-fitting param-

eters for each functional form of the fitness advantage function g(k). This uses data on fitness

values and number of acquired mutations, using the definitions given in the previous section.

Since the number of experimental points is low (5 for the E. coli and 21 for the A. baylyi exper-

iment - corresponding to 9 different values of k), it is possible to obtain good fits with different

functional forms of the advantage s0g(k). The results of the fit using the three different func-

tional forms considered here (power law, logarithm and geometric) is shown in the top panel of

Fig. 5.

In a second step of the procedure, simulations are repeated for a wide range of values of Ub.

A qualitative estimate of this parameter can be obtained using the mean number of mutations

at the end of the two experiments (Fig. 4). In our case, 〈k〉 = 11 and 〈k〉 = 45 for the A. baylyi

experiment and E. coli experiment respectively. In the model, for a fixed interval of time steps,

the number of accumulated mutations decreases significantly and monotonically with Ub, and

the number of time steps necessary to reach 〈k〉 = 11 and 〈k〉 = 45 depends on Ub. Thus,

for each model there is a single value of the parameter that verifies 〈k〉(texp) = 〈k〉exp (Fig. 4,

Fig. 5).

Note that referring to the experimental values of k as mean values we are assuming that

the sequenced clones are representative of the populations. For the A. baylyi experiment, the

uncertainty on the number of mutations present in the clones is a relevant source of error. For

the E. coli experiment, the number of mutations is referred to single clones, but the associated

fitness is a population mean.

The procedure described so far yields some estimated values of the beneficial mutation rate.

However, these values depend on the advantage model, and vary up to three orders of magnitude.

Nevertheless, the estimates are roughly in the expected biological range (10−8 − 10−5, see

refs. [89, 88]).

The third step of the estimate procedure allows to select between advantage model, using the

comparison of simulated and experimentally measured dynamics for the fitness and the number

of mutations as a function of time (similarly to ref. [78]). Note that the latter function is not

trivially equivalent to the fitness as a function of k, but contains the effects of the population

dynamics in presence of clonal interference, for which the model provides a description.

71

Estimate of Fitness advantage

s0g(k) from data

Estimate Ubusing

experimental time

Simulations withfixed g(k) and

varying Ub

Number of mutations k

Advanta

ge

Time steps

<k>

Experimentalnumber of

generations

Beneficial Mutation Rate Ub

Tim

e s

teps

Estimatedexperimental <k>

Figure 4: Sketch of the parameter matching procedure leading to an estimate of the beneficial

mutation rate from experimental fitness data. Top: the experimental data for fitness as a function

of the number of mutation (blue symbols) allow to estimate the advantage fitness function s0g(k) from a fit

(continuous green line). Middle: simulations are run using the estimate advantage function for different values of

Ub, which remains undetermined. This leads to different predicted dynamics for the number of acquired mutations

(dotted,dashed, and dashed-dotted lines) and the fitness increase in time. These predictions can be matched with

the experiment to estimate Ub. In particular, the predicted number of time steps necessary to reach the final

experimental number of mutations varies with the beneficial mutation rate (filled circles). Bottom: The estimate

of Ub is obtained matching the predicted final time with the experimental one. The best value of Ub (blue star) is

obtained when the predicted number of time steps necessary to reach the final number of mutations (determined

in the experiment) corresponds to the number of experimental generations (horizontal continuous red line). We

applied this procedure to the three model variants for the diminishing return described in the main text.

72

Number of mutations

Tim

e st

eps

Beneficial mutation rate10-8 10-6 10-4 0

4

10-6 10-4 10-2

2

0

0.4

0 2 4 6 8 10

x 104x 103

0 10 20 30 40

Power Law

GeometricLogarithm Data

Beneficial mutation rate

0.6

0.2

0

0.4

0.2

0.6

Adv

anta

ge

Number of mutations12 0 104 2 1040 1 103 2 103 3 103

Generations

Adv

anta

ge

0

0.2

0.4

0.6

0

0.2

0.4

0.6

0

5

10

0

20

40

Generations

-6

0 104 2 104

Generations

Mut

atio

ns

0 1 103 2 103 3 103

Generations

0

2

4

A. baylyi A. baylyiE. coli E. coli(a)

(b)

(c)

(d)A. baylyi A. baylyiE. coli E. coli

Figure 5: Estimate of the beneficial mutation rate from different experiments. Estimate of the beneficial

mutation rate using data from Jezequel et al. [93] and Barrick et al. [66] . (a) Estimate of the fitness advantage

function s0g(k) from data of advantage as a function of mutation number (red symbols), as described in the top

panel of Fig. 4. The different lines correspond to the three model variants for the diminishing-return advantage

(power-law dot-dashed purple lines, logarithmic long-dashed green lines, geometric blue dashed lines, as in legend).

For sake of simplicity for experiment A. baylyi (left panel) for each k only the mean value of the advantage is

shown. Complete data are reported in ref. [93]. (b) Estimates of Ub, from different model variants, obtained

matching the time for which 〈k〉(texp) = 〈k〉exp predicted by simulations with a given s0g(k) to the experimental

number of generations, as described in the bottom panel of Fig. 4 (the different line styles refer to the model

variants as above). For A. baylyi experiment we considered a total of 2850 generations. (c-d) Comparison of

the performance of the three model variants with the estimated values of Ub. For the Jezequel et al. data (left

panels), the power law and geometric variants are quite close, but power law model for the diminishing return

gives the best agreement, especially considering the data relative to the number of acquired mutations as a

function of time. This choice leads to the estimate Ub ≈ 10−6 − 10−7. For the Barrick et al. data, the logarithm

and power-law models perform better, and give equivalent estimates of Ub ≈ 10−5.

73

A qualitative comparison between data and simulations indicates that for the A. baylyi

experiment the power law and geometric advantage model (with parameters s0 = 0.42, α = 0.19

and s0 = 0.24, q = 0.63 respectively) best describe the increase of the fitness with time (bottom

panel of Fig. 5). Comparing the increase of the number of mutations as a function of time of

the two models suggests that the power law model better resembles the data. This leads us

to prefer the power law model to compare with data. With this choice the estimated value of

beneficial mutation rate is around the value Ub ≈ 3 · 10−7. In the E. coli experiment, the power

law and logarithmic advantage models describe the data best, and are roughly equivalent (with

parameters s0 = 0.22, α = 0.26 and s0 = 0.16 respectively). Note that the beneficial effect of

the first fixed mutation is comparable to the value s0 ≈ 0.1 estimated from experiments [68, 98].

The range of estimated values obtained for the beneficial mutation rate are also essentially

equivalent (Ub ≈ 1·10−5 for the power law advantage model and Ub ≈ 6·10−6 for the logarithmic

advantage model). Note that the data points used in this step are much more abundant, since

they include a value of fitness every 1000 generations. A logarithmic increase of the fitness

advantage for the E. coli long term evolution experiment is also suggested by a parallel work [61].

Finally, this simple procedure for comparing model with data should not be affected by the

loss of self-averaging property found in the model, which becomes relevant in a later regime. For

example, assuming the estimated parameters for the E. coli long-term evolution experiment,

and using the model to explore the relative variance of the speed over realizations σ(vk)/vk,

at t = 2 · 104 generations one gets an error of the order of 2%. This value is still relatively

small. Considering the much larger number of accumulated mutations k ≈ 103, corresponding

to t ≈ 2 · 106, the speed of evolution is ≈ 3 · 10−4 (>> Ub), and its relative variance over the

realizations results σ(vk)/vk ≈ 8%, which would be still under control.

74

Discussion and conclusions

Different laboratory evolution experiments show a decrease of the fitness advantage due

to newly acquired mutations and a decrease of the speed of evolution [65, 78, 99]. We considered

a simplified model, using a minimal number of parameters, which is a direct generalization of

the multiple-mutations model with constant advantage, but describes this feature in terms of

diminishing returns [78]. Specifically, it is assumed that the selective advantage of all individuals

having k beneficial mutations is identical, but decreases with k. We verified analytically and

with simulations that, in the infinite population limit, considering phenotypic fluctuations does

not affect our model (see Sect. A3).

We have shown that the basic phenomenology of the model entails a sublinear decrease of the

mean number of fixed mutations and a steeper sublinear decrease of the mean advantage. This

is in qualitative agreement with previous results using a similar model applicable in a regime

where concurrent mutations do not occur [78].

The evolutionary speeds of mutation accumulation and advantage are related to the width

of the distribution of coexisting advantage classes. We showed how a theoretical infinite-N

argument produces a relation between the speed of fixed mutations vk and the second moment of

the histogram of mutation classes, confirmed by simulations. Interestingly, simulations indicate

that for any finite N , different model realizations behave increasingly differently with time in

terms of both vk and width of the mutation class histogram. This non-self-averaging property

implies that even at intermediate times, the behavior of a realization can be quite different from

the average. However, as we have seen, this effect appears to become relevant in the model on

time scales longer than the experimental times considered here.

Finally, for finite population size, we were able to define through analytical arguments the

regime where the stochastic edge estimate of the adaptation and mutation speeds can be ex-

75

tended to the case of diminishing returns provided that the advantage s is substituted with the

appropriate function s0g′(〈k〉).

While more complex and realistic descriptions exist (see ref. [100], that incorporates genet-

ically linked multiple mutations explicitly), the advantage of this approach is that the model

depends on few parameters. We performed a numerical experiment giving a qualitative idea

of the comparison of model with data, and allowing to perform gross estimates. We consid-

ered two different experimental data sets from long-term evolution experiments, and defined a

simple procedure to compare model with data. Assuming the model, this procedure yields an

order-of-magnitude estimate of the beneficial mutation rate Ub. The values obtained for the

beneficial mutation rate fall within the range of the available measurements [88, 89], on the

order of 10−6/10−5 mutations per genome per generation for the E. coli long-term evolution

experiment, and between 10−7 and 10−6 in the case of A. baylyi.

As previously mentioned, the constant-advantage model can be seen as an effective description

of a multiple-mutation framework with a distribution of advantages [77], provided an effective

advantage is used. This description also produces an effective beneficial mutation rate that, in

a diminishing return framework, would vary with time, and thus would need an extension of

the present model to be fully implemented.

It is possible to give a rough but quantitative estimate of the underlying beneficial mutation

rate in the simple case of an exponential distribution of the advantages using the rescaling

procedure proposed by Good and coworkers. This estimate (Appendix, Sec. A4) indicates that

the rescaling should not affect the current order-of-magnitude estimates.

A very recent work [98] examines fitness trajectories of the long term evolution experiment up

to 50k generations and match them with a theoretical argument based on epistasis, but which

neglects multiple mutations [69]. Their results are completely in line with ours: power law

behavior of the fitness and estimated beneficial mutation rate of ≈ 10−6 or higher. Comparing

the model to the first 20k generation of the new dataset (where both mutation data and fitness

are available) gives α = 0.27, which is in line with our previous estimate, and provides a measure

of the intensity of the epistatic interactions. In their work, Wiser and coworkers [98] introduce

epistasis using a parameter g to express the decrease of the expected advantage of new mutations.

They find that fitness, as a function of time, is proportional to t1/2g. In order to obtain an

equivalent expression for w(t) in our model we can correlate k and t through the establishment

76

time (see Appendix Sec. A5), obtaining the approximate relation w(k) = s0kα ∝ s0t

α/(2−α).

Comparing the two expressions allows to map the parameters of the two models g = (2−α)/2α.

Substituting the estimated value for α, gives g ≈ 3.2, which is close to the range of values

g ≈ 4− 9 derived by Wiser and coworkers, and in particular similar to the value g ≈ 4 obtained

for the Ara-1 population analyzed here.

Moreover our estimate of the epistatic interaction can be compared to the measurements

obtained in ref. [80] on the first five mutations acquired during the E. coli long-term evolution

experiment. A qualitative comparison shows accordance with our results. However, each single

mutation seems to have a specific value of s0, which would require a more complex modelling

approach.

The constant-advantage multiple-mutations model has also previously been applied to short-

term laboratory evolution experiments [79]. In those early stages, adaptation does not slow

down, and the assumption of constant advantage is justified. We can compare our results

to those obtained with the same procedure, using a constant advantage function g(k) = s0k,

applied to the increase in fitness during the late stages of both the experiments. We have

performed this test considering different “starting points”, i.e. initial generation in the empirical

data. The order-of-magnitude values of Ub obtained with the procedure are similar to the ones

quoted above. However, one is forced to discard the information about the initial mutations,

and, as can be expected, the outcome for Ub depends on the chosen starting point. We found

that it could vary by almost an order of magnitude for time intervals that appeared equally

reasonable to fit with a constant advantage model. On the contrary, the diminishing return

model allows to use data from all mutations, and does not leave this freedom. Additionally, it

includes the early mutations, where s varies much more, and presumably the relative accuracy

in its experimental measurement is higher.

Other scenarios have been proposed and possibly co-occur with diminishing returns epista-

sis [78, 80, 81], and, accordingly, different models have been formulated in this context. For

example, the speed of evolution could decrease because beneficial mutations with larger advan-

tage fix sooner in the population [100] and because the mutation rate or the number of possible

beneficial mutations decreases with time [90, 91].

We tested that, for a fixed advantage s(k) = s0k, a decrease of the beneficial mutation rate

leads to a slow down in adaptation. Applying a procedure similar to the one described in

77

Sec. 2.3, we were able to estimate the beneficial mutation rate for E.coli experiment using a

power law and an exponential model for the decrease of Ub. According to our results, during

the first 5 · 104 generations of E.coli experiment, Ub varies in the reasonable range 10−5/10−6

for both the models. Interestingly, we observed a decrease in the variability of the population

in terms of classes of mutation. The intuitive reasoning beyond this phenomenon is that, as Ub

decreases, the edge of the distribution decreases its probability to gain new mutations, even if it

reaches the establishment size, and the population tend to collapse in a single genotype. Thus,

it is possible to distinguish between the diminishing return and diminishing beneficial mutation

rate model using this parameter. Since experiments suggest an increase of the number of the

subpopulations in time, the diminishing return model seems to describe better the experimental

dynamics of evolution.

The two models can be easily “collapsed” and the detailed dynamics depends on the chosen

parameters.

The different explanations proposed for the slow down of adaptation are not necessary mu-

tually exclusive, and could be stratified in actual laboratory evolution experiments [67]. It is

currently unclear whether the available experimental observables allow to establish the relative

weights of these distinct phenomena, or precisely which different experimental measurements

would; we believe that simple and possibly falsifiable models could help exploring these ques-

tions.

78

Appendix - Part II

A1 Self-consistency scaling argument estimating the adapta-

tion speed.

This section discusses the generalization of the self-consistency considerations used to esti-

mate the adaptation speed from the standard multiple-mutations model [75, 73, 76], applied to

the case of diminishing-return with power-law increase of fitness.

For a power-law model, the advantage of the edge respect to the mean can be expressed as

sedge ≈ s0αLkkα−1.

Imposing the condition that one new mutation class is established at the edge of the his-

togram gives

1 =

∫ τk

0dt

(Ub

2s0αLkkα−1es0α(Lk−1)kα−1t

)(2s0αLk+1(k + 1)α−1

). (S13)

The last term in this integral is the establishment probability of a new fittest class, while

the first is the rate of beneficial mutations from the previous fitness class, which is born with

size 12s0αLkkα−1 and grows exponentially.

An estimate of τk can be obtained integrating the above expression and using the approxima-

tion Lk ≈ Lk+1. Under the assumption that k is not too large (i.e. the advantage of the fittest

class is sufficiently high) the contribution of the integration boundary t = 0 can be neglected

τk =1

s0α(Lk − 1)kα−1log

(s0α(Lk − 1)k2(α−1)

Ub(k + 1)α−1

). (S14)

79

The above expression can be further simplified assuming that both k and Lk are large enough

so that k + 1 ' k and Lk − 1 ' Lk and neglecting the logarithmic term in Lk leading to

τk =1

s0αLkkα−1log

(s0αk

(α−1)

Ub

). (S15)

This indicates that for intermediate k, the estimate of the standard multiple-mutations model

is valid provided the diminishing return advantage function s0kα is substituted to the constant

advantage. For sufficiently large k, expansion of the exponential in Eq. (S13) gives τk ∼ 1/Ub,

compatibly with the infinite-N result.

The time τk is related to the instantaneous speed of the mutation class histogram vk. The

speed vs, can be obtained knowing that during the time the mutation class histogram travels by

one class, the advantage histogram has to move by the relative fitness between the newly added

fittest class and the previous one, s0αkα−1. Using the simplified Eq. (S15) (which neglects

logarithmic terms in Lk), one obtains the speed

vk =s0αLkk

(α−1)

log( s0αkα−1

Ub)

; vs =s0αk

α−1

τk. (S16)

The second part of the estimate involves the normalization condition. Assuming that the

largest term of the histogram dominates, we need to evaluate the time τ ′k necessary for the fittest

class to become the class with mean advantage, whose size is order N/2. If the fittest class has

k mutations, its establishment size is 12s0αLkkα−1 . Its growth will be roughly exponential, with

a rate that decreases while it gets closer to the mean. We estimate its growth by its mean

growth rate during the time τ ′k. Immediately after establishment, its relative growth rate will

be s0αLkkα−1, while its rate will tend to zero when it gets close to the mean. Thus, on average,

we can assume that it grows exponentially with rate s0αLkkα−1

2 .

This argument leads to the equation

N/2 ≈ 1

2s0αLkkα−1es0αLkk

α−1

2τ ′k (S17)

which implies

τ ′k =2

s0αLkkα−1log(Ns0αLkk

α−1). (S18)

In order to estimate vs, we need to determine how much the histogram of fitness advantage

has progressed during the time τ ′k from the establishment of the k-th mutation class. We

assume that during this time Lk is roughly constant, so that after time τ ′k, k + Lk mutations

are established, and the advantage of the edge has reached s0αLk(k + Lk)α−1.

80

This allows to estimate vs as the advantage gained divided by the time τ ′k, i.e.

vs =(s0αLk)

2

2

kα−1(k + Lk)α−1

log(Ns0αLkkα−1)(S19)

Eq. (S16) and (S19) together allow to determine Lk, which can subsequently be used to

obtain the speed of adaptation vs, or of fixed mutations vk. Assuming that k >> Lk and

neglecting the logarithmic corrections in Lk, as in Eq. (S19), we can obtain the following closed

expression for Lk,

Lk =2 log(Ns0αk

α−1)

log(s0αkα−1

Ub

) , (S20)

which is Eq. (2.10) of the main text.

Finally, substituting Lk in the expression of vs, allows to obtain an explicit expression for

the velocity,

vs = 2s0αk(α−1) log(Ns0αk

α−1)(log( s0αk

α−1

Ub))2 (S21)

which, compared to simulations, in general works rather well, despite of the approximations

taken.

More in general, from the estimated establishment time Eq. (S14) and (S18), keeping into

account the increase in fitness as described in the main text, it is possible to derive two expres-

sions for the speed of adaptation vs as function of Lk. Equating these two expressions generates

an implicit estimate of the width Lk

Lk =2 log(Ns0αLkk

α−1)

log(s0α(Lk−1)kα−1

Ub

) Lk − 1

Lk. (S22)

This expression gives a more precise estimate of Lk and vs, even in the case Lk is very small.

Note that for α = 1, the power law return model reduces to the particular case of absence

of epistasis (i.e. g′(k) = 1, hence wk = eks0) [101, 75].

81

A2 Simulation algorithm and effective parameters

Simulations used the algorithm of Park and Krug, as described in refs. [85, 73]. The sim-

ulation scheme is sketched in Fig. S3. In a typical initial configuration, all clones have k = 0

mutations. At each subsequent time step, the progeny of the individuals of all fitness classes

are sampled from a multinomial distribution of parameters {p(k, t+ 1)}{k∈[kmin,kmax]}. The pa-

rameters p(k, t+ 1) take into account the frequency f(k, t) of the class at the former step, the

relative fitness χk computed at time t and the contribution of beneficial mutations arising from

the preceding class. Specifically

p(k, t+ 1) = (1− Ub)f(k, t)χk(t) + Ubf(k − 1, t)χk−1(t) (S23)

(see also Eq. 2 of the main text). Together, these definitions are equivalent to a Wright-Fisher

model with separate selection and mutation steps.

The multinomial random numbers are generated by iteratively drawing binomial random

numbers with parameters q(k, t+ 1) = p(k, t+ 1)/∑

k p(k, t+ 1) starting from kmax.

The model defined above is invariant by suitable rescaling of time, provided the other model

parameters are also rescaled correctly. This feature is useful in comparison with experiments

(see following) in order to understand the correspondence between time steps in the model and

a generation in the experiment. It can also be useful in increasing the efficiency of simulations.

Suppose one wants to map a reference empirical time into the model time. This requires

the rescaling tmodel = rtemp.

A simple choice is r = 1, implying a one-to-one correspondence between empirical genera-

tions and time steps in the model. Since an established clone with advantage s grows as est, the

advantage is proportional to the time scale, smodel = r semp (this means that the value of the

fitness in the model depends exponentially on r). Similarly, since the beneficial mutation rate

is defined as the number of expected mutations per genome per generation, the map between

time steps and empirical generations implies Ub,emp = Ub,model/r. Finally, the correct rescaled

population size is Nmodel = Nemp/r. In practice N can be rather large in a typical experiment

(e.g. ≈ 109), so that the rescaling of population size does not affect much the dynamics provided

82

r is not too large. The rescaling of the parameters described above can easily be rationalized

keeping in mind that the basic time scales of the model are set by the products sN and NUb. In

order to verify that the invariance discussed above is valid for our model, where the advantage

s varies with the number of mutations as s0g(k), we ran some simulations choosing the model

parameters (N , s0, Ub) using different maps between the time units (see Fig. S6). The results

indicate that for any practical purpose, the invariance is effective.

A3 Phenotypic variability

The population genetics model presented in the main text works under the hypothesis that

there is a one-to one correspondence between the genotype (essentially k) and the phenotype

(represented by the fitness and the number of offspring of an individual). However, as already

mentioned, the “reproductive success” of an individual is a function of both genotype and

environment. Moreover, even when we consider asexual populations evolving under controlled

conditions, some variables will vary of some extent from one cell to another (e.g. the number

or concentration of given proteins) [102, 103].

We can introduce phenotypic fluctuations in the frame of our model expressing the fitness as:

wtot = xgen ∗ yphen, (S24)

where x and y indicate the two components (genotype and phenotype). Note that the phenotypic

component can be a function of the genotype, (e.g. a certain genotype could have lower/higher

variability in the outcome compared to another one).

We define the advantage of the two fitness components as s(k) and pk(l), where l is the

index of the “phenotypic class”, and pk(l) is the contribution to the fitness due to class l given

the genotype k. In general, a total value of fitness wtot can be obtained using more than one

combination of k and lk.

Considering the phenotypic term as stochastic fluctuation, we can assume that lk is not

inheritable ‡. If each individual of the class k chooses a class lk independently from its ancestor,

‡This seem to be not completely true. [103]

83

the model dynamics results essentially unaffected by phenotype. This is easily verified in the

infinite population limit.

We define the fraction of individuals in the mutation class k as fk and flk the fraction of

them in the phtnotypic class lk. The fraction of individuals with a given genotype and a given

phenotype over the whole population is fk,l′k = fkfl′k .

As a consequence, the mean fitness of the population is:

wt =∑k

∑l′k

(fk,l′k(t)w(k,l′k)

) =∑k

∑l′k

(fk(t)fl′k(t)xkyl′k

) =∑k

(fk(t)wk(t)

), (S25)

where wk(t) = xk(t)yk(t) is the mean fitness of all the individuals with genotype k. Note

that the quantities yk(t) and fl′k(t) are stochastic variables, they change in time but are not

properly a function of time.

In the infinite population limit we can consider yk(t) = yk and the equation for the evolution

of the frequency distribution can be rewritten as

fk,lk(t+ 1) = flk

∑l′k

(fk,l′k(t)

w(k,l′k)

wt

)(1− Ub) +

∑l′k−1

(fk−1,l′k−1

(t)w(k−1,l′k−1)

wt

)Ub

= flk

[fk(t)wk

1

wt(1− Ub) + fk−1(t)wk−1

1

wtUb

].

Essentially, we recover the standard result, where the fitness of a genotype is the average

fitness wk. Simulations confirm that this intuitive result holds also at finite population size

(data not shown).

A4 Estimate of the effective beneficial mutation rate

A recent study has shown how the constant-advantage model can be seen as an effective

description of a multiple-mutation framework with a distribution of advantages [77]. In princi-

ple, this procedure gives an underlying beneficial mutation rate Ub from the beneficial mutation

rate of the effective theory. The effective selection coefficient coincides with the advantage of

the most probable fixed mutation, and the effective mutation rate has to be rescaled by the

probability of observing that mutation under the original distribution. Therefore, the rescaling

factor (and the mutation advantage in the effective theory) is dependent on parameters related

84

to the distribution of fitness effects. In a diminishing return scenario, this rescaling factor be-

comes time dependent. However, the assumptions made on the fitness advantage distribution

and on the effects that negative epistasis has on this distribution are relevant for this rescaling

procedure. Therefore, a precise estimate of the underlying beneficial mutation rate requires

several additional assumptions about the process, which need to be carefully considered and

verified with data.

Assuming for simplicity an exponential distribution ρ(s) = 1/σe−s/σ of fitness effects, it is

possible to give a quantitative estimate of the underlying beneficial mutation rate. The relation

between the effective beneficial mutation rate Ueff and the underlying rate Ureal is

Ureal =Ueff√

2πvsρ(s∗), (S26)

as given in Eq. 23 of Good et al. [77], where s∗ is the constant advantage used in the effective

theory. In the diminishing return framework, the typical mutation advantage decreases with the

number of accumulated mutations, s∗ = δs(k) and the speed of adaptation vs is slowing down as

vs ∝ (δs(k))2V ar(k) (using the mean-field estimate described in the main text). Assuming that

the effect of epistasis on the advantage distribution is such that the mean advantage σ scales as

the effective advantage with the number of mutations (i.e., σ ∝ δs(k)), we have ρ(s∗) ∝ 1/s∗ =

1/δs(k). In this case, the underlying beneficial mutation rate is simply Ureal ≈Ueff√V ar(k)

. In

other words, an estimated Ueff can be mapped to a time-dependent Ureal with a scaling factor

inversely proportional to the width of the distribution of mutation classes (√V ar(k) ∼ Lk).

On the experimental time scales, this procedure leads to small corrections to the mutation rate,

so that the order of magnitude estimates of Ub can still be considered acceptable.

A5 Estimate of mean fitness as a function of time

In their work, Wiser and coworkers [98] show that mean fitness depends on time as a power

law, with exponent 1/2g, where g can be obtained for all the available data or from single

populations. To relate these findings to the present work one has to express fitness as a function

of time. The k-th mutation typically appears at time tk, which is the sum of the establishment

times w(k) = w(tk). Thus, time can be related to k through

tk =k∑

k′=1

τk′ ≈log2

(s0αk

′(α−1)U−1b

)2s0αk′(α−1) log

(Ns0αk′(α−1)

) , (S27)

85

where we used (S15) and (S20) to express the establishment time. This expression can be

estimated as an integral, assuming that the dependency from k inside the logarithms is weak and

that they can be considered almost constant on the experimental time scale. This assumption is

justified for the data in analysis, since varying the parameters in the estimated range for E. coli

evolution experiment (i.e. N = 107/108, Ub = 10−5/10−6, s0 ≈ 0.2 and α ≈ 0.25) the ratio

between the logarithmic terms varies between 5 and 12 for k in the range 1 − 100. Using this

approximation, the integral can be solved obtaining t ∝ k2−α. Hence, fitness should scale with

time as w ∝ tα/(2−α). This is in agreement with the results of Wiser and coworkers, and the

relation between the parameters giving the strength of epistasis in the two models is g = 2−α2α .

86

Supplementary Figures.

Frequency

(a)

(b)

Frequency

Supplementary Figure S1: Clonal interference and selective sweep. The figure shows a schematic repre-

sentation of the emergence of new genotypes from a monoclonal population in regime of selective sweep (a) and

clonal interference (b). In the first case the new arising mutant can invade the population before the next one

establishes. In clonal interference regime, many different genotypes arise at the same time, coexist and compete.

The establishment and fixation time are schematically represented in the case of selective sweep regime.

87

Fre

quen

cy f

(k,t)

k (t)max

k (t+Δt)max

Number of mutations k

<k>

Supplementary Figure S2: The dynamics is driven by beneficial mutations and the rate of establish-

ment of new classes. The dashed rectangles represent the class frequency histogram at a subsequent time.

The green arrows symbolize the growth and decrease of size of the classes depending on their relative fitness.

Frequency f(k,t)bU

Beneficial mutation rate

s0g(k)e wk=Fitness Surviving offspring probabilities p(k,t+1)

f(k,t+1) from multinomial distribution

Supplementary Figure S3: Sketch of the algorithm used in the simulations (from ref. [73], see text).

The instantaneous frequencies of mutation classes define, through Eq. S23 (which incorporates selection and

mutation) the probability p(k, t+ 1) that an individual with k mutations is found at the subsequent generation.

The frequencies at the subsequent generations are then sampled from a multinomial distribution with parameters

p(k, t+ 1). This procedure allows to access large population sizes.

-5

-4

-3

-2

-1

π(s)

Advantage s10-5 -4 -3 -2

10

10

10

10

10

10 10 10

Supplementary Figure S4: The fixation

probability is proportional to the fit-

ness advantage. The figure shows the fix-

ation probability π(s) of a single clone that

grows in a uniform background having advan-

tage s (symbols), measured from our simula-

tions. The fixation probability is obtained as

the percentage of realizations where the ben-

eficial mutator fixes. We obtain π(s) = 2s

(continuous red line), in accordance with [85].

Simulations are performed using N = 107

over 108 realizations.

88

0.02

0.06

0.1

0.14

<V

ark>

R/ <

k>R

0 5 106 107

Generations

A

0 5 106 107

Generations

0.01

0.02

0.03

0 5 106 107

σR(V

k)/<

Vk>

R

Generations

0.04

0.06

0.08

0.1

σR (

L k )

/<L

k >R

CB

Supplementary Figure S5: The relative variance of the distribution decreases in time, while the

relative variance of Lk and vk follow an increasing trend. Panel A shows the ratio between the variance

Vark and the mean number of mutations 〈k〉 as a function of time. Simulations results confirm that, despite Vark

increases in time, the mean number of mutations increases more rapidly. Since Dk ≤ Lk, simulations results

confirm the hypothesis Dk/〈k〉 � 1 used in the main text (see also Fig. 3B). However, as illustrated in panels

B and C, vk and Lk can be quite different across realizations at the same time. The increase of the relative

error (calculated over different realizations) of the width Lk (panel B) and of the velocity (panel C) reflects both

the increase of the variance of the distribution and the decrease of vk. However, the error is sufficiently small

not to affect the results on experimentally relevant time scales. Simulations are performed using the parameters

N = 107, s0 = 0.5, α = 0.02, Ub = 1 · 10−3. Data are averaged over 100 realizations of the process.

89

r =1

<k>

A B

0

60

120

180

0.22

0.18

0.26

0.3

5 1052.5 1055 1052.5 105

<s 0

g(k)

>/r

Rescaled time steps Rescaled time steps

r =100r =10

Supplementary Figure S6: The model is effectively invariant for rescaling of the parameters. The

figure shows the mean number of mutations (〈k〉, panel A) and the fitness advantage (panel B) obtained from

simulations run using three different rescaling factors (r=1,10,100 different symbols as in legend). The simulated

dynamics is almost unaffected by the rescaling procedure. Simulations are made using the parameters N = 109,

s0 = 0.1, α = 0.2, Ub = 2 · 10−6. The data are averaged over 100 iterations, and error bars (standard error) are

smaller than symbols.

0 5 10 15 20 25 30 35 40Number of mutations

0.10.20.30.40.5

0 5 10 15 20 25 30 35 40Number of mutations

10-8

1

10-4

Generation 5 103 5 1042.5 103500250

Supplementary Figure S7: For empirically relevant parameters, the class population histograms at in-

termediate times are nearly Gaussian. The figure refers to “geometric” advantage, g(k) = (1−qk)/(1−q),

plotted in the top panel, and shows (bottom panel) the mean mutation class histogram at different genera-

tions (different symbols, see legend). The continuous lines in the bottom panel represent normalized Gaussian

distributions with equal mean and standard deviation of simulated histograms. The dashed line indicates the

establishment size 1/π(s(k)). Simulations are averaged over 250 realizations using the parameters N = 109,

s0 = 0.1, Ub = 6 · 10−6 and q = 0.8.

90

Bibliography

[1] B. McClintock, “The origin and behavior of mutable loci in maize,” Proc. Natl. Acad.

Sci., vol. 36, no. 6, 1950.

[2] B. McClintock, “Mutable loci in maize,” Carnegie Institution of Washington Yearbook,

vol. 50, 1951.

[3] E. S. Lander et al., “Initial sequencing and analysis of the human genome,” Nature,

vol. 409, Feb 2001.

[4] C. R. L. Huang, K. H. Burns, and J. D. Boeke, “Active transposition in genomes,” Annual

Review of Genetics, vol. 46, Dec 2012.

[5] R. Cordaux and M. A. Batzer, “The impact of retrotransposons on human genome evo-

lution,” Nature Reviews Genetics, vol. 10, Oct 2009.

[6] J. K. Pace and C. Feschotte, “The evolutionary history of human DNA transposons:

evidence for intense activity in the primate lineage,” Genome Research, vol. 17, Apr 2007.

[7] C. Feschotte and E. J. Pritham, “DNA transposons and the evolution of eukaryotic

genomes,” Annual Review of Genetics, vol. 41, 2007.

[8] H. H. Kazazian, “Mobile elements: drivers of genome evolution,” Science, vol. 303, Mar

2004.

[9] P. L. Deininger and M. A. Batzer, “Mammalian retroelements,” Genome Research, vol. 12,

Oct 2002.

[10] M. K. Konkel and M. A. Batzer, “A mobile threat to genome stability: the impact of

non-LTR retrotransposons upon the human genome,” Seminars in cancer biology, vol. 20,

no. 4, 2010.

91

[11] E. A. Bennett, L. E. Coleman, C. Tsui, W. S. Pittard, and S. E. Devine, “Natural genetic

variation caused by transposable elements in humans,” Genetics, vol. 168, Oct 2004.

[12] K. Ohshima, “RNA-mediated gene duplication and retroposons: retrogenes, LINEs,

SINEs, and sequence specificity,” International Journal of Evolutionary Biology, vol. 2013,

2013.

[13] T. Singer, M. J. McConnell, M. C. Marchetto, N. G. Coufal, and F. H. Gage, “LINE-

1 retrotransposons: mediators of somatic variation in neuronal genomes?,” Trends in

Neurosciences, vol. 33, Aug 2010.

[14] J. M. C. Tubio et al., “Extensive transduction of nonrepetitive DNA mediated by L1

retrotransposition in cancer genomes,” Science, vol. 345, Aug 2014.

[15] A. D. Ewing and H. H. J. Kazazian, “High-throughput sequencing reveals extensive vari-

ation in human-specific L1 content in individual human genomes,” Genome Research,

vol. 20, no. 9, 2010.

[16] J. Xing, Y. Zhang, K. Han, A. H. Salem, S. K. Sen, C. D. Huff, Q. Zhou, E. F. Kirkness,

S. Levy, M. A. Batzer, and L. B. Jorde, “Mobile elements create structural variation:

analysis of a complete human genome,” Genome Research, vol. 19, Sept 2009.

[17] A. M. Roy-Engel, “LINEs, SINEs and other retroelements: do birds of a feather flock

together?,” Front Biosci, vol. 17, Jan 2012.

[18] J. Jurka, “Sequence patterns indicate an enzymatic involvement in integration of mam-

malian retroposons,” Proc. Natl. Acad. Sci., vol. 94, Mar 1997.

[19] S. Boissinot, A. Entezam, P. J. Munson, L. Young, and A. V. Furano, “The insertional

history of an active family of L1 retrotransposons in humans,” Genome Research, vol. 14,

no. 7, 2004.

[20] I. Ovchinnikov, A. B. Troxel, and G. D. Swergold, “Genomic characterization of recent

human LINE-1 insertions: evidence supporting random insertion,” Genome Research,

vol. 11, Dec 2001.

[21] A. M. Weiner, “SINEs and LINEs: the art of biting the hand that feeds you,” Current

Opinion in Cell Biology, vol. 14, Jun 2002.

92

[22] M. Costantini, F. Auletta, and G. Bernardi, “The distributions of “new” and “old” Alu

sequences in the human genome: the solution of a “mystery”,” Molecular Biology and

Evolution, vol. 29, Nov 2012.

[23] K. R. Oliver and W. K. Greene, “Mobile DNA and the TE-thrust hypothesis: supporting

evidence from the primates,” Mobile DNA, vol. 2, no. 8, 2011.

[24] K. Kobayashi et al., “An ancient retrotransposal insertion causes fukuyama-type congen-

ital muscular dystrophy,” Nature, vol. 394, 1998.

[25] H. Hassoun, T. L. Coetzer, J. N. Vassiliadis, K. E. Sahr, G. J. Maalouf, S. T. Saad,

and J. Palek, “A novel mobile element inserted in the alpha spectrin gene: spectrin

dayton. A truncated alpha spectrin associated with hereditary elliptocytosis,” J. Clinical

Investigation, vol. 94, Aug 1994.

[26] H. J. Kazazian, C. Wong, H. Youssoufian, A. Scott, D. Phillips, and S. Antonarakis,

“Haemophilia a resulting from de novo insertion of L1 sequences represents a novel mech-

anism for mutation in man,” Nature, vol. 332, Mar 1988.

[27] C. S. Lin, D. A. Goldthwait, and D. Samols, “Identification of Alu transposition in human

lung carcinoma cells,” Cell, vol. 54, Jun 1988.

[28] S. Solyom et al., “Extensive somatic L1 retrotransposition in colorectal tumors,” Genome

Research, vol. 22, Dec 2012.

[29] B. Chenais, “Transposable elements and human cancer: a causal relationship?,” Biochim-

ica et Biophysica Acta, vol. 1835, Jan 2013.

[30] F. E. Lock et al., “Distinct isoform of FABP7 revealed by screening for retroelement-

activated genes in diffuse large B-cell lymphoma,” Proc. Natl. Acad. Sci., vol. 111, Aug

2014.

[31] K. R. Upton et al., “Ubiquitous L1 mosaicism in hippocampal neurons,” Cell, vol. 161,

Apr 2015.

[32] T. Graham and S. Boissinot, “The genomic distribution of L1 elements: the role of inser-

tion bias and natural selection,” J. Biomed Biotechnol., vol. 1, 2006.

93

[33] J. A. Bailey, G. Liu, and E. E. Eichler, “An Alu transposition model for the origin and

expansion of human segmental duplications,” The American Journal of Human Genetics,

vol. 73, no. 4, 2003.

[34] M. Babcock, A. Pavlicek, E. Spiteri, C. D. Kashork, I. Ioshikhes, L. G. Shaffer, J. Jurka,

and B. E. Morrow, “Shuffling of genes within low-copy repeats on 22q11 (LCR22) by

Alu-mediated recombination events during evolution.,” Genome Research, vol. 13, Dec

2003.

[35] J. Jurka, O. Kohany, A. Pavlicek, V. V. Kapitonov, and M. V. Jurka, “Duplication, co-

clustering, and selection of human Alu retrotransposons,” Proc. Natl. Acad. Sci., vol. 101,

Feb 2004.

[36] T. H. Jukes and C. R. Cantor, Evolution of protein molecules. Academy Press, 1969.

[37] M. Kimura, “A simple method for estimating evolutionary rates of base substitutions

through comparative studies of nucleotide sequences,” Journal of Molecular Evolution,

vol. 16, no. 2, 1980.

[38] M. Nei and S. Kumar, Molecular evolution and phylogenetics. Oxford University Press,

New York, 2000.

[39] B. J. Wagstaff, E. N. Kroutter, R. S. Derbes, V. P. Belancio, and A. M. Roy-Engel,

“Molecular reconstruction of extinct LINE-1 elements and their interaction with nonau-

tonomous elements,” Mol. Biol. Evol., vol. 30, Aug 2013.

[40] H. Khan, A. Smit, and S. Boissinot, “Molecular evolution and tempo of amplification of

human LINE-1 retrotransposons since the origin of primates,” Genome Research, vol. 16,

Jan 2006.

[41] A. Sookdeo, C. M. Hepp, M. A. McClure, and S. Boissinot, “Revisiting the evolution of

mouse LINE-1 in the genomic era,” Mobile DNA, vol. 4, Jan 2013.

[42] A. Le Rouzic and G. Deceliere, “Models of the population genetics of transposable ele-

ments,” Genetics Research, vol. 85, Jun 2005.

[43] A. Le Rouzic and P. Capy, “Population genetics models of competition between transpos-

able element subfamilies,” Genetics, vol. 174, Oct 2006.

94

[44] A. Le Rouzic, S. B. Thibaud, and P. Capy, “Long-term evolution of transposable ele-

ments,” Proc. Natl. Acad. Sci., vol. 104, 2007.

[45] A. Le Rouzic, T. Payen, and A. Hua-Van, “Reconstructing the evolutionary history of

transposable elements,” Genome Biol. Evol., vol. 5, no. 1, 2012.

[46] G. Abrusan and H.-J. Krambeck, “Competition may determine the diversity of transpos-

able elements,” Theoretical Population Biology, vol. 70, Nov 2006.

[47] O. Piskurek, H. Nishihara, and N. Okada, “The evolution of two partner LINE/SINE

families and a full-length chromodomain-containing Ty3/Gypsy LTR element in the first

reptilian genome of anolis carolinensis,” Gene, vol. 441, Jul 2009.

[48] Y. Terai, K. Takahashi, and N. Okada, “SINE cousins: the 3’-end tails of the two oldest

and distantly related families of SINEs are descended from the 3’-ends of LINEs with the

same genealogical origin,” Mol. Biol. Evol., vol. 15, Nov 1998.

[49] R. M. Ziff and E. D. McGrady, “The kinetics of cluster fragmentation and depolymerisa-

tion,” J. Phys. A: Math. Gen., vol. 18, 1985.

[50] Z. Cheng and S. Redner, “Kinetics of fragmentation,” J. Phys. A: Math. Gen., vol. 23,

1990.

[51] J. D. Barrow, “Coagulation with fragmentation,” J. Phys. A: Math. Gen., vol. 14, no. 3,

1981.

[52] F. Massip and P. F. Arndt, “Neutral evolution of duplicated DNA: an evolutionary stick-

breaking process causes scale-invariant behavior,” Phys. Rev. Lett., vol. 110, no. 14, 2013.

[53] D. Sellis, A. Provata, and Y. Almirantis, “Alu and LINE1 distributions in the human

chromosomes: evidence of global genomic organization expressed in the form of power

laws,” Molecular biology and evolution, vol. 24, no. 11, 2007.

[54] A. Smit, R. R. Hubley, and P. Green, “RepeatMasker open-4.0,” 2013-2015.

[55] “UCSC Genome Browser.” https://genome.ucsc.edu/.

[56] D. A. Ray, C. Feschotte, H. J. Pagan, J. D. Smith, E. J. Pritham, P. Arensburger, P. W.

Atkinson, and N. L. Craig, “Multiple waves of recent DNA transposon activity in the bat,

myotis lucifugus,” Genome Research, vol. 18, May 2008.

95

[57] M. Kimura, “Evolutionary rate at the molecular level,” Nature, vol. 217, 1968.

[58] P. D. Keightley, “Rates and fitness consequences of new mutations in humans,” Genetics,

vol. 190, no. 2, 2012.

[59] A. Hodgkinson and A. Eyre-Walker, “Variation in the mutation rate across mammalian

genomes,” Nature Review Genetics, vol. 12, Nov 2011.

[60] S. Subramanian and S. Kumar, “Neutral substitutions occur at a faster rate in exons than

in noncoding DNA in primate genomes,” Genome Research, vol. 13, May 2003.

[61] S. Wielgoss, J. Barrick, O. Tenaillon, M. Wiser, W. Dittmar, S. Cruveiller, B. Chane-

Woon-Ming, C. Medigue, and R. E. Lenski, “Mutation rate dynamics in a bacterial

population reflect tension between adaptation and genetic load,” Proc. Natl. Acad. Sci.,

vol. 110, Jan 2013.

[62] M. W. Nachman and S. L. Crowell, “Estimate of the mutation rate per nucleotide in

humans,” Genetics, vol. 156, no. 1, 2000.

[63] G. E. Liu, C. Alkan, L. Jiang, S. Zhao, and E. E. Eichler, “Comparative analysis of Alu

repeats in primate genomes,” Genome Research, vol. 19, May 2009.

[64] L. Duret and P. F. Arndt, “The impact of recombination on nucleotide substitutions in

the human genome,” PLOS Genetics, vol. 4, no. 5, 2008.

[65] T. Hindre, C. Knibbe, G. Beslon, and D. Schneider, “New insights into bacterial adap-

tation through in vivo and in silico experimental evolution,” Nat Rev Microbiol, vol. 10,

May 2012.

[66] J. E. Barrick, D. S. Yu, S. H. Yoon, H. Jeong, T. K. Oh, D. Schneider, R. E. Lenski, and

J. F. Kim, “Genome evolution and adaptation in a long-term experiment with Escherichia

coli,” Nature, vol. 461, Oct 2009.

[67] O. Tenaillon, A. Rodrıguez-Verdugo, R. L. Gaut, P. McDonald, A. F. Bennett, A. D. Long,

and B. S. Gaut, “The molecular diversity of adaptive convergence,” Science, vol. 335, Jan

2012.

[68] R. E. Lenski, M. R. Rose, S. C. Simpson, and S. C. Tadler, “Long-term experimental

evolution in Escherichia coli. I - adaptation and divergence during 2,000 generations,”

Am Nat, vol. 138, Dec 1991.

96

[69] P. J. Gerrish and R. E. Lenski, “The fate of competing beneficial mutations in an asexual

population,” Genetica, vol. 102-103, 1998.

[70] J. Felsenstein, “The evolutionary advantage of recombination,” Genetics, vol. 78, Oct

1974.

[71] H. A. Orr, “The distribution of fitness effects among beneficial mutations,” Genetics,

vol. 163, Apr 2003.

[72] C. O. Wilke, “The speed of adaptation in large asexual populations,” Genetics, vol. 167,

Aug 2004.

[73] S.-C. Park, D. Simon, and J. Krug, “The speed of evolution in large asexual populations,”

Journal of Statistical Physics, vol. 138, Feb 2010.

[74] L. S. Tsimring, H. Levine, and D. A. Kessler, “RNA virus evolution via a fitness-space

model,” Phys. Rev. Lett., vol. 76, Jun 1996.

[75] M. M. Desai and D. S. Fisher, “Beneficial mutation selection balance and the effect of

linkage on positive selection,” Genetics, vol. 176, Jul 2007.

[76] E. Brunet, I. M. Rouzine, and C. O. Wilke, “The stochastic edge in adaptive evolution,”

Genetics, vol. 179, May 2008.

[77] B. H. Good, I. M. Rouzine, D. J. Balick, O. Hallatschek, and M. M. Desai, “Distribution

of fixed beneficial mutations and the rate of adaptation in asexual populations,” Proc.

Natl. Acad. Sci., vol. 109, Mar 2012.

[78] S. Kryazhimskiy, G. Tkacik, and J. B. Plotkin, “The dynamics of adaptation on correlated

fitness landscapes,” Proc. Natl. Acad. Sci., vol. 106, Nov 2009.

[79] M. M. Desai, D. S. Fisher, and A. W. Murray, “The speed of evolution and maintenance

of variation in asexual populations,” Curr. Biol., vol. 17, Mar 2007.

[80] A. I. Khan, D. M. Dinh, D. Schneider, R. E. Lenski, and T. F. Cooper, “Negative epistasis

between beneficial mutations in an evolving bacterial population,” Science, vol. 332, Jun

2011.

[81] H.-H. Chou, H.-C. Chiu, N. F. Delaney, D. Segre, and C. J. Marx, “Diminishing returns

epistasis among beneficial mutations decelerates adaptation,” Science, vol. 332, Jun 2011.

97

[82] G. Martin, S. F. Elena, and T. Lenormand, “Distributions of epistasis in microbes fit

predictions from a fitness landscape model,” Nat Genet, vol. 39, Apr 2007.

[83] S. Wright, “Evolution in mendelian populations,” Genetics, vol. 16, Mar 1931.

[84] R. Fisher, The genetical theory of natural selection. Clarendon Press, Oxford, 1930.

[85] S.-C. Park and J. Krug, “Clonal interference in large populations,” Proc. Natl. Acad. Sci.,

vol. 104, Nov 2007.

[86] A. Eyre-Walker and P. D. Keightley, “The distribution of fitness effects of new mutations,”

Nat Rev Genet, vol. 8, Aug 2007.

[87] J. B. S. Haldane, “A mathematical theory of natural and artificial selection, part V:

selection and mutation,” Proc. Camb. Philos. Soc., vol. 23, July 1927.

[88] M. Hegreness, N. Shoresh, D. Hartl, and R. Kishony, “An equivalence principle for the

incorporation of favorable mutations in asexual populations,” Science, vol. 311, Mar 2006.

[89] L. Perfeito, L. Fernandes, C. Mota, and I. Gordo, “Adaptive mutations in bacteria: high

rate and small effects,” Science, vol. 317, Aug 2007.

[90] I. Rouzine, E. Brunet, and C. O. Wilke, “The traveling-wave approach to asexual evolu-

tion: Muller’s ratchet and speed of adaptation,” Theor. Popul. Biol., vol. 73, Feb 2008.

[91] S.-C. Park and J. Krug, “Evolution in random fitness landscapes: the infinite sites model,”

Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 4, 2008.

[92] H. A. Guess, “Limit theorems for some stochastic evolution models,” The Annals of

Probability, vol. 2, Feb 1974.

[93] N. Jezequel, M. C. Lagomarsino, F. Heslot, and P. Thomen, “Long-term diversity and

genome adaptation of Acinetobacter Baylyi in a minimal-medium chemostat,” Genome

Biology and Evolution, vol. 5, no. 1, 2013.

[94] “E. coli long-term experimental evolution project.” http://myxo.css.msu.edu/index.html.

[95] D. E. Dykhuizen and D. L. Hartl, “Selection in chemostat,” Microbiol. Mol. Biol. Rev.,

vol. 47, no. 2, 1983.

98

[96] J. A. G. M. de Visser and R. E. Lenski, “Long-term experimental evolution in Escherichia

coli. XI - rejection of non-transitive interactions as cause of declining rate of adaptation,”

BMC Evolutionary Biology, vol. 2, Oct 2002.

[97] S. F. Elena and R. E. Lenski, “Long-term experimental evolution in Escherichia coli. VII

- mechanisms maintaining genetic variability within populations,” Evolution, vol. 51, Aug

1997.

[98] M. J. Wiser, N. Ribeck, and R. E. Lenski, “Long-term dynamics of adaptation in asexual

populations,” Science Express, vol. 342, Dec 2013.

[99] S. F. Elena and R. E. Lenski, “Evolution experiments with microorganisms: the dynamics

and genetic bases of adaptation,” Nat Rev Genet, vol. 4, Jun 2003.

[100] S. Schiffels, G. Szollosi, V. Mustonen, and M. Lassig, “Emergent neutrality in adaptive

asexual evolution,” Genetics, vol. 189, Dec 2011.

[101] I. M. Rouzine, J. Wakeley, and J. M. Coffin, “The solitary wave of asexual evolution,”

Proc. Natl. Acad. Sci., vol. 100, Jan 2003.

[102] K. Sato, Y. Ito, T. Yomo, and K. Kaneko, “On the relation between fluctuation and

response in biological systems,” Proc. Natl. Acad. Sci., vol. 100, no. 24, 2003.

[103] Y. Ito, H. Toyota, K. Kaneko, and T. Yomo, “How selection affects phenotypic fluctua-

tion,” Molecular Systems Biology, vol. 5, no. 64, 2009.

99

Download - Transposable elements distributions and genome …...Transposable elements (TEs), also known as \jumping genes" or transposons, are sequences of DNA able to insert and move within

Top Related