probabilistic models for understanding ...fm548ck9534/cpop...in this work, we rst present a...
TRANSCRIPT
PROBABILISTIC MODELS FOR UNDERSTANDING
REGULATION OF TRANSLATION
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Cristina Pop
March 2015
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/fm548ck9534
© 2015 by Cristina Pop. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Daphne Koller, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Serafim Batzoglou
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Jonathan Weissman
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
The process of translation, whereby RNA is converted to protein, is an essential
biosynthetic process requiring a large fraction of the cells resources. However, our
understanding of the regulatory mechanisms at this stage of gene expression is limited.
Recent high-throughput experimental techniques and our development of probabilistic
models for their analysis have allowed us to better explore translation efficiency, codon
preferences, and mRNA secondary structure, as well as the interplay between these
factors.
In this work, we first present a queuing-theory-based probabilistic model for ribo-
some profiling data to extract robust estimates of protein synthesis rates and trans-
lation rates per codon, which can vary across individual genes. We use this model to
show that local rates and translation efficiency are not affected by manipulations of
tRNA abundance in physiological conditions in yeast; this reverses the direction of
causality previously assumed to hold. Instead, we propose that initiation sequence
signals, such as mRNA structure, could drive translation. To further understand
varying translation rates, we also apply this model to human cells and present results
on allele-specific ribosome pausing.
Second, we delve deeper into RNA structure, which is important more broadly
throughout the pipeline of protein expression and in many aspects of regulation con-
trol. However, accurately determining RNA structure at large scale is difficult with
only experimental data or algorithmic methods. We present a conditional log-linear
iv
model that can incorporate information from multiple structure probing assays, and,
although limited by the data quality, improves prediction accuracy over leading algo-
rithms. Our method can also be used to derive new insight into biological processes
influenced by RNA structure, such as translation.
v
Acknowledgements
First and foremost, I’d like to thank my advisor, Daphne Koller, for her mentorship
and inspiration. Your source of boundless knowledge and spot-on guidance steered
me throughout my academic growth. Your ambition and fearlessness in asking the
hard questions became my goalpost too. I thank you for so much of what I have
learned in research, both in skills and in independence.
I’d also like to thank Jonathan Pritchard, for the insight he provided during
the last part of my PhD and the very warm support. I am very appreciative to
Jonathan Weissman, for fruitful discussions on much of this work and an amazing
long-term collaboration. Thank you also to Serafim Batzoglou and Anshul Kundaje,
who provided useful comments and valuable questions.
Throughout my PhD, I have had the pleasure of collaborating with a number of
fantastic people. Thank you:
� Nick Ingolia and Silvi Rouskin, for in-depth discussions and also for your pa-
tience in teaching me the biology I did not know.
� Chuan-Sheng Foo, for many vibrant and productive discussions, for the late-
night work sessions, and for being a great friend.
� Vlad Jojic, for your support in making my first year academically rich.
� Sara Mostafavi, for introducing me to a different part of computational biology.
vi
� Members of DAGS throughout the years: Suchi Saria, Karen Sachs, Alexis
Battle, Ben Packer, Joni Laserson, Manfred Classen, Varun Ganapathi, Yoni
Donner, David Knowles, Yi Liu, Irene Kaplow, Pawan Kumar, Michael Stark,
Huayan Wang, Tianshi Gao, Clara Fannjiang, and Madiha Chan.
� Members of the Pritchard lab, the Weissman lab, and the Batzoglou lab with
whom I have had the pleasure of working.
Thank you Alex Sandra and members of the CS department for making my days
stressless. I’d also like to acknowledge the NSF Graduate Research Fellowship and
NSERC Postgraduate Scholarship.
Finally, I am grateful for many of the folks I have had a chance to share these
years with:
To the friends I met in grad school – your support always made my day (and my
PhD, and the rest of my life).
To my parents, Emil and Ana, for wisdom and strength. You are my pillars in
everything, always.
To my sister, Ana, who has an uncanny way of being whatever you need. You are
awesome.
And most of all, to my grandmothers – from one I learned joy and math, from
one I learned wit and philosophy. From both of you, I learned the enchantment of
never giving up. My genes owe you.
vii
Contents
Abstract iv
Acknowledgements vi
1 Introduction 1
2 Background 4
2.1 Cell Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Analysis of Ribosome Profiling Data . . . . . . . . . . . . . . . . . . 9
2.3 Prediction of RNA Secondary Structure . . . . . . . . . . . . . . . . . 13
3 A Model for Translation 18
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Queuing Model for Elongation . . . . . . . . . . . . . . . . . . 21
3.2.2 Codon Translation and tRNA Manipulation . . . . . . . . . . 24
3.2.3 Translation Efficiency and tRNA Manipulaion . . . . . . . . . 27
3.2.4 Factors Correlating with Elongation Efficiency . . . . . . . . . 31
3.2.5 Factors Correlating with Translation Efficiency . . . . . . . . . 35
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 44
viii
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Translation in Humans 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Allele-Specific Ribosome Dwell Times . . . . . . . . . . . . . . 55
4.2.2 Codon Translation Rates Across Individuals . . . . . . . . . . 58
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 RNA Secondary Structure Prediction 63
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.1 Improved Secondary Structure Predictions . . . . . . . . . . . 66
5.2.2 The Value of Structure-Probing Data . . . . . . . . . . . . . . 71
5.2.3 Combining Data from Multiple Data Sources . . . . . . . . . . 73
5.2.4 Classification of RNA-Binding Protein Targets . . . . . . . . . 74
5.2.5 Nucleotide-Level Structure Contexts for RNA-Binding Proteins 75
5.2.6 Structure and Translation Efficiency under Oxidative Stress . 79
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.1 The CONTRAfold-SE Model . . . . . . . . . . . . . . . . . . 85
5.4.2 Dataset Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 Conclusions 94
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
ix
6.2 Going Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A Ribosome Profiling 99
A.1 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.2 Supplementary Figures and Tables . . . . . . . . . . . . . . . . . . . 101
B RNA Secondary Structure 117
B.1 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . 117
B.1.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . 117
B.1.2 Dataset Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B.2 Supplementary Figures and Tables . . . . . . . . . . . . . . . . . . . 127
Bibliography 140
x
List of Tables
5.1 F-measure of CONTRAfold-SE (C-SE) trained on Train-A(PARS) and
evaluated on Test-SeqFold. . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Performance of CONTRAfold-SE trained on Train-A and Train-B and
evaluated on three general test sets. . . . . . . . . . . . . . . . . . . . 70
5.3 Performance of CONTRAfold-SE trained on sets of varying composi-
tions with PARS data and evaluated on two test sets. . . . . . . . . . 72
A.1 Counts of tRNA in RPM (number of reads per million) in ACA-K and
wild-type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.2 Eight categories of potential correlates to outlier strength. . . . . . . 113
A.3 Spearman correlation between outlier strength and features, separated
by type and highlighted if significant. . . . . . . . . . . . . . . . . . . 114
A.4 Performance of TE regression model. . . . . . . . . . . . . . . . . . . 115
A.5 Summary of main results for model variations. . . . . . . . . . . . . . 116
B.1 AUC for receiver-operating-characteristic curves classifying bound RBP
genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
B.2 Spearman correlation between CONTRAfold-SE and translation effi-
ciency on in vivo data. . . . . . . . . . . . . . . . . . . . . . . . . . . 136
B.3 Spearman correlation between CONTRAfold-SE and translation effi-
ciency on in vitro data. . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xi
B.4 Spearman correlation between CONTRAfold-SE and translation effi-
ciency at earlier time point. . . . . . . . . . . . . . . . . . . . . . . . 138
B.5 Spearman correlation between CONTRAfold-SE in vivo and various
TE quantities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
xii
List of Figures
2.1 Central dogma of biology. . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Ribosome footprint density profile versus mRNA density profile. . . . 10
2.3 Alternative splicing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Common structure motifs. . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Model of protein synthesis. . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Correlation between codon translation rates and measures of codon
usage bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Comparison between codon translation rates in wild-type and mutants. 28
3.4 Comparison between translation efficiency in wild-type and mutants. 30
3.5 All codons show negative correlation between outlier strength and
proximity to gene start. . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 RNA structure energy and its relationship to translation efficiency. . . 36
3.7 Estimated Kozak motif for efficient genes. . . . . . . . . . . . . . . . 38
4.1 Comparison of ribosome fragment counts between alleles at SNPs. . . 56
4.2 Comparison of inferred codon dwell times between four random pairs
of human individuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Overview of CONTRAfold-SE. . . . . . . . . . . . . . . . . . . . . . . 67
5.2 CONTRAfold-SE performance using different data sources. . . . . . . 73
xiii
5.3 Classification of RNA binding protein targets into true bound versus
false bound genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Nucleotide-level structure prediction for the true bound sequences of
RNA binding protein FXR2 with motif WGGA. . . . . . . . . . . . . 78
5.5 Correlation between translation efficiency per gene and the accessibility
in rolling windows of 40nt, as predicted by CONTRAfold-SE. . . . . . 80
A.1 Correlation between experimental measures of protein abundance, and
estimated flow and average footprint count (baseline). . . . . . . . . . 101
A.2 Overexpression of tRNAArg(CCU) does not significantly alter amino
acid charging levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.3 The ratio between estimated mutant and wild-type rates. . . . . . . . 103
A.4 The ratio of mutant to wild-type footprint count per codon. . . . . . 104
A.5 The analysis of Figure A.2 repeated on flow instead of TE. . . . . . . 105
A.6 Distribution of three features among reduced TE genes and increased
TE genes in ACA-K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
A.7 Correlation between log(TE) and gene-level features. . . . . . . . . . 107
A.8 Dwell-corrected footprint counts normalized by flow. . . . . . . . . . . 108
A.9 Codon translation rates versus tAI. . . . . . . . . . . . . . . . . . . . 109
A.10 Histograms of positions of slow outliers and non-outliers are similar. . 110
A.11 Two different initializations of the parameters for the translation model.111
B.1 Sensitivity-PPV curve for ASH1-E1 in Test-SeqFold. . . . . . . . . . 127
B.2 Sensitivity-PPV curve for RDN58-2 in Test-SeqFold. . . . . . . . . . 127
B.3 Sensitivity-PPV curve for p4p6 in Test-SeqFold. . . . . . . . . . . . . 128
B.4 Sensitivity-PPV curve for p9 in Test-SeqFold. . . . . . . . . . . . . . 128
B.5 Sensitivity-PPV curve for snR10 in Test-SeqFold. . . . . . . . . . . . 129
B.6 Sensitivity-PPV curve for snR33 in Test-SeqFold. . . . . . . . . . . . 129
xiv
B.7 Sensitivity-PPV curve for snR37 in Test-SeqFold. . . . . . . . . . . . 130
B.8 Sensitivity-PPV curve for snR46 in Test-SeqFold. . . . . . . . . . . . 130
B.9 Sensitivity-PPV curve for snR53 in Test-SeqFold. . . . . . . . . . . . 131
B.10 Sensitivity-PPV curve for snR81 in Test-SeqFold. . . . . . . . . . . . 131
B.11 Structure profiles for human RNA binding proteins. . . . . . . . . . . 132
B.12 Learned noise model for structure probing data. . . . . . . . . . . . . 133
B.13 Correlation between learned parameters for different parameter initial-
izations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
xv
Chapter 1
Introduction
The expression of protein from DNA follows a complex series of steps. Classically,
genes within the DNA (a string of bases) are transcribed into RNA (a similar string
of bases) and translated into protein (a string of amino acids). We now know there is
further processing, post-modifications, and feedback at each stage, which makes the
linear process suggested by the central dogma much more complicated.
At the RNA and protein level, the cell contains multiple copies of each gene in
varying amounts (i.e. genes are expressed in varying amounts), with transcription
and translation changing the amount of each transcript in a process called gene reg-
ulation. The correlation between the level of RNA and the level of protein across the
genes in an organism is not perfect; it ranges from a Pearson r-value of 0.36 to 0.66
depending on the organism [78]. Knowing how much of each transcript is translated is
important for many levels of biological understanding, including determining how to
control translation, understanding how translation changes with disease states, and
deciphering the mechanisms behind differences between individuals.
Much research has been devoted toward understanding regulation of transcription
(converting DNA to RNA), namely, why some genes are expressed more than others.
However, only over the past few years have we developed high-throughput assays
1
CHAPTER 1. INTRODUCTION 2
and direct experiments to understand translation – converting RNA to protein and
the factors that contribute to its regulation. Embedded in these data are insights
about the process of translation, but accessing them requires handling sparse data,
distinguishing noise from the true signal, and identifying the relationships between
the variables in the underlying biological process. These tasks require new analytical
frameworks designed for these new kind of data. Probabilistic models are partic-
ularly useful in this case when we have prior information on the structure of the
process but not much ground truth data to learn from. These techniques also let us
infer missing values, smooth out noise, and learn biologically meaningful variables.
Consequently, in this work, we provide robust probabilistic methods for extracting
biologically meaningful parameters from high-throughput datasets in order to gain a
mechanistic understanding of the regulation of translation.
One of the key variables of interest in translation regulation is the relative amount
of protein produced between genes. Traditional techniques, mass spectrometry and
tagging using green-fluorescence protein, are able to measure protein levels but suf-
fer from lower accuracy, especially for low-abundance levels. More recently, a high-
throughput deep sequencing technique called ribosome profiling [57] has produced
a high-resolution snapshot of translation and a finer way for estimating ribosome
throughput, which is proportional to protein abundance. In addition, several studies
have appeared on properties of the RNA related to ribosome throughput or local
ribosome dynamics in order to understand the basis for what makes translation ef-
ficient in some proteins but not in others, and to tease apart causal factors from
correlated factors. This technique also allows easier comparison between physiologi-
cal conditions and synthetic biology constructs, which we exploit in this work. One
feature of the RNA in particular, namely the structure of RNA (how the RNA folds
on itself), has garnered much attention, both as a potential regulating mechanism for
CHAPTER 1. INTRODUCTION 3
translation and in many other essential processes within the cell. Whereas computa-
tional methods and experimental methods have generally not been tightly integrated
in the goal of structure prediction, recent high-throughput datasets providing partial
structure-probing data have made this coupling easier.
In Chapter 2, we give a brief background on the process of translation, the ex-
perimental assays we use in this work, the generated data that measures protein
abundance and RNA structure, and previous computational approaches for analyz-
ing such data. We then present, in Chapter 3, a probabilistic model for a ribosome
profiling dataset, which allows us to extract several variables of interest: the protein
synthesis rate and the rate at which each codon is translated. We also offer some bio-
logical factors influencing translation regulation in yeast. In Chapter 4, we extend our
analysis of translation to a human dataset, focusing on genetic variation and codon
translation rates. In Chapter 5, we return to RNA secondary structure as a poten-
tial regulator of translation and present a probabilistic model, CONTRAfold-SE, for
improved structure prediction using partial large-scale information. Finally, we sum-
marize the contributions of this thesis in Chapter 6 and reflect on our approaches
to modeling high-throughput data for better understanding regulation of translation
and its mechanistic basis.
Chapter 2
Background
2.1 Cell Biology
The cell consists of three major players: DNA, RNA, and protein. Although we have
gained much insight into the path from DNA to protein, there are many subtleties
and concepts left to discover. In this chapter, we will introduce some key biological
concepts specifically related to translation, the focus of this thesis. Translation is
a conversion from RNA to protein. Although much RNA is functional and a major
player in a number of key processes acting within the cell, proteins are often called the
building blocks. Each protein participates in a specific activity, including housekeep-
ing, regulation, or pairing with other proteins into complexes to achieve sophisticated
functions. Taking another step backward, proteins are derived from genes in DNA
– specific strings that undergo transcription (DNA to RNA), translation (RNA to
protein), and other processing before eventually becoming a functional protein. In
the following sections, we specifically focus on translation and the biological concepts
important for understanding its regulation. Figure 2.1 summarizes these concepts.
Further information can be found in [2].
Each gene is transcribed and translated into multiple copies of RNA and protein,
4
CHAPTER 2. BACKGROUND 5
both an oriented concatenation of smaller units. This string of units identify the
molecule, or the encoded gene. At a high level of abstraction, we refer to the RNA
transcript as the string that is directly translated into a protein (called mature mes-
senger RNA, or mRNA) and in the next section we will introduce an additional layer
of processing at the RNA level. Translation initiates at the 5’ end when the ribosome,
another large molecule, latches onto the RNA transcript, translocating towards the
3’ end. The ribosome converts the RNA string into the protein string, essentially
converting the string in series from one alphabet to another: the alphabet of RNA
(the four bases Adenine, Cytosine, Guanine, and Uracil, shortened to A, C, G, and U)
to the alphabet of proteins (the 20-22 amino acids, depending on the organism). Dur-
ing translation, the RNA string is grouped into consecutive triplets of bases, called
codons, each of which corresponds to a specific amino acid. This code is redundant;
with 64 codons and 22 amino acids, there are 1-6 codons coding for the same amino
acid. These are called synonymous codons, because substituting one for another will
not affect the final protein product. Synonymous codons are not uniformly used; this
preference phenomenon is called codon usage bias and its basis and influence is an
active topic of study and debate in translation literature.
The first codon in the string, the start codon, is typically AUG and is encoded by
the Methionine amino acid (Met). The ribosome begins initiation, starting translation
at the AUG and pausing to recruit other necessary helper molecules such as initia-
tion factors. Once the Met amino acid is added at the beginning of the novel protein
chain, the ribosome transitions to a stage called elongation, in which the ribosome
translocates to each successive codon, pausing at each one to recruit another impor-
tant molecule called tRNA. The tRNA molecule is specific to the codon currently in
the A-site (the active site of the ribosome), and holds the associated amino acid to be
added onto the growing protein chain. Similar to non-uniform codon usage, tRNAs
CHAPTER 2. BACKGROUND 6
5’ UTR
protein
tRNA
mRNA structure
AUG
3’ UTR
UAA
DNA
RNA
amino acid
codon ribosome
5’ end
initiation elongation termination
ATG… …TAA
sites within ribosome
A P E
3’ end
Figure 2.1: Central dogma of biology.DNA is transcribed into RNA and RNA is translated into protein. The ribosomeinitiates translation at the AUG codon, located after the 5’ UTR. The ribosomethen enters the elongation stage, where it pauses at each codon to recruit the tRNAmolecule that brings in the associated amino acid. The ribosome is large enoughto cover 3 codon positions – the currently active one is in the A-site, the previously-translated codon is in the P-site, and the second-last translated codon is in the E-site.The ribosome terminates translation at a stop codon.
CHAPTER 2. BACKGROUND 7
exist in varying amounts, floating in the cytoplasm until they are needed by the ri-
bosome. Typically, elongation is much less intensive than initiation. The final codon
is the stop codon, typically one of UAG, UAA, UGA. At this stage, termination, the
ribosome subunits dissociate and the completed protein chain is released.
There are several biological components involved in this process. We focus on
those that we will refer to in this thesis, and in particular those that have been
associated in literature with translation regulation:
Sequence Signals Beyond the codons themselves, the RNA transcript encodes
various other sequence signals that are important for initiation, elongation, and ter-
mination. Upstream (before) to the AUG is the 5’ UTR (untranslated region) and,
similarly, downstream (after) the stop codon is the 3’ UTR. These regions are, as
indicated, not typically translated by the ribosome into amino acids, but do act as
indicators for the start and stop of the gene, or regulate translation via, for example,
RNA structure.
RNA structure The RNA strand folds on itself in its native state. Since this
structure has to be unfolded while the ribosome elongates translation, this barrier
could intuitively affect the speed of translation [86]. It has also been shown that spe-
cific types of structures at the 5’ end can impact initiation and hence the efficiency of
translation [66, 63, 104]. We will focus on better determining structure via computa-
tional and experimental techniques in the last section of this thesis. Various motifs
or k-mers in the RNA (strings of length k) can also be recognized by other molecules
that can bind these regions in order to repress or help translation. A particular region
of interest around the start codon has seen much analysis [66, 109], but other parts
of the RNA could be affected by RNA binding proteins.
CHAPTER 2. BACKGROUND 8
Protein folding Similarly to the RNA, the growing protein chain can also fold on
itself as it is translated (co-translational folding). Since this structure is often critical
to its function, the ribosome might need to pause at specific locations in order to
ensure a correct fold [140, 90].
Ribosome conformation The ribosome itself could also affect its speed of trans-
lation. Recently it was shown that the ribosome takes on two different conformations
[70], an area that we are now able to explore at a larger-scale with the advent of
high-throughput techniques.
Other sequence signals Finally, there are many other signals that have been
suggested to regulate translation. The co-occurrence of specific codons (codon pairs)
could lead their amino acids to interact with each other in speed-impacting ways
within the ribosome [59, 20, 21]. This occurs because the ribosome actually houses
each codon through the A, P, and E sites, from amino acid recruitment to exit from
the ribosome tunnel and into the freed protein chain not protected by the ribosome.
The A-site codon is the one for which the ribosome recruits the amino acid, and
it is shifted over into the P site for further processing and to make room for the
next amino acid. Clusters of rare or “slow” codons could impede translation [141].
Specific codons upstream of the stop codon could affect translation [120] and AUG
sites upstream on the actual start location within the UTR could act as regulatory
mechanism for initiation. Depending on the organisms, the specific set and the sample
space of biological factors could vary greatly. For example, a particular motif in E.
coli is by far a significant repressor of translation [73], whereas that factor is not
observed in other organisms, like yeast.
CHAPTER 2. BACKGROUND 9
2.2 Analysis of Ribosome Profiling Data
Experimental Techniques
Protein abundance has traditionally been measured by the standard techniques of
mass spectrometry and fluorescence-tagging (GFP), which give a relative abundance
level representing how many copies exist for each gene. A more recent technique, ribo-
some profiling [57], combines the concept of polysome profiling with deep sequencing
to extract information about translation at a codon-level resolution. In particular,
ribosomes are immobilized during translation using flash freezing (or, originally also
a drug called cycloheximide), capturing the location of the active codon (the A site).
Since the ribosome is a large molecule, it also covers the region around the active
codon, around 30 nucleotides (30nt) in length. The RNA not covered by the ribo-
some is digested, leaving only the ribosome-bound fragments. These RNA are then
purified to remove the ribosome. In a manner similar to measuring abundance of
RNA, these fragments are then reverse transcribed into complementary DNA, am-
plified, and sequenced, so that they can be aligned to the genome. This final stage
reveals their location in the genome and, hence, which gene they correspond to. For
specific fragments, we can confidently and unambiguously identify the location of the
active codon on this footprint length (typically halfway in). Therefore, this data gives
us a ribosome footprint density profile for every gene, representing how many counts
we observe for every codon on that gene, or how many ribosomes were translating
each codon at a given snapshot in time (Figure 2.2).
Given a snapshot at steady-state, uniform translation speed by the ribosome across
the transcript, and sufficient sampling depth (sufficient footprints), we could average
the footprint counts in each gene profile to obtain an estimate of how many ribosomes
terminate translation per transcript. Ribosome throughput corresponds, up to a
factor that accounts for protein degradation, to how many proteins are produced
CHAPTER 2. BACKGROUND 10
gene position [codon]!
mRNA Density Profile!
Ribosome Footprint Density Profile!
Figure 2.2: Ribosome footprint density profile versus mRNA density profile.These densities are over a sample gene of length 250 codons. The ribosome counts(top) have more variance than the mRNA counts.
for each gene. Indeed, this technique is applied when estimating RNA abundance
from RNA-seq data. In RNA-seq, the transcriptome is randomly fragmented (as
in ribosome-profiling, but with no ribosomes, and random cuts of the RNA) and
mapped back to the genome, this time giving RNA abundance profiles for each gene.
Averaging the fragment counts per gene in RNA-seq is a reasonable approach since
in that situation we expect a uniform coverage of all positions on the transcript if
fragments are randomly selected. However, during translation, the ribosome pauses
for varying amounts across a gene, and hence the footprints extracted from this
process are not uniform across a gene. Figure 2.2 indeed shows a comparison between
a typical ribosome footprint density profile and an RNA-seq profile. And so, in the
case of ribosome-profiling, we also obtain an estimate for each codon of how many
ribosomes were translating that location across all copies of the gene. These extracted
counts are proportional to the dwell time of the ribosome at that location.
CHAPTER 2. BACKGROUND 11
Several ribosome profiling studies have now appeared in a variety of organisms
including yeast, E. coli, C. elegans, Arabadopsis, mouse, and human, in various con-
ditions including amino acid starvation, oxidative stress, and physiological conditions
[55]. In this work, we first focus on the model organism yeast and then move to a
higher-order organism, human.
Sample preparation in each one is incredibly important and does vary from dataset
to dataset. As previously discussed, cyclohexamide is a potential drug for halting
translation, but it has been shown that it can bias fragment extraction and produce
artifacts in the fragment counts [8]. Therefore, in this work, we use a modified
procedure with flash freezing instead of drug treatment (as in [56]).
Computational Techniques
In the setup described above, several ad-hoc methods simply take the average of the
counts in order to obtain an estimate of protein abundance. Similarly, in order to
obtain an estimate of the speed of the ribosome at a particular location, one could
divide the count at the position in question by the average of those in the window
around it. These approaches are complicated by the fact that the ribosome is not
translating each gene at uniform speed – particularly slow or fast positions can inflate
or deflate the average.
Another approach is to model the codons as sites on a 1D lattice and the movement
of the ribosome as a TASEP, a totally asymmetric simple exclusion process [102]. In
this approach, ribosomes enter the system with a certain rate and process each codon
(site) with a certain rate. This representation allows easy addition of components
such as ribosome drop-off via an exit rate at each unit. Several physics-based methods
treat variations of such a system by adjusting the boundary conditions, the input and
output rates per site, and/or the occupancy of each site in relation to those around it,
in order to represent physical properties like sterical restrictions caused by ribosome
CHAPTER 2. BACKGROUND 12
stacking due to a slow codon. However, the more complicated the model becomes,
the harder the analytical treatment. As such, these methods are forced to make
simplifying assumptions that make the model unrealistic (e.g. uniform translation
rate per gene) or forced to rely on simulations. These approaches use ribosome
profiling data either to apply system constraints or to find ideal parameter settings
from a series of simulations that attempt to re-generate the data.
To the best of our knowledge, probabilistic models have not been used for analyz-
ing such data with high accuracy.
Translation in Higher-Order Organisms
Translation and other biological processes in human species are extremely more com-
plicated than in lower-order organisms like yeast. For example, humans are diploid
organisms, with two copies of each chromosome, with each copy potentially containing
different versions (alleles) of the genome at specific key sites called single-nucleotide
polymorphisms (SNPs). Even synonymous SNPs have been shown to induce differ-
ent phenotypes [106], but often the mechanism via which they act is not understood.
Is the speed of translation different for each allele? What biological factors affect
these translation-level changes? These are interesting questions toward understand-
ing the genetic basis behind translation (namely, what genome-level differences cause
associated changes in translation).
Another complication in higher-order organisms is that each protein can have more
than one isoform (Figure 2.2). To describe this, we refine our definition of RNA. DNA
is first transcribed into pre-mRNA (pre messenger RNA). These transcripts encode
alternating regions of introns and exons, whereby the introns are removed via splicing
and only the exons are retained in the mature mRNA. This type of RNA, which we
simply refer to as mRNA or RNA in the remainder of the thesis, contains the exons
that are translated into a protein. However, in eukaryotes, the same template of exons
CHAPTER 2. BACKGROUND 13
exon ! ! intron ! ! exon! ! intron ! exon!
pre-mRNA!
isoform 1 (mature mRNA)!
isoform 2 (mature mRNA)!
Figure 2.3: Alternative splicing.Splicing occurs when pre-mRNA produces different mature mRNA copies. Intronsare always removed from mature mRNA. The green exon is skipped in isoform 1,but kept in isoform 2. Ribosome fragment counts (short black lines) that map tocommon exons can be mapped to either isoform, but the green exon footprints canunambiguously be mapped to isoform 2.
(the same gene) can be parsed differently to produce different proteins, or isoforms.
This process, called alternative splicing, can occur for example by skipping certain ex-
ons from an mRNA. Clearly, in a deep-sequencing context, when ribosome fragments
of lengths shorter than exons need to be mapped back to the genome, we encounter
identifiability issues. Properly attributing each ribosome-protected fragment to each
protein isoform is a difficult process and should be considered when interpreting the
data and model results.
2.3 Prediction of RNA Secondary Structure
As previously described, the structure of RNA is critical to its function. Structured
motifs in an RNA molecule permit or impede the binding of proteins and small
molecules, resulting in downstream effects on gene expression [129, 86]. For example,
presence of a pseudoknot (a specific structural motif) during elongation has been
CHAPTER 2. BACKGROUND 14
shown to cause a shift in the reading frame of the ribosome, which disturbs the
parsed 3-codon periodicity and can lead to an amino acid mis-incorporation that
renders the protein non-functional [119].
High-accuracy experimental techniques for measuring RNA structure are typically
expensive, low-throughput, and can only be achieved in vitro, which doesn’t always
reflect the folding kinetics in a live organism. Consequently, computational meth-
ods have been developed to predict structure from the RNA sequence. While the
ultimate goal of RNA structure modelling methods is to determine a complete three-
dimensional structure, this is currently an extremely challenging task [108]. The 3D
structure includes many different forces beyond those at the “secondary structure”
level – such as long-range forces that play a role in the final structure but are not
well-understood and hard to model. As such, much effort has focused instead on the
more tractable problem of determining secondary structure: the set of intra-molecular
complementary Watson-Crick basepairs (A pairs with U, and C pairs with G). Suc-
cessful prediction of secondary structure is an important step towards a complete
three-dimensional model of an RNA molecule; many 3D structure prediction algo-
rithms use a putative secondary structure as a scaffold for determining higher order
tertiary interactions (e.g., pseudoknots) [108, 93, 100]. After many advances in com-
putational techniques, such as the use of machine learning, prediction accuracy has
mostly remained around 50-70%, varying with the class of structures, the length of
the RNA, and other factors. Besides the pseudoknot, there are several other common
motifs in RNA structure (Figure 2.3. In general, C-G basepairs are more energetically
stable (have lower free energy) than A-U basepairs.
When creating computational methods for secondary structure, there are three
competing axes we want to optimize on: speed, accuracy, and generality. Speed is
often described in terms of Big-O notation relative to the length of the RNA strand
in question. Accuracy is relative to the ground truth structure. Generality refers to
CHAPTER 2. BACKGROUND 15
hairpin!
stacked!pair!
stem!
internal loop! pseudoknot!
loop!
Figure 2.4: Common structure motifs.The top row is a cartoon representation of the folded RNA. The bottom row is anarc diagram where the bases are ordered from the 5’ to the 3’ end of the region andconnected by an arc if they are paired. A loop is a set of unpaired bases and a stemis a set of paired bases.
which types of motifs are allowed in the structures. In this work, we will focus on
secondary structure motifs. We will refer to secondary structure, or simply structure,
as the set of Waston-Crick basepairs without pseudoknots. In the arc-diagrams of
Figure 2.3, these are motifs without crossing arcs. Although we will not analyze
running time complexity in this work, it is important to note that as algorithms
handle more complex structures exactly, their running time often increases, which
makes long sequences over 1000nt difficult to predict on or include in training sets.
Experimental Techniques
Individual RNA structures are most accurately determined through low-throughput
experimental means, such as NMR spectroscopy [42], X-ray crystallography [16, 60],
CHAPTER 2. BACKGROUND 16
or chemical and enzymatic probing methods [37, 132]. The former two are both time-
consuming and expensive, but the recent combination of the latter methods with
high-throughput sequencing has led to the development of several genome-wide RNA
structure-probing assays [142, 62, 125, 77, 31, 105]. These assays reveal which nu-
cleotides are paired and which are not, but cannot determine specific pairing partners.
In this thesis, we will be focusing on the later high-throughput assays, consisting of
three major structure-probing approaches: PARS, DMS, and SHAPE.
In the PARS assay [62], the RNA structure signal is obtained by treating RNA
with enzymes that preferentially cleave either paired or unpaired nucleotides. These
cleaved fragments are of different lengths depending on the location of the paired/un-
paired base and hence can be mapped back to the genome to reveal how likely that
position was to be paired or unpaired. These counts are combined to form a score
per base representing structured-ness.
The DMS-seq assay [105] relies on the reactivity of unpaired nucleotides to a
smaller molecule called dimethyl-sulfate chemical. Reactive positions block reverse
transcriptase, again leaving pieces which can be mapped back to the genome for a
score per base representing unstructured-ness. The DMS-seq assay was applied to
both renatured RNA and live yeast, giving us a glimpse into both in vitro and in vivo
settings.
Finally, SHAPE-seq [83] is a chemical-probing method using selective 2-hydroxyl
acylation analyzed by primer extension. In this chemistry, a reagent reacts with single-
stranded sequence and similarly blocks reverse transcriptase. This data is thought to
be less biased [36], but a large-scale assay for it has yet to be developed.
Computational Techniques
RNA secondary structure prediction methods can be broadly classified into energy-
based methods and methods based on statistical models. Energy-based prediction
CHAPTER 2. BACKGROUND 17
methods or algorithms based on thermodynamic models [144, 80, 101] compute a
minimum free-energy (MFE) secondary structure using experimentally derived ener-
gies for each template motif (for example, a stacked pair of A-U followed by C-G
emits a certain free energy that has been measured in an experimental setting; the
combination of these motifs is then explored via a dynamic programming algorithm
to derive the set of pairings that emit the lowest energy).
Methods based on statistical models, on the other hand, rely on data from a
training set of sequences and their known structures in order to learn a model of
secondary structure. In general, statistical methods for RNA secondary structure
prediction outperform energy-based methods [103, 97]. CONTRAfold [33] is an ex-
ample of one of the leading statistical algorithms for pseudoknot-free prediction, and
the one which we will extend in this work. CONTRAfold is a conditional log-linear
model modeling the probability of a structure given a sequence using a weighted sum
of features reflecting those included in MFE-based models. For example, a feature
could be the indicator that we see an (A-U, C-G) stack at position (i−j), (i+1, j−1).
Similar to the dynamic programming approach for MFE models, we can write a set of
recursions that are solved via a version of the inside and outside algorithm common
in natural language processing stochastic-free grammar models [35].
More recently, structure-probing data such as SHAPE and PARS have been used
in conjunction with computational methods in order to infer complete RNA structures
[81, 27, 99, 131, 92, 48]. Thus far, such methods have been heuristic derivatives of
thermodynamic models and do not explicitly model the structure-probing data.
Chapter 3
A Model for Translation
3.1 Introduction
The translation of RNA into protein is the nexus of decoding genetic information
into functional polypeptides and also a central biosynthetic process consuming a sub-
stantial fraction of the cell’s resources. Although apparently redundant nucleotide
sequences encode each protein, usage of different synonymous codons is highly bi-
ased [95]. These preferences are strongest in highly-expressed genes throughout di-
verse organisms [79, 51], suggesting selective pressure for the efficient use of the trans-
lational apparatus during the synthesis of abundant proteins. At the same time, less
common codons may be used in order to modulate translation, or may arise due
to competing sequence constraints such as mRNA secondary structure. While the
evolutionary signature of codon bias is clear, its biochemical basis remains unsettled.
Ribosome profiling [57] is an emerging technique for profiling translation in vivo
that is well suited to provide insights into the factors controlling the speed of transla-
tion as well as the amounts of each protein produced by the cell. Ribosome profiling
data comprise a set of ribosome-protected fragments (footprints) marking ribosome
density along mRNA transcripts with codon resolution. We can therefore extract
18
CHAPTER 3. A MODEL FOR TRANSLATION 19
from these data both the yield of each protein (protein synthesis rate) and the rate at
which each codon is translated (codon translation rate or elongation rate). However,
estimation of these two quantities is nontrivial, and ad-hoc approaches disregard dif-
ferences in elongation rates between genes or exclude mRNAs with sparse footprint
coverage. A number of studies with different analysis approaches present varying
hypotheses for the mechanisms underlying variation in elongation and translation ef-
ficiency in yeast and other organisms [123, 121, 58, 122, 113, 98, 17, 107, 136, 70, 43].
These include codon effects mediated by tRNA abundance or wobble base pairing, as
well as effects of mRNA structure and the nascent peptide on the ribosome.
Here, we present a rigorous statistical method that estimates, from ribosome pro-
filing data, both elongation rates and protein synthesis levels on individual transcripts;
as a byproduct, it also estimates translation efficiency (TE), the propensity of a tran-
script to generate complete protein, defined as the total amount of protein produced
from an mRNA message, and calculated here as our model-derived protein synthesis
rates divided by the mRNA levels. We use our robust modeling framework in con-
junction with new high-resolution data from wild-type yeast, along with three tRNA
mutants, to explore some of the conflicting views on the causality between codon
usage and elongation rate, as well as between codon usage and TE, in physiological
conditions at a genome-wide level.
We first apply our model to examine biological factors contributing to local trans-
lation kinetics. Due to differences in tRNA levels that correlate with synonymous
codon bias, variability in codon translation rates observed per gene is commonly
thought to be governed by the abundance of cognate tRNAs [126, 110]. However,
codon bias does not correlate with indirect measures of decoding speed, at least in bac-
teria [12, 25]. Similar to other observations in ribosome profiling datasets [73, 98, 17],
we find that codon usage bias is a poor predictor of elongation rate. We further test
for causal influence and illustrate that experimentally manipulating tRNA abundance
CHAPTER 3. A MODEL FOR TRANSLATION 20
or body similarly does not affect the elongation rate when decoding with the manip-
ulated tRNA. In addition, our model identifies positions where elongation is slower
than expected based on codon identity and suggests that such pauses commonly occur
closer to the 5’ end but are unrelated to codon bias.
Finally, we use our model to disentangle the factors underlying message-specific
differences in translational efficiency. In physiological conditions, initiation rather
than elongation may largely determine overall protein production; initiation predom-
inates when it is slow relative to the time needed to elongate through the width of
one ribosome (∼10 codons), so that translating ribosomes rarely interfere with each
other, and when elongation is highly processive, so that most initiation events re-
sult in a protein [5, 13, 7, 68]. Analysis of our tRNA-perturbed mutant experiments
shows that efficiency is not causally affected by improving tRNA levels, leading us to
focus on initiation signals in understanding variation in translational efficiency across
different messages. Several causes for slow initiation have been proposed: codon bias
at the 5’ end [123, 122], secondary structure [67, 46, 62, 122, 61, 145], and gene
length [6, 69, 29]. We find that a Kozak-like initiation motif [65] and lack of structure
around the start codon are predictors of TE. Overall, our experimental and analytical
results provide support to a previously proposed model in which initiation is rate-
limiting in physiological conditions [13], in which initiation rate is affected largely
by mRNA sequence features, and where translational efficiency is not significantly
affected by codon usage [5, 13]. In contrast with experiments in non-physiological
conditions, our results endorse the resulting explanation that, in endogenous con-
ditions, perhaps in combination with other pressures, selection for efficient use of
ribosomes and associated factors in the synthesis of highly-translated proteins is a
potential driver of the observed codon usage biases.
This work was conducted in collaboration with Silvi Rouskin and Jonathan S.
CHAPTER 3. A MODEL FOR TRANSLATION 21
Weissman at University of California, San Francisco (for the ribosome profiling ex-
periments) and Lu Han and Eric M. Phizicky at the University of Rochester Medical
Center (for the aminoacylation experiments). The computational methods and anal-
yses were conducted by myself under my advisor Daphne Koller.
3.2 Results
3.2.1 Queuing Model for Elongation
To extract high-quality estimates of protein synthesis rates and codon translation
rates from the ribosome footprint data, we model the process of ribosome flow, using
gene- and codon-dependent parameters, and the physical sampling that occurs in the
experimental protocol from which these data are derived. Our design choices are
motivated by potential biases in the data including sparse footprint counts for low
abundance genes, biases due to the position along the mRNA, and biases due to the
identity of the mRNA.
Our model inputs are the set of ribosome footprint counts d at each codon in the
genome, sparsely sampled (due to sequencing depth) from an (unobserved) steady-
state distribution π. In particular, dmk is the observed footprint count at position k in
mRNA message m and πmk encodes the fraction of ribosomes at (m, k). Consequently,
the distribution must satisfy flow conservation constraints: if ribosomes do not fall
off the message, then due to conservation of matter, the protein synthesis rate Jm for
message m (the ribosome flow out of the stop codon) must be the same as the flow
Jmk from any position k on m. If we define µmk as the dwell time of the ribosome at
(m, k), flow conservation also implies that rapidly translating positions (small µmk)
are occupied for a smaller fraction of time (small πmk) than positions that are slow
to translate. The dwell time µmk is the inverse of the rate at which the ribosome
CHAPTER 3. A MODEL FOR TRANSLATION 22
Ribosome Footprint Density Profile
dmk Jm
Jm
dm1
dm2 dm3 dm4
dm5
µm1 < µm2 > µm3 > µm4 < µm5
m = gene k = position on gene dmk = ribosome footprint
count at (m,k) Jm = flow per m µmk = dwell time at (m,k)
count at position = flow * dwell at position
RNA position
dmk = Jmµmk
Figure 3.1: Model of protein synthesis.Ribosomes initiate translation with a protein synthesis rate or flow (J) of ribosomes.This is conserved across the strand, so that at each residue (m, k) the flow dependson the dwell time of the ribosome (µ) and the ribosome occupancy (proportional tofootprint count d). Slower positions, for example (m, 2) compared to (m, 1), can in-flate the average footprint count per gene and must be accounted for when estimatingflow. Dwell times and flow are correlated with local and global cis-features.
elongates off of position (m, k) and so intuitively depends on the amount of time the
ribosome requires to perform one elongation step (recruit tRNA, form the peptide
bond, and translocate). Thus, at steady-state, flow Jmk is proportional (up to a
constant encoding the number of ribosomes in the system) to πmk/µmk, where we use
dmk throughout as our observed proxy for πmk. Figure 3.2.1 shows the relationship
between the variables.
We use the counts {d} to estimate the quantities {µmk} and {Jm} in a novel
probabilistic regression accounting for flow conservation and assuming steady-state
CHAPTER 3. A MODEL FOR TRANSLATION 23
and no ribosome fall-off. Briefly, we optimize over two terms:
maxµc
m,µclog Πm,kµ
cm
(dmk/Jm) exp(−µcm)− [∑m,c
wcm(log µcm − log µc)2]
The first term is a standard likelihood term for the data, using a model encoding
flow conservation. Since a single ribosome profiling dataset does not contain enough
data to robustly infer a separate µmk for each (m, k), we use the same dwell time
µmc for every occurrence of the same codon c within message m, making µmc an
expected dwell time for codon c on message m. The second term additionally softly
constrains µmc to be similar to a global codon dwell µc, based on the intuition that
the same codon behaves similarly throughout the cell. To optimize the objective, we
(1) estimate the dwell times µmc and µc with flow Jm fixed and (2) set flow Jm to be
the average of the flows Jmk (namely, the dwell-corrected footprint counts dmk/µmk)
across each message: Jm =∑
k∈mdmk/µmk)
Lm(see Materials and Methods for details).
We ran our model on a ribosome profiling dataset gathered for Saccharomyces
cerevisiae in rich medium, using a flash-freezing technique as described before [56]
(see Materials and Methods). To verify the validity of our estimated parameters, we
compared our protein synthesis rate Jm to two external measures of protein abun-
dance GFP-based levels from Newman et al [88] and mass-spectrometry-based levels
from de Godoy [26] and obtained strong correlations (Pearson r = 0.789 and 0.680, re-
spectively, p = 0). These improve on the protein abundance estimates from Ingolia et
al [57], computed as the simple average of (uncorrected) footprint counts per message
(Figure A.2). While correlation with these standard estimates of protein abundance
is reassuring, these methods have general limitations such as ascertainment bias for
less abundant proteins as well as technical limitations such as the impact of fusion
tags on protein levels. In addition, ribosome profiling measures translation and pro-
tein synthesis, but steady-state protein abundance is also affected by rates of protein
CHAPTER 3. A MODEL FOR TRANSLATION 24
degradation.
While the protein synthesis flux is perhaps the most obvious interesting quantity
that can be extracted from profiling data, we can also derive other quantities of inter-
est from our learned model parameters. We compute translation efficiency TEm of a
given mRNA molecule m by dividing protein synthesis rate Jm by mRNA transcript
levels Mm, derived from mRNA fragment data collected separately in the ribosome
footprinting experiment. We can identify codon-dependent effects on translation from
differences in µc. By looking at footprint count deviation from expected dwell time at
each (m, k), we can also examine differences among codons on the same message. In
the following sections, using the parameters estimated under our robust probabilistic
framework, we perform a comprehensive analysis of the biological factors influencing
local and global dynamics of translation.
3.2.2 Codon Translation and tRNA Manipulation
A number of studies in Escherichia coli initially identified codon usage and the avail-
ability of tRNA as the dominant force for codon translation rate [126, 110]. Later
studies found no correlation between measured rates and tRNA abundance or codon
frequency [12, 25, 111]. However, all of these studies measured translation speed indi-
rectly, on individual and potentially idiosyncratic reporter systems. We explore these
competing hypotheses in the physiological conditions of our yeast data set. If tRNA
abundance were rate-limiting for elongation, we would expect a positive correlation
between codon translation rate and tRNA abundance. However, as shown in Figure
3.2.2, the correlation is insignificant (Spearman r = 0.144, p = 0.380 for Cy5 and r
= 0.133, p = 0.417 for Cy3 from microarray tRNA measurements [32]). A similar
result (r = 0.210, p = 0.104) is also obtained when comparing to tAI, a measure
of codon bias based on tRNA gene copy number relative to the overall collection of
isoacceptor tRNAs [34]. If we restrict the analysis to the slowest synonymous codon
CHAPTER 3. A MODEL FOR TRANSLATION 25
(in terms of tAI), to the fastest, or to the average per amino acid, the correlation to
tAI does not improve: r = - 0.12 (p = 0.61), r = -0.29 (p = 0.22), and r = -0.32
(p = 0.18), respectively. Finally, the same insignificant correlation exists in the raw
footprint data (r = 0.112, p = 0.392; baseline method for rate described in Materials
and Methods) and was also observed in another analysis of the yeast data set from
Ingolia et al [57], in which codon dwell time was estimated as the ratio of observed
codon frequencies in the footprint data relative to expected codon frequencies in the
mRNA fragment data [98].
Our analysis of elongation rates on endogenous mRNAs in the context of the co-
adapted cellular tRNA pool addresses the effects of codon usage in natural physiology,
but may be confounded by this co-adaptation and cannot directly test the causal
links between various correlated mRNA features. To measure the effect of tRNA
abundance on codon translation rate directly, we created three mutant yeast species
to test whether (1) tRNA over-expression speeds up translation, (2) the tRNA body
itself causes the tRNA-dependent rate effect observed in other studies, or (3) depletion
of tRNA slows down ribosomes. In our first mutant, AGG-OE, the tRNA recognizing
AGG (namely, tRNAArg(CCU)) was over-expressed on a high-copy plasmid; in mutant
AGG-QC, the body sequence of the tRNA recognizing AGG was swapped with the
body of a more preferred tRNA (as measured by tAI); and in mutant ACA-K, 3
out of 4 copies of the tRNA recognizing ACA were deleted from the genome. The
AGG mutants had a URA marker and were compared against a wild-type sample
with a URA plasmid (see Materials and Methods). For ACA-K, we checked that
the abundance of the tRNA for ACA (namely, tRNAThr(UGU)) did decrease to about
30% of wild-type (Table A.2). In the AGG-OE mutant, we measured the amount of
total and aminoacylated tRNA for tRNAArg(CCU) (see Materials and Methods) and
verified that the tRNA was over-expressed by 13.8-fold (+/- 0.4), based on an analysis
of two independently derived RNA samples, and remained charged at a level similar
CHAPTER 3. A MODEL FOR TRANSLATION 26
0 0.5 1 1.50
0.2
0.4
0.6
0.8
1
codon translation rate
tAI
r=0.210, p=0.104
0 0.5 1 1.50
20
40
60
80
codon translation rate
tRN
A a
bund
ance
(10
00s)
Cy5: r=0.144, p=0.380Cy3: r=0.133, p=0.417
Cy5Cy3
Figure 3.2: Correlation between codon translation rates and measures of codon usagebias.Left: Insignificant Spearman correlation between estimated codon translation rates(scaled up by a factor of 1000) and tRNA abundance from microarray measurementsusing either fluorophore Cy3 or Cy5 [32] on 39 codons with measured levels. Right:The same correlation but to tAI is also not significant.
CHAPTER 3. A MODEL FOR TRANSLATION 27
to wild-type (87%) (Figure A.2). For the AGG-QC mutant, we similarly verified
that the amount of charged tRNAArg(CCU) was similar to wild-type (Figure A.2).
We generated ribosome profiling data and ran our model on these mutants to test
whether AGG codons are translated faster in AGG-OE and AGG-QC and whether
ACA codons are translated slower in ACA-K. We observe no significant change in the
elongation rates of the affected codon in any of the three mutants compared to wild-
type (Figure 3.2.2, A.2); the overall correlation between ACA-K and wild-type is not
as tight as for other mutants, but this is due to changes affecting all codons, not only
ACA. We verified the result by inspecting the footprint counts at the perturbed codon
relative to adjacent counts in the mutants compared to wild-type and saw no unusual
increase or decrease (Figure A.2). One prevailing hypothesis [133] is that the amount
of charged as opposed to total tRNA is the true predictor of codon elongation; our
measurements of aminoacylated tRNA suggest that these levels were manipulated as
expected and that this is not a confounding factor in the mutant samples. Hence, our
results suggest that several-fold changes in tRNA abundance do not affect ribosome
dwell time.
3.2.3 Translation Efficiency and tRNA Manipulaion
One of the major goals of codon optimization in biotechnology is an increase in protein
yield. Studies done on transgenes expressed at a large fraction of cellular mRNA
abundance report increased protein abundance when the mRNA was optimized for
codon bias [47, 71, 14], suggesting that codon usage contributes to efficiency [118,
121]. However, other studies observed that optimizing codon adaptation of a reporter
does not significantly improve TE or protein yield [137, 67, 133, 50, 72, 107]. Our
experiments likewise provide support for the view that the TE of endogenous mRNAs
is unchanged by effective codon optimization achieved by changes in the tRNA pool
(Figure 3.2.3). We find that increasing tRNA abundance or replacing the tRNA body
CHAPTER 3. A MODEL FOR TRANSLATION 28
1
1.5
2
2.5
rate
AG
G−
OE
r=0.99, p=2e−55
1
1.5
2
2.5
rate
AG
G−
QC
r=1.00, p=2e−62
1 1.5 2 2.51
1.5
2
2.5
rate wild−type
rate
AC
A−
K
r=0.91, p=1e−24
Figure 3.3: Comparison between codon translation rates in wild-type and mutants.Correlation between estimated codon translation rates in wild-type versus mutantfor the three mutant samples (the manipulated codon is highlighted in red). Ratesare normalized by the minimum one in each sample. Pearson correlations are nearlyexact, indicating that the mutant rates are generally unaffected.
CHAPTER 3. A MODEL FOR TRANSLATION 29
sequence by one with higher tAI does not improve efficiency: most genes remain
unchanged in TE between the wild-type and mutant samples (Pearson r = 0.96 for
AGG-OE and r = 0.95 for AGG-QC). Further, the top 200 genes that do deviate
most in TE relative to the wild-type sample have mutant TE that is both lower
(reduced TE genes) and higher (increased TE genes) compared to wild-type, with
bias towards reduced TE genes (123 reduced vs 77 increased for AGG-OE and 133 vs
67 for AGG-QC). In AGG-OE, we observe no correlation between the fraction of AGG
codons per message and the change between mutant and wild-type TE (Spearman
r = 0.00002, p = 0.99); we would expect a positive correlation if increasing tRNA
abundance increased TE. Further, despite the many-fold overexpression of tRNA,
the correlation between TE and fraction of codon per message for AGG is not higher
than the correlation for any of the other codons (Figure 3.2.3). AGG-QC behaves
similarly, such that manipulating the tRNA to be “faster” does not lead to a scenario
where AGG outperforms other codons in affecting translation efficiency. Finally, these
observations also hold if we look at protein synthesis rates instead of TE (Figure A.2).
While improving codon optimization by changes in tRNA structure or abundance
does not seem to causally affect TE, we do see evidence for a modest impact from
tRNA depletion (Figure 3.2.3). Mutant and wild-type TEs are generally correlated in
the ACA-K mutant (Pearson r = 0.96). Although there are more reduced TE genes
than increased TE genes (127 versus 73), this difference is not significant via a per-
mutation test (see Materials and Methods). However, we find a negative correlation,
the lowest of all codons, between the fraction of ACA codons per message and the
change in TE between mutant and wild-type (Spearman r = -0.08, p < 10−8), as
we would expect if decreasing tRNA abundance decreases TE through a direct effect
on its cognate codon. One explanation is that tRNA reduction could compromise
TE if the demand is higher than the supply the number of ACA occurrences in the
genome is about the average number of occurrences over all codons, but we reduced
CHAPTER 3. A MODEL FOR TRANSLATION 30
−15 −10 −5
−15
−10
−5
log(TE−wt)
log(
TE
−A
CA
−K
)
73 increased
127 reduced
r=0.96
−15
−10
−5
log(
TE
−A
GG
−O
E)
77 increased
123 reduced
r=0.96
−15
−10
−5
log(
TE
−A
GG
−Q
C)
67 increased
133 reduced
r=0.95
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
mut
AC
A−
K
AC
AC
AC
TC
CC
CC
AC
TC
GA
AG
TA
CC
TC
TA
AC
AT
AA
TG
TC
AC
CA
GT
AT
AC
AA
TA
CG
AT
CC
GG
CA
TA
GC
TA
TA
GG
TC
GC
TC
CC
GC
AG
CA
AA
GA
GT
TT
TC
CG
CT
GG
GG
AC
CT
CT
AT
TT
TT
AG
CC
TG
TT
GC
GC
GC
TT
CG
TA
AG
GG
TG
CT
GC
AC
TG
GG
GG
TC
GT
GT
TG
AA
AG
AC
AT
TG
GC
GA
TG
AG
GA
A
−0.1
−0.05
0
0.05
0.1
0.15
0.2
mut
AG
G−
OE
Correlation between log(TE−mut/TE−wt)and % codon per gene
GC
CG
GT
AA
GG
GC
GC
TG
TC
AT
GG
TG
GA
CG
GG
TA
CT
GG
GC
GG
CA
AC
CT
GT
GG
AA
GA
GA
GC
AC
CG
TA
GG
GT
TA
TC
AC
GT
GC
CT
GT
TC
TT
GA
CT
CG
CA
AC
CC
AC
CC
CA
GT
CC
AG
CC
GG
CC
TG
AA
CA
TG
TA
CA
AC
TC
CG
AA
GT
TC
GC
CG
TT
TC
TT
GA
TC
TA
TC
TA
TT
TA
TA
CA
AT
AA
AA
TC
AT
TA
AA
T
−0.2
−0.1
0
0.1
0.2
0.3m
ut A
GG
−Q
C
GC
CG
GT
GC
TG
TC
AC
CT
CC
TA
CC
CA
AT
CT
TC
GG
CA
AG
GT
TC
AC
TT
GA
CT
GA
CT
CT
CG
TA
AC
AT
GT
GG
GT
GA
GA
CC
TC
AA
TG
TG
CG
GC
AC
CC
GG
GC
GC
TG
CC
TG
TC
GA
CG
AT
TC
AG
CT
CC
TA
GG
AC
AT
AG
CC
CG
CG
GG
AG
GA
AT
TT
CT
TG
TA
AC
AA
GT
TA
TA
GG
CG
AT
CA
TT
AG
AT
AA
AA
TA
AA
T
Figure 3.4: Comparison between translation efficiency in wild-type and mutants.Left: Wild-type TE compared to mutant TE for the three mutant samples. StrongSpearman correlations shown suggest TE is generally unaffected by tRNA manipu-lation. Right: Spearman correlation, for each codon, between the ratio of mutantTE to wild-type TE and the percent of codon per gene. Significant correlations areshown as filled dots. For AGG mutants, the correlation is not higher for the manip-ulated codon (highlighted) than for other codons, indicating that optimizing codonusage does not affect TE. For ACA-K, the correlation is negative for the ACA codon,suggesting a mild effect.
CHAPTER 3. A MODEL FOR TRANSLATION 31
its levels below those of any other tRNA. However, if protein synthesis and thus TE
are controlled by initiation, this implies some feedback from slowed elongation on ini-
tiation, whereby affected ACA codons might stack ribosomes. In particular, reduced
TE genes compared to increased TE genes have slower-than-expected codons closer
to the 5’ end and stronger pausing in the first 100 codons (Figure A.2; significant un-
der Kolmogorov-Smirnov test; see next section for definition of slower-than-expected
codons as “outliers”). These confounding factors might contribute to the decrease
in TE for ACA-heavy genes. Alternatively, ribosome stacking at ACA codons could
induce fall-off and reduced processivity that manifests as decreased TE.
To situate our results in the context of many previous studies on codon bias and
tRNA abundance, we note that our observation focuses on endogenous messages with
physiological or near-physiological tRNA levels. When the tRNA pool is limited
compared to the number of free ribosomes, as in strong overexpression of transgenes,
simulations indeed show that large demand for tRNAs can be rate-limiting [22, 23,
107]. Experiments showing rate-limiting effects of tRNA abundance likely operated in
this non-physiological regime. In addition, manipulation of codon usage rather than
the tRNA abundance can perturb mRNA structure and other non-coding sequence
features; our experiment is less susceptible to those issues.
3.2.4 Factors Correlating with Elongation Efficiency
The notably modest effect of dramatic changes to the tRNA pool motivates the ques-
tion: what signals do affect elongation efficiency and translation efficiency? We first
take advantage of the ribosome profiling data to understand elongation efficiency the
time for a ribosome to finish translating a transcript once initiated by studying rate-
limiting elongation signals via inspection of outliers in the footprint counts. Based
on the observed footprint counts and our model parameters for expected codon dwell
time, we define slow outliers and fast outliers at each position k along a message m as
CHAPTER 3. A MODEL FOR TRANSLATION 32
positions where ribosomes are stalled more or less than expected, respectively. We de-
note their deviation from expected dwell time as outlier strength ∆mk (see Materials
and Methods). We considered a broad array of potential correlates of ∆mk, based on
literature hypothesizing their association with variation in codon translation rate or
pausing, classified into eight categories (Table A.2): position on message, structure in
downstream windows, protein folding, wobble basepairs, reuse of tRNAs from nearby
codons, downstream RNA binding protein motifs, nascent peptide effects, and global
features. Table A.2 shows these correlations, which include significant features in the
position, structure, wobble, and nascent peptide categories. We discuss these below
and in Appendix A.
The strongest correlation to outlier strength for slow outliers is proximity to the 5’
end, with larger pauses occurring closer to the beginning of a message, even relative to
gene length or even when aligned by stop codon as opposed to start codon (position
from 5’ correlates to ∆mk with Spearman r = -0.043; position from 5’per length
with r = -0.144; and position from 5’ end with r = 0.162, p ≈ 0 for all). Similar
observations of increased ribosome occupancy at the 5’ end have produced various
hypotheses for the causal basis. In the “ramp” model [123], the presence of more slow
codons (low tAI) at the beginning of a message is thought to separate ribosomes early
to avoid the wasteful expenditure of resources on stacked, idling ribosomes. However,
we observe a correlation between position from 5’ end and slow outlier strength even
when conditioning on the codon (Figure 3.2.4), and thereby controlling for differences
in codon usage at different positions within the gene, suggesting that there is an
initial low translation speed, regardless of codon usage, which gradually increases as
translation proceeds. Additionally, our model helps account for length, position, and
abundance biases when calculating outliers in a particular message in two ways: first,
we include message-specific codon dwell times, and, second, we exclude the first 100
codons from each gene during model learning (see Materials and Methods) to avoid
CHAPTER 3. A MODEL FOR TRANSLATION 33
inflating or otherwise biasing the expected rates µmc and µc. Our analysis indicates
that pausing occurs at the 5’ end, even after accounting for major factors such as
codon bias and gene length.
Other explanatory signals have been suggested for pausing in ribosome profiling
datasets [113, 73, 17]. Our analysis shows a (mild) correlation between pausing and
computationally-predicted downstream mRNA secondary structure (Spearman r =
0.021, p ≈ 0 with structure measured by the density of stems). This correlation
is reproduced when considering experimentally derived in vivo structure data from
high-throughput DMS probing of unpaired A and C bases [105] (r = -0.033). It
is also maintained when we restrict our analysis to slow outliers in the first 100
codons (r = 0.015 for density of stems and a similarly reduced r = -0.026 for in
vivo energy, potentially due to genes with short UTRs and the decreased reliability of
DMS structure probing data at ≈20nt or less from the 5’ end), and so the effect is not
necessarily caused by structure elsewhere on the strand. Single molecule experiments
with bacterial ribosomes [19] found that some hairpin and pseudoknot constructs
at varying distances downstream of the active codon can slow down the ribosome;
structural energy could therefore potentially contribute to the excess ribosome density
at the 5’ end. We also see a positive correlation on that same order of magnitude
between slow outliers and the number of proline codons in the two sites upstream of
the active codon (r = 0.069, p ≈ 0), as observed in other organisms [58, 136]. Two
correlations that we observed are not expected on the basis of previous studies. A
study showing pausing specifically at CGA [72] suggests slower elongation on wobble
base pairs, whereas we observe the opposite correlation; this discrepancy might arise
because the wobble effect is limited to a few specific codons, or to repeated wobble
codons, or because of an incomplete characterization of codon / anticodon pairings
which limits our assignment of wobble decoding. The correlation to charge observed
by Charneski & Hurst [17] holds in sign but not in significance even when considering
CHAPTER 3. A MODEL FOR TRANSLATION 34
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.25
−0.2
−0.15
−0.1
−0.05
0
Spe
arm
an r
Correlation per codon between outlier strengthand position per length from 5’ end for slow outliers
tAI
Figure 3.5: All codons show negative correlation between outlier strength and prox-imity to gene start.Correlation between slow outlier strength and position per length from 5’ end, con-ditioned by the codon, plotted against codon tAI. For each codon c, we calculate theSpearman correlation for outlier strength ∆mk and position per length from 5’ end(k/Lm) but restricted to the (m, k) that satisfy codon(m, k) = c. All codons exceptone (hollow circle), which has the lowest abundance in the genome, have a significantnegative correlation. This indicates that 5’ end outliers are slower even independentof codon bias.
CHAPTER 3. A MODEL FOR TRANSLATION 35
the number of Arg and Lys residues in a window upstream of the active codon,
although this result was later attributed to technical artifacts relating to the strand
orientation [18].
3.2.5 Factors Correlating with Translation Efficiency
While elongation efficiency measures time required to synthesize a new protein, trans-
lation efficiency measures the throughput of protein synthesis. Besides codon adap-
tation, which we find to play little or no causal role in improving efficiency, other
significant correlates to TE include structural features and the sequence motif around
the start codon (Figure A.2).
Structure is reduced near the translation start site in many organisms [46, 143]
and, in combination with specific structural motifs downstream, can promote or halt
initiation [66, 63, 104]. We performed a sliding window analysis (see Materials and
Methods and Figure 3.2.5) to correlate TE with RNA secondary structure in 40nt
windows along the gene, for both experimental in vitro and in vivo structural en-
ergy [105]. The window near the start codon is most significant, as reported previ-
ously for computational and in vitro structure measurements [67, 62, 121, 61]; the
positive correlation indicates that increased TE corresponds to loose structure in this
region. Indeed, this is also the window with highest energy, corresponding to the
lowest structure, as averaged over all genes (first red line in Figure 3.2.5). Interest-
ingly, the correlation to TE for in vivo structure is less pronounced and the window
is shifted 3 codons downstream. We call this Window A.
Our attention was also drawn to the window downstream of the start codon at
∼60nt in vitro and ∼80nt in vivo (second red line in Figure 3.2.5) with the lowest
energy (more structure) compared to neighboring positions. We call this Window
B. The most likely role for this energy barrier seems to be a stalling mechanism.
Ribosome density is high nearby: at 132nt (approximately two to three ribosome
CHAPTER 3. A MODEL FOR TRANSLATION 36
0 50 100 150 200 250
0
0.01
0.02
0.03
0.04
0.05
0.06
less
str
uctu
re
DM
S in
vitr
o en
ergy
0 50 100 150 200 250
0
0.05
0.1
0.15
log(
TE
) ~
DM
S in
vitr
o en
ergy
less
str
uctu
re ~
hig
h T
E
not significantsignificant
0 50 100 150 200 250−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
position [nt]
DM
S in
viv
o en
ergy
0 50 100 150 200 250−0.1
−0.05
0
0.05
position [nt]
log(
TE
) ~
DM
S in
viv
o en
ergy
not significantsignificant
Figure 3.6: RNA structure energy and its relationship to translation efficiency.Left: Energy averaged in sliding windows of 40nt (see Materials and Methods) acrossall genes for in vitro and in vivo measures of energy via DMS probing [105]. Thesecond red line corresponds to the first window with lowest energy (≈60nt for in vitroand ≈80nt in vivo). Right: Spearman correlation between the energy windows andTE. The first red line corresponds to the first window with significant correlation (9ntfor in vitro and 18nt for in vivo).
CHAPTER 3. A MODEL FOR TRANSLATION 37
footprints downstream), our model-estimated ribosome density has a notable peak
that is reduced when we exclude outliers, which capture positions where sufficient
pausing could stack ribosomes (Figure A.2). Although properly placed downstream
structure can improve the efficiency of initiation by stalling the scanning pre-initiation
complex [104], or might be selected for heavy structure in order to prevent other
regions (namely, around the start codon) from being paired, the lack of significant
correlation to TE for Window B suggests that ribosome flow control here optimizes
other aspects of translation besides throughput.
In addition to low structure at the start codon, initiation may be assisted by
recognition of a 12-mer motif around the start codon called the Kozak sequence in
eukaryotes [65], derived in yeast based on a sequence consensus from highly expressed
genes by Hamilton et al [49]. As expected, due to a tight correlation between mRNA
abundance and TE (Figure A.2), similarity to the Kozak motif correlates strongly
to TE (Spearman r = -0.21, p < 10−45) (measuring similarity by Kullback-Leibler
divergence to the position-weight matrix where 0 divergence means a closer match).
The 3rd nucleotide preceding AUG is the most significant (Spearman r = -0.17, p <
10−29), consistent with experimental measures of initiation efficiency after modifying
positions in the Kozak site [138, 75]. Using a linear regression model for predicting
TE based on a set of correlates suggested in literature (see Materials and Methods),
we learn a refined Kozak motif to reflect highly efficient genes (Figure 3.2.5). Our
learned Kozak motif reduces the error of our regression model predictions relative to
an equivalent model using the original motif (from 0.84 to 0.75, averaged over 100
test sets selected randomly, compared to a null model error of 0.97) (Table A.2). This
indicates that our refined motif better corresponds to highly translated genes, likely
because it was trained directly on translation efficiency measurements rather than on
a proxy such as mRNA abundance.
Finally, we tested the correlation between translation efficiency and other mRNA
CHAPTER 3. A MODEL FOR TRANSLATION 38
1 2 3 4 5 6 7 8 9 10 11 12
Position
0
0.2
0.4
0.6
0.8
1
Pro
babi
lity
Figure 3.7: Estimated Kozak motif for efficient genes.Estimated TE-driven Kozak motif based on a regression model (see Materials andMethods). The original Kozak consensus for yeast [49] is WAMAMAATGTCY.
features often discussed in literature (Figure A.2). We find a negative correlation to
evolutionary rate that is suggestive of the intuitive fact that more conserved genes
are more highly translated. The positive correlation we find with mRNA abundance
suggests a model of co-expression where the need for high protein abundance drives
high translation of abundant transcripts. Consistent with previous studies [57], we
observe a very small negative correlation to length. We also find a positive correlation
(although weaker than that for tAI) to the codon translation rates geometrically
averaged over the codons within a gene. Lastly, RNA-binding proteins (RBPs) have
recently received attention for their roles in post-transcription regulation, and we also
see high Spearman correlations between RBP occupancy and TE. When looking at
enrichment of 15 proteins, we find the expected correlation to translation efficiency (as
suggested by literature) in eight of ten cases. One of the two “unexpected” proteins,
scp160, was recently reported to be required for translational efficiency of particular
CHAPTER 3. A MODEL FOR TRANSLATION 39
mRNAs in yeast [52], even though it correlates negatively to ribosome occupancy in
Hogan et al [54]; our analysis encouragingly suggests the former correlation. Appendix
A has further discussion.
3.3 Discussion
In this section, we presented a statistical model to extract codon translation rates
and protein synthesis levels from ribosome profiling data. Our model is designed to
account for the complexities of ribosome profiling data while keeping parameter esti-
mation tractable. Although average footprint density on a gene is well correlated to
protein abundance, outliers can pull the estimate provided by the mean away from the
true level, especially when ribosome stacking is common. Thus, properly accounting
for differential elongation rates can improve inference of protein synthesis levels from
this data. We maintain a simple translation model (for example, we do not explicitly
include a rate of ribosome falloff or an analytical treatment of codons being processed
in series), but our design choices trade-off for model simplicity, algorithmic stability,
and smoothing of noisy data. Using one model parameter for all codon instances in a
gene, as opposed to an individual dwell per position, has several advantages: it aver-
ages out sequence biases in footprint fragments, makes the optimization algorithm less
susceptible to local minima and hence robust to parameter initialization, and allows
us to infer parameters even for low abundance genes by offsetting the lack of data
with soft prior constraints. We reassuringly find qualitatively similar results when
we replace our refined protein synthesis rates with a simple average of the footprints
per gene, while obtaining better quantitative estimates compared to existing protein
abundance datasets. More physics-based or simulation models [141, 102, 122] require
knowledge of the kinetic parameters of translation, can necessitate grossly simplify-
ing assumptions such as a single codon translation rate per gene, base certain model
CHAPTER 3. A MODEL FOR TRANSLATION 40
quantities on a limited set of features, or directly assume that codon rate is correlated
to codon adaptation. In comparison, our method reduces the number of assumptions
made by directly modeling the experimental processing and fitting the model param-
eters to the data under the single concept of flow conservation. On the other hand,
methods that aggregate the data directly [98, 17, 43], similar to our baseline method
for calculating codon translation rates, do not readily lend themselves to computing
other quantities. For example, because we have an underlying model, detection of
outlier codon positions follows easily within our framework, whereas other works rely
on choosing an adjacent window of appropriate size to compare counts. Similarly,
we can easily study other potentially interesting effects, such as codon translation
rate variance within genes and among genes. Finally, our method would particularly
be useful in situations where ribosomal profiling data is scarce or noisy. By using a
probabilistic model, we infer rates of interest from the observed, noisy data without
needing to exclude genes with sparse information. With the growing usage of ribo-
some profiling, a robust framework for studying rates of elongation and synthesis is
essential.
The robust framework of our model allows us to shed new light on causality in
regulation of translation and characterize the features associated with efficient elon-
gation and translation. Although codon usage is a strong correlate to TE (Figure
A.2), our mutant experiments suggest (via the correlation between codon bias and
tRNA abundance) that codon usage may not causally influence efficiency. The direct
impact of codon usage on efficiency and the basis of the selective force underlying
codon bias has remained a topic of controversy for decades. Some authors have pro-
posed that codon optimization serves directly to enhance the translational efficiency
of specific genes, perhaps by speeding elongation on their mRNAs. Our work provides
direct experimental evidence against this view. Rather, our work is consistent with
an alternative model, aligned with previous results for Escherichia coli [67], in which
CHAPTER 3. A MODEL FOR TRANSLATION 41
codon bias in highly translated genes results from selection to optimize utilization
of the translational machinery, whose abundance and production represents a major
limitation on cell growth [5, 13, 67]; this selection induces a correlation without im-
plying that increasing codon bias optimizes efficiency on individual genes [133]. In
this view, initiation is rate-limiting and thereby determines translational efficiency.
When the demand-supply balance for a tRNA is not compromised by extremely high
expression of a transgene not adapted to the host organism, we propose that selective
forces beyond the TEs of individual messages guide the distribution of codons. The
positive correlation between elongation rate and TE suggests a potential contributor,
namely, selection for efficient use of ribosomes and translation factors, and that this
selective force is strongest for high-expression, high-TE genes. Such selection pressure
is consistent with studies of overall cell growth and protein synthesis, which indicate
that the translational apparatus is rate-limiting for cell growth and that reduction in
the amount of ribosome time devoted to producing an abundant protein can speed
cell growth [5, 7, 67, 85]. As elongation rate is not the strongest correlate to TE,
other mechanisms also deserve further study. For example, there may be selective
pressures on the mRNA sequence itself (e.g., to induce certain secondary structures),
which in turn create pressure in the cell to ensure a sufficient supply of tRNAs for
efficient translation of the highly translated messages. Our results are also consistent
with the prevalent view that initiation is typically the rate-limiting step in protein
synthesis, which does not provide a clear mechanism for codon usage in the body
of a gene to affect its efficiency, and particularly not through increased elongation
rates. Instead, tRNA levels are likely forced to match the lack of disfavored codons
by selection against the cost of tRNA production or against poor decoding accuracy.
Our resulting analyses address the contributions of initiation versus elongation
to efficiency [7, 68, 107]. While efficient usage of ribosomes and elongation factors
influence the overall amount of protein produced from the whole genome, initiation
CHAPTER 3. A MODEL FOR TRANSLATION 42
may dictate differences between genes [39]. We characterize two initiation signals
that could play a role in translation regulation via a two-stage metering-light model:
reduced structure around the start codon and favorable sequence context to promote
ribosome binding, followed by an increase in structure that could, in turn, serve
to reduce misfolding of the emergent polypeptide by allowing sufficient time for re-
cruitment of chaperones to the ribosome exit tunnel [40]. This barrier could reflect
the observed universal per-gene effect, independent of codon identity, whereby the
strengths of slow outlier positions correlate to 5’ end proximity. Since translation
is resource-heavy, requiring tRNAs, mRNAs, and ribosomes, with the latter being
especially costly to produce, we intuit that the cell must balance use of these finite
resources while at the same time producing functional protein products. Structure
around the 5’ end could be one of the key mechanisms through which the cell regulates
translation so as to avoid wasting resources.
The region of slow elongation at the 5’ end certainly merits further exploration. In
contrast to the slow-codon ramp proposed in Tuller et al [123], our model shows that
while there may be an abundance of low tAI codons near the 5’ end, these codons do
not cause slow elongation (Figure A.2). We find (mild) correlations between pausing
and downstream structure, between tAI and downstream structure over the first 50
codons of all genes (Spearman r = -0.0055, p = 0.01 for stem density, and insignificant
for in vitro or in vivo structure), but not between codon usage and codon translation
rate. A study performed over diverse bacteria, controlling for GC content, proposes
that structure drives codon usage early at the 5’ end [11]; in yeast, there may be
similar selection whereby structure-related constraints induce a low-tAI ramp.
The impact of secondary structure on translation is complex. In addition to a
role in initiation, high structure regions could also act by influencing elongation [19].
Outliers in the high-variance ribosome profiling data can differ from expected dwell
times by a factor of 40, and are distributed throughout the message (Figure A.10).
CHAPTER 3. A MODEL FOR TRANSLATION 43
One explanation is the presence of downstream structural features that create an
energy barrier to elongation; these correlate (more weakly) to outlier strength when
ignoring the first 100 codons (whole gene versus truncated gene has r = -0.033 versus
r = -0.034 for downstream in vivo energy and r = 0.021 versus r = 0.010 for density
of stems), precluding the possibility that high ribosome density (based on the 5’ end
as a proxy) drives the effect. In addition, mRNA-binding factors can interact with
structure [28], but whether structure performs any common genome-wide functions
is not yet established. One possibility is that secondary structure slows the ribosome
during elongation to promote correct folding of the nascent protein during its vectorial
synthesis by the ribosome.
The significant but mild correlation to structure suggests that other factors are
important in pausing. Experiments suggest that the wobble base in CGA causes sig-
nificant pausing [72, 113], clusters of slowly translated codons could stall ribosomes
more than the sum of their individual decoding times [140], and effects from the
nascent peptide stall elongation at prolines [58, 136]. It is likely that a compendium
of biological features interact to dictate elongation rate. Although our genome-wide
outlier analysis shows promising correlations to pausing, the small magnitude of cor-
relation could be improved by looking at more restrictive or genetically meaningful
sets of positions. The growing interest in ribosome profiling poses exciting directions
for further investigation of the interactions between these features and the changes
that may occur in different conditions. With this additional data and measurements
from single-molecule experiments [134, 124], our model could be extended to include
finer-grained parameters for codon translation rates, partitioned in various ways, in
order to better understand how rate changes over a transcript. Further analysis is
also needed into how structure and the sequence around the initiation site work with
or against each other. For example, heavy structure can promote initiation in spite
of weak initiation context, but the ways in which they interact are still unknown.
CHAPTER 3. A MODEL FOR TRANSLATION 44
3.4 Materials and Methods
Ribosome Profiling Datasets All experiments were done on yeast strain 288C.
Cells were collected for ribosome profiling by filtering ∼250ml culture of OD = 0.6 and
immediately flash freezing on liquid nitrogen. For all ribosome-profiling experiments,
footprints were obtained as described before [56]. Three out of four copies of Threo-
nine tRNA (tT(UGU)G2, tT(UGU)H, tT(UGU)P), recognizing the ACA codon, were
knocked out using the standard technique of homologous recombination from a plas-
mid PCR product. The resulting strain was marked with nourseothricin, kanamycin,
and hygromycin B resistance respectively. Successfully transformed yeast were iden-
tified by check PCR. tRNA arginine (tR(CCU)J) recognizing the AGG codon was
overexpressed by cloning into a URA marked 2-micron plasmid (pRS426) and trans-
forming wild-type yeast using –URA selection. For the tRNA body swap, tRNA se-
quence from tR(UCU)B was mutated in the anticodon to CCU using QuikChange site-
directed mutagenesis kit (Stratagene) in order for the tRNA product from tR(UCU)B
to recognize the AGG codon. The mutated tRNA was then cloned in the 2-micron
plasmid pRS426 and transformed into 288C.
Ribosome-protected fragments were aligned against assembly R63 from the Sac-
charomyces Genome Database (SGD, http://www.yeastgenome.org) and we kept
uniquely mapped reads with no more than 2 mismatches and lengths between 28 and
31. To identify the active codon for ribosome-protected fragments, we let 0 be the
first nucleotide of the read and if the read begins on the first/last/middle nucleotide
of a codon, the active codon starts at nucleotide 15/16/17, respectively. An mRNA
fragment was mapped to a gene if it begins less than 16nt upstream of the start codon
and more than 16nt upstream of the stop codon. Genes were ignored if they did not
have an AUG start codon, had internal stop codons, had less than 50% of positions
on the coding sequence with at least one mapped mRNA count, or if all the footprint
CHAPTER 3. A MODEL FOR TRANSLATION 45
counts were 0 over the gene length used in the translation model (see below), leaving
around 5000 genes in each sample. When comparing mutants to wild-type samples,
we used the intersection of the valid genes in each sample. The AGG mutants were
compared against the wild-type sample with a URA plasmid.
Analysis of tRNA Charging and Relative RNA Levels For analysis of charg-
ing levels of tRNAs, duplicate samples of each strain were grown under conditions
used for ribosome profiling, followed by harvesting of ≈4 OD-ml of cells. Then, bulk
RNA was prepared from each pellet under acidic conditions (pH 4.5) using glass
beads, and RNA was resolved on a 6.5% acrylamide gel at pH 5 for 15 hours at 4◦C,
transferred to Hybond N+ membrane, and hybridized with appropriate 5’-labeled
oligonucleotide probes, as described [3]. Charging levels were visualized on a Typhoon
PhosphorImager (GE Healthcare) and quantified using ImageQuant, and relative lev-
els of tRNAArg(CCU) were measured by normalization to levels of tRNALeu(CAA) in
the corresponding lane.
Feature Calculations tRNA gene copy numbers were obtained from the tRNAscan-
SE database [76]. To measure codon usage bias, we use tAI, which ranges from 0 to
1 for more preferred codons, calculated as in dos Reis et al [34] with refined weights
described in Tuller et al [123].
Experimentally derived structure data from DMS probing [105] was normalized in
windows of size 150nt by the minimum count in the top 5% of A and C nucleotides, and
the top 5% of counts were set to 1. Windows with less than ten A and C nucleotides
in the top 5%, windows with a zero normalization constant, genes without data,
and genes without a characterized UTR [87] were ignored in analyses. In the sliding
window energy analysis, energy windows were normalized per gene by the mean over
windows on each gene. In the energy profile, normalized windows were then averaged
CHAPTER 3. A MODEL FOR TRANSLATION 46
across positions without missing data, aligned by start codon. In the energy-TE
correlation profile, we applied a conservative Bonferroni correction by multiplying
the p-values by the number of windows (30 upstream of the start codon and 250
downstream, since this span covered the maximum number of genes). To calculate
the location of the dip in the energy profile, we identified global minimums within
spans of 90nt and took the first minimum.
The correlation between tAI and downstream energy is for tAI over windows of
3 codons in the first 50 codons of all genes and the associated average of the 40nt
energy windows 15nt downstream from each nucleotide in the tAI window. Energy
windows are calculated as above using the number of stems and DMS in vitro and in
vivo energy.
Translation Model As discussed in the main text, we optimize our objective over
the parameters µmc and µc and solve for Jm. Since individual footprint counts can
be noisy and sparse, we smooth the data in three ways. First, we use a single µmc
for every copy of codon c on message m. The dwells The dwells µcm for a specific c
over all genes m softly agree with the global µc in a weighted geometric average with
weight wcm: the number of codons c on gene m normalized by the number of codons
c over all genes. Hence, genes with more copies of codon c get a larger vote in the
average estimating µc. Second, we add a pseudo-count of 1 to all footprint counts
and use the logarithm of normalized counts in the Poisson term (similar to a more
robust geometric average as opposed to an arithmetic average that is easily skewed
by outliers), first scaling the flow-normalized counts by a single factor over all (m, k)
so that the lowest one is 1. We refer to these transformed counts as d’. Third, during
model training, we ignore the first 100 codons (or the first 25% for genes shorter than
100 codons) since this region may have unusual flow conservation properties. If it
doesn’t, excluding these codons should not affect the learned rates. We refer to these
CHAPTER 3. A MODEL FOR TRANSLATION 47
restricted positions as k’. The second term in the objective function is multiplied by
a constant C = 100 so as to not be greatly outweighed by the data term. Altogether,
we solve the following optimization problem (where k′ is restricted and d′ are scaled
as described above):
maxµc
m,µclog Πm,k′µ
cm
(d′mk/Jm) exp(−µcm)− C[∑m,c
wcm(log µcm − log µc)2]
We verified that the constant C did not affect our results by running the main
analyses again – correlations for codon bias measures, protein abundance, and outliers
– on several other values (1, 10, 1000, 10000, 100000). We note no significant change
(Table A.2), except for some outlier correlations for 100000 (stemsGC-down15 is
now not significant; cluster-ArgLys-up-1 is significant) and for 1 and 10 (internal-
down is now significant). Similar to taking the limit of the constant to infinity, we
also considered a model with only µc parameters and no µmc (and hence no second
term in the objective function) (Table A.2). Again, no extreme change exists in the
correlation between codon translation rate and codon bias measures. Perhaps because
we have removed a layer of parameters, we do see a slight decrease in correlation to
protein abundance and some changes to outlier correlations: multi-down is no longer
significant but still shows a similar correlation strength; is-in-domain is significant,
suggesting that slow outliers lie outside of protein domains, and the upstream number
of Arg/Lys codons is now significant.
The optimization algorithm is as follows: Jm is fixed to Dm =∑
k∈m dmk/Lm and
µmc and µc are initialized to dwells from the baseline method (see below), shifted in
log space so that the mean is log(7.2), plus a small random number. The value 7.2 is
the mean over all (m, k) of the flow-normalized counts normalized and smoothed as
described above for the wild-type sample. The appropriate mean value was replaced
for each of the mutant samples. The parameters are estimated via coordinate descent
CHAPTER 3. A MODEL FOR TRANSLATION 48
by iterating through codons c and learning the associated µmc and µc. Optimization
per c used an L-BFGS method [15] in Matlab (Matlab wrapper from http://www.cs.
toronto.edu/~liam/software.shtml). with the following stopping criteria: max
number of iterations 5000; gradient tolerance 10−5; function tolerance 103. Coordinate
descent was stopped when the difference in weights was less than 5 ∗ 10−5 or the
difference in function value was less than 10−5. Codons not appearing in a particular
gene m did not have an associated µmc and we also excluded the stop codons. We
then compute Jm =∑
k∈mdmk/µmk
Lm=∑
k∈mdmk/µm,c=codon(m,k)
Lm. The optimization is
not sensitive to initialization (Figure A.2).
Although less robust, we also optimized a model with a separate dwell time µmk for
every (m, k) with the following initialization of weights: µmk = dmk/Dm, with 0 counts
replaced by the mean of all non-zero counts, shifted in log space so that the mean is
log(7.2); µc are dwells from the baseline method (see below) shifted in log space so
that the mean is log(7.2); all weights perturbed by a small random value. The value
7.2 was chosen as above. L-BFGS settings were as above. Coordinate descent was
stopped when the difference in weights was less than 10−2 or the difference in function
value was less than 10−1. The overall codon dwell times µc were well correlated to
those in the original model (Pearson r = 0.99, p < 10−74), but analyses based on
dwell times per (m, k) could be impacted, since these parameters are more sensitive
to initialization. So we verified all qualitative observations presented still hold. The
correlation between codon translation rate and codon bias measures is insignificant
(r = 0.151, p = 0.359 for Cy5; r = 0.138, p = 0.401 for Cy3; r = 0.223, p = 0.084 for
tAI). Protein abundance estimates correlate similarly to external measures (r = 0.671
for de Godoy [26] data and r = 0.778 for Newman et al [88] data, p = 0 for both). In
the outlier analysis, all correlations still hold except for the structure features only the
density of stems 12nt and 9nt downstream are significant but the others are on the
same order of magnitude, the protein domain feature is significant for bases inside
CHAPTER 3. A MODEL FOR TRANSLATION 49
a domain, and the feature for upstream number of Arg/Lys codons is significant.
Correlations between TE and gene-level features are similar except Kozak position -2
is now barely not significant, experimental in vitro energy for the mRNA sequence is
barely not significant, and Npl3 is significant (in the expected direction). The energy-
TE correlation profile is the same except the window at 18nt for in vivo energy is
barely not significant but still a peak. The ribosome density graph has the same
peak at 132nt and decreases when outliers are removed. The refined Kozak motif has
the same dominant bases except position 1 in Figure 3.2.5 has the non-dominant T
swapped with A. Finally, the error when replacing the learned Kozak motif with the
original similarly increases from 0.69 to 0.77.
Baseline Method for Codon Translation Rate To get dwell time per codon c
from the raw data, we average over counts (m, k) for which codon(m, k) = c, normal-
ized by the average per gene (Dm =∑
k∈m dmk/Lm). Rate is the reciprocal of dwell
time. As above, we first add a pseudo-count of one to each dmk and ignore the first
100 codons (or the first 25% for genes shorter than 100).
Analysis of Translation Efficiency in Mutants To test if the difference in the
number of reduced TE genes versus increased TE genes (127 versus 73) in ACA-
K is significant, we permuted the mutant TE values 1000 times and calculated the
number of reduced TE versus increased TE genes for each permutation. There were 0
cases where the difference was less than the original difference, indicating the original
difference is not statistically significant.
CHAPTER 3. A MODEL FOR TRANSLATION 50
Model for Translation Efficiency We used a regression model to predict TE of
an mRNA message based on various features:
minw
∑m
(TEm − wTfm)2 + λ1
∑p
|wp|+ λ2
∑w2p
The first term fits an optimal set of weights w to the TE of a set of genes m using
a linear combination of the set of features fm. The last two terms enforce sparsity (so
that features that do not explain the data well receive a weight of 0) and shrinkage (so
that weights are kept at a small scale). Under a standard machine learning framework,
we divide the genes in our yeast dataset into a test set (size 400 genes) and a training
set (the remaining genes). The hyperparameters λ1 and λ2 are learned via cross-
validation: we further divide the training set into fifths, and evaluate the error for a
grid of hyperparameter values on each fifth of the training set. The weights w are then
learned on the whole training set with the best hyperparameters (with lowest cross-
validation error). Test set error is the squared norm difference between predicted
and actual TE, averaged over all genes in the test set. For reference, we create a
null model where the weights are learned from TEs randomly permuted among the
genes. The final weights are the average over all training/test combinations. The
features used are minimal in order to maximize the number of genes that have these
characterized: tAI of gene; computationally predicted energy of 3’ UTR, 3’ UTR,
mRNA, and window around the start codon with highest correlation to TE; length
of coding sequence; mRNA abundance; identity of bases overlapping the Kozac site
(genes without a characterized UTR [87] were excluded).
To compute the weights for the refined Kozak site, we include a feature fk in f
defined as fk = 1/(1 + exp(x ∗ g)). The vector g has 36 indicators, 4 per each of
the 9 positions in the Kozak site (excludes the start codon). The vector x has the
corresponding weights for each indicator, is included in the shrinkage term, and is
CHAPTER 3. A MODEL FOR TRANSLATION 51
learned iteratively with w. The refined Kozak motif in Figure 3.2.5 is the average
of the 100 values of x learned separately for each training set. To create a position-
weight matrix from these weights, we shift the weights for each position so that the
most negative value (if any) is 0 and normalize by the sum of the four weights at
that position. The sequence logo was generated by seqLogo (seqLogo: Sequence
logos for DNA sequence alignments, R package v1.28.0, http://bioconductor.org/
packages/release/bioc/html/seqLogo.html).
To test whether the refined motif provides better TE predictions than the original
Kozak motif, within each of 100 training sets, we fix fk for each sequence with x set
to the original motif (scaling the weights so that the sum at each position matches
the sum of the learned motif) and learn the remaining weights as before. We then
compute accuracy on the corresponding test set.
Outlier Model The strength of an outlier ∆mk at position (m, k) is defined as
the difference between the observed count (dmk) and the expected count (Jmkµmc),
divided by smk, a standard deviation representing the variance in that count due to
the abundance of the gene and the codon it corresponds to. For smk, we divide the
genes into 32 quantiles by abundance and compute the standard deviation of the
counts in each bin per codon. Thirty-two was chosen as the maximum number that
still gave at least three counts in each bin per codon and no zero-valued smk. This
normalization helps distinguish true biological outliers from outliers arising due to
differential mRNA sampling and abundance depths across genes. Counts are as in
the optimization setup (dmk have a pseudo-count of 1 and Jmk are scaled by a single
factor). A slow outlier is an (m, k) with ∆mk > T for some threshold T. Non-outliers
are (m, k) with −1 < ∆mk < 1, excluding slow outliers.
Since there is a small uncertainty in the position of the active codon within
CHAPTER 3. A MODEL FOR TRANSLATION 52
ribosome-protected fragments of certain lengths, what we might see as a fast out-
lier (a position (m, k) where ∆mk < −T and, for example, a wrongly-labeled count of
0) could actually have a fragment that was falsely associated with an adjacent slow
position. The opposite is much less likely; an observed slow outlier has many more
counts than expected, making it unlikely that so many fragments were wrongly at-
tributed and belong instead to an adjacent fast outlier. For that reason, we compare
slow outliers only to non-outliers.
When correlating features to outlier strength (Table A.2), we call features signifi-
cant only if they pass a stringent set of conditions: Pearson and Spearman correlations
must have the same sign for all slow outlier thresholds (T = 0, 0.5, 1, 1.5, 2, 2.5) and
be significant; the correlation when binned by codons must have at least 30 significant
codons; the sign of the correlation must match the direction suggested by the com-
parison of means for slow versus non-outliers. When referring to significant features
in Table A.2, we cite the correlation for T = 0 since all thresholds are significant. For
a more stringent set of outliers, we use T = 1 in analyses requiring a fixed T (Figure
A.2, Figure A.2, Figure A.10).
Data Availability Data is available at GEO Series accession number GSE63789.
The conclusions in this chapter are published in [96].
3.5 Conclusions
In this chapter, we presented a method that provides a rigorous framework for ana-
lyzing the increasing number of ribosome profiling data sets, and thereby addressing
the outstanding questions raised in the discussion about correlations between trans-
lation efficiency, codon bias, and RNA secondary structure. We illustrate the use of
the method in the context of one of these data sets to create a high-level view of
CHAPTER 3. A MODEL FOR TRANSLATION 53
the mechanisms involved in initiation and elongation, to study the factors affecting
initiation as the rate-limiting step for translation, and to support a model in which
the direction of causality goes from translation efficiency to codon usage rather than
the opposite.
Chapter 4
Translation in Humans
4.1 Introduction
Translation in higher-order organisms is notoriously more difficult to model and un-
derstand. Alternative splicing complicates sequencing processes like RNA-seq, and
now also ribosome profiling. In particular, common exons cannot unambiguously be
mapped to the correct isoform without additional information or computational tech-
niques that are only now being tackled [84]. Nevertheless, several ribosome profiling
datasets exist in a variety of conditions and human tissues and while these additional
intricacies complicate data analysis, these data also yield valuable insight towards the
genetic basis for translation.
Previous studies have focused on understanding the impact of genetic variation on
expression levels, protein levels, or ribosome occupancy levels [1, 8, 10, 82]. However,
due to the difficulty of obtaining a clean signal from codon-resolution ribosome frag-
ment counts, few if no studies have looked at genetic variation affecting intermediate
signals during translation as opposed to the genome-wide ribosome throughput. In
this section, we will present the results of our translation model on a large dataset
of many human individuals. We will show that there exist SNPs associated with
54
CHAPTER 4. TRANSLATION IN HUMANS 55
significant differences in codon translation rates, suggesting that genetic variability
might cause variability during elongation.
This analysis was performed in collaboration with Jonathan Pritchard.
4.2 Results
4.2.1 Allele-Specific Ribosome Dwell Times
A recent ribosome profiling dataset on 71 human individuals [10] allows us to com-
pare translation rates and protein synthesis rates between sequence variants. In this
data, we do find that the reference allele and the alternate allele are not always as-
signed the same number of ribosome footprint counts during mapping to the genome.
For each (gene, codon) pair, we therefore calculate allele-specific ribosome fragment
counts, giving us an estimate of how often each version of that transcript is seen by
a translating ribosome. For example, if a codon AAA contains an A/T SNP at the
third base, we count all the ribosome fragments which map to an AAA and all the
ribosome fragments which map to an AAT. To calculate a score representing this ratio
of ribosome fragment counts per allele-pair (e.g. the AAA-AAT pair), we aggregate
over all instances of that SNP-induced pair in every gene and every individual (see
Methods). Our analysis reveals several allele pairs for which the ratio is significant
compared to a binomial test (Figure 4.2.1; bottom-left triangle of each square).
To clarify, these fragment counts correspond to a specific codon location – if one of
more SNPs modify a location within the codon, we examine all possible pairs. We note
also that this method weighs genes with higher abundance more than those with lower
abundance, affording us some smoothing from using raw (sparse and noisy) ribosome
fragment counts, as opposed to taking an average of ratios. This score aggregates over
up to 30000 instances depending on the allele-pair. The number of ribosome fragment
CHAPTER 4. TRANSLATION IN HUMANS 56
AA
A
AA
C
AA
G
AA
T
AAA
AAC
AAG
AAT
AC
A
AC
C
AC
G
AC
T
ACA
ACC
ACG
ACT
AG
A
AG
C
AG
G
AG
T
AGA
AGC
AGG
AGT
AT
A
AT
C
AT
G
AT
T
ATA
ATC
ATG
ATT
CA
A
CA
C
CA
G
CA
T
CAA
CAC
CAG
CAT
CC
A
CC
C
CC
G
CC
T
CCA
CCC
CCG
CCT
CG
A
CG
C
CG
G
CG
T
CGA
CGC
CGG
CGT
CT
A
CT
C
CT
G
CT
T
CTA
CTC
CTG
CTT
GA
A
GA
C
GA
G
GA
T
GAA
GAC
GAG
GAT
GC
A
GC
C
GC
G
GC
T
GCA
GCC
GCG
GCT
GG
A
GG
C
GG
G
GG
T
GGA
GGC
GGG
GGT
GT
A
GT
C
GT
G
GT
T
GTA
GTC
GTG
GTT
TA
A
TA
C
TA
G
TA
T
TAA
TAC
TAG
TAT
TC
A
TC
C
TC
G
TC
T
TCA
TCC
TCG
TCT
TG
A
TG
C
TG
G
TG
T
TGA
TGC
TGG
TGTT
TA
TT
C
TT
G
TT
T
TTA
TTC
TTG
TTT
0 0.5 1
Binomial Test p−value
Figure 4.1: Comparison of ribosome fragment counts between alleles at SNPs.The bottom-left triangle is calculated for the score derived from raw counts and thetop-right triangle is calculated for the score derived from outlier strengths inferredfrom our translation model. The values are p-values under a binomial test, with0 representing a pair with significant differences between ribosome occupancy (orstrength of the pausing) of the two alleles.
CHAPTER 4. TRANSLATION IN HUMANS 57
counts in the numerator and denominator range from tens up to 20000 counts. We
remove several problematic genes from this analysis (roughly 30 instances from each
allele-pair), where we see a highly skewed ratio (e.g. zero counts for one allele versus
hundreds for the other) across all allele pairs. Since this skew occurs regardless of
the pair, these are likely due to artifacts in the experimental protocol. We also
remove any transcripts with high sequence similarity to another transcript in order
to alleviate the ambiguity caused by multiple isoforms (see Methods). Finally, we
focus on pairs with a difference in the wobble-base (the third base). These have been
implicated in translational control [114], although this analysis can be performed over
any allele-pair.
As seen previously in this thesis, the raw ribosome counts are a noisy observation
of the true dwell times. We therefore applied our probabilistic method for analyzing
ribosome profiling data (described in the previous chapter) to this human dataset,
learning as before the dwell times per codon c over two different granularities: a
global dwell time µc and a per-gene dwell time µcm. Using these parameters, our
model can also extract outliers strengths, basically measuring how much more the
ribosome dwells than expected. Repeating the ratio analysis as before, but now
substituting the raw ribosome fragment counts with the inferred outlier strengths,
we find similar results: several allele-pairs have a ratio that is significant compared
to a binomial test, but there are fewer pairs than when using the raw counts (fewer
p-values close to 0 or white cells in Figure 4.2.1; top-right triangle in each square).
This is not surprising since the raw counts are noisy observations of the true rate and
hence their high variance, especially in high abundance genes, can introduce noise
into the ratio. All except one of the allele-pairs that remain significant when using
outlier strength represent synonymous codons: for example, AAA/AAG code for
Lysine, AGA/AGG code for Arginine, and AGC/AGT code for Serine. Interestingly,
the ATA-ATG combination is significant, perhaps indicating that ATA is a potential
CHAPTER 4. TRANSLATION IN HUMANS 58
alternative start codon.
The raw count and the outlier strength represent different levels of granularity.
The raw count, the finest, is a noisy observation of the true phenomenon and hence
would be the most susceptible to perturbations in the system caused by artifacts
that don’t correspond to fluctuations in the true codon translation rates. The outlier
strengths, compared to the global dwell times also learned by the model, capture the
gene-specific dynamics but help alleviate some of the differences observed due purely
to noise. The latter is therefore a more conservative estimate. With careful analysis
of the genes and the aggregation of counts, we therefore find the potential for variable
ribosome pausing rates associated with genetic variation.
4.2.2 Codon Translation Rates Across Individuals
Applying our translation model from the previous chapter to this human dataset
allows us to explore other aspects of translation. In line with the results on variable
ribosome pausing at different alleles, we see variability in codon translation rates
between individuals (Figure 4.2). However, this analysis is somewhat sensitive to the
complexities of a higher-order organism. When we include more genes in our model
(3000, 5000, 10000 genes), we have more data to learn from, but we also have more
ambiguous data due to alternatively-spliced transcripts. The correlations between
global codon dwell times in the models pairs are Pearson r = 0.25, r = 0.35, and r
= 0.33. As such, we based all inferences throughout this chapter on the 3000 genes
with highest RNA levels and lowest similarity to other transcripts (see Methods) in
order to strike a balance between sufficient data and unambiguous data.
We also find, as in yeast and other organisms [98, 73, 96], that codon translation
rates do not correlate well to tAI [34], a measure of codon bias. Spearman r-values
range from -0.38 to 0.11, depending on the individual, with all p-values except two
being greater than 0.01.
CHAPTER 4. TRANSLATION IN HUMANS 59
1 1.02 1.04 1.06 1.08 1.11
1.02
1.04
1.06
1.08
1.1
Individual 24
Indi
vidu
al 1
9
1 1.01 1.02 1.03 1.04 1.051
1.01
1.02
1.03
1.04
1.05
1.06
1.07
Individual 40
Indi
vidu
al 3
8
1 1.02 1.04 1.06 1.08 1.11
1.02
1.04
1.06
1.08
Individual 66
Indi
vidu
al 2
6
1 1.02 1.04 1.06 1.08 1.11
1.02
1.04
1.06
1.08
1.1
1.12
Individual 2
Indi
vidu
al 1
8
Figure 4.2: Comparison of inferred codon dwell times between four random pairs ofhuman individuals.
CHAPTER 4. TRANSLATION IN HUMANS 60
4.3 Discussion
In this work, we showed that allele-specific ribosome counts and ribosome dwell times
exist at SNP locations for several allele-pairs. Recent studies in yeast and human
[1, 8, 10, 82] have surprisingly shown that eQTLs – quantitative trait loci associated
with an effect on RNA expression levels – have a significantly reduced effect size
on protein levels. It then follows to ask whether SNPs are potentially associated
with variation at an elongation level as opposed to a protein synthesis level. In
particular, it might be the case that genetic variation acts on codon translation rates
in order to affect other mechanisms beyond ribosome throughput. We also showed
that translation rates between individuals can differ. The source of this can be,
for example, accumulated differences in per-allele rates from accumulated genetic
differences between these individuals, although a more thorough investigation of other
biological factors, or potentially confounding factors, would be interesting to perform.
Through comparison of ribosome profiling datasets on several individuals, we illus-
trated the potential impact of genetic variation on ribosome pausing, but the mecha-
nism behind this variability deserves further exploration. It was recently shown that
genetic variability is also associated with differences in structure via a PARS assay on
three human individuals [130]. This biological feature, as well as those suggested in
the previous chapter in yeast, are candidates for biophysical characteristics that can
act via the genome to affect translation and eventually create a phenotype of inter-
est. It would be interesting to apply combine this analysis with datasets on complex
diseases to gauge whether elongation-level SNPs can help explain those phenotypes
or boost predictive power.
Finally, the ribosome profiling dataset analyzed in this work provides many differ-
ent signals of interest. We can extract other per-gene quantities potentially associated
CHAPTER 4. TRANSLATION IN HUMANS 61
with QTLs (replacing the “e” with other letters). For example, we could look at pe-
riodicity of the ribosome occupancy signal in order to understand whether shifts in
frame, potentially caused by RNA secondary structure, have a genetic basis. Over-
lapping loci associated with different signals and with different biological features
could lead to an elucidating cross-comparison of QTL affects, and would be useful in
elevating our understanding of translation in humans.
4.4 Methods
Ribosome Profiling Datasets Ribosome profiling data was gathered for 71 hu-
man individuals in Yoruba lymphoblastoid cell lines (LCLs) [10]. RNA-seq data and
genome-wide genotypes were obtained from [94].
Ribosome fragment counts were mapped to the genome and the active codon
was determined as explained for our translation model in the Methods section of the
previous chapter. When choosing the genes to train over, we computed a score as
follows: RNA*RNA/similarity. Here, RNA represents the average RNA level per gene
and similarity represents how similar the RNA sequence is to any other transcript (a
value of X means similarity to X other transcripts). We chose the first 3000 genes
with highest score. For comparison, we also looked at a model over the first 5000 and
10000 genes.
Ratio Score To aggregate the ratios of different allele pairs, suppose we have a
position on the genome where a SNP gives us the following genetic variants: AAA
and AAT. We scan every codon on every gene, looking at SNP locations which have
this specific pair. We then keep a running sum of the ribosome counts at every
AAA and at every AAT. Lastly, we take the ratio of the sum of AAA counts to
the sum of AAT counts. In the subsequent analyses, we replace the count with the
CHAPTER 4. TRANSLATION IN HUMANS 62
outlier strength learned from our translation model. This ratio is compared against
a binomial test to obtain a p-value representing the significance of ratios that differ
from one. For example, a large ratio means that the AAA allele is translated slower
than the AAT allele.
Since several genes illustrated artifacts in the experimental protocol where we saw
zero ribosome fragment counts despite the specified genotype, we anecdotally scanned
for such anomalies (where we saw these genes creating the same skew for over half of
the allele-pairs), and removed from the analysis these genes. This only eliminated on
average 30 instances from an allele-pair (a relatively small fraction).
The global codon dwell times and the outliers were calculated according to the
previous chapter. In calculating the standard deviation for normalizing the outlier
residual, we used the reference allele at each position.
4.5 Conclusions
In this section, we presented an analysis of ribosome profiling data for 71 human
individuals. By aggregating various measures of ribosome pausing, we illustrated
the potential impact of genetic variation on variable codon translation rates. This
analysis can be harnessed for asking other questions: how are codon translation rates
affected amongst synonymous allele-pairs, how are elongation-level features related
to variable translation rates, and how expression-level, protein-level, and ribosome-
occupancy-level QTLs are related to elongation-level variation.
Chapter 5
RNA Secondary Structure
Prediction
5.1 Introduction
The development of genome-wide RNA secondary structure-probing assays has en-
abled new insight into the role of secondary structure in gene regulation, and has
spawned new computational methods that leverage these data for RNA secondary
structure prediction.
To date, RNA secondary structure prediction methods incorporating structure-
probing data extend energy-based methods, generally using the data to constrain
the space of possible structures considered by the algorithm. [142] and [73] adopt the
straightforward approach of enforcing hard constraints that particular nucleotides are
paired or unpaired in the MFE computation. [27] and [48] use the data as soft con-
straints, biasing the energy model to pair or unpair nucleotides based on their probing
signal. [131] estimate a perturbation to the energy model to encourage agreement be-
tween the basepairings predicted by the energy model and those inferred from the
experimental data. This perturbed energy model is then used to predict structures
63
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 64
using an MFE algorithm. [99] and [92] use the probing data in a post-processing
step to select amongst structures sampled from the structure ensemble defined by the
energy model [30].
In this section, we build upon the success of statistical methods and present a novel
method, CONTRAfold-SE, that incorporates multiple structure-probing datasets to
achieve improved prediction accuracy on diverse RNA sequences. A statistical ap-
proach provides two key advantages. First, it obviates the need for heuristic treat-
ments of the probing data (as in existing methods), such as thresholding the data
to a binary value (reflecting whether a base is paired or not) or incorporating it in
the energy model as a pseudo-energy term. Second, a statistical approach provides
a principled framework for combining data from multiple structure-probing experi-
ments. Each probing strategy has specific biases, and combining data obtained from
small-scale experiments using different strategies has been shown to improve predic-
tion accuracy [24].
Our method, CONTRAfold-SE, extends the statistical model of CONTRAfold
[33], one of the best-performing secondary structure prediction methods [103, 97], to
model the structure-probing data as observations of possibly unknown secondary
structures. This model can be learned from datasets containing only structure-
probing data, or a mix of known structures and probing data. CONTRAfold-SE can
then generate predictions on novel sequences from this learned model. By contrast,
CONTRAfold requires a set of complete structures to learn a model. We evaluated
CONTRAfold-SE using three genome-wide structure-probing datasets in yeast, based
on two different probing techniques that are performed in two different conditions.
We show that when predicting the structure of a novel sequence, CONTRAfold-SE is
competitive with current methods, and slightly outperforms CONTRAfold, on several
test sets of known RNA structures, whether using structure-probing data available
for the novel sequence or just the sequence itself. We find that combining datasets
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 65
in different probing conditions can have an adverse effect on performance, but other-
wise allows for cross-correction of errors in the data. CONTRAfold-SE outperforms
competing methods in predicting genes bound by RNA binding proteins (RBPs),
and is able to identify specific structural motifs bound by RBPs. Surprisingly, while
CONTRAfold-SE outperforms the existing state-of-the-art method SeqFold [92], we
find that its gains over CONTRAfold are modest, suggesting that using accurate sta-
tistical prediction models is an important supplement to current structure-probing
data. This method was developed in collaboration with Chuan-Sheng Foo.
5.2 Results
CONTRAfold-SE is a probabilistic model for single-sequence RNA secondary struc-
ture prediction that can utilize structure-probing data both for training the model
and in making predictions. For a single RNA sequence x, CONTRAfold-SE mod-
els the conditional probability of secondary structure y and structure-probing data
from sources d1,. . . , dn (when available), given x. Secondary structure y is modelled
using the CONTRAfold statistical model. Briefly, the CONTRAfold model is a con-
ditional log-linear model for secondary structure given sequence, that uses features
analogous to structural motifs used in energy-based models; model parameters are
trained on a set of RNAs with known structures. Structure-probing data d are mod-
elled as observations of the (often unobserved) secondary structure. We refer to these
two components as the structure model (with associated CONTRAfold parameters)
and the data model (with parameters representing the distribution over the data).
CONTRAfold-SE takes as input {(x, y, d)}: a training set of RNA sequences with
associated secondary structures and one or more sources of structure-probing data
(if available). Having both known structure and probing data is not necessary, but
nonetheless desirable.
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 66
To estimate the parameters of the structure model and the data model, we maxi-
mize the model likelihood on the given training set. These estimated parameters can
then be used to perform predictions on arbitrary RNA sequences, with or without
supporting structure-probing data. If structure-probing data is not available or is too
noisy to be used, we can predict a structure based on the structure model alone, as
in CONTRAfold. If structure-probing data is available, it can be incorporated by
first updating model parameters in an additional round of training with the query
sequence as an single training example and then using the updated structure model
as before. The tradeoff between these two prediction methods are discussed in the
following sections, but unless otherwise specified, prediction is done without using
structure probing data. Figure 5.2 summarizes the components of CONTRAfold-SE,
and a detailed description of the model, estimation, and inference procedures are
found in the Methods section and Appendix B.
5.2.1 Improved Secondary Structure Predictions
We use the following notation: CONTRAfold-SE trained on “Train(DataSource)”
represents training on the specific set of sequences represented by “Train” with
structure-probing data from the source(s) labeled “DataSource”. We considered com-
binations of the largest high-throughput assays in yeast – the parallel analysis of RNA
structure (PARS) [62], and DMS-seq assays [105]. In the PARS assay, the RNA struc-
ture signal is obtained by treating RNA with enzymes that preferentially cleave either
paired or unpaired nucleotides. The DMS-seq assay relies instead on the reactivity of
unpaired nucleotides to the dimethyl-sulfate chemical; the DMS-seq assay was applied
to both renatured RNA and live yeast. We denote these sources PARS, DMS-vitro,
and DMS-vivo, respectively. The Methods section summarizes the training and test
sets we used. When comparing CONTRAfold-SE to CONTRAfold, we simply ex-
clude from the training set the data-only sequences and keep the same structure-only
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 67
…
C U A G U C A A G G!G G U C A G U U C C!
A U U!
C C U!. . . . . . . . . .!
!A A U C G C A A U U U G C C C C!
unpaired paired
structure-only sequences
data-only sequences
STRUCTURE MODEL
DATA MODEL
LEARNED MODEL WEIGHTS: w, θ
w, θ
C C A C C C A A U U U G G G!
C C A C C C! G G G!
A A! U!U U!
. . .!
w, θ
C C A C ! G !
C C A A! U!G G U U!
. !
C C A C C C A A U U U G G G!
TRAINING PREDICTION (no data)
PREDICTION (with data)
y | x,w ~ exp(wTF(x,y)) exp(wTF(x,y'))
y '∑
dk | x, y,θ ~Gammaxk ,paired (k,y) (dk :θ )…
x
y
d1 dS C C A C C C A A U U U G G G!
Figure 5.1: Overview of CONTRAfold-SE.During training, we learn the model parameters w and θ from a training set con-sisting of sequences with only known structure or only structure-probing data (or acombination of both, although in practice there are few such sequences available forboth training and testing). At prediction, we use the model parameters to predictthe structure of a new sequence (prediction without data). If data is available, wecan also predict by incorporating this information (prediction with data).
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 68
sequences for fair comparison. We evaluate methods based on F-measure (calcu-
lated from sensitivity and positive predictive value (PPV)), accuracy, and AUC (see
Methods).
We first demonstrate how CONTRAfold-SE performs in comparison to CON-
TRAfold and SeqFold, the current state-of-the-art algorithm incorporating probing
data, on the small set of sequences with known structures presented in [92] (denoted by
Test-SeqFold). Table 5.2.1 shows that CONTRAfold-SE trained on Train-A(PARS),
a combination of structure-only sequences and data-only yeast mRNA sequences,
does at least as well as SeqFold on 6 of 10 of the sequences and at least as well as
CONTRAfold on all of the sequences. No probing data was included during predic-
tion and performance was measured by F-measure (to allow comparison to SeqFold).
Notably, CONTRAfold-SE achieves the same performance as CONTRAfold on 4
of 10 of the sequences, indicating that structure-probing data does not necessarily
contribute significantly to prediction quality. CONTRAfold-SE, like CONTRAfold,
offers a sensitivity-PPV tradeoff via a hyperparameter γ, which essentially adds more
basepairs with increasing γ; we select γ based on cross-validation (see Methods) when
comparing to other methods (such as SeqFold) with a single point on the sensitivity-
PPV curve; the full sensitivity-PPV curves are shown in Figures B.1 - B.10.
CONTRAfold-SE can also incorporate available structure-probing data for the
query sequences during prediction. In this prediction mode, a data tuning parameter
shifts the structure model either closer or farther from the distribution represented
by the data of the query sequence. Table 5.2.1 shows the F-measure on Test-SeqFold
where structures are predicted using the cross-validated data tuning parameter (see
Methods). In one case (snR81), the prediction with data improves over the prediction
mode without data by a high value of 8%, but surprisingly remains the same in four
cases and drops in five cases. The data quality for a single query sequence is likely poor
enough or diverse enough that without pushing the prediction toward the (smoothed)
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 69
Sequence C-SE C-SE (query data) SeqFold CONTRAfold
ASH1-E1 0.58 0.58 0.79 0.52RDN58-2 0.54 0.54 0.52 0.53p4p6 0.90 0.84 0.82 0.90p9 1.00 0.98 0.98 1.00snR10 0.72 0.66 0.83 0.69snR33 0.89 0.73 0.76 0.89snR37 0.73 0.72 0.94 0.71snR46 0.75 0.75 0.88 0.74snR53 0.67 0.67 0.56 0.67snR81 0.82 0.89 0.77 0.80
Table 5.1: F-measure of CONTRAfold-SE (C-SE) trained on Train-A(PARS) andevaluated on Test-SeqFold.The presented structure for CONTRAfold-SE and CONTRAfold are based on a hy-perparameter γ selected by cross-validation (see Methods). SeqFold F-measure iscalculated from the sensitivity and PPV presented in [92]. CONTRAfold-SE withquery data (C-SE (query data)) incorporates the data per test sequence during pre-diction based on a cross-validated data tuning parameter δ (see Methods) and a fixedγ (the average cross-validated γ from CONTRAfold-SE). CONTRAfold-SE is com-petitive with SeqFold on 6 of 10 sequences, even without data at prediction. Boldednumbers indicate the algorithm with highest F-measure across all algorithms.
structure model parameters, the algorithm cannot sufficiently correct it.
Since CONTRAfold alone does as least as well as SeqFold on 6 of 10 sequences,
and is only marginally worse than CONTRAfold-SE, we will largely focus the subse-
quent results on how CONTRAfold-SE compares to CONTRAfold. We next demon-
strate that structure-probing data can be beneficial for learning general-purpose sec-
ondary structure prediction models, by evaluating the learned models on two test
sets with a diverse set of RNA structures, as compiled in [103] (denoted by Test-
Tornado-TestSetA and Test-Tornado-TestSetB). In this experiment (Table 5.2.1),
CONTRAfold-SE trained on Train-A(PARS) outperforms CONTRAfold on all three
metrics. The overall performance differences are fairly small, possibly because the
sequences are short (a mean of 192nt and 121nt) and hence presumably easier to
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 70
predict in the first place. These sequences also cover a wide range of RNA families
that may not be reflected in the training set.
AUC F-measure Accuracy
Test-Tornado-TestSetA
Train-A with CONTRAfold 0.7110 0.7122 0.7610Train-A(PARS) 0.7169 0.7177 0.7640Train-B(PARS) 0.7209 0.7203 0.7662Train-B(DMS-vitro) 0.7126 0.7128 0.7616Train-B(DMS-vivo) 0.7096 0.7105 0.7604Train-B(PARS,DMS-vitro) 0.7240 0.7214 0.7662
Test-Tornado-TestSetB
Train-A with CONTRAfold 0.6178 0.6478 0.7498Train-A(PARS) 0.6236 0.6537 0.7519Train-B(PARS) 0.6256 0.6554 0.7535Train-B(DMS-vitro) 0.6201 0.6499 0.7514Train-B(DMS-vivo) 0.6172 0.6483 0.7498Train-BPARS,DMS-vitro) 0.6252 0.6558 0.7551
Test-mRNA
Train-A with CONTRAfold 0.7158 0.7152 0.7597Train-A(PARS) 0.7158 0.7141 0.7578Train-B(PARS) 0.7129 0.7097 0.7549Train-B(DMS-vitro) 0.7189 0.7164 0.7608Train-B(DMS-vivo) 0.7159 0.7147 0.7600Train-B(PARS,DMS-vitro) 0.7170 0.7149 0.7576
Table 5.2: Performance of CONTRAfold-SE trained on Train-A and Train-B andevaluated on three general test sets.CONTRAfold performance is shown for reference. Performance metrics are explainedin Methods.
We hence extended our evaluation with an additional test set that better reflects
the training set, Test-mRNA, which includes 188 highly conserved mRNA sequences
[105] (see Methods). However, we find that incorporating data during training with
CONTRAfold-SE does not result in significant improvement, potentially because the
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 71
“ground truth” structures in this case may differ from the ones reflected by the data.
That is, true structures are calculated here from phylogenetic conservation of sec-
ondary structure, in which basepairs covary between species. Table 5.2.1 shows that
there is no significant improvement of CONTRAfold-SE trained on Train-A(PARS)
compared with CONTRAfold.
5.2.2 The Value of Structure-Probing Data
Although probabilistic models can help account for uncertainty and noise in measure-
ments, they still require data with strong signal for robust parameter estimation. In
[99], the noise level of the data was manipulated in a synthetic train-test environ-
ment in order to explore the sensitivity of RNA secondary structure prediction to
noise. Here, we instead manipulate the noise level by varying the number of probing
data-only instances in the training set.
We first explored how performance changes when increasing the fraction of se-
quences with only PARS structure-probing data from 50% (Train-A50%) up to 100%
(Train-A100%) while keeping the total number of sequences in the training set con-
stant. Performance on Test-SeqFold degrades rapidly as we rely more on structure-
probing data (Table 5.2.2). This indicates that having a substantial set of known
structures is important so that the noise in the experimental data does not over-
whelm the signal in the known structures. To explore how much value the probing
data brings to model estimation, we next fix the number of sequences with known
structure and increase the number of sequences with probing data (Train-A75 to
Train-A100). We again see that performance is harmed (though less rapidly) as we
rely more on experimental information. The training set sequences with structure-
probing data (mRNAs) are not necessarily representative of the test set (mainly
rRNAs), and so as the number of these sequences in the training set increases, the
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 72
algorithm could be learning a different set of conformational rules. The same re-
sults hold for Test-mRNA (Table 5.2.2); however, increasing the amount of data-only
structures in the training set still degrades performance (perhaps slightly less so) de-
spite the fact that both the training and test set have mRNA sequences. This result
highlights the difficulty of learning from incomplete data, and suggests that a deeper
understanding of the biases in the data is required in order to improve the data model.
(#known,#data) AUC Accuracy F-measure
Test-SeqFold
Train-A Contrafold (119, 0) 0.8665 0.8158 0.8466Train-A50% (119, 119) 0.8751 0.8323 0.8580Train-A75% (60, 178) 0.8533 0.8223 0.8468Train-A100% (0, 238) 0.2883 0.3586 0.5614Train-A75 (119, 178) 0.8734 0.8309 0.8581Train-A100 (119, 238) 0.8662 0.8274 0.8558
Test-mRNA
Train-A CONTRAfold (119, 0) 0.7158 0.7152 0.7597Train-A50% (119, 119) 0.7158 0.7141 0.7578Train-A75% (60, 178) 0.7030 0.7081 0.7514Train-A100% (0, 238) 0.2808 0.3683 0.5508Train-A75 (119, 178) 0.7142 0.7106 0.7559Train-A100 (119, 238) 0.7109 0.7075 0.7519
Table 5.3: Performance of CONTRAfold-SE trained on sets of varying compositionswith PARS data and evaluated on two test sets.In each test set, the first row gives the CONTRAfold performance. The next threerows maintain the same total number of training sequences (238) while changingthe fraction of sequences with only structure-probing data. Train-A50% is the exacttraining set Train-A(PARS) (and as Train-A50). The last two rows maintain thesame number of training sequences with known structure (119) while increasing thenumber of sequences with structure-probing data. Performance metrics are explainedin Methods.
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 73
AUC F−measure Accuracy0.75
0.8
0.85
0.9CONTRAfold−SE performance on Test−SeqFold, trained on Train−B
ContrafoldPARSDMS−vitroDMS−vivoPARS + DMS−vitroPARS + DMS−vivoDMS−vivo + DMS−vitroPARS + DMS−vitro + DMS−vivo
Figure 5.2: CONTRAfold-SE performance using different data sources.Evaluation is on Test-SeqFold trained on all combinations of PARS, DMS-vitro, andDMS-vivo data sources for Train-B.
5.2.3 Combining Data from Multiple Data Sources
To our knowledge, CONTRAfold-SE is the first algorithm that can incorporate mul-
tiple data sources, using its probabilistic framework to combine them in a principled
way, and enabling cross-correction of errors. Figure 5.2 shows prediction performance
on Test-SeqFold using all possible combinations of the PARS, DMS-vitro, and DMS-
vivo data sources for Train-A. Combining the two in vitro assays, DMS-vitro and
PARS (green bar), yields the best performance, and boosts it above that of each
assay alone, thereby demonstrating how we can compensate for errors in one source
with a second, complementary data source.
Overall, each combination of two data sources performs better than the individual
sources alone. However, interestingly, using all three data sources (maroon bar)
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 74
does not outperform the PARS/DMS-vitro combination. This is consistent with the
observation in [105] that many sequences have different structures in vivo as compared
to in vitro. Indeed, in the combinations with two sources, the ones including an in
vivo data source perform worst, even when the same DMS-seq assay is used (red bar).
These conflicts between in vivo and in vitro sources, in conjunction with the fact that
we are evaluating on “true” structures measured in an in vitro setting, can reduce the
ability of training to generalize to unseen test sequences, thereby decreasing prediction
accuracy when training on all three sources. On the sets Test-Tornado-TestSetA
and Test-Tornado-TestSetB, CONTRAfold-SE trained on Train-B(PARS,DMS-vitro)
does better than trained on each individual dataset (Table 5.2.1). When using Test-
mRNA, we observe that the PARS/DMS-vitro combination does worse than DMS-
vitro alone (Table 5.2.1), perhaps because our “ground truth” in this set may not
reflect the true structure.
5.2.4 Classification of RNA-Binding Protein Targets
Although many RBPs recognize specific sequence motifs situated in single-stranded
RNA, the secondary structure context near the motif plays an important role in target
recognition [9, 91, 53]. [41] developed CapR, a method to analyze binding data from
RNA-protein interaction assays, specifically cross-linking immunoprecipitation high-
throughput sequencing (CLIP-Seq), to determine if RBPs bind specific structural
motifs. In addition, structure prediction with SeqFold [92] is shown to better classify
RBP targets determined by such an assay than simply using a threshold on the
number of motifs to distinguish bound and unbound targets.
Here, we give further validation of CONTRAfold-SE as a tool for predicting sec-
ondary structure as it affects regulatory functions. Specifically, we will show that
CONTRAfold-SE accurately distinguishes RBP sequence motifs that are bound from
those that are unbound, and suggest an associated structure specificity profile for
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 75
each RBP. We again use the RIP-chip study in yeast [54] in a setup similar to that in
[92] (see Methods), and predict the structure (specifically, the probability that each
base is paired) for each mRNA using CONTRAfold-SE trained on Train-B(PARS,
DMS-vitro) (i.e. the best-performing data combination). In addition, we evaluate
performance of CONTRAfold-SE trained on Train-B(DMS-vivo), since RBP binding
only occurs in vivo.
By aggregating the accessibility over motifs per gene, we obtain a score that we can
threshold to predict whether the gene is truly bound, thereby generating a receiver
operating characteristic (ROC) curve from different thresholds. Figure 5.3 compares
the ROC curves for CONTRAfold-SE on Train-B(PARS, DMS-vitro) computed for
10 motifs. The accessibility per gene is calculated using the same sum over motif
instances as in [92], but modified to account for gene length (see Methods), which is
a better measure than the sum itself (Table B.1). Compared to a similarly computed
score for SeqFold and the motif count baseline, CONTRAfold-SE yields better AUC
on 8 and 9 sequences, respectively, out of 10. CONTRAfold-SE also outperforms
CONTRAfold itself (for 6 sequences), as well as the motif count divided by gene
length (for 8 sequences). However, CONTRAfold-SE trained on Train-B(DMS-vivo)
does not generally outperform the algorithm trained on Train-B(PARS, DMS-vitro),
and the differences between CONTRAfold-SE and CONTRAfold are again small.
Finally, the same qualitative comparisons hold when using the aggregate score not
normalized by gene length (Table B.1).
5.2.5 Nucleotide-Level Structure Contexts for RNA-Binding
Proteins
In addition to classification of RBP targets, CONTRAfold-SE can be used to study
the specific pairing partners of bases involved in bound RBP motifs. We return to the
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 76
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
False Positive Rate
TruePositive
Rate
PUF4−1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1PUB1−1
TruePositive
Rate
False Positive Rate
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1PUF2−1
TruePositive
Rate
False Positive Rate0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1PAB1−1
TruePositive
Rate
False Positive Rate
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
False Positive Rate
TruePositive
Rate
KHD1−1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1NAB2−1
TruePositive
Rate
False Positive Rate
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1YLL032C−1
TruePositive
Rate
False Positive Rate0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1VTS1−1
TruePositive
Rate
False Positive Rate
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
False Positive Rate
TruePositive
Rate
PIN4−1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1NRD1−1
TruePositive
Rate
False Positive Rate
CONTRAfold−SE(PARS,DMS−vitro)SeqFold# Motifs
Legend:
Figure 5.3: Classification of RNA binding protein targets into true bound versusfalse bound genes.The receiver operating characteristic (ROC) curve uses a thresholded sum of acces-sibilities over motifs, normalized by gene length (see Methods). CONTRAfold-SEoutperforms Seqfold and a baseline of the motif count per gene.
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 77
more complex human setting and compute the structure profile and potential pairing
partners within and around the motif region for ten human RBPs. In a setup similar
to the evaluation of CapR [41], we use a CLIP-seq dataset to identify true bound
and false bound targets. CONTRAfold-SE is trained on PARS data from human
lymphoblastoid cell lines [130] (see Methods). Figure 5.4 shows our predictions on
binding protein FXR2 with motif WGGA. The top panel gives the average structure
profile between the true bound versus false bound regions. The value of the separation
is measured by a Mann-Whitney-Wilcoxon test (middle panel). More interestingly,
we can compute a dotplot (bottom panel) showing the pairing probabilities for each
base partner. Although CapR is able to predict the probability that a specific RBP
sequence motif lies in various structure contexts (e.g. a hairpin region or a stem
region; Figure 4 in [41]), our algorithm predicts the exact pairing partners (i.e. the
coordinates of the stem itself). For FXR2 in particular, CapR reflects an affinity for
stems near the start of the motif; indeed, we also show a high affinity for pairedness,
and in addition, can identify that it most likely runs through positions 35-39 paired
with 24-20. Our results qualitatively agree with the four other motifs presented in [41]
(Figure B.11). Interestingly, for SF2ASF, CapR predicts the motif to be unpaired,
except with reduced uncertainty in the middle of the motif. We similarly show a drop
in pairedness in the middle of the motif, surrounded by higher accessibility, although
we find two possible scenarios: that the motif region is unpaired and flanked by two
stems (bottom left and top right of the dotplot) or lies in the hairpin loop of a stem
(top left and bottom right of the dotplot). CONTRAfold-SE can therefore be used
as a tool for nucleotide-level analysis of putative structure profiles determined by a
CapR screen.
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 78
0.36
0.38
0.4
0.42
0.44
0.46
0.48
0.5
0.52
Pai
ring
Pro
babi
lity
FXR2 (WGGA)
true false
0
5
10
Man
n−W
hitn
ey−
Wilc
oxon
−lo
g 10 p
−va
lue
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
Pairing Partners Heat Map for True Bound Genes
Sequence Position
Seq
uenc
e P
ositi
on
0
1
2
3
4
5
6
7
x 10−3
Figure 5.4: Nucleotide-level structure prediction for the true bound sequences ofRNA binding protein FXR2 with motif WGGA.The top panel shows the average pairing probability for the true bound versus falsebound motifs. Dashed red lines mark the position of the motif within the 40bp regionaround it. The middle panel shows the significance of the p-value from a Mann-Whitney-Wilcox test on the separation of the structure profile distributions at eachposition. The bottom panel shows the dotplot of probabilities of specific basepairingsin the region, such that darker squares correspond to a higher probability that thosetwo base positions are paired.
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 79
5.2.6 Structure and Translation Efficiency under Oxidative
Stress
As we showed in previous chapters, in eukaryotes (and other species), secondary struc-
ture is thought to play an important role in regulation of translation [86], especially
during initiation [112]. Particularly, PARS structure-probing data in endogenous con-
ditions [62] has been shown to correlate strongest with ribosome density at translation
initiation and in this thesis, and in [96] we showed a similar result with DMS-seq data.
With our new method, CONTRAfold-SE, we can calculate this correlation at a more
reliable resolution. Indeed, we find an even stronger correlation (Spearman r = 0.2
versus r = 0.07 or r = 0.15) for a larger region around the AUG when we replace
accessibility calculated from raw DMS data with accessibility calculated from the
pairing probabilities of CONTRAfold-SE (trained on Train-B with DMS-vitro and
DMS-vivo; see Methods) (Figure 5.5). Notably, with CONTRAfold-SE, unlike with
the (sparse) raw DMS data, we can include many more genes in this analysis.
Under stress conditions, the translation mechanism becomes even more compli-
cated, with genes relying, for example, on structured elements called internal ribo-
some entry sites (IRESs) to bypass cap-dependent recruitment of the mRNA. Several
features have been studied in correlation with varying protein expression and mRNA
levels in oxidative stress [127]; here we use CONTRAfold-SE to study the effect of sec-
ondary structure on these dynamics. Using ribosome profiling data gathered in yeast
at different time points under oxidative stress [44], we obtain estimates of transla-
tion efficiency (TE), which we correlate to the pairing probability over several regions
of interest, calculated from CONTRAfold-SE trained on Train-B(DMS-vivo) (Table
B.2). Average structure in the coding region correlates significantly with the change
in efficiency between baseline and 30 minutes (Spearman p = 8 ∗ 10−25). The corre-
lation is positive (Spearman r = 0.19), suggesting that more structured regions are
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 80
0 50 100 150 200 250
−0.05
0
0.05
0.1
0.15
0.2
Log
Tra
nsla
tion
Effi
cien
cy C
orre
latio
nw
ith C
ON
TR
Afo
ld−
SE
(DM
S−
vitr
o) A
cces
sibi
lity
not significantsignificant
0 50 100 150 200 250
−0.05
0
0.05
0.1
0.15
0.2
position [nt]
Log
Tra
nsla
tion
Effi
cien
cy C
orre
latio
nw
ith C
ON
TR
Afo
ld−
SE
(DM
S−
vivo
) A
cces
sibi
lity
not significantsignificant
Figure 5.5: Correlation between translation efficiency per gene and the accessibilityin rolling windows of 40nt, as predicted by CONTRAfold-SE.
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 81
associated with a greater drop in efficiency under heavier stress conditions. Protein
expression measurements gathered via mass spectrometry also indicate that genes
with low protein expression under stress show enrichment for structure in the coding
region [127]. Interestingly, the correlation for the coding sequence feature is higher
than that for the 5’ UTR (r = 0.13, p = 6 ∗ 10−13) and much higher than that for the
3’ UTR (r = 0.05, p = 3 ∗ 10−3), indicating these regions may play different roles in
translation. Reassuringly, other metrics for aggregating pairing probability in each
region are similarly significant, and the same correlations hold for CONTRAfold-SE
on Train-B(PARS, DMS-vitro) (Table B.2). One possible explanation for this finding
might be that unfolding of highly structured RNA during elongation requires more
energy and cellular resources that are not available under stress conditions, although
the mechanism behind this hypothesis requires further investigation. Indeed, the
correlation is stronger for the 30min time point compared to the 15min time point
(Table B.2). Furthermore, the correlation to change in TE is stronger than to each
individual TE (Table B.2). Although the correlation to initial TE is strong (p-values
< 1 ∗ 10−7 for the coding sequence and 5’ UTR features), the correlation to change
in TE remains significant (p-value < 0.05) when conditioning by the initial TE (for
all features in replicate 2 and all features except the coding sequence in replicate 1),
indicating that structure may play a role during stress.
5.3 Discussion
Our method, CONTRAfold-SE, shows improved performance over existing methods,
whether or not structure-probing data is provided at test time. Strikingly, the CON-
TRAfold method alone is highly competitive in all our experiments. In many cases,
CONTRAfold even outperforms methods that utilize the structure-probing data. This
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 82
suggests that much of the previously demonstrated improvements in prediction per-
formance when using structure-probing data may be more simply obtained through
the use of more accurate statistical models. Our results suggest that including probing
data into these models only provides a relatively modest improvement.
There are several potential reasons why the probing data did not provide a sig-
nificant boost in performance. Firstly, the probing data are very sparse: there is
typically no signal for many bases in an RNA transcript. The bases that are mea-
sured therefore may not provide further information to the already accurate statistical
models. Deeper sequencing of the probing libraries would help reduce data sparsity.
Secondly, and more importantly, there are complex structure-dependent biases in the
probing data used [37, 135], that we (and many others) do not account for. While
data from the selective 2-hydroxyl acylation analyzed by primer extension (SHAPE)
[83] method for structure-probing is thought to be less biased [36], a large-scale assay
for it has yet to be developed. As such, we were unable to evaluate the improve-
ment from data generated using this method, since CONTRAfold-SE requires at
least 50-100 sequences with probing data to learn a good model. In addition, [131]
observe that the probing data from current SHAPE chemistry still has a dependence
on structure context, suggesting that the expected improvement with SHAPE may
not be substantially greater than with the DMS and PARS datasets we used. A third
possible reason is that since RNA typically folds into several structures that co-exist,
the probing data is in fact derived from a mixture of structures. This would violate
the assumption in the CONTRAfold-SE model that the observed probing data orig-
inates from a single structure. Extending CONTRAfold-SE to account for the fact
that multiple structures could give rise to the probing data is not trivial, and is an
interesting topic for future work.
A key benefit of our probabilistic framework is its modularity, allowing the re-
placement of individual components with small changes to the learning and prediction
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 83
algorithms. One could, for instance, replace the structure model with an alternative
that allows higher-order structures such as pseudoknots, which are involved in var-
ious important functions, such as frameshifting during translation [115]. One could
also replace the Gamma distributions used in the data model with Poisson distribu-
tions to directly model the count nature of the sequencing data. In fact, it would
be straightforward to use any data model in the exponential family, a broad class
that covers most commonly used distributions. A more involved extension would be
to model the structure-dependent biases in the probing data, for instance, by having
separate models for bases in stacked pairs and at the end of hairpin loops, as done
in [117]. Finally, one could integrate other information about the structure of the
sequence as another data source. For example, information from solvent accessibility
models could used to incorporate the fact that if an unpaired base is inaccessible, it
may falsely appear to be paired in a DMS probing assay [105].
Our approach is most similar in spirit to that of [131], in that we adapt an ex-
isting model to the structure-probing data in a principled way; the method in [131]
learns a perturbation to the thermodynamic energies to try and match the posterior
probabilities of base pairs in the thermodynamic ensemble to the probing data. Like
[24] and [117], we make use of probabilistic models for the experimental data, and
integrate data from multiple sources. We use the probing data to update our prob-
abilistic model via Bayes rule, which is reminiscent of SeqFold’s heuristic approach
of picking the peak in the structure distribution that is closest to the observed data
[92]. However, unlike these other works, we have unified these various components in
a probabilistic framework, thus enabling additional synergies amongst its parts. For
instance, because CONTRAfold-SE jointly estimates the data and structure mod-
els, the method can learn which probing data sources are more reliable and reduce
its reliance on noisier data sources in estimating the structure model. By contrast,
standalone estimation of data models (e.g. in [24, 131, 92, 117]) requires associated
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 84
ground-truth structures, and it is not immediately obvious how multiple data sources
can be combined. Furthermore, if the data are extremely noisy, fitting their distri-
bution alone on a small set of ground-truth structures could still lead to inaccurate
results when used in conjunction with a prediction algorithm that uses the likelihoods
as in [116, 36]. A recent review [36] provides a comprehensive critique of algorithms
that use structure-probing data for secondary structure prediction and describes an
alternative method that integrates structure-probing data in a probabilistic way. Our
method was developed independently and differs from the proposals in [36], as we
also enable parameter learning for the structure model using the probing data.
In an analysis capacity, we demonstrated two benefits of our method. First, we
showed that by incorporating one or more probing datasets in a principled proba-
bilistic framework, CONTRAfold-SE can help mitigate limitations in the structure-
probing data. Combining multiple datasets effectively increases data density, and
allows the model to perform some correction for the biases in the individual datasets.
These benefits are seen in our experiments, where combining PARS and DMS-vitro
data led to improved prediction performance beyond that of using either data alone.
We also found that combining datasets from in vitro and in vivo studies led to de-
graded prediction performance, consistent with the findings in [105] that in vitro
and in vivo RNA structures differ. Second, once trained on (one or more) structure-
probing datasets, CONTRAfold-SE can be used to provide per-base accessibility esti-
mates, filling in the many gaps in the sparse probing data. Indeed, we can explore the
structure profiles of RBP bound sites and classify RBP targets with CONTRAfold-SE.
If we had to rely on structure-probing data alone, these analyses would be hampered
by the need to throw out many sequences without sufficient data, and reduced signal-
to-noise in the noisy raw data. Similarly, sparsity of coverage in specific parts of a
sequence, such as at the 5’ end for DMS-seq, reduces the numbers of data points over
which we can test hypotheses about the effect of RNA structure on gene regulation,
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 85
such as about translation efficiency.
A key requirement for obtaining good performance with statistical methods for
RNA secondary structure prediction is the use of large, diverse sets of training data
[103]. However, most available structures are those of RNA found in bacteria and
viruses. Few structures are available for structural elements in the 5’ and 3’ UTRs of
mammalian mRNAs, which are increasingly found to play critical roles in regulating
gene expression. We suggest that genome-wide RNA structure-probing data can
plug this data gap, and will allow greatly improved prediction performance on this
important class of structures. Indeed, CONTRAfold-SE’s gains in performance over
CONTRAfold varies with the test set. The growing number of structure-probing
datasets will provide a rich source of data for elevating the performance of statistical
methods for RNA secondary structure prediction to the next level, by allowing the
effective training of more sophisticated structure models [139, 103]. Improvement
in the prediction of mammalian RNA structures, particularly of regulatory RNAs
and regulatory regions of mRNAs, and its integration into downstream discovery
applications, such as translational dynamics and functional elements, will certainly
lead to an expanded understanding of the role of RNA structure in gene regulation.
5.4 Methods
5.4.1 The CONTRAfold-SE Model
Our model assumes that structure-probing data are available at per-base resolution
– that there is a probing signal for some set of bases in a given RNA sequence.
The processed DMS-seq and PARS signals at each base are modelled using Gamma
distributions. While our probabilistic framework can accommodate any distribution
in the exponential family (see Appendix B), we chose the Gamma distribution as
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 86
a flexible family of distributions for modelling the continuous, unbounded, probing
signal. In addition, the Gamma distributions models the data well (Figure B.12),
and has been previously used for DMS-seq, SHAPE and CMCT reactivities [24]. We
assume that bases are independently modified, so that the resultant probing signals
are independent of the actual location within the RNA sequence. However, as the
reactivities of different bases could differ based on their identity and whether they are
paired (for instance, DMS preferentially modifies unpaired adenines and cytosines),
we have incorporated a separate distribution for each combination of base identity
(A, C, T or G) and pairedness state (paired or unpaired) for a total of 8 separate
Gamma distributions in our data model.
Formally, for an RNA sequence x of length L with secondary structure y and
associated structure-probing data d = (d1, . . . , dL), the distribution for the probing
signal dk at base k in the sequence is given by
dk|xk, y ∼ Gamma(αxk,paired(k,y), βxk,paired(k,y)) (5.1)
where xk ∈ {A,C, T,G} is the identity of base k in the sequence, paired(k, y) denotes
whether base k in structure y is paired, and the Gamma density is defined as
xα−1
Γ(α)βαexp(−x/β), for x ∼ Gamma(α, β).
Model Specification Let x be an RNA sequence of length L with structure y and S
associated structure-probing datasets d. We denote by d(j)k the probing signal for the
jth data source at base k in the sequence. CONTRAfold-SE models the conditional
joint probability of the structure and probing data given sequence as
P (y, d|x;w, θ) = P (y|x;w)S∏j=1
L∏k=1
P (d(j)k |xk, y; θ(j)). (5.2)
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 87
Here, the structure model P (y|x;w) is given by the conditional log-linear model of
CONTRAfold with parameters w, and P (d(j)k |xk, y; θ(j)) is the Gamma distribution as
defined in Equation 5.1, with θ(j) being the vector of parameters for the 8 Gamma dis-
tributions for dataset j. In the absence of structure-probing data, the CONTRAfold-
SE model reduces to the CONTRAfold model.
Parameter Estimation Given a training set, we estimate parameters w and θ by
maximizing the conditional log-likelihood. For a training set D = DS ∪ DP ∪ DS+P
of sequences with: i) only known structures and no probing data (DS), ii) only prob-
ing data but unknown (missing) structure (DP), and iii) both known structure and
probing data (DS+P), we find w, θ that maximize the (regularized) conditional log-
likelihood
∑(x,y)∈DS
logP (y|x;w) + λ ·∑
(x,d)∈DP
log∑y
P (y, d|x;w, θ)
+∑
(x,y,d)∈DS+P
logP (y, d|x;w, θ) + logP (w) + logP (θ) (5.3)
The hyperparameter λ controls the weighting of data-only training instances as
compared to instances with known structure, thus mitigating the adverse effects of
noisy, partial data on model estimation; this strategy is common in the machine
learning literature [89]. λ is set by cross-validation (Appendix B). We used the L-
BFGS algorithm [74] to find a local maximum of the likelihood. The key technical
challenge is that the gradient computations for the second term in the sum (the
likelihood for sequences with unknown structures) requires inference. Fortunately,
the log-linear form of the structure model allows the data model to be represented as
additional base-level features in the structure model. [36] independently presents a
similar observation in the context of thermodynamic models for RNA structure, which
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 88
have a similar log-linear form. We thus adapted the existing inference algorithms
in CONTRAfold to efficiently compute the required gradients (see Appendix B for
further details). While the log-likelihood for the CONTRAfold-SE model is non-
convex, in practice our gradient-based parameter estimation algorithm achieves stable
parameter estimates (Appendix B).
Predicting Secondary Structures CONTRAfold-SE has two options for gener-
ating predictions on query sequences: 1) prediction without probing data – ignoring
any probing data associated with the test example and returning the structure that
maximizes the expected accuracy based on the structure model, as in CONTRAfold,
and 2) prediction with probing data – re-estimating the model parameters based on
the single example of the query sequence with structure-probing data, initialized with
the parameters learned on the training set. In option 2, a data tuning parameter δ
controls a regularization term that shifts the model parameters either toward (small
δ) or away (large δ) from the data, and hence away or toward the learned parame-
ters on the training set. This allows us to control how much the algorithm relies on
(potentially) noisy data in the test sequence.
5.4.2 Dataset Setup
Training and Test Sets [103] show that the careful construction of training and
test sets is necessary for proper evaluation of statistical methods for structure pre-
diction. We follow their procedures to ensure that our training sets do not contain
significant similarity to our test sets. We briefly describe the construction procedure
(see Appendix B for more details).
Train-A has two components making up 238 sequences: 119 sequences with only
known secondary structure and 119 sequences with only structure-probing data (where
we chose 119 in order to obtain a 50%-50% split). We exclude any sequences that
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 89
share an RFAM match with the test sets or the yeast mRNA genes (so that these
can be included as sequences with structure-probing data for greater diversity in the
training set). This ensures that there is no family similarity between the training
and test sets. The data-only sequences are the shortest, most data-dense sequences,
namely those with the lowest data-sparsity score (see Appendix B). Train-A50% is
the same as Train-A. Train-A75% and Train-A100% are constructed similarly, but
contain either 75% or 100% sequences with structure-probing data, at the same total
number of sequences. Train-A75 contains the same sequences with known structure
as Train-A, but an additional set of data-only sequences at the same amount as in
Train-A75%; Train-A100 is constructed similarly.
Train-B contains the same 119 structure-only sequences as Train-A and the first
119 sequences with lowest data-sparsity scores, cycling through DMS-vitro, DMS-
vivo, and PARS data. To clarify, the sequences included are the same amongst Train-
B(PARS), Train-B(DMS-vitro), Train-B(DMS-vivo), etc, but the data that model
is trained on is only PARS in the first case, only DMS-vitro in the second case,
only DMS-vivo in the third case, etc. Using the same set of sequences allows fair
comparison of these different data sources.
Test-SeqFold is the set of sequences in Table 1 of [92]. Test-Tornado-TestSetA and
Test-Tornado-TestSetB are TestSetA and TestSetB, respectively, from [103]. Test-
mRNA has the sequences presented in the conservation analysis of [105].
Running and Evaluating CONTRAfold-SE CONTRAfold training allows spec-
ification of several hyperparameters, set as described in Appendix B.
For evaluating performance, we define, in standard fashion, sensitivity, TPTP+FN
,
and positive predictive value (PPV), TPTP+FP
, where the number of true positives (TP)
is the number of correctly predicted basepairs, the number of false positives (FP) is
the number of basepairs predicted but not in the true structure, and the number of
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 90
false negatives (FN) is the number of true basepairs predicted to be unpaired. In
general, we desire both high sensitivity and high PPV. To combine sensitivity and
PPV into a single value, we use F-measure, 2×sensitivity×PPVsensitivity+PPV
. Accuracy is the number
of correctly predicted basepairs.
For each sequence in Table 5.2.1, we report, in standard cross-validation fashion
for CONTRAfold-SE and CONTRAfold, F-measure for the γ that gives the best
average F-measure over the remaining sequences (γ = 2 for most sequences). For
CONTRAfold-SE with data at prediction, we therefore fix γ = 2 and cross-validate
the data tuning parameter δ by reporting F-measure for the tuning parameter which
gives the best average F-measure over all remaining sequences.
For each sequence in the test sets in Tables 5.2.1 and 5.2.2, we calculate three
metrics over varying γ: AUC over the sensitivity and PPV, maximum accuracy, and
maximum F-measure. We then average across the sequences in the test set.
Structure-Probing Data DMS structure-probing data for yeast was obtained
from [105]. Since the assay described in [105] only modifies A and C bases, we
have no data model for G and T. Similar to [105], raw DMS counts were normalized
in windows of 250nt per gene. In each window, the A positions were normalized by
the median of the top 5% of A positions; the C positions were similarly normalized.
If the median was zero, we normalized by the mean. We ignored zero counts, as well
as G and T positions.
PARS structure-probing data for yeast was obtained from [62]. We ignored po-
sitions with a zero PARS score, and added 8 to all scores to make them positive.
Human PARS structure-probing data was obtained from the GM12878 strain mea-
sured in [130]. We again ignored positions with a zero PARS score, and added 15 to
all scores to make them positive.
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 91
RNA-Binding Protein Data For yeast RBP analysis, we identified bound and
unbound transcripts as in [92] and [54]. In the RIP-chip dataset of [54], we defined true
bound transcripts as those having FDR < 1%; for Ssd1, Khd1, Puf1-5 we used local
FDR < 1%; for She2 we used identified targets. The remaining transcripts identified
in the RIP-chip data are deemed to be false bound. Both true and false bound tar-
gets must contain at least one instance of the RBP motif. For all CONTRAfold-style
algorithms, we predict the structure over each yeast gene using parameters learned
on Train-B, and calculate the probability that a base is paired from the “posterior”
output mode of CONTRAfold (and CONTRAfold-SE) by summing the pairing prob-
abilities associated with each pairing partner for that base. The accessibility is 1
minus this pairing probability. For SeqFold pairing probability, we used 1 minus the
accessibility predicted by SeqFold.
For aggregating the accessibility over all motifs in a gene, we sum the individual
accessibility per position per motif per instance of motif on the gene, and then divide
by the gene length. As an alternative equivalent to [92], we compute the same score
but do not normalize by the gene length.
For the classification task, we choose the 14 highlighted RBPs (or, more precisely,
RBPs with specific motifs) from Table S4 of [54] and filter out the ones with less than
200 instances of the motif per gene, leaving 10 RBPs.
For the human RBP analysis, we use a procedure similar to the evaluation in
[41] to identify bound and unbound transcripts. We obtain CLIP-seq data from the
doRiNA database [4] on the following human RBPs and their associated sequence
motifs: Pum2 (UGUANAUA), SRSF1 (GAAGAA), FXR1 (ACUK, WGGA), FXR2
(ACUK, WGGA), FMR1 7 (ACUK, WGGA) and FMR1 1 (ACUK, WGGA), where
we excluded QKI since there were fewer than 10 sequences in the true bound set
(described below). We use the RefSeq genes on assembly hg19 to identify transcripts
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 92
with at least one motif instance. From these, true bound motifs are those with a CLIP-
seq peak that starts within the motif, and false bound motifs are those such that there
is no CLIP-seq peak within 1000bp of the start of the motif. Since there are typically
many more false bound motifs than true bound motifs, we randomly sample the same
number of false bound motifs as true bound motifs. For each instance in the true and
false bound sets, we predict the structure of the 200bp window centered around the
start of the motif instance using parameters learned over the following training set:
the first 119 (structure-only) sequences in S151 [33] and the 119 sequences with lowest
data-sparsity score for the PARS data for the GM12878 strain measured in [130] over
sequences from UCSC RefSeq and Gencode v12 (hg19 assembly). The cross-validated
λ here was 1. We plot the structure profile on 20bp upstream and downstream
of the motif start. For one motif, FXR2(WGGA), we additionally check that two
modifications do not significantly affect the structure profile or the qualitative results:
we predict the structure of the 500bp window around the motif to show that the
window size does not matter, and we include all motifs in the false bound set to show
that subsampling does not matter.
Oxidative Stress Analysis To find the correlation between accessibility and trans-
lation efficiency around the AUG, as in [96], we first calculate the accessibility at each
position (namely, the pairing probability predicted by CONTRAfold-SE) and average
in sliding windows of 40nt. Windows are normalized by the mean over windows on
that gene, and correlated with the translation efficiency from [96] of all genes that
cover that position.
For oxidative stress data, we used the average ribosome footprint levels (FP) and
mRNA levels from the ribosome profiling dataset from [44] to calculate translation
efficiency as FP / mRNA. Transcripts were restricted over those with FP ≥ 10 RPKM
for all three time points.
CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 93
Software Availability and References CONTRAfold-SE is freely available for
non-commercial use and can be downloaded at http://www.cs.stanford.edu/~cpop/
contrafoldse.html.
5.5 Conclusions
In this work, we explored the benefits of a fully probabilistic method for RNA
secondary structure prediction that incorporates high-throughput structure-probing
data. CONTRAfold-SE outperforms existing methods in terms of prediction accu-
racy, and also has several other features which make it a valuable tool for the analysis
of structure probing data.
CONTRAfold-SE’s ability to combine multiple probing datasets allows us to de-
rive insights about which types of data should be combined for optimal performance.
Another distinguishing feature of our computational methods is that it augments our
knowledge about the structure of a sequence, allowing analyses that would not be
possible using the raw data alone where information is noisy or missing. For exam-
ple, we notably showed that we can obtain a better correlation between structure
and translation efficiency at the start codon when using CONTRAfold-SE to fill in
structure predictions for the bases that do not have sufficient or reliable data. At
the same time, as the quality and coverage of structure-probing data increases, our
experiments with different training set compositions show the exciting potential of a
statistical method like CONTRAfold-SE to yield further improvements in prediction
performance. Finally, CONTRAfold-SE is an invaluable tool in downstream appli-
cations where structure informs functions. We predicted nucleotide-level structural
contexts that define binding sites for RNA-binding proteins (RBPs), classified bound
versus unbound genes, and showed that RBP-bound sites are more accessible due to
active unfolding.
Chapter 6
Conclusions
6.1 Contributions
In this thesis, we presented two sets of high-throughput, genome-wide assays repre-
senting two biological processes defining translation and the factors involved in its
regulation. We built probabilistic models specific to each dataset in order to extract
useful information about biologically meaningful variables of interest from noisy and
sparse data – a process that would have otherwise relied on ad-hoc tuning parameters,
would have been less extensible and modular, and would have excluded a lot of sparse
data (positions or even whole genes).
We first described the concept of ribosome profiling. Ribosome profiling data re-
ports on the distribution of translating ribosomes, at steady-state, with codon-level
resolution. We presented a robust method to extract codon translation rates and
protein synthesis rates from these data, and identify causal features associated with
elongation and translation efficiency in physiological conditions in yeast. We showed
that neither elongation rate nor translational efficiency is improved by experimental
manipulation of the abundance or body sequence of the rare AGG tRNA. Deletion
of three of the four copies of the heavily used ACA tRNA showed a modest efficiency
94
CHAPTER 6. CONCLUSIONS 95
decrease that could be explained by other rate-reducing signals at gene start. This
suggests that correlation between codon bias and efficiency arises as selection for
codons to utilize translation machinery efficiently in highly translated genes. We also
showed a correlation between efficiency and RNA structure calculated both computa-
tionally and from recent structure probing data, as well as the Kozak initiation motif,
which may comprise a mechanism to regulate initiation.
Second, we explored ribosome pausing in a higher-order organism. By comparing
ribosome fragment counts for different alleles at the same SNP location, we found
the potential for a genetic basis for ribosome pausing. Other measures of ribosome
pausing – dwell times and slow outlier strength estimated from our translation model
– also indicate that variation between individuals could be driven by elongation-level
changes. In conjunction with other, albeit noisy, measures of biological features that
could play a role during elongation, such as RNA secondary structure, this work
points to an exciting direction for understanding the mechanism behind phenotypic
changes.
The strong correlation we observed between translation efficiency and RNA sec-
ondary structure motivated our last section: determining secondary structure with
higher accuracy. We presented CONTRAfold-SE, a probabilistic method for RNA
secondary structure prediction that incorporates structure-probing data by build-
ing on the CONTRAfold structure model and representing structure-probing data
as observations of the underlying (possibly unknown) structure. Our probabilistic
framework allows us to use any of the growing number of structure-probing datasets
that provide per-base measurements, and combine them together in a principled way.
Evaluated on benchmark datasets, CONTRAfold-SE outperforms competing meth-
ods even when our method does not have structure-probing data available for test
sequences. Importantly, we showed that using CONTRAfold-SE reveals a stronger
correlation between structure at the 5’ end and translation efficiency, and extended
CHAPTER 6. CONCLUSIONS 96
the analysis to ribosome profiling datasets in stress conditions, where structure is also
important.
In summary, we presented a tool for analyzing ribosome profiling datasets, pre-
sented a tool for using structure-probing data in highly accurate structure prediction,
and demonstrated how these probabilistic methods can uncover important and novel
biological results.
6.2 Going Forward
Our models for translation and RNA secondary structure are important for under-
standing the latest experimental assays and integrating them with leading proba-
bilistic methods. Our translation model parallels RNA-seq analysis, dealing with the
intricacies of variable ribosome pausing and inter-gene differences. Our secondary
structure model can be trained on different datasets depending on the application
and can be used with high accuracy in both prediction tasks and application tasks
where structure is an input. With these models as a foundation, going forward we
can explore several interesting technical and biological directions.
In terms of our model for translation, a probabilistic framework for ribosome profil-
ing data alleviates many concerns of previous methods. The technique we present here
can therefore be used as a standard method for easily comparing different datasets.
Our analysis of both yeast and human ribosome profiling data can be extended to
other biological features that could be related to translation efficiency and ribosome
pausing – signals that could even be included as features in our model for a more
cohesive analysis pipeline. It would be interesting to use this integrated approach to
understand the relative importance of these features between organisms. The relation-
ship between biological features must also be better understood, perhaps by looking
at correlations between our estimated parameters and feature interaction terms.
CHAPTER 6. CONCLUSIONS 97
The translation model itself could be extended in several ways. Incorporating
other biological processes, such as ribosome drop-off, is certainly possible as another
rate in the model. Since this might create unidentifiability in the parameters, it
would be useful to incorporate this either as a global prior or as a feature derived
from experimental measurements when these become available.
The location of the active codon within a ribosome fragment is traditionally de-
termined by looking for enrichment of the AUG codon in fragments at the 5’ end.
While this is usually sufficient for a reasonable sequencing depth, we can improve
this estimation by including it into our model – for example, by using or learning
a weighted average of the ribosome fragment counts potentially contributing to that
active location. Learning this weight would require careful construction of the model
parameters, and would benefit from additional experimental information as a ground-
truth training set.
One of the other major concerns we saw was in dealing with alternative splicing.
This situation is similar to sequencing in multiploid genomes. There, sequence frag-
ments can map to regions that look similar but originate from distinct copies. The
goal is to identify the correct distribution. We can potentially draw upon ideas from
this adjacent area or from effects like frameshifting that impact ribosome distribution
[84]. For example, we can consider an EM-style algorithm that alternates between
inferring the correct fragment attribution and learning the remaining parameters,
perhaps using non-ambiguous fragment information as training data.
In terms of our structure model, by jointly modelling structure-probing data and
RNA secondary structure in a probabilistic framework, CONTRAfold-SE is a first
step towards describing the sequence and structure biases of various probing reagents.
Integrating multiple genome-wide structure-probing datasets with CONTRAfold-SE
allows for cross-correction of errors, and reveals principles on how datasets should be
combined. With the growing number of structure-probing datasets, we can exploit
CHAPTER 6. CONCLUSIONS 98
the flexibility and modularity of this probabilistic model to include information about
biases in different contexts (e.g. ability to capture dynamics at ends of stems com-
pared to the middle of stems). Because CONTRAfold-SE can learn from a training
set where full structures are not available, we can also apply it to learn class-specific
structures where experimental methods are lacking. Finally, CONTRAfold-SE would
be useful in various applications that depend on function. It would be an excellent
tool for uncovering structural preferences in vivo, where structure-probing data alone
is sparse or noisy, but where understanding the underlying mechanism is essential
for physiological models of the cell. This tool could also be used in genome-wide
experiments where full experimental data is not available, but where we want to ex-
plore structural changes due to genetic variation, potentially identifying a mechanism
associated with an observed phenotype.
We still have much more to explore in the landscape of translation regulation. We
showed an initial set of results in human, a much more complex setting for under-
standing the interaction between RNA sequence and protein sequence. The multitude
of ribosome profiling datasets afford us a scaffold for comparison between species for
understanding mechanism differences, comparison between conditions for understand-
ing synthetic and physiological conditions, and comparison between individuals for
genetic variation analysis. In conjunction with other datasets that inform associated
mechanisms, such as genome-wide RNA structure datasets, we can make even more
informed connections between the causes and effects of translation in vivo.
Appendix A
Ribosome Profiling
A.1 Supplementary Methods
Feature Calculations for Outlier Analysis
Computationally predicted mRNA secondary structures and associated energies were
computed using Unafold v3.6 [80] with the default settings. In the outlier analy-
sis (feature “energy-down”), we ignored downstream regions with energies of 0 and
above. Structural features (e.g. stems) were counted based on the structure of the
whole mRNA strand, including characterized UTRs [87]. Genes without a charac-
terized UTR were ignored for all energy-related features. For experimentally derived
structure from the PARS method [62], we used the PARS score; genes without a
PARS score were ignored.
Protein domain boundaries were based on Pfam-A domains from Pfam [38]. Wob-
ble codons were set to be those with mismatches to the anticodon and those with an
“I” base in the tRNA that can recognize either a C or a U.
For RNA binding protein enrichment features in the outlier analysis, we computed
the Kullback-Leibler (KL) divergence between each of the 60 motifs from Table S4 in
99
APPENDIX A. RIBOSOME PROFILING 100
[54] and positions along each coding sequence. We then calculated the mean/mini-
mum KL divergence in 3-codon windows 5 codons downstream of the active site and
took the mean/minimum score over all motifs.
Feature Calculations for Translation Efficiency
Evolutionary rate is adjusted dN/dS from [128]. The Kozak site motif is from [49];
we ignored this in genes without characterized UTRs. Energies are calculated as
described in the outlier analysis section above. Energies near the start codon are
those with the most significant Spearman correlation (as calculated by looking at
global maximums in spans of 20nt and taking the first such maximum). These energies
are corrected for multiple hypothesis testing as described in the sliding window energy
analysis. The tAI per gene or per window is the weighted average of all codons in
that range, excluding stop codons.
The RNA binding protein enrichment features are the scores reported from the
Significance Analysis of Microarray algorithm in Dataset S3 of [54]. We selected
the top fifteen RBPs with the largest number of RNA targets from Table S2. Sug-
gested “true” correlations between RNA binding proteins enrichment and translation
efficiency are drawn from ribosome occupancy correlations using polysome profiling
(Table S3 in [54]), where possible. In other cases, we use additional literature: Puf4 is
most commonly studied in mRNA stability and localization and is also likely a player
in translation regulation [45]. As noted in the main text, scp160 has an additional
contradictory source indicating a positive role in translational efficiency [52]. Ypl184c
was proposed to repress translation due to its association with Pab1 and mRNAs un-
der translational control [54]. The proteins Cbc2, Gbp2, Nab3, and Nop56 do not
seem to have documented direct associations with translation.
APPENDIX A. RIBOSOME PROFILING 101
A.2 Supplementary Figures and Tables
4 6 8 10 12
−10
−5
0
mod
el fl
ow
Pearson: r=0.7885
4 6 8 10 12−5
0
5
measured protein abundance(Newman et al)
base
line
aver
age
coun
ts
Pearson: r=0.7755
10 15 20 25
−10
−5
0
Pearson: r=0.6802
10 15 20 25−5
0
5
measured protein abundance(de Godoy et al)
Pearson: r=0.6704
Figure A.1: Correlation between experimental measures of protein abundance, andestimated flow and average footprint count (baseline).
APPENDIX A. RIBOSOME PROFILING 102
WT
QC
OE
Deacyl
Deacyl
tL(CAA)
tT(UGU)
WT
QC
OE
(A) (B)
87
tR(CCU)
85 86 87 87 83 91 91 92 88 2.7 0.3 0.9 2.7 2.7
% charged μg 2.7 0.3 0.9 2.7 2.7
Figure A.2: Overexpression of tRNAArg(CCU) does not significantly alter amino acidcharging levels.Bulk RNAs from strains as indicated were resolved at pH 5 by PAGE, transferred,and hybridized with oligonucleotide probes specific for tRNA species as indicated,and relative tRNAArg(CCU) levels and charging levels were evaluated as described inMaterials and Methods. Solid arrows show deacylated tRNAs; dashed arrows showcharged tRNAs; % charged refers to tRNAArg(CCU).
APPENDIX A. RIBOSOME PROFILING 103
0.85
0.9
0.95 1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
rate AGG−OE / rate wt
0.85
0.9
0.95 1
1.05
1.1
1.15
1.2
1.25
1.3
1.35rate AGG−QC / rate wt
0.85
0.9
0.95 1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
rate ACA−K / rate wt
GCA
GCCGCGGCTAGAAGG
CGACGCCGGCGT
AACAATGACGAT
TGCTGTCAACAGGAA
GAGGGAGGCGGG
GGTCACCATATAATC
ATTCTA
CTCCTG
CTTTTA
TTGAAAAAG
ATGTTCTTTCCA
CCCCCGCCTAGC
AGTTCATCCTCGTCT
ACAACCACGACT
TGGTACTATGTAGTC
GTGGTT
Figure A.3: The ratio between estimated mutant and wild-type rates.The mean (solid black line) and standard deviation (dashed line) are shown. ACA-Khas a larger spread, but the manipulated codon (shown in red) is not an outlier inany sample. Codons are grouped and sorted by amino acid.
APPENDIX A. RIBOSOME PROFILING 104
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
mut
AC
A−
K
AA
G
AT
T
GA
A
GA
C
GG
C
GT
T
AG
T
CG
A
CG
G
CT
C
CT
G
CT
T
AA
A
AC
C
CA
C
GC
C
TA
C
TC
C
AC
A
AG
G
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5m
ut A
GG
−O
E
AA
G
AT
T
GA
A
GA
C
GG
C
GT
T
AG
T
CG
A
CG
G
CT
C
CT
G
CT
T
AA
A
AC
C
CA
C
GC
C
TA
C
TC
C
AC
A
AG
G
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
mut
AG
G−
QC
AA
G
AT
T
GA
A
GA
C
GG
C
GT
T
AG
T
CG
A
CG
G
CT
C
CT
G
CT
T
AA
A
AC
C
CA
C
GC
C
TA
C
TC
C
AC
A
AG
G
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
AA
G
AT
T
GA
A
GA
C
GG
C
GT
T
AG
T
CG
A
CG
G
CT
C
CT
G
CT
T
AA
A
AC
C
CA
C
GC
C
TA
C
TC
C
AC
A
AG
G
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
AA
G
AT
T
GA
A
GA
C
GG
C
GT
T
AG
T
CG
A
CG
G
CT
C
CT
G
CT
T
AA
A
AC
C
CA
C
GC
C
TA
C
TC
C
AC
A
AG
G
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
AA
G
AT
T
GA
A
GA
C
GG
C
GT
T
AG
T
CG
A
CG
G
CT
C
CT
G
CT
T
AA
A
AC
C
CA
C
GC
C
TA
C
TC
C
AC
A
AG
G
window before
window after
window around
high tAI
low tAI
mid tAI
of interest
Normalized footprint ratio for mut/wt averaged over occurances 1 to 5 of each codon
Figure A.4: The ratio of mutant to wild-type footprint count per codon.Counts are averaged over the first 5 occurrences of the codon per gene over all genesand presented for the three mutant samples. Counts are normalized by the averagein the 15-codon window before (red line), after (green line), or around (blue line) thecodon. We show a subset of the codons: the 5 with lowest tAI (dots), the 5 withhighest tAI (squares), and the 6 with middle tAI (stars), in addition to the two codonsACA and AGG (diamonds). In each case, if the manipulated codon of interest inducesa change in speed under the common hypothesis (lower for ACA-K and higher forAGG-OE and AGG-QC), we expect a corresponding peak or valley, respectively, inthe presented ratio. However, the ratios at ACA and AGG are not significantly higherthan 1-standard deviation (dotted line) or than the other representative codons. Left:Counts are raw footprint counts. Right: Counts are dwell-corrected footprint counts.
APPENDIX A. RIBOSOME PROFILING 105
−10 −5 0
−10
−5
0
log(PA−wt)
log(
PA
−A
CA
−K
)
62 increased
138 reduced
r=0.99
−10
−5
0
log(
PA
−A
GG
−O
E)
88 increased
112 reduced
r=0.99
−10
−5
0
log(
PA
−A
GG
−Q
C)
95 increased
105 reduced
r=0.99
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
mut
AC
A−
K
AT
AA
CA
CG
AA
AT
AA
AA
GT
GT
AT
TA
TA
TT
CA
AG
GC
AT
AG
CC
GG
CT
TA
TG
TT
TG
GA
CT
AC
TC
AC
GA
GA
TG
CC
AG
CC
GT
CG
CC
CT
GT
CA
AA
AC
CT
GG
AT
CG
CA
CT
CA
CA
AG
TG
GG
AA
AT
TG
GG
TC
TG
CG
GA
GG
CA
CC
AC
GT
CC
TT
AC
AC
CG
TT
AT
CT
TC
TC
CG
TG
TT
GG
GC
GA
CG
TC
GG
TG
CT
GC
C
−0.1
−0.05
0
0.05
0.1
mut
AG
G−
OE
Correlation between log(PA−mut/PA−wt)and % codon per gene
GG
AA
GG
GG
GG
TG
CT
GC
GG
CC
CG
GC
AC
GT
GC
AT
AC
AG
CG
AA
GC
CT
CA
TG
TG
GC
TT
GC
GC
CG
GT
AT
TT
GA
GT
CG
CC
TG
CA
TA
TC
AT
CG
CA
GT
AC
AT
AC
CA
CT
GT
CT
AG
AC
AA
TA
AG
TC
AA
AC
TT
CA
AA
GC
CA
CC
GT
CA
CT
AG
AG
AT
TC
CC
GT
AT
CG
CT
GG
TT
TG
TT
AC
AA
AT
TC
CA
TC
TG
TT
GA
A
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
mut
AG
G−
QC
GG
TG
CT
GT
CG
CC
GT
TA
AG
AT
CA
CC
TT
GA
GA
CC
AC
GT
TT
CT
AC
TC
CA
CT
CA
CT
GT
GA
CG
GC
TG
GT
CT
AT
GC
AA
GA
AG
TG
AA
CG
CA
AT
TG
GG
TG
CG
CG
CG
CC
CT
GG
AC
TA
CT
GC
AT
GA
GG
TA
TT
TC
GG
CT
CC
CC
TC
GC
TT
CC
GT
TA
GA
TA
CG
AC
AC
AG
AG
CA
GT
CG
AA
AA
TA
TT
CA
AG
GA
TA
AA
T
Figure A.5: The analysis of Figure A.2 repeated on flow instead of TE.As before, wild-type and mutant flows generally agree. Correlations between the ratioof mutant flow to wild-type flow and the percent of codon per gene are not higherfor the manipulated codons compared to other codons, despite the dramatic changein tRNA abundance.
APPENDIX A. RIBOSOME PROFILING 106
0 0.5 10
100
200
300
400
500
redu
ced
TE
gen
es
p=3.24e−07mean: 0.4052
0 0.5 10
50
100
150
200
position per length from 5’ endof slow outliers
incr
ease
d T
E g
enes
mean: 0.4517
0 10 20 30 400
100
200
300
400
500
600
700
strength of slow outliersin first 100 codons
mean: 2.6864
0 10 20 30 400
200
400
600
800
1000
p=1.73e−01mean: 2.9256
0 0.5 10
5
10
15
20
p=7.59e−01mean: 0.4214
0 0.5 10
1
2
3
4
5
6
position per length from 5’ endof ACA slow outliers
mean: 0.4118
Figure A.6: Distribution of three features among reduced TE genes and increased TEgenes in ACA-K.Distributions are skewed for reduced TE genes (with lower TE in mutant comparedto wild-type) toward initiation signals that could confound the TE decrease. Slower-than-expected codons with an excess number of ribosome counts are defined formallyas “outliers” (see Materials and Methods). Each feature distribution is calculatedover all positions in the genes in the specified gene set (either reduced TE genesor increased TE genes) satisfying the specified criteria (a position that is a slowoutlier, a position that is a slow outlier in the first 100 codons, or a position that isa slow outlier and an ACA codon). The feature distributions for reduced TE versusincreased TE genes are distinct (p-values shown are calculated to be significant undera Kolmogorov-Smirnov test). Outlier positions are calculated in the ACA-K mutant.
APPENDIX A. RIBOSOME PROFILING 107
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
tAI (coding sequence)average elongation rate
energy exp vivo (5’ UTR)energy exp vivo (3’ UTR)
energy exp vivo (mRNA sequence)energy exp vivo (win 11 to 50)
energy exp vitro (5’ UTR)energy exp vitro (3’ UTR)
energy exp vitro (mRNA sequence)energy exp vitro (win −11 to 28)
energy (5’ UTR)energy (3’ UTR)
energy (mRNA sequence)energy (win −16 to 23)
KL divergence to KozakKL divergence to Kozak (pos −6)KL divergence to Kozak (pos −5)KL divergence to Kozak (pos −4)KL divergence to Kozak (pos −3)KL divergence to Kozak (pos −2)KL divergence to Kozak (pos −1)KL divergence to Kozak (pos 3)KL divergence to Kozak (pos 4)KL divergence to Kozak (pos 5)
length (coding sequence)mRNA abundance
evolutionary rate
significantnot significant
−0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
Khd1
Scp160
Bfr1
Nab2
Pab1
Pub1
Puf4
Cbc2
Gbp2
Nab3
Nop56
Npl3
Nrd1
Nsr1
Ypl184c
Spearman correlation to log(TE)
RN
A b
indi
ng p
rote
in e
nric
hmen
t
expectednot expectedexpected, not signot expected, not sigunknown
Figure A.7: Correlation between log(TE) and gene-level features.Cis-features and RNA binding protein enrichment are described in Materials andMethods. Significant threshold is p = 0.05. (See Appendix A for how expectedcorrelations for the RNA binding proteins were determined.)
APPENDIX A. RIBOSOME PROFILING 108
0 100 200 300 400 500 600 700 800 900 10000
1
2
3
4x 10
−4
dwel
l−co
rrec
ted
flow
−no
rmal
ized
cou
nts
aver
aged
per
pos
ition
acr
oss
gene
s
position [codon]
all positions, no slow outliersall positionsuniform
Figure A.8: Dwell-corrected footprint counts normalized by flow.Counts are geometrically averaged per position over all genes aligned by start codon(ignoring 0 footprint counts). Removing slow outliers (red curve) reduces the peak indensity at ≈44 codons (132 nt).
APPENDIX A. RIBOSOME PROFILING 109
0 50 100 150 200 250 300 350 4000.7
0.8
0.9
position [codon]
codo
n tr
ansl
atio
n ra
te
Rates and tAI in 17−codon windows
0 50 100 150 200 250 300 350 4000.35
0.4
0.45
tAI
Figure A.9: Codon translation rates versus tAI.The tAI in sliding windows of 17-codons is averaged across all the genes aligned bystart codon (red curve). The same analysis with our estimated codon translationrates (scaled up by 1000) (black curve) shows that rates at the 5’ end are not lowercompared to the rest of the gene.
APPENDIX A. RIBOSOME PROFILING 110
500 1000 1500 2000 2500 3000 3500 4000 45000
2
4
6
x 105
position [nt]
num
ber
of n
on−
outli
ers
500 1000 1500 2000 2500 3000 3500 4000 45000
5
10
x 104
num
ber
of s
low
out
liers
Figure A.10: Histograms of positions of slow outliers and non-outliers are similar.
APPENDIX A. RIBOSOME PROFILING 111
Figure A.11: Two different initializations of the parameters for the translation model.Estimated parameters are nearly exact, demonstrating the model is robust to initial-ization.
APPENDIX A. RIBOSOME PROFILING 112
tRNA (anticodon) RPM (ACA-K) RPM (wt) RPM (ACA-K)
tK(UUU)D 78 80 0.98tY(GUA)F1 19 16 1.19tM(CAU)C 11 11 1.00tD(GUC)B 218 251 0.87tE(UUC)B 582 428 1.36tN(GUU)C 225 166 1.36tS(UGA)P 122 148 0.82tP(AGG)N 24 27 0.89tC(GCA)B 82 58 1.41tQ(UUG)B 103 106 0.97tW(CCA)G1 35 44 0.80tG(UCC)O 143 96 1.49tT(UGU)G1 25 75 0.33tR(UCU)E 138 172 0.80tA(AGC)D 72 42 1.71tT(CGU)K 9 9 1.00tV(AAC)E1 129 82 1.57tQ(CUG)M 166 138 1.20tA(UGC)Q 3 3 1.00tL(UAA)J 72 81 0.89tI(AAU)B 98 46 2.13tH(GUG)E1 328 266 1.23tT(AGU)B 152 141 1.08tF(GAA)B 124 115 1.08tK(CUU)C 1328 1914 0.69
Table A.1: Counts of tRNA in RPM (number of reads per million) in ACA-K andwild-type.The threonine tRNA recognizing the ACA codon (highlighted) is reduced to 1/3 ofthe wild-type level.
APPENDIX A. RIBOSOME PROFILING 113
Category Features
PositionDistance from 5’ end (pos)Distance from 5’ end per length (pos-per-len)Distance from 3’ end (pos-from-end)
Structure
Minimum free energy (energy-down)In vitro energy (vitroDMS-energy-down) [105]In vivo energy (vivoDMS-energy-down) [105]In vitro inverse-energy (PARS-invenergy-down) [62]Number of hairpins (hairpins-down)Number of internal loops (internal-down)Number of multi-loops (multi-down)Number of stems (stems-down15)Number of GC pairs in stems (stemsGC-down15)Number of stems 12nt downstream (stems-down12)Number of stems 9nt downstream (stems-down9)
Protein foldingActive site is inside a protein domain (is-in-domain)Domain ends 30 codons upstream (is-end-domain-up-30)
Wobble bases Is wobble base at P-site (is-wobble)
tRNAs Reuse
Distance from same codon upstream (dist-prev-codon)Distance from upstream iso-accepting tRNA (dist-prev-trna)Is codon in window upstream (is-prev-codon-close)Is iso-accepting tRNA in window upstream (is-prev-trna-close)
RBPsKL divergence combined via mean (rbp-mean)KL divergence combined via min (rbp-min)
Peptide
Charge of active codon (charge)Mean charge in window upstream (cluster-charge-up-1)Arg/Lys fraction in window upstream (cluster-ArgLys-up-1)Pro fraction in P, E sites (pair-Pro-up)Pro fraction downstream (pair-Pro-down)
GlobalLength (len)Abundance (abund)
Table A.2: Eight categories of potential correlates to outlier strength.Distances are relative to active codon; upstream windows are 10-codons long. Struc-ture is calculated in 25nt windows 15nt downstream and, unless indicated, derivedcomputationally. RBP (RNA binding protein) motifs [54] are aggregated by KL-divergence in 3-codon windows 5 codons downstream.
APPENDIX A. RIBOSOME PROFILING 114
Feature r-value p-value Mean Std Mean Std(Slow) (Slow) (Non) (Non)
pos -0.046 0 355.02 371.58 415.29 406.38pos-per-len -0.148 0 0.47 0.29 0.52 0.28pos-from-end 0.126 0 396.23 398.43 381.56 388.88energy-down 0 0.6 -2.65 1.75 -2.62 1.72vitroDMS-energy-down -0.013 0 0.49 0.15 0.48 0.15vivoDMS-energy-down -0.028 0 0.51 0.17 0.50 0.17PARS-invenergy-down -0.017 0 0.32 0.55 0.31 0.54hairpins-down 0.024 0 5.92 5.00 5.79 4.98internal-down 0.017 0 1.13 1.33 1.10 1.32multi-down 0.023 0 0.18 0.43 0.17 0.42stems-down15 0.024 0 5.92 5.00 5.79 4.98stemsGC-down15 0.021 0 2.33 2.28 2.26 2.26stems-down12 0.025 0 5.94 5.00 5.78 4.98stems-down9 0.027 0 5.97 5.01 5.77 4.97is-in-domain -0.022 0 0.72 0.44 0.73 0.44is-end-domain-up-30 -0.005 0 0.004 0.06 0.004 0.06is-wobble -0.032 0 0.42 0.49 0.46 0.49dist-prev-codon -0.031 0 43.48 57.95 46.96 63.11dist-prev-trna -0.024 0 35.58 47.50 37.72 50.44is-prev-codon-close 0.012 0 0.26 0.44 0.25 0.43is-prev-trna-close 0.007 0 0.30 0.45 0.29 0.45rbp-mean -0.001 0.3 11.62 0.69 11.60 0.69rbp-min -0.001 0.2 2.57 1.11 2.56 1.11charge -0.006 0 0.009 0.52 0.02 0.50cluster-charge-up-1 0.017 0 0.01 0.18 0.01 0.17cluster-ArgLys-up-1 0.024 0 0.12 0.11 0.11 0.10pair-Pro-up 0.092 0 0.05 0.15 0.04 0.13pair-Pro-down -0.01 0 0.04 0.14 0.04 0.14len 0.061 0 750.25 541.11 795.85 572.19abund 0.016 0 13.41 67.88 9.54 51.27
Table A.3: Spearman correlation between outlier strength and features, separated bytype and highlighted if significant.Outliers (slow and non) are calculated for a threshold of 0. See Appendix A for morediscussion.
APPENDIX A. RIBOSOME PROFILING 115
Regression Regression - Kozak Null ModelMean Std Mean Std Mean Std
Error 0.7549 0.0508 0.8443 0.0486 0.9674 0.0581Error (Train) 0.7499 0.0057 0.8438 0.0051 0.9569 0.0066Spearman r 0.6614 0.0278 0.5161 0.0382 0.0385 0.0491Spearman p 0.0000 0.0000 0.0000 0.0000 0.4325 0.3022Pearson r 0.6224 0.0329 0.5094 0.0381 0.0307 0.0483Pearson p 0.0000 0.0000 0.0000 0.0000 0.4587 0.3028
Table A.4: Performance of TE regression model.Error (should be low) and correlation (should be high) between predicted and actualTE is measured on 100 random test sets of genes not used during model training.Performance drops in a null model learned on randomized TE labels (last column).Performance also drops when using the original Kozak motif (middle column). Erroron the training set is included to show that our model generalizes to genes not used intraining (it is close to test set error). See Materials and Methods for further details.
APPENDIX A. RIBOSOME PROFILING 116
Result c=1 c=10 c=1000 c=10000 c=100000 No µcm
µc (c=100)r 1.000 1.000 1.000 1.000 1.000 1.000p 10−202 10−206 10−150 10−105 10−95 10−96
µcm (c=100)r 1.000 1.000 1.000 0.983 0.838 NAp 0 0 0 0 0 NA
Jm (c=100)r 1.000 1.000 1.000 1.000 0.999 0.994p 0 0 0 0 0 0
tAIr 0.210 0.210 0.210 0.213 0.217 0.211p 0.104 0.104 0.104 0.100 0.094 0.103
tRNA (Cy5)r 0.144 0.144 0.140 0.140 0.140 0.133p 0.380 0.4380 0.393 0.393 0.393 0.420
tRNA (Cy3)r 0.144 0.144 0.140 0.140 0.140 0.133p 0.417 0.417 0.429 0.429 0.429 0.456
PA [88] r 0.7885 0.7885 0.7886 0.7889 0.7882 0.7782PA [26] r 0.6802 0.6802 0.6802 0.6802 0.6786 0.6710
Table A.5: Summary of main results for model variations.The first five columns are models with different constants for the second term inthe objective function and the last column is a model without µcm parameters (seeMaterials and Methods). Rows 1-3 represent correlation between our parameters inour model and in the model variation. Rows 4-6 represent correlation between codontranslation rates in model variations and codon bias measures. Rows 7-8 representcorrelation between protein synthesis rates in model variation and protein abundancemeasures. Results are similar to the ones reported for the model used throughout thepaper (const c = 100).
Appendix B
RNA Secondary Structure
B.1 Supplementary Methods
B.1.1 Model Specification
Let x be an RNA sequence of length Lx with structure y. Let Sx be the set of indices
of available structure–probing datasets for sequence x so that Sx ⊆ {1, . . . , S}, where
S is the total number of structure–probing datasets. We denote the collection of
probing signals as d, where d(j)k the probing signal in the jth data source at base k
in the sequence. CONTRAfold-SE models the conditional joint probability of the
structure and probing data given sequence as
P (y, d|x;w, θ) = P (y|x;w)∏j∈Sx
Lx∏k=1
P (d(j)k |xk, y; θ(j)) (B.1)
In this equation,
� P (y|x;w) is given by the conditional log-linear model of CONTRAfold with
parameters w,
117
APPENDIX B. RNA SECONDARY STRUCTURE 118
� P (d(j)k |xk, y; θ(j)) is the Gamma distribution for the probing data for dataset j.
� θ(j) = {α(j)b,p, β
(j)b,p |b ∈ {A,C, T,G}, p ∈ {paired, unpaired}} is the set of Gamma
parameters for dataset j.
� θ = ∪Sj=1θ(j) is the set of Gamma parameters over all datasets.
In the absence of structure-probing data, the CONTRAfold-SE model reduces to the
CONTRAfold model.
Parameter Estimation
The parameters of the CONTRAfold-SE model, w and θ, are estimated by maximiz-
ing the conditional log-likelihood of the known structures and probing data, given
sequence. Formally, for a training set D = DS ∪ DP ∪ DS+P of sequences with: i)
only known structures and no probing data (DS), ii) only probing data but unknown
(missing) structure (DP), and iii) both known structure and probing data (DS+P),
we find w, θ that maximize the (regularized) conditional log-likelihood
`(w, θ;D) =∑
(x,y)∈DS
logP (y|x;w) + λ ·∑
(x,d)∈DP
log∑y
P (y, d|x;w, θ)
+∑
(x,y,d)∈DS+P
logP (y, d|x;w, θ)
The hyperparameter λ is added as in [89] to temper the use of partial evidence
against ground truth. The main difficulty with solving this optimization problem is
that the likelihood for training instances with missing structures requires summing the
probability in equation B.1 over all possible structures. In addition, the parameters for
the Gamma distributions (α(j)b,p, β
(j)b,p ) are constrained to be non-negative. We handle
this constraint by parameterizing these Gamma parameters in terms of unconstrained
APPENDIX B. RNA SECONDARY STRUCTURE 119
variables α(j)b,p, β
(j)b,p such that α
(j)b,p = exp(α
(j)b,p), β
(j)b,p = exp(β
(j)b,p ). We then solve the
optimization problem over these new variables.
Gradient Computation
We use the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm
for parameter estimation, and the key technical detail is how the gradient of the
conditional log-likelihood is computed. We will discuss the gradient computation for
the likelihood over the training examples in DS ,DS+P , and DP in turn; the complete
gradient is simply the sum of these three gradients.
Gradient for Examples in DS
The gradient for each term in the sum∑
(x,y)∈DS logP (y|x;w) is simply the gradient,
for the particular training example, of the conditional log-likelihood of the original
CONTRAfold model. As P (y|x;w) = exp(wTF (x,y))Py′ exp(wTF (x,y′))
(the features are RNA struc-
tural motifs whose descriptions may be found in the original CONTRAfold paper)
this is given by
∇w logP (y|x;w) = ∇w
[wTF (x, y)− log
∑y′
exp(wTF (x, y′))
]
= F (x, y)−∑y′
P (y′|x;w)F (x, y′)
= F (x, y)− E[F (x, y)]
A detailed description of how this gradient (in particular, the feature expectations
with respect to the model E[F (x, y)]) may be computed efficiently via dynamic
programming is found in the Supplementary Material of the original CONTRAfold
manuscript [33].
APPENDIX B. RNA SECONDARY STRUCTURE 120
Gradient for Examples in DS+P
Substituting the expression for the CONTRAfold-SE model (Equation B.1), we see
that each term in the sum∑
(x,y,d)∈DS+PlogP (y, d|x;w, θ) decomposes as:
logP (y, d|x;w, θ) = logP (y|x;w) +∑j∈Sx
Lx∑k=1
logP (d(j)k |xk, y; θ(j))
The first term is the original conditional log-likelihood of the CONTRAfold model.
The second term is the sum of log-likelihoods of the various probing data Gamma
distributions, for which gradients may be computed analytically by straightforward
differentiation. Let α = α(j)xk,paired(k,y)
, β = β(j)xk,paired(k,y)
, then
logP (d(j)k |xk, y; θ(j)) = (α− 1) log d
(j)k − log Γ(α)− α log β − d
(j)k
β∂
∂αlogP (d
(j)k |xk, y; θ(j)) = log d
(j)k − ψ(α)− log β
∂
∂βlogP (d
(j)k |xk, y; θ(j)) = −α
β+d
(j)k
β2
We can then use the chain rule to compute the gradients with respect to the uncon-
strained variables α ≡ α(j)xk,paired(k,y)
, β ≡ β(j)xk,paired(k,y)
. As ∂α∂α
= α, ∂β∂β
= β,
∂
∂αlogP (d
(j)k |xk, y; θ(j)) = log d
(j)k − ψ(α)− log βα
∂
∂βlogP (d
(j)k |xk, y; θ(j)) = −α
β+d
(j)k
β2β = −α +
d(j)k
β
Note that the gradients with respect to the other Gamma parameters (i.e., α 6=
α(j)xk,paired(k,y)
, β 6= β(j)xk,paired(k,y)
) will be 0. Therefore, more generally, for any of the
APPENDIX B. RNA SECONDARY STRUCTURE 121
unconstrained Gamma distribution parameters α(j)b,p, β
(j)b,p , we have that
∂
∂α(j)b,p
logP (d(j)k |xk, y; θ(j)) = I[j ∈ Sx]I[xk = b]I[paired(k, y) = p](
log d(j)k − ψ(α
(j)b,p)− log β
(j)b,p
)α
(j)b,p
∂
∂β(j)b,p
logP (d(j)k |xk, y; θ(j)) = I[j ∈ Sx]I[xk = b]I[paired(k, y) = p](
−α(j)b,p +
d(j)k
β(j)b,p
)
Here, I[condition] is the indicator variable that is 1 when condition is true, and 0
otherwise.
Gradient for Examples in DP
Consider a single term in the (outer) sum∑
(x,d)∈DP log∑
y P (y, d|x;w, θ), which cor-
responds to the log-likelihood for a single training example in DP . This is the chal-
lenging case due to the sum over exponentially many possible structures y. Following
the argument in Theorem 19.6 in [64], or by directly differentiating this log-likelihood
term, we find that the gradient of the log-likelihood is equal to the gradient of the
expected log-likelihood, where the expectation is taken over the posterior distribution
over unknown structures y given the observed probing data d, Q(y) = P (y|d, x;w, θ).
Note that Q(y) is the posterior evaluated at the particular parameter values w, θ for
which we wish to compute a gradient, and therefore Q(y) has no more dependence
on w or θ. More formally,
∇w,θ log∑y
P (y, d|x;w, θ) = ∇w,θ EQ[P (y, d|x;w, θ)]
APPENDIX B. RNA SECONDARY STRUCTURE 122
Expanding the expected log-likelihood given by the model in equation B.1,
EQ[P (y, d|x;w, θ)] =∑y
Q(y) logP (y, d|x;w, θ)
=∑y
Q(y) log
[P (y|x;w)
∏j∈Sx
Lx∏k=1
P (d(j)k |xk, y; θ(j))
]
=∑y
Q(y) logP (y|x;w) +∑y
Q(y)∑j∈Sx
Lx∑k=1
logP (d(j)k |xk, y; θ(j))
=∑y
Q(y) logP (y|x;w) +∑j∈Sx
Lx∑k=1
∑y
Q(y) logP (d(j)k |xk, y; θ(j))
we see that the likelihood decomposes (additively) over the CONTRAfold model
and over each of the separate Gamma distributions. We will describe the gradient
computation for each of these components in turn. Conceptually, these are similar
to the gradient computation when structures are known, except that terms involving
the sufficient statistics (e.g. features F (x, y)), will be replaced by expected sufficient
statistics ; the challenge is to compute these efficiently.
Gradient over w The required gradient is given by
∑y
Q(y) · ∇w logP (y|x;w) =∑y
Q(y)F (x, y)−∑y
P (y|x;w)F (x, y′)
We can compute the required feature expectations over Q(y) (the first term) by
adapting the existing routines in CONTRAfold for computing feature expectations
over the CONTRAfold model (the second term). We rewrite Q(y) in terms of known
quantities: the model probabilities in Equation B.1 and the form of the CONTRAfold
APPENDIX B. RNA SECONDARY STRUCTURE 123
log-linear model, P (y|x;w) = exp(wTF (x,y)Py′ exp(wTF (x,y′)
Q(y) = P (y|d, x;w, θ)
=P (y, d|x;w, θ)∑y′ P (y′, d|x;w, θ)
=P (y|x;w)
∏j∈S∏Lx
k=1 P (d(j)k |xk, y; θ(j))∑
y′ P (y′|x;w)∏
j∈S∏Lx
k=1 P (d(j)k |xk, y′; θ(j))
=exp(wTF (x, y))
∏j∈S∏Lx
k=1 P (d(j)k |xk, y; θ(j))∑
y′ exp(wTF (x, y′))∏
j∈S∏Lx
k=1 P (d(j)k |xk, y′; θ(j))
=exp
(wTF (x, y) +
∑j∈S∑Lx
k=1 logP (d(j)k |xk, y; θ(j))
)∑
y′ exp(wTF (x, y′) +
∑j∈S∑Lx
k=1 logP (d(j)k |xk, y′; θ(j))
)
We see that Q(y) is also a log-linear model like CONTRAfold, but with additional
features for each base in the sequence given by the densities of the structure–probing
data. This means that we can simply modify the dynamic programming recurrences in
CONTRAfold to add the appropriate density terms whenever a base-pair or unpaired-
base is scored.
Gradient over θ Similar to the case for examples in DS+P , the partial derivatives
with respect to the unconstrained Gamma distribution parameters α(j)b,p, β
(j)b,p are given
APPENDIX B. RNA SECONDARY STRUCTURE 124
by
∂
∂α(j)b,p
(∑y
Q(y) logP (d(j)k |xk, y; θ(j))
)
=∑y
Q(y)I[j ∈ Sx]I[xk = b]I[paired(k, y) = p](
log d(j)k − ψ(α
(j)b,p)− log β
(j)b,p
)α
(j)b,p
= I[j ∈ Sx]I[xk = b](
log d(j)k − ψ(α
(j)b,p)− log β
(j)b,p
)α
(j)b,p
∑y
Q(y) I[paired(k, y) = p]
∂
∂β(j)b,p
(∑y
Q(y) logP (d(j)k |xk, y; θ(j))
)
=∑y
[Q(y) I[j ∈ Sx]I[xk = b]I[paired(k, y) = p]
(−α(j)
b,p +d
(j)k
β(j)b,p
)]
= I[j ∈ Sx]I[xk = b]
(−α(j)
b,p +d
(j)k
β(j)b,p
)∑y
Q(y) I[paired(k, y) = p]
The required expectations∑
yQ(y) I[paired(k, y) = p] can be computed by adapting
the existing CONTRAfold routines for computing base pairing posteriors. Specifi-
cally, we can adapt the CONTRAfold routine to compute the posterior probability
pi,j, that base i pairs with base j under Q(y) instead of the original CONTRAfold
model (as previously described). Then, we can compute∑
yQ(y) I[paired(k, y) = p]
by summing the posteriors over the appropriate positions. For example, if we wish
to find the sum over all structures for sequence x where paired(1, y) = paired, then
we compute∑
j p1,j. If we wish to find the sum where paired(1, y) = unpaired, then
we compute the sum as 1−∑
j p1,j.
B.1.2 Dataset Setup
Parameter Optimization CONTRAfold-SE is a gradient-based method that re-
quires an initialization for the model parameters. For different initializations, we find
that the metrics at consecutive gradient steps during parameter training converge,
APPENDIX B. RNA SECONDARY STRUCTURE 125
and that the accuracy for different parameter initializations are also consistent. In
addition, the learned parameters are also weakly correlated for different initializations
(Figure B.13). Since performance is mostly agnostic to initialization, we select one ini-
tialization style (described below) and use that throughout all experiments. We also
find that the accuracy across iterations saturates and hence the number of iterations
at which optimization was stopped also does not play a major role in practice.
Running and Evaluating CONTRAfold-SE Unless specified, we use the fol-
lowing settings throughout: regularize = 1, maxiter = 1000. For initial weights
(“initweight”), we concatenate the original CONTRAfold parameters (for the struc-
ture model) with 16 parameters specifying the natural logarithm of the shape and
scale parameters of 8 Gamma distributions, one for each paired or unpaired base
A, C, G, T (for the data model). Throughout, unless otherwise noted, we initial-
ize the parameters as follows: the structure model parameters are set to the opti-
mal ones given in CONTRAfold v2.02 (available at http://contra.stanford.edu/
contrafold/contrafold_v2_02.tar.gz); the data model parameters are initialized
by fitting a Gamma distribution to all bases in the first 2000 sequences that are
data-dense and short, determined as described in the section on training sets. This
corresponds to “init0” in Figure B.13. In addition, we check two other initializations:
1) the structure model parameters are as above and the data model parameters are
randomly set (init1); and 2) the non-zero structure model parameters are set to a
random value between -1 and 1 and the data model parameters are randomly set
(init2). For the parameter γ, we run on a grid from 0.000001 to 1024 (namely: 1e-4,
2e-4, 3e-4, 4e-4, 5e-4, 8e-4, 1e-3, 2e-3, 5e-3, 6e-3, 8e-3, 1-2, 2e-2, and 2e-5 through
2e10 incrementing the power). This tuning parameter roughly controls the number
of bases included in the final structure and affects specificity or sensitivity.
We select the optimal λ using 10-fold cross-validation over a grid of values (0.001,
APPENDIX B. RNA SECONDARY STRUCTURE 126
0.01, 0.05, 0.1, 0.5, 1): we divide the set of known-structure sequences into 10 sets,
evaluate AUC (see below) on each set (trained on the remaining structure-only and
all data-only sequences), and average across all 10 sets for each possible λ. We then
set the λ to that with the highest average AUC and learn the parameters over the
complete training set. We perform this procedure for Train-A and Train-B. Selected
λ are typically near 0.05. Train-A75%, Train-A100%, Train-A75, and Train-A100 use
the same λ as Train-A (namely, 0.05), since we are interested in seeing how adjusting
the data composition affects performances and λ also modulates that.
Training and Test Sets Train-A has two components making up 238 sequences:
sequences with only known secondary structure and sequences with only structure-
probing data. For the first component, we take the first 119 sequences from the 151
training set sequences compiled from RFAM for the CONTRAfold training set (set
S151 in [33]), after excluding any sequences that share an RFAM match with the test
sets described below or any of the yeast mRNA genes. For the second component, we
assign to each yeast mRNA sequence with structure-probing data a data-sparsity score
calculated as the length divided by the number of non-zero data counts per length.
We select the first 119 sequences with smallest score, again excluding any that share
an RFAM match with the test sets. This ensures that we are first using sequences that
are both short (faster running time) and have dense data (more structural information
for the algorithm to use). Train-B is constructed similarly but using the data-sparsity
score cycling through DMS-vitro, DMS-vivo, and PARS data instead.
APPENDIX B. RNA SECONDARY STRUCTURE 127
B.2 Supplementary Figures and Tables
Figure B.1: Sensitivity-PPV curve for ASH1-E1 in Test-SeqFold.
Figure B.2: Sensitivity-PPV curve for RDN58-2 in Test-SeqFold.
APPENDIX B. RNA SECONDARY STRUCTURE 128
Figure B.3: Sensitivity-PPV curve for p4p6 in Test-SeqFold.
Figure B.4: Sensitivity-PPV curve for p9 in Test-SeqFold.
APPENDIX B. RNA SECONDARY STRUCTURE 129
Figure B.5: Sensitivity-PPV curve for snR10 in Test-SeqFold.
Figure B.6: Sensitivity-PPV curve for snR33 in Test-SeqFold.
APPENDIX B. RNA SECONDARY STRUCTURE 130
Figure B.7: Sensitivity-PPV curve for snR37 in Test-SeqFold.
Figure B.8: Sensitivity-PPV curve for snR46 in Test-SeqFold.
APPENDIX B. RNA SECONDARY STRUCTURE 131
Figure B.9: Sensitivity-PPV curve for snR53 in Test-SeqFold.
Figure B.10: Sensitivity-PPV curve for snR81 in Test-SeqFold.
APPENDIX B. RNA SECONDARY STRUCTURE 132
0.3
0.32
0.34
0.36
0.38
0.4
Pai
ring
Pro
babi
lity
Pum2
true false
0
2
4
6
8
Man
n−W
hitn
ey−
Wilc
oxon
−lo
g 10 p
−va
lue
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
Pairing Partners Heat Map for True Bound Genes
Sequence Position
Seq
uenc
e P
ositi
on
0
2
4
6
8
10
x 10−3
0.15
0.2
0.25
0.3
0.35
Pai
ring
Pro
babi
lity
SF2ASF
true false
20
40
60
Man
n−W
hitn
ey−
Wilc
oxon
−lo
g 10 p
−va
lue
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
Pairing Partners Heat Map for True Bound Genes
Sequence Position
Seq
uenc
e P
ositi
on
0
1
2
3
4
5
6
x 10−3
0.3
0.32
0.34
0.36
0.38
0.4
0.42
Pai
ring
Pro
babi
lity
FMR1_1 (ACUK)
true false
0
10
20
30
Man
n−W
hitn
ey−
Wilc
oxon
−lo
g 10 p
−va
lue
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
Pairing Partners Heat Map for True Bound Genes
Sequence Position
Seq
uenc
e P
ositi
on
1
2
3
4
5
6
x 10−3
0.38
0.4
0.42
0.44
0.46
0.48
0.5
0.52
Pai
ring
Pro
babi
lity
FMR1_1 (WGGA)
true false
20406080
100120140
Man
n−W
hitn
ey−
Wilc
oxon
−lo
g 10 p
−va
lue
5 10 15 20 25 30 35 40
5
10
15
20
25
30
35
40
Pairing Partners Heat Map for True Bound Genes
Sequence Position
Seq
uenc
e P
ositi
on
1
2
3
4
5
6
7x 10−3
Figure B.11: Structure profiles for human RNA binding proteins.
APPENDIX B. RNA SECONDARY STRUCTURE 133
Figure B.12: Learned noise model for structure probing data.
APPENDIX B. RNA SECONDARY STRUCTURE 134
−4−2 0 2
−4−202
init0 − structure
init1
− s
truc
ture
r=0.98
−4−2 0 2−6−4−202
init0 − structurein
it2 −
str
uctu
re
r=0.97
−4−2 0 2
−4−202
init0 − structure
init3
− s
truc
ture
r=0.99
−4−2 0 2−6−4−202
init1 − structure
init2
− s
truc
ture
r=0.97
−4−2 0 2
−4−202
init1 − structurein
it3 −
str
uctu
re
r=0.98
−6−4−2 0 2
−4−202
init2 − structure
init3
− s
truc
ture
r=0.97
−1 0 1 2 3−10123
init0 − data
init1
− d
ata
r=1.00
−1 0 1 2 3−10123
init0 − data
init2
− d
ata
r=1.00
−1 0 1 2 3−10123
init0 − data
init3
− d
ata
r=1.00
−1 0 1 2 3−10123
init1 − data
init2
− d
ata
r=1.00
−1 0 1 2 3−10123
init1 − data
init3
− d
ata
r=1.00
−1 0 1 2 3−10123
init2 − data
init3
− d
ata
r=1.00
Figure B.13: Correlation between learned parameters for different parameter initial-izations.
APPENDIX B. RNA SECONDARY STRUCTURE 135
Motif AUC for RNA binding protein classification
Not normalized by gene length
C-SE(P,D-vitro) C-SE(D-vivo) C SeqFold #MotifsPUF4-1 0.682 0.687 0.688 0.660 0.625PUB1-1 0.606 0.606 0.598 0.598 0.598PUF2-1 0.767 0.751 0.757 0.751 0.749PAB1-1 0.678 0.677 0.682 0.642 0.665KHD1-1 0.499 0.497 0.497 0.509 0.491NAB2-1 0.549 0.540 0.547 0.545 0.548YLL032C-1 0.708 0.683 0.699 0.706 0.670VTS1-1 0.446 0.448 0.439 0.514 0.547PIN4-1 0.956 0.974 0.975 0.947 0.948NRD1-1 0.527 0.561 0.552 0.502 0.557
Normalized by gene length
C-SE(P,D-vitro) C-SE(D-vivo) C SeqFold # MotifsPUF4-1 0.664 0.682 0.695 0.609 0.534PUB1-1 0.595 0.600 0.597 0.603 0.590PUF2-1 0.695 0.682 0.673 0.662 0.676PAB1-1 0.493 0.488 0.494 0.489 0.509KHD1-1 0.507 0.503 0.508 0.513 0.499NAB2-1 0.507 0.499 0.503 0.511 0.505YLL032C-1 0.662 0.612 0.634 0.649 0.550VTS1-1 0.403 0.399 0.410 0.472 0.489PIN4-1 0.896 0.759 0.847 0.692 0.644NRD1-1 0.489 0.533 0.516 0.463 0.494
Table B.1: AUC for receiver-operating-characteristic curves classifying bound RBPgenes.C-SE represents CONTRAfold-SE and C represents CONTRAfold, trained on Train-B with the following data: PARS (P), DMS-vitro (D-vitro), DMS-vivo (D-vivo). Thetop half uses the aggregate accessibility of motifs and motif count normalized by genelength; the bottom half has no normalization (see Methods).
APPENDIX B. RNA SECONDARY STRUCTURE 136
rep1 (r) rep1 (p) rep2 (r) rep2 (p)
5’ UTR mean 0.13 5.7e-13 0.14 4.3e-14min 0.06 4.8e-04 0.04 2.9e-02max 0.09 3.4e-07 0.14 7.1e-14
median 0.14 7.3e-14 0.13 4.3e-12std 0.07 9.7e-05 0.11 1.3e-08CV -0.08 6.7e-06 -0.04 4.1e-02
CDS mean 0.19 7.7e-25 0.23 2.0e-34min -0.18 5.9e-24 -0.22 4.0e-31max 0.34 1.1e-82 0.44 1.2e-137
median 0.18 1.6e-23 0.22 1.1e-31std 0.14 2.0e-14 0.18 6.7e-22CV -0.13 2.1e-12 -0.15 6.5e-16
3’ UTR mean 0.05 3.4e-03 0.03 1.0e-01min 0.06 6.7e-04 0.07 4.7e-04max 0.02 3.1e-01 0.02 4.1e-01
median 0.06 5.8e-04 0.04 6.1e-02std 0.02 2.9e-01 0.01 6.8e-01CV -0.05 3.9e-03 -0.03 9.2e-02
All mean 0.24 1.9e-39 0.28 1.4e-51min -0.15 1.9e-16 -0.19 1.2e-23max 0.33 4.5e-75 0.43 1.1e-127
median 0.24 4.7e-39 0.27 1.5e-49std 0.15 6.0e-17 0.20 8.5e-26CV -0.18 9.8e-23 -0.20 1.3e-26
Table B.2: Spearman correlation between CONTRAfold-SE and translation efficiencyon in vivo data.Correlation is between CONTRAfold-SE pairing probabilities trained on Train-B(DMS-vivo) and log fold change in translation efficiency at time 0 and time 30minutes: log(initial TE) - log(TE at 30min). Pairing probability is calculated overdifferent regions and metrics (CV is coefficient of variation = std / mean). Efficiencyis calculated over two replicates (see Methods).
APPENDIX B. RNA SECONDARY STRUCTURE 137
rep1 (r) rep1 (p) rep2 (r) rep2 (p)
5’ UTR mean 0.13 7.5e-13 0.17 2.5e-19min 0.10 1.3e-08 0.08 1.4e-05max 0.09 2.1e-06 0.15 4.1e-16
median 0.14 3.5e-15 0.16 1.6e-17std 0.05 1.5e-02 0.10 1.5e-07CV -0.11 5.6e-10 -0.08 3.4e-05
CDS mean 0.20 3.2e-29 0.27 8.1e-47min -0.26 4.2e-47 -0.29 3.7e-54max 0.31 1.3e-67 0.41 2.4e-115
median 0.20 1.9e-27 0.25 2.8e-42std 0.05 6.3e-03 0.05 1.0e-02CV -0.12 2.8e-10 -0.17 9.8e-20
3’ UTR mean 0.02 1.9e-01 0.01 7.4e-01min 0.06 1.3e-03 0.06 1.8e-03max -0.01 4.4e-01 -0.00 8.2e-01
median 0.03 8.6e-02 -0.00 9.8e-01std -0.01 6.9e-01 -0.00 8.4e-01CV -0.04 3.9e-02 -0.01 5.0e-01
All mean 0.31 1.2e-65 0.38 4.2e-96min -0.22 1.7e-32 -0.24 2.7e-38max 0.30 1.6e-60 0.39 2.3e-104
median 0.31 8.0e-65 0.37 1.7e-92std 0.05 3.6e-03 0.07 4.6e-04CV -0.20 3.7e-28 -0.26 1.6e-43
Table B.3: Spearman correlation between CONTRAfold-SE and translation efficiencyon in vitro data.Correlation is between CONTRAfold-SE pairing probability trained on Train-B(PARS,DMS-vitro) and log fold change in translation efficiency at time 0 and time30 minutes: log(initial TE) - log(TE at 30min). Quantities are calculated as in FigureB.2.
APPENDIX B. RNA SECONDARY STRUCTURE 138
rep1 (r) rep1 (p) rep2 (r) rep2 (p)
5’ UTR mean 0.17 7.3e-20 0.15 2.0e-15min 0.02 3.0e-01 0.01 7.5e-01max 0.17 7.8e-20 0.17 4.8e-20
median 0.15 1.1e-16 0.13 4.0e-12std 0.13 6.3e-13 0.13 8.7e-12CV -0.04 2.4e-02 -0.02 2.7e-01
CDS mean 0.09 2.7e-06 0.19 1.4e-23min -0.20 3.8e-29 -0.14 5.4e-13max 0.31 2.1e-68 0.31 1.6e-64
median 0.07 9.7e-05 0.17 7.8e-19std 0.13 2.7e-13 0.19 2.4e-25CV -0.02 4.0e-01 -0.10 9.5e-08
3’ UTR mean 0.02 2.2e-01 0.03 7.8e-02min 0.04 3.2e-02 0.04 5.0e-02max -0.00 8.0e-01 0.01 5.9e-01
median 0.03 1.5e-01 0.04 3.0e-02std 0.01 6.2e-01 0.01 5.1e-01CV -0.03 6.4e-02 -0.03 1.1e-01
All mean 0.13 2.4e-13 0.22 9.0e-34min -0.18 1.0e-22 -0.11 1.1e-09max 0.30 1.4e-62 0.30 2.6e-58
median 0.12 1.5e-11 0.21 6.7e-29std 0.14 2.1e-14 0.20 9.3e-26CV -0.06 6.5e-04 -0.14 8.8e-14
Table B.4: Spearman correlation between CONTRAfold-SE and translation efficiencyat earlier time point.Correlation is between CONTRAfold-SE on Train-B(DMS-vivo) pairing probabilityand log fold change in translation efficiency at time 0 and time 15 minutes: log(initialTE) - log(TE at 15min). Quantities are calculated as in Figure B.2.
APPENDIX B. RNA SECONDARY STRUCTURE 139
rep1 (r) rep1 (p) rep2 (r) rep2 (p)
log(TE at 30min)
5’ UTR mean -0.23 5.5e-38 -0.29 1.2e-54CDS mean -0.00 9.6e-01 0.02 3.8e-01
3’ UTR mean -0.10 1.5e-07 -0.09 4.9e-06All mean -0.06 1.0e-03 -0.05 7.0e-03
log(initial TE)
5’ UTR mean -0.10 6.9e-08 -0.09 7.1e-07CDS mean 0.18 1.2e-23 0.15 3.8e-16
3’ UTR mean -0.03 8.5e-02 -0.04 5.5e-02All mean 0.17 1.6e-21 0.15 6.8e-15
log(initial TE) - log(TE at 30min) conditioned on log(initial TE)
5’ UTR mean 0.24 2.0e-38 0.21 2.4e-30CDS mean 0.03 1.4e-01 0.11 1.0e-09
3’ UTR mean 0.04 2.1e-02 0.05 4.0e-03All mean 0.09 3.9e-06 0.16 5.2e-18
Table B.5: Spearman correlation between CONTRAfold-SE in vivo and various TEquantities.CONTRAfold-SE is trained on Train-B(DMS-vivo) and other quantities are calcu-lated as in Figure B.2.
Bibliography
[1] Frank W Albert, Dale Muzzey, Jonathan S Weissman, and Leonid Kruglyak.
Genetic influences on translation in yeast. PLoS genetics, 10(10):e1004692,
October 2014.
[2] Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts,
and Peter Walter. Molecular Biology of the Cell, 2002.
[3] Andrei Alexandrov, Irina Chernyakov, Weifeng Gu, Shawna L Hiley, Timo-
thy R Hughes, Elizabeth J Grayhack, and Eric M Phizicky. Rapid tRNA decay
can result from lack of nonessential modifications. Molecular cell, 21(1):87–96,
January 2006.
[4] Gerd Anders, Sebastian D Mackowiak, Marvin Jens, Jonas Maaskola, Andreas
Kuntzagk, Nikolaus Rajewsky, Markus Landthaler, and Christoph Dieterich.
doRiNA: a database of RNA interactions in post-transcriptional regulation.
Nucleic acids research, 40(Database issue):D180–6, January 2012.
[5] S G Andersson and C G Kurland. Codon preferences in free-living microorgan-
isms. Microbiological reviews, 54(2):198–210, June 1990.
[6] Yoav Arava, F Edward Boas, Patrick O Brown, and Daniel Herschlag. Dissect-
ing eukaryotic translation and its control by ribosome density mapping. Nucleic
acids research, 33(8):2421–32, January 2005.
140
BIBLIOGRAPHY 141
[7] Yoav Arava, Yulei Wang, John D Storey, Chih Long Liu, Patrick O Brown,
and Daniel Herschlag. Genome-wide analysis of mRNA translation profiles in
Saccharomyces cerevisiae. Proceedings of the National Academy of Sciences of
the United States of America, 100(7):3889–94, April 2003.
[8] Carlo G Artieri and Hunter B Fraser. Evolution at two levels of gene expression
in yeast. Genome research, 24(3):411–21, March 2014.
[9] Tzvi Aviv, Zhen Lin, Giora Ben-Ari, Craig A Smibert, and Frank Sicheri.
Sequence-specific recognition of RNA hairpins by the SAM domain of Vts1p.
Nature structural & molecular biology, 13(2):168–76, February 2006.
[10] A. Battle, Z. Khan, S. H. Wang, A. Mitrano, M. J. Ford, J. K. Pritchard,
and Y. Gilad. Impact of regulatory variation from RNA to protein. Science,
347(6222):664–667, December 2014.
[11] Kajetan Bentele, Paul Saffert, Robert Rauscher, Zoya Ignatova, and Nils
Bluthgen. Efficient translation initiation dictates codon usage at gene start.
Molecular systems biology, 9:675, January 2013.
[12] F Bonekamp, H Dalbø ge, T Christensen, and K F Jensen. Translation rates of
individual codons are not correlated with tRNA abundances or with frequen-
cies of utilization in Escherichia coli. Journal of bacteriology, 171(11):5812–6,
November 1989.
[13] M Bulmer. The selection-mutation-drift theory of synonymous codon usage.
Genetics, 129(3):897–907, November 1991.
BIBLIOGRAPHY 142
[14] Nicola A Burgess-Brown, Sujata Sharma, Frank Sobott, Christoph Loenarz,
Udo Oppermann, and Opher Gileadi. Codon optimization can improve expres-
sion of human genes in Escherichia coli: A multi-gene study. Protein expression
and purification, 59(1):94–102, May 2008.
[15] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A Limited
Memory Algorithm for Bound Constrained Optimization. SIAM Journal on
Scientific Computing, 16(5):1190–1208, September 1995.
[16] J H Cate and J A Doudna. Solving large RNA structures by X-ray crystallog-
raphy. Methods in enzymology, 317:169–80, January 2000.
[17] Catherine A Charneski and Laurence D Hurst. Positively charged residues are
the major determinants of ribosomal velocity. PLoS biology, 11(3):e1001508,
January 2013.
[18] Catherine A Charneski and Laurence D Hurst. Positive charge loading at pro-
tein termini is due to membrane protein topology, not a translational ramp.
Molecular biology and evolution, 31(1):70–84, January 2014.
[19] Chunlai Chen, Haibo Zhang, Steven L Broitman, Michael Reiche, Ian Farrell,
Barry S Cooperman, and Yale E Goldman. Dynamics of translation by single
ribosomes through mRNA secondary structures. Nature structural & molecular
biology, 20(5):582–8, May 2013.
[20] L Cheng and E Goldman. Absence of effect of varying Thr-Leu codon pairs on
protein synthesis in a T7 system. Biochemistry, 40(20):6102–6, May 2001.
[21] Fabienne F V Chevance, Soazig Le Guyon, and Kelly T Hughes. The effects
of codon context on in vivo translation speed. PLoS genetics, 10(6):e1004392,
June 2014.
BIBLIOGRAPHY 143
[22] Dominique Chu, David J Barnes, and Tobias von der Haar. The role of tRNA
and ribosome competition in coupling the expression of different mRNAs in
Saccharomyces cerevisiae. Nucleic acids research, 39(15):6705–14, August 2011.
[23] Dominique Chu and Tobias von der Haar. The architecture of eukaryotic trans-
lation. Nucleic acids research, 40(20):10098–106, November 2012.
[24] Pablo Cordero, Wipapat Kladwang, Christopher C VanLang, and Rhiju Das.
Quantitative dimethyl sulfate mapping for automated RNA secondary structure
inference. Biochemistry, 51(36):7037–9, September 2012.
[25] J F Curran and M Yarus. Rates of aminoacyl-tRNA selection at 29 sense codons
in vivo. Journal of molecular biology, 209(1):65–77, September 1989.
[26] Lyris M F de Godoy, Jesper V Olsen, Jurgen Cox, Michael L Nielsen, Nina C
Hubner, Florian Frohlich, Tobias C Walther, and Matthias Mann. Comprehen-
sive mass-spectrometry-based proteome quantification of haploid versus diploid
yeast. Nature, 455(7217):1251–4, October 2008.
[27] Katherine E Deigan, Tian W Li, David H Mathews, and Kevin M Weeks.
Accurate SHAPE-directed RNA structure determination. Proceedings of the
National Academy of Sciences of the United States of America, 106(1):97–102,
January 2009.
[28] Elizabeth A Dethoff, Jeetender Chugh, Anthony M Mustoe, and Hashim M
Al-Hashimi. Functional complexity and regulation through RNA dynamics.
Nature, 482(7385):322–30, February 2012.
[29] Yang Ding, Premal Shah, and Joshua B Plotkin. Weak 5’-mRNA secondary
structures in short eukaryotic genes. Genome biology and evolution, 4(10):1046–
53, January 2012.
BIBLIOGRAPHY 144
[30] Ye Ding and Charles E. Lawrence. A statistical sampling algorithm for RNA
secondary structure prediction. Nucleic Acids Research, 31(24):7280–7301, De-
cember 2003.
[31] Yiliang Ding, Yin Tang, Chun Kit Kwok, Yu Zhang, Philip C Bevilacqua, and
Sarah M Assmann. In vivo genome-wide profiling of RNA secondary structure
reveals novel regulatory features. Nature, November 2013.
[32] Kimberly A Dittmar, Evelyn M Mobley, Agnes Jancso Radek, and Tao Pan.
Exploring the regulation of tRNA distribution on the genomic scale. Journal
of molecular biology, 337(1):31–47, March 2004.
[33] Chuong B Do, Daniel A Woods, and Serafim Batzoglou. CONTRAfold: RNA
secondary structure prediction without physics-based models. Bioinformatics,
22(14):e90–e98, July 2006.
[34] Mario dos Reis, Renos Savva, and Lorenz Wernisch. Solving the riddle of codon
usage preferences: a test for translational selection. Nucleic acids research,
32(17):5036–44, January 2004.
[35] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Bio-
logical Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.
Cambridge University Press, Cambridge, Massachusetts, 1998.
[36] Sean R Eddy. Computational analysis of conserved RNA secondary structure
in transcriptomes and genomes. Annual Review of Biophysics, 43:433–456, Jan-
uary 2014.
[37] Chantal Ehresmann, Florence Baudin, Marylene Mougel, Pascale Romby, Jean-
Pierre Ebel, and Bernard Ehresmann. Probing the structure of RNAs in solu-
tion. Nucleic Acids Research, 15(22):9109–9128, November 1987.
BIBLIOGRAPHY 145
[38] Robert D Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y Eber-
hardt, Sean R Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina
Mistry, Erik L L Sonnhammer, John Tate, and Marco Punta. Pfam: the protein
families database. Nucleic acids research, 42(Database issue):D222–30, January
2014.
[39] Helena Firczuk, Shichina Kannambath, Jurgen Pahle, Amy Claydon, Robert
Beynon, John Duncan, Hans Westerhoff, Pedro Mendes, and John Eg Mc-
Carthy. An in vivo control map for the eukaryotic mRNA translation machinery.
Molecular systems biology, 9:635, January 2013.
[40] Kurt Fredrick and Michael Ibba. How the sequence of a gene can tune its
translation. Cell, 141(2):227–9, April 2010.
[41] Tsukasa Fukunaga, Haruka Ozaki, Goro Terai, Kiyoshi Asai, Wataru Iwasaki,
and Hisanori Kiryu. CapR: revealing structural specificities of RNA-binding
protein target recognition using CLIP-seq data. Genome Biology, 15(1):R16,
January 2014.
[42] Boris Furtig, Christian Richter, Jens Wohnert, and Harald Schwalbe. NMR
spectroscopy of RNA. ChemBioChem, 4(10):936–962, October 2003.
[43] Justin Gardin, Rukhsana Yeasmin, Alisa Yurovsky, Ying Cai, Steve Skiena, and
Bruce Futcher. Measurement of average decoding rates of the 61 sense codons
in vivo. eLife, 3, January 2014.
[44] Maxim V Gerashchenko, Alexei V Lobanov, and Vadim N Gladyshev. Genome-
wide ribosome profiling reveals complex translational regulation in response to
oxidative stress. Proceedings of the National Academy of Sciences of the United
States of America, 109(43):17394–9, October 2012.
BIBLIOGRAPHY 146
[45] Aaron C Goldstrohm, Brad A Hook, Daniel J Seay, and Marvin Wickens. PUF
proteins bind Pop2p to regulate messenger RNAs. Nature structural & molecular
biology, 13:533–539, 2006.
[46] Wanjun Gu, Tong Zhou, and Claus O Wilke. A universal trend of reduced
mRNA stability near the translation-initiation site in prokaryotes and eukary-
otes. PLoS computational biology, 6(2):e1000664, February 2010.
[47] Claes Gustafsson, Sridhar Govindarajan, and Jeremy Minshull. Codon bias
and heterologous protein expression. Trends in biotechnology, 22(7):346–53,
July 2004.
[48] Christine E Hajdin, Stanislav Bellaousov, Wayne Huggins, Christopher W
Leonard, David H Mathews, and Kevin M Weeks. Accurate SHAPE-directed
RNA secondary structure modeling, including pseudoknots. Proceedings of the
National Academy of Sciences of the United States of America, 110(14):5498–
5503, April 2013.
[49] R Hamilton, C K Watanabe, and H A de Boer. Compilation and comparison of
the sequence context around the AUG startcodons in Saccharomyces cerevisiae
mRNAs. Nucleic acids research, 15(8):3581–93, April 1987.
[50] Winfried Hense, Nathan Anderson, Stephan Hutter, Wolfgang Stephan, John
Parsch, and David B Carlini. Experimentally increased codon bias in the
Drosophila Adh gene leads to an increase in larval, but not adult, alcohol de-
hydrogenase activity. Genetics, 184(2):547–55, February 2010.
[51] Ruth Hershberg and Dmitri A Petrov. Selection on codon bias. Annual review
of genetics, 42:287–99, January 2008.
BIBLIOGRAPHY 147
[52] Wolf D Hirschmann, Heidrun Westendorf, Andreas Mayer, Gina Cannarozzi,
Patrick Cramer, and Ralf-Peter Jansen. Scp160p is required for translational
efficiency of codon-optimized mRNAs in yeast. Nucleic acids research, pages
gkt1392–, January 2014.
[53] Jessica I Hoell, Erik Larsson, Simon Runge, Jeffrey D Nusbaum, Sujitha Dug-
gimpudi, Thalia A Farazi, Markus Hafner, Arndt Borkhardt, Chris Sander, and
Thomas Tuschl. RNA targets of wild-type and mutant FET family proteins.
Nature structural & molecular biology, 18(12):1428–31, December 2011.
[54] Daniel J Hogan, Daniel P Riordan, Andre P Gerber, Daniel Herschlag, and
Patrick O Brown. Diverse RNA-binding proteins interact with functionally
related sets of RNAs, suggesting an extensive regulatory system. PLoS Biology,
6(10):e255, October 2008.
[55] Nicholas T. Ingolia. Ribosome profiling: new views of translation, from single
codons to genome scale. Nature Reviews Genetics, 15(3):205–213, January 2014.
[56] Nicholas T Ingolia, Gloria A Brar, Silvia Rouskin, Anna M McGeachy, and
Jonathan S Weissman. The ribosome profiling strategy for monitoring transla-
tion in vivo by deep sequencing of ribosome-protected mRNA fragments. Nature
protocols, 7(8):1534–50, August 2012.
[57] Nicholas T Ingolia, Sina Ghaemmaghami, John R S Newman, and Jonathan S
Weissman. Genome-wide analysis in vivo of translation with nucleotide reso-
lution using ribosome profiling. Science (New York, N.Y.), 324(5924):218–23,
April 2009.
[58] Nicholas T Ingolia, Liana F Lareau, and Jonathan S Weissman. Ribosome
profiling of mouse embryonic stem cells reveals the complexity and dynamics of
mammalian proteomes. Cell, 147(4):789–802, November 2011.
BIBLIOGRAPHY 148
[59] B. Irwin, J. D. Heck, and G. W. Hatfield. Codon Pair Utilization Biases In-
fluence Translational Elongation Step Times. Journal of Biological Chemistry,
270(39):22801–22806, September 1995.
[60] Ailong Ke and Jennifer A Doudna. Crystallization of RNA and RNA-protein
complexes. Methods (San Diego, Calif.), 34(3):408–14, November 2004.
[61] Thomas E Keller, S David Mis, Kevin E Jia, and Claus O Wilke. Reduced
mRNA secondary-structure stability near the start codon indicates functional
genes in prokaryotes. Genome biology and evolution, 4(2):80–8, January 2012.
[62] Michael Kertesz, Yue Wan, Elad Mazor, John L Rinn, Robert C Nutter,
Howard Y Chang, and Eran Segal. Genome-wide measurement of RNA sec-
ondary structure in yeast. Nature, 467(7311):103–7, September 2010.
[63] Alex V Kochetov, Andrey Palyanov, Igor I Titov, Dmitry Grigorovich, Akinori
Sarai, and Nikolay A Kolchanov. AUG hairpin: prediction of a downstream
secondary structure influencing the recognition of a translation start site. BMC
bioinformatics, 8:318, January 2007.
[64] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles
and Techniques. Adaptive Computation and Machine Learning. MIT Press,
Cambridge, Massachusetts, 2009.
[65] M Kozak. Possible role of flanking nucleotides in recognition of the AUG ini-
tiator codon by eukaryotic ribosomes. Nucleic acids research, 9(20):5233–52,
October 1981.
[66] M Kozak. Downstream secondary structure facilitates recognition of initiator
codons by eukaryotic ribosomes. Proceedings of the National Academy of Sci-
ences of the United States of America, 87(21):8301–5, November 1990.
BIBLIOGRAPHY 149
[67] Grzegorz Kudla, Andrew W Murray, David Tollervey, and Joshua B Plotkin.
Coding-sequence determinants of gene expression in Escherichia coli. Science
(New York, N.Y.), 324(5924):255–8, April 2009.
[68] Daniel H Lackner and Jurg Bahler. Translational control of gene expression
from transcripts to transcriptomes. International review of cell and molecular
biology, 271:199–251, January 2008.
[69] Daniel H. Lackner, Traude H. Beilharz, Samuel Marguerat, Juan Mata, Stephen
Watt, Falk Schubert, Thomas Preiss, and Jurg Bahler. A Network of Multiple
Regulatory Layers Shapes Gene Expression in Fission Yeast. Molecular Cell,
26(1):145–155, April 2007.
[70] Liana F. Lareau, Dustin H. Hite, Gregory J. Hogan, and Patrick O. Brown. Dis-
tinct stages of the translation elongation cycle revealed by sequencing ribosome-
protected mRNA fragments. eLife, 2014, 2014.
[71] Yizhar Lavner and Daniel Kotlar. Codon bias as a factor in regulating expres-
sion via translation rate in the human genome. Gene, 345(1):127–38, January
2005.
[72] Daniel P Letzring, Kimberly M Dean, and Elizabeth J Grayhack. Control
of translation efficiency in yeast by codon-anticodon interactions. RNA (New
York, N.Y.), 16(12):2516–28, December 2010.
[73] Fan Li, Qi Zheng, Paul Ryvkin, Isabelle Dragomir, Yaanik Desai, Subhadra
Aiyer, Otto Valladares, Jamie Yang, Shelly Bambina, Leah R Sabin, John I
Murray, Todd Lamitina, Arjun Raj, Sara Cherry, Li-San Wang, and Brian D
Gregory. Global analysis of RNA secondary structure in two metazoans. Cell
Reports, 1(1):69–82, January 2012.
BIBLIOGRAPHY 150
[74] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for
large scale optimization. Mathematical Programming, 45(1-3):503–528, August
1989.
[75] A C Looman and J A Kuivenhoven. Influence of the three nucleotides up-
stream of the initiation codon on expression of the Escherichia coli lacZ gene in
Saccharomyces cerevisiae. Nucleic acids research, 21(18):4268–71, September
1993.
[76] T M Lowe and S R Eddy. tRNAscan-SE: a program for improved detection of
transfer RNA genes in genomic sequence. Nucleic acids research, 25(5):955–64,
March 1997.
[77] Julius B Lucks, Stefanie A Mortimer, Cole Trapnell, Shujun Luo, Sharon Avi-
ran, Gary P Schroth, Lior Pachter, Jennifer A Doudna, and Adam P Arkin.
Multiplexed RNA structure characterization with selective 2’-hydroxyl acyla-
tion analyzed by primer extension sequencing (SHAPE-Seq). Proceedings of the
National Academy of Sciences of the United States of America, 108(27):11063–8,
July 2011.
[78] Tobias Maier, Marc Guell, and Luis Serrano. Correlation of mRNA and protein
in complex biological samples. FEBS letters, 583(24):3966–73, December 2009.
[79] Orna Man and Yitzhak Pilpel. Differential translation efficiency of orthologous
genes is involved in phenotypic divergence of yeast species. Nature genetics,
39(3):415–21, March 2007.
[80] Nicholas R Markham and Michael Zuker. UNAFold: software for nucleic acid
folding and hybridization. Methods in molecular biology (Clifton, N.J.), 453:3–
31, January 2008.
BIBLIOGRAPHY 151
[81] David H Mathews, Matthew D Disney, Jessica L Childs, Susan J Schroeder,
Michael Zuker, and Douglas H Turner. Incorporating chemical modification
constraints into a dynamic programming algorithm for prediction of RNA sec-
ondary structure. Proceedings of the National Academy of Sciences of the United
States of America, 101(19):7287–92, May 2004.
[82] C Joel McManus, Gemma E May, Pieter Spealman, and Alan Shteyman. Ribo-
some profiling reveals post-transcriptional buffering of divergent gene expression
in yeast. Genome research, 24(3):422–30, March 2014.
[83] Edward J Merino, Kevin A Wilkinson, Jennifer L Coughlan, and Kevin M
Weeks. RNA structure analysis at single nucleotide resolution by selective 2’-
hydroxyl acylation and primer extension (SHAPE). Journal of the American
Chemical Society, 127(12):4223–31, March 2005.
[84] Audrey M Michel, Kingshuk Roy Choudhury, Andrew E Firth, Nicholas T
Ingolia, John F Atkins, and Pavel V Baranov. Observation of dually decoded
regions of the human genome using ribosome profiling data. Genome research,
22(11):2219–29, November 2012.
[85] Namiko Mitarai and Steen Pedersen. Control of ribosome traffic by position-
dependent choice of synonymous codons. Physical biology, 10(5):056011, Octo-
ber 2013.
[86] Stefanie A Mortimer, Mary Anne Kidwell, and Jennifer A Doudna. Insights
into RNA structure and function from genome-wide studies. Nature reviews.
Genetics, 15(7):469–79, July 2014.
[87] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish
Raha, Mark Gerstein, and Michael Snyder. The transcriptional landscape of
BIBLIOGRAPHY 152
the yeast genome defined by RNA sequencing. Science (New York, N.Y.),
320(5881):1344–9, June 2008.
[88] John R S Newman, Sina Ghaemmaghami, Jan Ihmels, David K Breslow,
Matthew Noble, Joseph L DeRisi, and Jonathan S Weissman. Single-cell pro-
teomic analysis of S. cerevisiae reveals the architecture of biological noise. Na-
ture, 441(7095):840–6, June 2006.
[89] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom
Mitchell. Text Classification from Labeled and Unlabeled Documents using
EM. Machine Learning, 39(2-3):103–134, May 2000.
[90] Daniel A Nissley and Edward P O’Brien. Timing is everything: unifying codon
translation rates and nascent proteome behavior. Journal of the American
Chemical Society, 136(52):17892–8, December 2014.
[91] Florian C Oberstrass, Albert Lee, Richard Stefl, Michael Janis, Guillaume
Chanfreau, and Frederic H-T Allain. Shape-specific recognition in the structure
of the Vts1p SAM domain with RNA. Nature structural & molecular biology,
13(2):160–7, February 2006.
[92] Zhengqing Ouyang, Michael P Snyder, and Howard Y Chang. SeqFold: genome-
scale reconstruction of RNA secondary structure integrating high-throughput
sequencing data. Genome Research, 23(2):377–87, February 2013.
[93] Marc Parisien and Francois Major. The MC-Fold and MC-Sym pipeline infers
RNA structure from sequence data. Nature, 452(7183):51–5, March 2008.
[94] Joseph K Pickrell, John C Marioni, Athma A Pai, Jacob F Degner, Barbara E
Engelhardt, Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens,
Yoav Gilad, and Jonathan K Pritchard. Understanding mechanisms underlying
BIBLIOGRAPHY 153
human gene expression variation with RNA sequencing. Nature, 464(7289):768–
72, April 2010.
[95] Joshua B Plotkin and Grzegorz Kudla. Synonymous but not the same: the
causes and consequences of codon bias. Nature reviews. Genetics, 12(1):32–42,
January 2011.
[96] Cristina Pop, Silvi Rouskin, Nicholas T Ingolia, Lu Han, Eric M Phizicky,
Jonathan S Weissman, and Daphne Koller. Causal signals between codon bias,
mRNA structure, and the efficiency of translation and elongation. Molecular
systems biology, 10(12):770, January 2014.
[97] Tomasz Puton, Lukasz P Kozlowski, Kristian M Rother, and Janusz M Bujnicki.
CompaRNA: a server for continuous benchmarking of automated methods for
RNA secondary structure prediction. Nucleic Acids Research, 41(7):4307–4323,
April 2013.
[98] Wenfeng Qian, Jian-Rong Yang, Nathaniel M Pearson, Calum Maclean, and
Jianzhi Zhang. Balanced codon usage optimizes eukaryotic translational effi-
ciency. PLoS genetics, 8(3):e1002603, January 2012.
[99] Scott Quarrier, Joshua S Martin, Lauren Davis-Neulander, Arthur Beauregard,
and Alain Laederach. Evaluation of the information content of RNA structure
mapping data for secondary structure prediction. RNA (New York, N.Y.),
16(6):1108–17, June 2010.
[100] Vladimir Reinharz, Francois Major, and Jerome Waldispuhl. Towards 3D struc-
ture prediction of large RNA molecules: an integer programming framework to
insert local 3D motifs in RNA secondary structure. Bioinformatics (Oxford,
England), 28(12):i207–14, June 2012.
BIBLIOGRAPHY 154
[101] Jessica S Reuter and David H Mathews. RNAstructure: software for RNA
secondary structure prediction and analysis. BMC Bioinformatics, 11(1):129,
January 2010.
[102] Shlomi Reuveni, Isaac Meilijson, Martin Kupiec, Eytan Ruppin, and Tamir
Tuller. Genome-scale analysis of translation elongation with a ribosome flow
model. PLoS computational biology, 7(9):e1002127, September 2011.
[103] Elena Rivas, Raymond Lang, and Sean R Eddy. A range of complex probabilis-
tic models for RNA secondary structure prediction that includes the nearest-
neighbor model and more. RNA, 18(2):193–212, February 2012.
[104] A Robbins-Pianka, M D Rice, and M P Weir. The mRNA landscape at yeast
translation initiation sites. Bioinformatics (Oxford, England), 26(21):2651–5,
November 2010.
[105] Silvi Rouskin, Meghan Zubradt, Stefan Washietl, Manolis Kellis, and
Jonathan S. Weissman. Genome-wide probing of RNA structure reveals ac-
tive unfolding of mRNA structures in vivo. Nature, December 2013.
[106] Zuben E Sauna and Chava Kimchi-Sarfaty. Understanding the contribution of
synonymous mutations to human disease. Nature reviews. Genetics, 12(10):683–
91, October 2011.
[107] Premal Shah, Yang Ding, Malwina Niemczyk, Grzegorz Kudla, and Joshua B
Plotkin. Rate-limiting steps in yeast protein translation. Cell, 153(7):1589–601,
June 2013.
[108] Bruce A Shapiro, Yaroslava G Yingling, Wojciech Kasprzak, and Eckart Binde-
wald. Bridging the gap in RNA structure prediction. Current opinion in struc-
tural biology, 17(2):157–65, April 2007.
BIBLIOGRAPHY 155
[109] J Shine and L Dalgarno. The 3’-terminal sequence of Escherichia coli 16S
ribosomal RNA: complementarity to nonsense triplets and ribosome binding
sites. Proceedings of the National Academy of Sciences of the United States of
America, 71(4):1342–6, April 1974.
[110] M A Sø rensen, C G Kurland, and S Pedersen. Codon usage determines trans-
lation rate in Escherichia coli. Journal of molecular biology, 207(2):365–77, May
1989.
[111] M A Sø rensen and S Pedersen. Absolute in vivo translation rates of individual
codons in Escherichia coli. The two glutamic acid codons GAA and GAG are
translated with a threefold difference in rate. Journal of molecular biology,
222(2):265–80, November 1991.
[112] Keith A Spriggs, Martin Bushell, and Anne E Willis. Translational regulation
of gene expression during conditions of cell stress. Molecular cell, 40(2):228–37,
October 2010.
[113] Michael Stadler and Andrew Fire. Wobble base-pairing slows in vivo translation
elongation in metazoans. RNA (New York, N.Y.), 17(12):2063–73, December
2011.
[114] Michael Stadler and Andrew Fire. Wobble base-pairing slows in vivo translation
elongation in metazoans. RNA (New York, N.Y.), 17(12):2063–73, December
2011.
[115] David W Staple and Samuel E Butcher. Pseudoknots: RNA structures with
diverse functions. PLoS Biology, 3(6):e213, June 2005.
[116] Zsuzsanna Sukosd, Bjarne Knudsen, Jø rgen Kjems, and Christian N S Ped-
ersen. PPfold 3.0: fast RNA secondary structure prediction using phylogeny
BIBLIOGRAPHY 156
and auxiliary data. Bioinformatics (Oxford, England), 28(20):2691–2, October
2012.
[117] Zsuzsanna Sukosd, M Shel Swenson, Jø rgen Kjems, and Christine E Heitsch.
Evaluating the accuracy of SHAPE-directed RNA secondary structure predic-
tions. Nucleic acids research, 41(5):2807–16, March 2013.
[118] Fran Supek and Tomislav Smuc. On relevance of codon usage to expression of
synthetic and natural genes in Escherichia coli. Genetics, 185(3):1129–34, July
2010.
[119] Jesper Tholstrup, Lene B Oddershede, and Michael A Sø rensen. mRNA pseu-
doknot structures can act as ribosomal roadblocks. Nucleic acids research,
40(1):303–13, January 2012.
[120] T. Tuller and H. Zur. Multiple roles of the coding sequence 5’ end in gene
expression regulation. Nucleic Acids Research, pages gku1313–, December 2014.
[121] Tamir Tuller, Asaf Carmi, Kalin Vestsigian, Sivan Navon, Yuval Dorfan, John
Zaborske, Tao Pan, Orna Dahan, Itay Furman, and Yitzhak Pilpel. An evolu-
tionarily conserved mechanism for controlling the efficiency of protein transla-
tion. Cell, 141(2):344–54, April 2010.
[122] Tamir Tuller, Isana Veksler-Lublinsky, Nir Gazit, Martin Kupiec, Eytan Rup-
pin, and Michal Ziv-Ukelson. Composite effects of gene determinants on the
translation speed and density of ribosomes. Genome biology, 12(11):R110, Jan-
uary 2011.
BIBLIOGRAPHY 157
[123] Tamir Tuller, Yedael Y Waldman, Martin Kupiec, and Eytan Ruppin. Trans-
lation efficiency is determined by both codon bias and folding energy. Pro-
ceedings of the National Academy of Sciences of the United States of America,
107(8):3645–50, February 2010.
[124] Sotaro Uemura, Colin Echeverrıa Aitken, Jonas Korlach, Benjamin A Flusberg,
Stephen W Turner, and Joseph D Puglisi. Real-time tRNA transit on single
translating ribosomes at codon resolution. Nature, 464(7291):1012–7, April
2010.
[125] Jason G Underwood, Andrew V Uzilov, Sol Katzman, Courtney S Onodera,
Jacob E Mainzer, David H Mathews, Todd M Lowe, Sofie R Salama, and David
Haussler. FragSeq: transcriptome-wide RNA structure probing using high-
throughput sequencing. Nature Methods, 7(12):995–1001, December 2010.
[126] S Varenne, J Buc, R Lloubes, and C Lazdunski. Translation is a non-uniform
process. Effect of tRNA availability on the rate of elongation of nascent polypep-
tide chains. Journal of molecular biology, 180(3):549–76, December 1984.
[127] Christine Vogel, Gustavo Monteiro Silva, and Edward M Marcotte. Protein
expression regulation under oxidative stress. Molecular & cellular proteomics :
MCP, 10(12):M111.009217, December 2011.
[128] Dennis P Wall, Aaron E Hirsh, Hunter B Fraser, Jochen Kumm, Guri Giaever,
Michael B Eisen, and Marcus W Feldman. Functional genomic analysis of the
rates of protein evolution. Proceedings of the National Academy of Sciences of
the United States of America, 102:5483–5488, 2005.
[129] Yue Wan, Michael Kertesz, Robert C Spitale, Eran Segal, and Howard Y Chang.
Understanding the transcriptome through RNA structure. Nature Reviews Ge-
netics, 12(9):641–655, September 2011.
BIBLIOGRAPHY 158
[130] Yue Wan, Kun Qu, Qiangfeng Cliff Zhang, Ryan A Flynn, Ohad Manor,
Zhengqing Ouyang, Jiajing Zhang, Robert C Spitale, Michael P Snyder, Eran
Segal, and Howard Y Chang. Landscape and variation of RNA secondary struc-
ture across the human transcriptome. Nature, 505(7485):706–9, January 2014.
[131] Stefan Washietl, Ivo L Hofacker, Peter F Stadler, and Manolis Kellis. RNA
folding with soft constraints: reconciliation of probing data and thermodynamic
secondary structure prediction. Nucleic Acids Research, 40(10):4261–72, May
2012.
[132] Kevin M Weeks. Advances in RNA structure analysis by chemical probing.
Current Opinion in Structural Biology, 20(3):295–304, June 2010.
[133] Mark Welch, Sridhar Govindarajan, Jon E Ness, Alan Villalobos, Austin Gur-
ney, Jeremy Minshull, and Claes Gustafsson. Design parameters to control
synthetic gene expression in Escherichia coli. PloS one, 4(9):e7002, January
2009.
[134] Jin-Der Wen, Laura Lancaster, Courtney Hodges, Ana-Carolina Zeri, Shige H
Yoshimura, Harry F Noller, Carlos Bustamante, and Ignacio Tinoco. Following
translation by single ribosomes one codon at a time. Nature, 452(7187):598–603,
April 2008.
[135] Kevin A Wilkinson, Suzy M Vasa, Katherine E Deigan, Stefanie A Mortimer,
Morgan C Giddings, and Kevin M Weeks. Influence of nucleotide identity on
ribose 2’-hydroxyl reactivity in RNA. RNA (New York, N.Y.), 15(7):1314–21,
July 2009.
[136] Christopher J Woolstenhulme, Shankar Parajuli, David W Healey, Diana P
Valverde, E Nicholas Petersen, Agata L Starosta, Nicholas R Guydosh, W Evan
BIBLIOGRAPHY 159
Johnson, Daniel N Wilson, and Allen R Buskirk. Nascent peptides that block
protein synthesis in bacteria. Proceedings of the National Academy of Sciences
of the United States of America, 110(10):E878–87, March 2013.
[137] Xiaoqiu Wu, Hans Jornvall, Kurt D Berndt, and Udo Oppermann. Codon
optimization reveals critical factors for high level expression of two rare codon
genes in Escherichia coli: RNA stability and secondary structure but not tRNA
abundance. Biochemical and biophysical research communications, 313(1):89–
96, January 2004.
[138] D F Yun, T M Laz, J M Clements, and F Sherman. mRNA sequences in-
fluencing translation and the selection of AUG initiator codons in the yeast
Saccharomyces cerevisiae. Molecular microbiology, 19(6):1225–39, March 1996.
[139] Shay Zakov, Yoav Goldberg, Michael Elhadad, and Michal Ziv-Ukelson. Rich
parameterization improves RNA structure prediction. Journal of Computational
Biology, 18(11):1525–1542, November 2011.
[140] Gong Zhang, Magdalena Hubalewska, and Zoya Ignatova. Transient ribosomal
attenuation coordinates protein synthesis and co-translational folding. Nature
structural & molecular biology, 16(3):274–80, March 2009.
[141] S Zhang, E Goldman, and G Zubay. Clustering of low usage codons and ribo-
some movement. Journal of theoretical biology, 170(4):339–54, October 1994.
[142] Qi Zheng, Paul Ryvkin, Fan Li, Isabelle Dragomir, Otto Valladares, Jamie
Yang, Kajia Cao, Li-San Wang, and Brian D Gregory. Genome-wide double-
stranded RNA sequencing reveals the functional significance of base-paired
RNAs in Arabidopsis. PLoS genetics, 6(9):e1001141, September 2010.
BIBLIOGRAPHY 160
[143] Tong Zhou and Claus O Wilke. Reduced stability of mRNA secondary structure
near the translation-initiation site in dsDNA viruses. BMC evolutionary biology,
11(1):59, January 2011.
[144] Michael Zuker and Patrick Stiegler. Optimal computer folding of large RNA
sequences using thermodynamics and auxiliary information. Nucleic Acids Re-
search, 9(1):133–148, January 1981.
[145] Hadas Zur and Tamir Tuller. Strong association between mRNA folding
strength and protein abundance in S. cerevisiae. EMBO reports, 13(3):272–
7, March 2012.