probabilistic models for understanding ...fm548ck9534/cpop...in this work, we rst present a...

PROBABILISTIC MODELS FOR UNDERSTANDING

REGULATION OF TRANSLATION

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Cristina Pop

March 2015

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/fm548ck9534

© 2015 by Cristina Pop. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/fm548ck9534

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Daphne Koller, Primary Adviser


Serafim Batzoglou


Jonathan Weissman

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

The process of translation, whereby RNA is converted to protein, is an essential

biosynthetic process requiring a large fraction of the cells resources. However, our

understanding of the regulatory mechanisms at this stage of gene expression is limited.

Recent high-throughput experimental techniques and our development of probabilistic

models for their analysis have allowed us to better explore translation efficiency, codon

preferences, and mRNA secondary structure, as well as the interplay between these

factors.

In this work, we first present a queuing-theory-based probabilistic model for ribo-

some profiling data to extract robust estimates of protein synthesis rates and trans-

lation rates per codon, which can vary across individual genes. We use this model to

show that local rates and translation efficiency are not affected by manipulations of

tRNA abundance in physiological conditions in yeast; this reverses the direction of

causality previously assumed to hold. Instead, we propose that initiation sequence

signals, such as mRNA structure, could drive translation. To further understand

varying translation rates, we also apply this model to human cells and present results

on allele-specific ribosome pausing.

Second, we delve deeper into RNA structure, which is important more broadly

throughout the pipeline of protein expression and in many aspects of regulation con-

trol. However, accurately determining RNA structure at large scale is difficult with

only experimental data or algorithmic methods. We present a conditional log-linear

iv

model that can incorporate information from multiple structure probing assays, and,

although limited by the data quality, improves prediction accuracy over leading algo-

rithms. Our method can also be used to derive new insight into biological processes

influenced by RNA structure, such as translation.

v

Acknowledgements

First and foremost, I’d like to thank my advisor, Daphne Koller, for her mentorship

and inspiration. Your source of boundless knowledge and spot-on guidance steered

me throughout my academic growth. Your ambition and fearlessness in asking the

hard questions became my goalpost too. I thank you for so much of what I have

learned in research, both in skills and in independence.

I’d also like to thank Jonathan Pritchard, for the insight he provided during

the last part of my PhD and the very warm support. I am very appreciative to

Jonathan Weissman, for fruitful discussions on much of this work and an amazing

long-term collaboration. Thank you also to Serafim Batzoglou and Anshul Kundaje,

who provided useful comments and valuable questions.

Throughout my PhD, I have had the pleasure of collaborating with a number of

fantastic people. Thank you:

� Nick Ingolia and Silvi Rouskin, for in-depth discussions and also for your pa-

tience in teaching me the biology I did not know.

� Chuan-Sheng Foo, for many vibrant and productive discussions, for the late-

night work sessions, and for being a great friend.

� Vlad Jojic, for your support in making my first year academically rich.

� Sara Mostafavi, for introducing me to a different part of computational biology.

vi

� Members of DAGS throughout the years: Suchi Saria, Karen Sachs, Alexis

Battle, Ben Packer, Joni Laserson, Manfred Classen, Varun Ganapathi, Yoni

Donner, David Knowles, Yi Liu, Irene Kaplow, Pawan Kumar, Michael Stark,

Huayan Wang, Tianshi Gao, Clara Fannjiang, and Madiha Chan.

� Members of the Pritchard lab, the Weissman lab, and the Batzoglou lab with

whom I have had the pleasure of working.

Thank you Alex Sandra and members of the CS department for making my days

stressless. I’d also like to acknowledge the NSF Graduate Research Fellowship and

NSERC Postgraduate Scholarship.

Finally, I am grateful for many of the folks I have had a chance to share these

years with:

To the friends I met in grad school – your support always made my day (and my

PhD, and the rest of my life).

To my parents, Emil and Ana, for wisdom and strength. You are my pillars in

everything, always.

To my sister, Ana, who has an uncanny way of being whatever you need. You are

awesome.

And most of all, to my grandmothers – from one I learned joy and math, from

one I learned wit and philosophy. From both of you, I learned the enchantment of

never giving up. My genes owe you.

vii

Contents

Abstract iv

Acknowledgements vi

1 Introduction 1

2 Background 4

2.1 Cell Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Analysis of Ribosome Profiling Data . . . . . . . . . . . . . . . . . . 9

2.3 Prediction of RNA Secondary Structure . . . . . . . . . . . . . . . . . 13

3 A Model for Translation 18

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Queuing Model for Elongation . . . . . . . . . . . . . . . . . . 21

3.2.2 Codon Translation and tRNA Manipulation . . . . . . . . . . 24

3.2.3 Translation Efficiency and tRNA Manipulaion . . . . . . . . . 27

3.2.4 Factors Correlating with Elongation Efficiency . . . . . . . . . 31

3.2.5 Factors Correlating with Translation Efficiency . . . . . . . . . 35

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 44

viii

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Translation in Humans 54

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.1 Allele-Specific Ribosome Dwell Times . . . . . . . . . . . . . . 55

4.2.2 Codon Translation Rates Across Individuals . . . . . . . . . . 58

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 RNA Secondary Structure Prediction 63

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2.1 Improved Secondary Structure Predictions . . . . . . . . . . . 66

5.2.2 The Value of Structure-Probing Data . . . . . . . . . . . . . . 71

5.2.3 Combining Data from Multiple Data Sources . . . . . . . . . . 73

5.2.4 Classification of RNA-Binding Protein Targets . . . . . . . . . 74

5.2.5 Nucleotide-Level Structure Contexts for RNA-Binding Proteins 75

5.2.6 Structure and Translation Efficiency under Oxidative Stress . 79

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.4.1 The CONTRAfold-SE Model . . . . . . . . . . . . . . . . . . 85

5.4.2 Dataset Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 Conclusions 94

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

ix

6.2 Going Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A Ribosome Profiling 99

A.1 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . 99

A.2 Supplementary Figures and Tables . . . . . . . . . . . . . . . . . . . 101

B RNA Secondary Structure 117

B.1 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . 117

B.1.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . 117

B.1.2 Dataset Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

B.2 Supplementary Figures and Tables . . . . . . . . . . . . . . . . . . . 127

Bibliography 140

x

List of Tables

5.1 F-measure of CONTRAfold-SE (C-SE) trained on Train-A(PARS) and

evaluated on Test-SeqFold. . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Performance of CONTRAfold-SE trained on Train-A and Train-B and

evaluated on three general test sets. . . . . . . . . . . . . . . . . . . . 70

5.3 Performance of CONTRAfold-SE trained on sets of varying composi-

tions with PARS data and evaluated on two test sets. . . . . . . . . . 72

A.1 Counts of tRNA in RPM (number of reads per million) in ACA-K and

wild-type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

A.2 Eight categories of potential correlates to outlier strength. . . . . . . 113

A.3 Spearman correlation between outlier strength and features, separated

by type and highlighted if significant. . . . . . . . . . . . . . . . . . . 114

A.4 Performance of TE regression model. . . . . . . . . . . . . . . . . . . 115

A.5 Summary of main results for model variations. . . . . . . . . . . . . . 116

B.1 AUC for receiver-operating-characteristic curves classifying bound RBP

genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

B.2 Spearman correlation between CONTRAfold-SE and translation effi-

ciency on in vivo data. . . . . . . . . . . . . . . . . . . . . . . . . . . 136


ciency on in vitro data. . . . . . . . . . . . . . . . . . . . . . . . . . . 137

xi


ciency at earlier time point. . . . . . . . . . . . . . . . . . . . . . . . 138

B.5 Spearman correlation between CONTRAfold-SE in vivo and various

TE quantities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

xii

List of Figures

2.1 Central dogma of biology. . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Ribosome footprint density profile versus mRNA density profile. . . . 10

2.3 Alternative splicing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Common structure motifs. . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Model of protein synthesis. . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Correlation between codon translation rates and measures of codon

usage bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Comparison between codon translation rates in wild-type and mutants. 28

3.4 Comparison between translation efficiency in wild-type and mutants. 30

3.5 All codons show negative correlation between outlier strength and

proximity to gene start. . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6 RNA structure energy and its relationship to translation efficiency. . . 36

3.7 Estimated Kozak motif for efficient genes. . . . . . . . . . . . . . . . 38

4.1 Comparison of ribosome fragment counts between alleles at SNPs. . . 56

4.2 Comparison of inferred codon dwell times between four random pairs

of human individuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 Overview of CONTRAfold-SE. . . . . . . . . . . . . . . . . . . . . . . 67

5.2 CONTRAfold-SE performance using different data sources. . . . . . . 73

xiii

5.3 Classification of RNA binding protein targets into true bound versus

false bound genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 Nucleotide-level structure prediction for the true bound sequences of

RNA binding protein FXR2 with motif WGGA. . . . . . . . . . . . . 78

5.5 Correlation between translation efficiency per gene and the accessibility

in rolling windows of 40nt, as predicted by CONTRAfold-SE. . . . . . 80

A.1 Correlation between experimental measures of protein abundance, and

estimated flow and average footprint count (baseline). . . . . . . . . . 101

A.2 Overexpression of tRNAArg(CCU) does not significantly alter amino

acid charging levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.3 The ratio between estimated mutant and wild-type rates. . . . . . . . 103

A.4 The ratio of mutant to wild-type footprint count per codon. . . . . . 104

A.5 The analysis of Figure A.2 repeated on flow instead of TE. . . . . . . 105

A.6 Distribution of three features among reduced TE genes and increased

TE genes in ACA-K. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

A.7 Correlation between log(TE) and gene-level features. . . . . . . . . . 107

A.8 Dwell-corrected footprint counts normalized by flow. . . . . . . . . . . 108

A.9 Codon translation rates versus tAI. . . . . . . . . . . . . . . . . . . . 109

A.10 Histograms of positions of slow outliers and non-outliers are similar. . 110

A.11 Two different initializations of the parameters for the translation model.111

B.1 Sensitivity-PPV curve for ASH1-E1 in Test-SeqFold. . . . . . . . . . 127

B.2 Sensitivity-PPV curve for RDN58-2 in Test-SeqFold. . . . . . . . . . 127

B.3 Sensitivity-PPV curve for p4p6 in Test-SeqFold. . . . . . . . . . . . . 128

B.4 Sensitivity-PPV curve for p9 in Test-SeqFold. . . . . . . . . . . . . . 128

B.5 Sensitivity-PPV curve for snR10 in Test-SeqFold. . . . . . . . . . . . 129


xiv





B.11 Structure profiles for human RNA binding proteins. . . . . . . . . . . 132

B.12 Learned noise model for structure probing data. . . . . . . . . . . . . 133

B.13 Correlation between learned parameters for different parameter initial-

izations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

xv

Chapter 1

Introduction

The expression of protein from DNA follows a complex series of steps. Classically,

genes within the DNA (a string of bases) are transcribed into RNA (a similar string

of bases) and translated into protein (a string of amino acids). We now know there is

further processing, post-modifications, and feedback at each stage, which makes the

linear process suggested by the central dogma much more complicated.

At the RNA and protein level, the cell contains multiple copies of each gene in

varying amounts (i.e. genes are expressed in varying amounts), with transcription

and translation changing the amount of each transcript in a process called gene reg-

ulation. The correlation between the level of RNA and the level of protein across the

genes in an organism is not perfect; it ranges from a Pearson r-value of 0.36 to 0.66

depending on the organism [78]. Knowing how much of each transcript is translated is

important for many levels of biological understanding, including determining how to

control translation, understanding how translation changes with disease states, and

deciphering the mechanisms behind differences between individuals.

Much research has been devoted toward understanding regulation of transcription

(converting DNA to RNA), namely, why some genes are expressed more than others.

However, only over the past few years have we developed high-throughput assays

1

CHAPTER 1. INTRODUCTION 2

and direct experiments to understand translation – converting RNA to protein and

the factors that contribute to its regulation. Embedded in these data are insights

about the process of translation, but accessing them requires handling sparse data,

distinguishing noise from the true signal, and identifying the relationships between

the variables in the underlying biological process. These tasks require new analytical

frameworks designed for these new kind of data. Probabilistic models are partic-

ularly useful in this case when we have prior information on the structure of the

process but not much ground truth data to learn from. These techniques also let us

infer missing values, smooth out noise, and learn biologically meaningful variables.

Consequently, in this work, we provide robust probabilistic methods for extracting

biologically meaningful parameters from high-throughput datasets in order to gain a

mechanistic understanding of the regulation of translation.

One of the key variables of interest in translation regulation is the relative amount

of protein produced between genes. Traditional techniques, mass spectrometry and

tagging using green-fluorescence protein, are able to measure protein levels but suf-

fer from lower accuracy, especially for low-abundance levels. More recently, a high-

throughput deep sequencing technique called ribosome profiling [57] has produced

a high-resolution snapshot of translation and a finer way for estimating ribosome

throughput, which is proportional to protein abundance. In addition, several studies

have appeared on properties of the RNA related to ribosome throughput or local

ribosome dynamics in order to understand the basis for what makes translation ef-

ficient in some proteins but not in others, and to tease apart causal factors from

correlated factors. This technique also allows easier comparison between physiologi-

cal conditions and synthetic biology constructs, which we exploit in this work. One

feature of the RNA in particular, namely the structure of RNA (how the RNA folds

on itself), has garnered much attention, both as a potential regulating mechanism for

CHAPTER 1. INTRODUCTION 3

translation and in many other essential processes within the cell. Whereas computa-

tional methods and experimental methods have generally not been tightly integrated

in the goal of structure prediction, recent high-throughput datasets providing partial

structure-probing data have made this coupling easier.

In Chapter 2, we give a brief background on the process of translation, the ex-

perimental assays we use in this work, the generated data that measures protein

abundance and RNA structure, and previous computational approaches for analyz-

ing such data. We then present, in Chapter 3, a probabilistic model for a ribosome

profiling dataset, which allows us to extract several variables of interest: the protein

synthesis rate and the rate at which each codon is translated. We also offer some bio-

logical factors influencing translation regulation in yeast. In Chapter 4, we extend our

analysis of translation to a human dataset, focusing on genetic variation and codon

translation rates. In Chapter 5, we return to RNA secondary structure as a poten-

tial regulator of translation and present a probabilistic model, CONTRAfold-SE, for

improved structure prediction using partial large-scale information. Finally, we sum-

marize the contributions of this thesis in Chapter 6 and reflect on our approaches

to modeling high-throughput data for better understanding regulation of translation

and its mechanistic basis.

Chapter 2

Background

2.1 Cell Biology

The cell consists of three major players: DNA, RNA, and protein. Although we have

gained much insight into the path from DNA to protein, there are many subtleties

and concepts left to discover. In this chapter, we will introduce some key biological

concepts specifically related to translation, the focus of this thesis. Translation is

a conversion from RNA to protein. Although much RNA is functional and a major

player in a number of key processes acting within the cell, proteins are often called the

building blocks. Each protein participates in a specific activity, including housekeep-

ing, regulation, or pairing with other proteins into complexes to achieve sophisticated

functions. Taking another step backward, proteins are derived from genes in DNA

– specific strings that undergo transcription (DNA to RNA), translation (RNA to

protein), and other processing before eventually becoming a functional protein. In

the following sections, we specifically focus on translation and the biological concepts

important for understanding its regulation. Figure 2.1 summarizes these concepts.

Further information can be found in [2].

Each gene is transcribed and translated into multiple copies of RNA and protein,

4

CHAPTER 2. BACKGROUND 5

both an oriented concatenation of smaller units. This string of units identify the

molecule, or the encoded gene. At a high level of abstraction, we refer to the RNA

transcript as the string that is directly translated into a protein (called mature mes-

senger RNA, or mRNA) and in the next section we will introduce an additional layer

of processing at the RNA level. Translation initiates at the 5’ end when the ribosome,

another large molecule, latches onto the RNA transcript, translocating towards the

3’ end. The ribosome converts the RNA string into the protein string, essentially

converting the string in series from one alphabet to another: the alphabet of RNA

(the four bases Adenine, Cytosine, Guanine, and Uracil, shortened to A, C, G, and U)

to the alphabet of proteins (the 20-22 amino acids, depending on the organism). Dur-

ing translation, the RNA string is grouped into consecutive triplets of bases, called

codons, each of which corresponds to a specific amino acid. This code is redundant;

with 64 codons and 22 amino acids, there are 1-6 codons coding for the same amino

acid. These are called synonymous codons, because substituting one for another will

not affect the final protein product. Synonymous codons are not uniformly used; this

preference phenomenon is called codon usage bias and its basis and influence is an

active topic of study and debate in translation literature.

The first codon in the string, the start codon, is typically AUG and is encoded by

the Methionine amino acid (Met). The ribosome begins initiation, starting translation

at the AUG and pausing to recruit other necessary helper molecules such as initia-

tion factors. Once the Met amino acid is added at the beginning of the novel protein

chain, the ribosome transitions to a stage called elongation, in which the ribosome

translocates to each successive codon, pausing at each one to recruit another impor-

tant molecule called tRNA. The tRNA molecule is specific to the codon currently in

the A-site (the active site of the ribosome), and holds the associated amino acid to be

added onto the growing protein chain. Similar to non-uniform codon usage, tRNAs


5’ UTR

protein

tRNA

mRNA structure

AUG

3’ UTR

UAA

DNA

RNA

amino acid

codon ribosome

5’ end

initiation elongation termination

ATG… …TAA

sites within ribosome

A P E

3’ end

Figure 2.1: Central dogma of biology.DNA is transcribed into RNA and RNA is translated into protein. The ribosomeinitiates translation at the AUG codon, located after the 5’ UTR. The ribosomethen enters the elongation stage, where it pauses at each codon to recruit the tRNAmolecule that brings in the associated amino acid. The ribosome is large enoughto cover 3 codon positions – the currently active one is in the A-site, the previously-translated codon is in the P-site, and the second-last translated codon is in the E-site.The ribosome terminates translation at a stop codon.


exist in varying amounts, floating in the cytoplasm until they are needed by the ri-

bosome. Typically, elongation is much less intensive than initiation. The final codon

is the stop codon, typically one of UAG, UAA, UGA. At this stage, termination, the

ribosome subunits dissociate and the completed protein chain is released.

There are several biological components involved in this process. We focus on

those that we will refer to in this thesis, and in particular those that have been

associated in literature with translation regulation:

Sequence Signals Beyond the codons themselves, the RNA transcript encodes

various other sequence signals that are important for initiation, elongation, and ter-

mination. Upstream (before) to the AUG is the 5’ UTR (untranslated region) and,

similarly, downstream (after) the stop codon is the 3’ UTR. These regions are, as

indicated, not typically translated by the ribosome into amino acids, but do act as

indicators for the start and stop of the gene, or regulate translation via, for example,

RNA structure.

RNA structure The RNA strand folds on itself in its native state. Since this

structure has to be unfolded while the ribosome elongates translation, this barrier

could intuitively affect the speed of translation [86]. It has also been shown that spe-

cific types of structures at the 5’ end can impact initiation and hence the efficiency of

translation [66, 63, 104]. We will focus on better determining structure via computa-

tional and experimental techniques in the last section of this thesis. Various motifs

or k-mers in the RNA (strings of length k) can also be recognized by other molecules

that can bind these regions in order to repress or help translation. A particular region

of interest around the start codon has seen much analysis [66, 109], but other parts

of the RNA could be affected by RNA binding proteins.


Protein folding Similarly to the RNA, the growing protein chain can also fold on

itself as it is translated (co-translational folding). Since this structure is often critical

to its function, the ribosome might need to pause at specific locations in order to

ensure a correct fold [140, 90].

Ribosome conformation The ribosome itself could also affect its speed of trans-

lation. Recently it was shown that the ribosome takes on two different conformations

[70], an area that we are now able to explore at a larger-scale with the advent of

high-throughput techniques.

Other sequence signals Finally, there are many other signals that have been

suggested to regulate translation. The co-occurrence of specific codons (codon pairs)

could lead their amino acids to interact with each other in speed-impacting ways

within the ribosome [59, 20, 21]. This occurs because the ribosome actually houses

each codon through the A, P, and E sites, from amino acid recruitment to exit from

the ribosome tunnel and into the freed protein chain not protected by the ribosome.

The A-site codon is the one for which the ribosome recruits the amino acid, and

it is shifted over into the P site for further processing and to make room for the

next amino acid. Clusters of rare or “slow” codons could impede translation [141].

Specific codons upstream of the stop codon could affect translation [120] and AUG

sites upstream on the actual start location within the UTR could act as regulatory

mechanism for initiation. Depending on the organisms, the specific set and the sample

space of biological factors could vary greatly. For example, a particular motif in E.

coli is by far a significant repressor of translation [73], whereas that factor is not

observed in other organisms, like yeast.


2.2 Analysis of Ribosome Profiling Data

Experimental Techniques

Protein abundance has traditionally been measured by the standard techniques of

mass spectrometry and fluorescence-tagging (GFP), which give a relative abundance

level representing how many copies exist for each gene. A more recent technique, ribo-

some profiling [57], combines the concept of polysome profiling with deep sequencing

to extract information about translation at a codon-level resolution. In particular,

ribosomes are immobilized during translation using flash freezing (or, originally also

a drug called cycloheximide), capturing the location of the active codon (the A site).

Since the ribosome is a large molecule, it also covers the region around the active

codon, around 30 nucleotides (30nt) in length. The RNA not covered by the ribo-

some is digested, leaving only the ribosome-bound fragments. These RNA are then

purified to remove the ribosome. In a manner similar to measuring abundance of

RNA, these fragments are then reverse transcribed into complementary DNA, am-

plified, and sequenced, so that they can be aligned to the genome. This final stage

reveals their location in the genome and, hence, which gene they correspond to. For

specific fragments, we can confidently and unambiguously identify the location of the

active codon on this footprint length (typically halfway in). Therefore, this data gives

us a ribosome footprint density profile for every gene, representing how many counts

we observe for every codon on that gene, or how many ribosomes were translating

each codon at a given snapshot in time (Figure 2.2).

Given a snapshot at steady-state, uniform translation speed by the ribosome across

the transcript, and sufficient sampling depth (sufficient footprints), we could average

the footprint counts in each gene profile to obtain an estimate of how many ribosomes

terminate translation per transcript. Ribosome throughput corresponds, up to a

factor that accounts for protein degradation, to how many proteins are produced


gene position [codon]!

mRNA Density Profile!

Ribosome Footprint Density Profile!

Figure 2.2: Ribosome footprint density profile versus mRNA density profile.These densities are over a sample gene of length 250 codons. The ribosome counts(top) have more variance than the mRNA counts.

for each gene. Indeed, this technique is applied when estimating RNA abundance

from RNA-seq data. In RNA-seq, the transcriptome is randomly fragmented (as

in ribosome-profiling, but with no ribosomes, and random cuts of the RNA) and

mapped back to the genome, this time giving RNA abundance profiles for each gene.

Averaging the fragment counts per gene in RNA-seq is a reasonable approach since

in that situation we expect a uniform coverage of all positions on the transcript if

fragments are randomly selected. However, during translation, the ribosome pauses

for varying amounts across a gene, and hence the footprints extracted from this

process are not uniform across a gene. Figure 2.2 indeed shows a comparison between

a typical ribosome footprint density profile and an RNA-seq profile. And so, in the

case of ribosome-profiling, we also obtain an estimate for each codon of how many

ribosomes were translating that location across all copies of the gene. These extracted

counts are proportional to the dwell time of the ribosome at that location.


Several ribosome profiling studies have now appeared in a variety of organisms

including yeast, E. coli, C. elegans, Arabadopsis, mouse, and human, in various con-

ditions including amino acid starvation, oxidative stress, and physiological conditions

[55]. In this work, we first focus on the model organism yeast and then move to a

higher-order organism, human.

Sample preparation in each one is incredibly important and does vary from dataset

to dataset. As previously discussed, cyclohexamide is a potential drug for halting

translation, but it has been shown that it can bias fragment extraction and produce

artifacts in the fragment counts [8]. Therefore, in this work, we use a modified

procedure with flash freezing instead of drug treatment (as in [56]).

Computational Techniques

In the setup described above, several ad-hoc methods simply take the average of the

counts in order to obtain an estimate of protein abundance. Similarly, in order to

obtain an estimate of the speed of the ribosome at a particular location, one could

divide the count at the position in question by the average of those in the window

around it. These approaches are complicated by the fact that the ribosome is not

translating each gene at uniform speed – particularly slow or fast positions can inflate

or deflate the average.

Another approach is to model the codons as sites on a 1D lattice and the movement

of the ribosome as a TASEP, a totally asymmetric simple exclusion process [102]. In

this approach, ribosomes enter the system with a certain rate and process each codon

(site) with a certain rate. This representation allows easy addition of components

such as ribosome drop-off via an exit rate at each unit. Several physics-based methods

treat variations of such a system by adjusting the boundary conditions, the input and

output rates per site, and/or the occupancy of each site in relation to those around it,

in order to represent physical properties like sterical restrictions caused by ribosome


stacking due to a slow codon. However, the more complicated the model becomes,

the harder the analytical treatment. As such, these methods are forced to make

simplifying assumptions that make the model unrealistic (e.g. uniform translation

rate per gene) or forced to rely on simulations. These approaches use ribosome

profiling data either to apply system constraints or to find ideal parameter settings

from a series of simulations that attempt to re-generate the data.

To the best of our knowledge, probabilistic models have not been used for analyz-

ing such data with high accuracy.

Translation in Higher-Order Organisms

Translation and other biological processes in human species are extremely more com-

plicated than in lower-order organisms like yeast. For example, humans are diploid

organisms, with two copies of each chromosome, with each copy potentially containing

different versions (alleles) of the genome at specific key sites called single-nucleotide

polymorphisms (SNPs). Even synonymous SNPs have been shown to induce differ-

ent phenotypes [106], but often the mechanism via which they act is not understood.

Is the speed of translation different for each allele? What biological factors affect

these translation-level changes? These are interesting questions toward understand-

ing the genetic basis behind translation (namely, what genome-level differences cause

associated changes in translation).

Another complication in higher-order organisms is that each protein can have more

than one isoform (Figure 2.2). To describe this, we refine our definition of RNA. DNA

is first transcribed into pre-mRNA (pre messenger RNA). These transcripts encode

alternating regions of introns and exons, whereby the introns are removed via splicing

and only the exons are retained in the mature mRNA. This type of RNA, which we

simply refer to as mRNA or RNA in the remainder of the thesis, contains the exons

that are translated into a protein. However, in eukaryotes, the same template of exons


exon ! ! intron ! ! exon! ! intron ! exon!

pre-mRNA!

isoform 1 (mature mRNA)!

isoform 2 (mature mRNA)!

Figure 2.3: Alternative splicing.Splicing occurs when pre-mRNA produces different mature mRNA copies. Intronsare always removed from mature mRNA. The green exon is skipped in isoform 1,but kept in isoform 2. Ribosome fragment counts (short black lines) that map tocommon exons can be mapped to either isoform, but the green exon footprints canunambiguously be mapped to isoform 2.

(the same gene) can be parsed differently to produce different proteins, or isoforms.

This process, called alternative splicing, can occur for example by skipping certain ex-

ons from an mRNA. Clearly, in a deep-sequencing context, when ribosome fragments

of lengths shorter than exons need to be mapped back to the genome, we encounter

identifiability issues. Properly attributing each ribosome-protected fragment to each

protein isoform is a difficult process and should be considered when interpreting the

data and model results.

2.3 Prediction of RNA Secondary Structure

As previously described, the structure of RNA is critical to its function. Structured

motifs in an RNA molecule permit or impede the binding of proteins and small

molecules, resulting in downstream effects on gene expression [129, 86]. For example,

presence of a pseudoknot (a specific structural motif) during elongation has been


shown to cause a shift in the reading frame of the ribosome, which disturbs the

parsed 3-codon periodicity and can lead to an amino acid mis-incorporation that

renders the protein non-functional [119].

High-accuracy experimental techniques for measuring RNA structure are typically

expensive, low-throughput, and can only be achieved in vitro, which doesn’t always

reflect the folding kinetics in a live organism. Consequently, computational meth-

ods have been developed to predict structure from the RNA sequence. While the

ultimate goal of RNA structure modelling methods is to determine a complete three-

dimensional structure, this is currently an extremely challenging task [108]. The 3D

structure includes many different forces beyond those at the “secondary structure”

level – such as long-range forces that play a role in the final structure but are not

well-understood and hard to model. As such, much effort has focused instead on the

more tractable problem of determining secondary structure: the set of intra-molecular

complementary Watson-Crick basepairs (A pairs with U, and C pairs with G). Suc-

cessful prediction of secondary structure is an important step towards a complete

three-dimensional model of an RNA molecule; many 3D structure prediction algo-

rithms use a putative secondary structure as a scaffold for determining higher order

tertiary interactions (e.g., pseudoknots) [108, 93, 100]. After many advances in com-

putational techniques, such as the use of machine learning, prediction accuracy has

mostly remained around 50-70%, varying with the class of structures, the length of

the RNA, and other factors. Besides the pseudoknot, there are several other common

motifs in RNA structure (Figure 2.3. In general, C-G basepairs are more energetically

stable (have lower free energy) than A-U basepairs.

When creating computational methods for secondary structure, there are three

competing axes we want to optimize on: speed, accuracy, and generality. Speed is

often described in terms of Big-O notation relative to the length of the RNA strand

in question. Accuracy is relative to the ground truth structure. Generality refers to


hairpin!

stacked!pair!

stem!

internal loop! pseudoknot!

loop!

Figure 2.4: Common structure motifs.The top row is a cartoon representation of the folded RNA. The bottom row is anarc diagram where the bases are ordered from the 5’ to the 3’ end of the region andconnected by an arc if they are paired. A loop is a set of unpaired bases and a stemis a set of paired bases.

which types of motifs are allowed in the structures. In this work, we will focus on

secondary structure motifs. We will refer to secondary structure, or simply structure,

as the set of Waston-Crick basepairs without pseudoknots. In the arc-diagrams of

Figure 2.3, these are motifs without crossing arcs. Although we will not analyze

running time complexity in this work, it is important to note that as algorithms

handle more complex structures exactly, their running time often increases, which

makes long sequences over 1000nt difficult to predict on or include in training sets.

Experimental Techniques

Individual RNA structures are most accurately determined through low-throughput

experimental means, such as NMR spectroscopy [42], X-ray crystallography [16, 60],


or chemical and enzymatic probing methods [37, 132]. The former two are both time-

consuming and expensive, but the recent combination of the latter methods with

high-throughput sequencing has led to the development of several genome-wide RNA

structure-probing assays [142, 62, 125, 77, 31, 105]. These assays reveal which nu-

cleotides are paired and which are not, but cannot determine specific pairing partners.

In this thesis, we will be focusing on the later high-throughput assays, consisting of

three major structure-probing approaches: PARS, DMS, and SHAPE.

In the PARS assay [62], the RNA structure signal is obtained by treating RNA

with enzymes that preferentially cleave either paired or unpaired nucleotides. These

cleaved fragments are of different lengths depending on the location of the paired/un-

paired base and hence can be mapped back to the genome to reveal how likely that

position was to be paired or unpaired. These counts are combined to form a score

per base representing structured-ness.

The DMS-seq assay [105] relies on the reactivity of unpaired nucleotides to a

smaller molecule called dimethyl-sulfate chemical. Reactive positions block reverse

transcriptase, again leaving pieces which can be mapped back to the genome for a

score per base representing unstructured-ness. The DMS-seq assay was applied to

both renatured RNA and live yeast, giving us a glimpse into both in vitro and in vivo

settings.

Finally, SHAPE-seq [83] is a chemical-probing method using selective 2-hydroxyl

acylation analyzed by primer extension. In this chemistry, a reagent reacts with single-

stranded sequence and similarly blocks reverse transcriptase. This data is thought to

be less biased [36], but a large-scale assay for it has yet to be developed.

Computational Techniques

RNA secondary structure prediction methods can be broadly classified into energy-

based methods and methods based on statistical models. Energy-based prediction


methods or algorithms based on thermodynamic models [144, 80, 101] compute a

minimum free-energy (MFE) secondary structure using experimentally derived ener-

gies for each template motif (for example, a stacked pair of A-U followed by C-G

emits a certain free energy that has been measured in an experimental setting; the

combination of these motifs is then explored via a dynamic programming algorithm

to derive the set of pairings that emit the lowest energy).

Methods based on statistical models, on the other hand, rely on data from a

training set of sequences and their known structures in order to learn a model of

secondary structure. In general, statistical methods for RNA secondary structure

prediction outperform energy-based methods [103, 97]. CONTRAfold [33] is an ex-

ample of one of the leading statistical algorithms for pseudoknot-free prediction, and

the one which we will extend in this work. CONTRAfold is a conditional log-linear

model modeling the probability of a structure given a sequence using a weighted sum

of features reflecting those included in MFE-based models. For example, a feature

could be the indicator that we see an (A-U, C-G) stack at position (i−j), (i+1, j−1).

Similar to the dynamic programming approach for MFE models, we can write a set of

recursions that are solved via a version of the inside and outside algorithm common

in natural language processing stochastic-free grammar models [35].

More recently, structure-probing data such as SHAPE and PARS have been used

in conjunction with computational methods in order to infer complete RNA structures

[81, 27, 99, 131, 92, 48]. Thus far, such methods have been heuristic derivatives of

thermodynamic models and do not explicitly model the structure-probing data.

Chapter 3

A Model for Translation

3.1 Introduction

The translation of RNA into protein is the nexus of decoding genetic information

into functional polypeptides and also a central biosynthetic process consuming a sub-

stantial fraction of the cell’s resources. Although apparently redundant nucleotide

sequences encode each protein, usage of different synonymous codons is highly bi-

ased [95]. These preferences are strongest in highly-expressed genes throughout di-

verse organisms [79, 51], suggesting selective pressure for the efficient use of the trans-

lational apparatus during the synthesis of abundant proteins. At the same time, less

common codons may be used in order to modulate translation, or may arise due

to competing sequence constraints such as mRNA secondary structure. While the

evolutionary signature of codon bias is clear, its biochemical basis remains unsettled.

Ribosome profiling [57] is an emerging technique for profiling translation in vivo

that is well suited to provide insights into the factors controlling the speed of transla-

tion as well as the amounts of each protein produced by the cell. Ribosome profiling

data comprise a set of ribosome-protected fragments (footprints) marking ribosome

density along mRNA transcripts with codon resolution. We can therefore extract

18

CHAPTER 3. A MODEL FOR TRANSLATION 19

from these data both the yield of each protein (protein synthesis rate) and the rate at

which each codon is translated (codon translation rate or elongation rate). However,

estimation of these two quantities is nontrivial, and ad-hoc approaches disregard dif-

ferences in elongation rates between genes or exclude mRNAs with sparse footprint

coverage. A number of studies with different analysis approaches present varying

hypotheses for the mechanisms underlying variation in elongation and translation ef-

ficiency in yeast and other organisms [123, 121, 58, 122, 113, 98, 17, 107, 136, 70, 43].

These include codon effects mediated by tRNA abundance or wobble base pairing, as

well as effects of mRNA structure and the nascent peptide on the ribosome.

Here, we present a rigorous statistical method that estimates, from ribosome pro-

filing data, both elongation rates and protein synthesis levels on individual transcripts;

as a byproduct, it also estimates translation efficiency (TE), the propensity of a tran-

script to generate complete protein, defined as the total amount of protein produced

from an mRNA message, and calculated here as our model-derived protein synthesis

rates divided by the mRNA levels. We use our robust modeling framework in con-

junction with new high-resolution data from wild-type yeast, along with three tRNA

mutants, to explore some of the conflicting views on the causality between codon

usage and elongation rate, as well as between codon usage and TE, in physiological

conditions at a genome-wide level.

We first apply our model to examine biological factors contributing to local trans-

lation kinetics. Due to differences in tRNA levels that correlate with synonymous

codon bias, variability in codon translation rates observed per gene is commonly

thought to be governed by the abundance of cognate tRNAs [126, 110]. However,

codon bias does not correlate with indirect measures of decoding speed, at least in bac-

teria [12, 25]. Similar to other observations in ribosome profiling datasets [73, 98, 17],

we find that codon usage bias is a poor predictor of elongation rate. We further test

for causal influence and illustrate that experimentally manipulating tRNA abundance


or body similarly does not affect the elongation rate when decoding with the manip-

ulated tRNA. In addition, our model identifies positions where elongation is slower

than expected based on codon identity and suggests that such pauses commonly occur

closer to the 5’ end but are unrelated to codon bias.

Finally, we use our model to disentangle the factors underlying message-specific

differences in translational efficiency. In physiological conditions, initiation rather

than elongation may largely determine overall protein production; initiation predom-

inates when it is slow relative to the time needed to elongate through the width of

one ribosome (∼10 codons), so that translating ribosomes rarely interfere with each

other, and when elongation is highly processive, so that most initiation events re-

sult in a protein [5, 13, 7, 68]. Analysis of our tRNA-perturbed mutant experiments

shows that efficiency is not causally affected by improving tRNA levels, leading us to

focus on initiation signals in understanding variation in translational efficiency across

different messages. Several causes for slow initiation have been proposed: codon bias

at the 5’ end [123, 122], secondary structure [67, 46, 62, 122, 61, 145], and gene

length [6, 69, 29]. We find that a Kozak-like initiation motif [65] and lack of structure

around the start codon are predictors of TE. Overall, our experimental and analytical

results provide support to a previously proposed model in which initiation is rate-

limiting in physiological conditions [13], in which initiation rate is affected largely

by mRNA sequence features, and where translational efficiency is not significantly

affected by codon usage [5, 13]. In contrast with experiments in non-physiological

conditions, our results endorse the resulting explanation that, in endogenous con-

ditions, perhaps in combination with other pressures, selection for efficient use of

ribosomes and associated factors in the synthesis of highly-translated proteins is a

potential driver of the observed codon usage biases.

This work was conducted in collaboration with Silvi Rouskin and Jonathan S.


Weissman at University of California, San Francisco (for the ribosome profiling ex-

periments) and Lu Han and Eric M. Phizicky at the University of Rochester Medical

Center (for the aminoacylation experiments). The computational methods and anal-

yses were conducted by myself under my advisor Daphne Koller.

3.2 Results

3.2.1 Queuing Model for Elongation

To extract high-quality estimates of protein synthesis rates and codon translation

rates from the ribosome footprint data, we model the process of ribosome flow, using

gene- and codon-dependent parameters, and the physical sampling that occurs in the

experimental protocol from which these data are derived. Our design choices are

motivated by potential biases in the data including sparse footprint counts for low

abundance genes, biases due to the position along the mRNA, and biases due to the

identity of the mRNA.

Our model inputs are the set of ribosome footprint counts d at each codon in the

genome, sparsely sampled (due to sequencing depth) from an (unobserved) steady-

state distribution π. In particular, dmk is the observed footprint count at position k in

mRNA message m and πmk encodes the fraction of ribosomes at (m, k). Consequently,

the distribution must satisfy flow conservation constraints: if ribosomes do not fall

off the message, then due to conservation of matter, the protein synthesis rate Jm for

message m (the ribosome flow out of the stop codon) must be the same as the flow

Jmk from any position k on m. If we define µmk as the dwell time of the ribosome at

(m, k), flow conservation also implies that rapidly translating positions (small µmk)

are occupied for a smaller fraction of time (small πmk) than positions that are slow

to translate. The dwell time µmk is the inverse of the rate at which the ribosome


Ribosome Footprint Density Profile

dmk Jm

Jm

dm1

dm2 dm3 dm4

dm5

µm1 < µm2 > µm3 > µm4 < µm5

m = gene k = position on gene dmk = ribosome footprint

count at (m,k) Jm = flow per m µmk = dwell time at (m,k)

count at position = flow * dwell at position

RNA position

dmk = Jmµmk

Figure 3.1: Model of protein synthesis.Ribosomes initiate translation with a protein synthesis rate or flow (J) of ribosomes.This is conserved across the strand, so that at each residue (m, k) the flow dependson the dwell time of the ribosome (µ) and the ribosome occupancy (proportional tofootprint count d). Slower positions, for example (m, 2) compared to (m, 1), can in-flate the average footprint count per gene and must be accounted for when estimatingflow. Dwell times and flow are correlated with local and global cis-features.

elongates off of position (m, k) and so intuitively depends on the amount of time the

ribosome requires to perform one elongation step (recruit tRNA, form the peptide

bond, and translocate). Thus, at steady-state, flow Jmk is proportional (up to a

constant encoding the number of ribosomes in the system) to πmk/µmk, where we use

dmk throughout as our observed proxy for πmk. Figure 3.2.1 shows the relationship

between the variables.

We use the counts {d} to estimate the quantities {µmk} and {Jm} in a novel

probabilistic regression accounting for flow conservation and assuming steady-state


and no ribosome fall-off. Briefly, we optimize over two terms:

maxµc

m,µclog Πm,kµ

cm

(dmk/Jm) exp(−µcm)− [∑m,c

wcm(log µcm − log µc)2]

The first term is a standard likelihood term for the data, using a model encoding

flow conservation. Since a single ribosome profiling dataset does not contain enough

data to robustly infer a separate µmk for each (m, k), we use the same dwell time

µmc for every occurrence of the same codon c within message m, making µmc an

expected dwell time for codon c on message m. The second term additionally softly

constrains µmc to be similar to a global codon dwell µc, based on the intuition that

the same codon behaves similarly throughout the cell. To optimize the objective, we

(1) estimate the dwell times µmc and µc with flow Jm fixed and (2) set flow Jm to be

the average of the flows Jmk (namely, the dwell-corrected footprint counts dmk/µmk)

across each message: Jm =∑

k∈mdmk/µmk)

Lm(see Materials and Methods for details).

We ran our model on a ribosome profiling dataset gathered for Saccharomyces

cerevisiae in rich medium, using a flash-freezing technique as described before [56]

(see Materials and Methods). To verify the validity of our estimated parameters, we

compared our protein synthesis rate Jm to two external measures of protein abun-

dance GFP-based levels from Newman et al [88] and mass-spectrometry-based levels

from de Godoy [26] and obtained strong correlations (Pearson r = 0.789 and 0.680, re-

spectively, p = 0). These improve on the protein abundance estimates from Ingolia et

al [57], computed as the simple average of (uncorrected) footprint counts per message

(Figure A.2). While correlation with these standard estimates of protein abundance

is reassuring, these methods have general limitations such as ascertainment bias for

less abundant proteins as well as technical limitations such as the impact of fusion

tags on protein levels. In addition, ribosome profiling measures translation and pro-

tein synthesis, but steady-state protein abundance is also affected by rates of protein


degradation.

While the protein synthesis flux is perhaps the most obvious interesting quantity

that can be extracted from profiling data, we can also derive other quantities of inter-

est from our learned model parameters. We compute translation efficiency TEm of a

given mRNA molecule m by dividing protein synthesis rate Jm by mRNA transcript

levels Mm, derived from mRNA fragment data collected separately in the ribosome

footprinting experiment. We can identify codon-dependent effects on translation from

differences in µc. By looking at footprint count deviation from expected dwell time at

each (m, k), we can also examine differences among codons on the same message. In

the following sections, using the parameters estimated under our robust probabilistic

framework, we perform a comprehensive analysis of the biological factors influencing

local and global dynamics of translation.

3.2.2 Codon Translation and tRNA Manipulation

A number of studies in Escherichia coli initially identified codon usage and the avail-

ability of tRNA as the dominant force for codon translation rate [126, 110]. Later

studies found no correlation between measured rates and tRNA abundance or codon

frequency [12, 25, 111]. However, all of these studies measured translation speed indi-

rectly, on individual and potentially idiosyncratic reporter systems. We explore these

competing hypotheses in the physiological conditions of our yeast data set. If tRNA

abundance were rate-limiting for elongation, we would expect a positive correlation

between codon translation rate and tRNA abundance. However, as shown in Figure

3.2.2, the correlation is insignificant (Spearman r = 0.144, p = 0.380 for Cy5 and r

= 0.133, p = 0.417 for Cy3 from microarray tRNA measurements [32]). A similar

result (r = 0.210, p = 0.104) is also obtained when comparing to tAI, a measure

of codon bias based on tRNA gene copy number relative to the overall collection of

isoacceptor tRNAs [34]. If we restrict the analysis to the slowest synonymous codon


(in terms of tAI), to the fastest, or to the average per amino acid, the correlation to

tAI does not improve: r = - 0.12 (p = 0.61), r = -0.29 (p = 0.22), and r = -0.32

(p = 0.18), respectively. Finally, the same insignificant correlation exists in the raw

footprint data (r = 0.112, p = 0.392; baseline method for rate described in Materials

and Methods) and was also observed in another analysis of the yeast data set from

Ingolia et al [57], in which codon dwell time was estimated as the ratio of observed

codon frequencies in the footprint data relative to expected codon frequencies in the

mRNA fragment data [98].

Our analysis of elongation rates on endogenous mRNAs in the context of the co-

adapted cellular tRNA pool addresses the effects of codon usage in natural physiology,

but may be confounded by this co-adaptation and cannot directly test the causal

links between various correlated mRNA features. To measure the effect of tRNA

abundance on codon translation rate directly, we created three mutant yeast species

to test whether (1) tRNA over-expression speeds up translation, (2) the tRNA body

itself causes the tRNA-dependent rate effect observed in other studies, or (3) depletion

of tRNA slows down ribosomes. In our first mutant, AGG-OE, the tRNA recognizing

AGG (namely, tRNAArg(CCU)) was over-expressed on a high-copy plasmid; in mutant

AGG-QC, the body sequence of the tRNA recognizing AGG was swapped with the

body of a more preferred tRNA (as measured by tAI); and in mutant ACA-K, 3

out of 4 copies of the tRNA recognizing ACA were deleted from the genome. The

AGG mutants had a URA marker and were compared against a wild-type sample

with a URA plasmid (see Materials and Methods). For ACA-K, we checked that

the abundance of the tRNA for ACA (namely, tRNAThr(UGU)) did decrease to about

30% of wild-type (Table A.2). In the AGG-OE mutant, we measured the amount of

total and aminoacylated tRNA for tRNAArg(CCU) (see Materials and Methods) and

verified that the tRNA was over-expressed by 13.8-fold (+/- 0.4), based on an analysis

of two independently derived RNA samples, and remained charged at a level similar


0 0.5 1 1.50

0.2

0.4

0.6

0.8

1

codon translation rate

tAI

r=0.210, p=0.104

0 0.5 1 1.50

20

40

60

80

codon translation rate

tRN

A a

bund

ance

(10

00s)

Cy5: r=0.144, p=0.380Cy3: r=0.133, p=0.417

Cy5Cy3

Figure 3.2: Correlation between codon translation rates and measures of codon usagebias.Left: Insignificant Spearman correlation between estimated codon translation rates(scaled up by a factor of 1000) and tRNA abundance from microarray measurementsusing either fluorophore Cy3 or Cy5 [32] on 39 codons with measured levels. Right:The same correlation but to tAI is also not significant.


to wild-type (87%) (Figure A.2). For the AGG-QC mutant, we similarly verified

that the amount of charged tRNAArg(CCU) was similar to wild-type (Figure A.2).

We generated ribosome profiling data and ran our model on these mutants to test

whether AGG codons are translated faster in AGG-OE and AGG-QC and whether

ACA codons are translated slower in ACA-K. We observe no significant change in the

elongation rates of the affected codon in any of the three mutants compared to wild-

type (Figure 3.2.2, A.2); the overall correlation between ACA-K and wild-type is not

as tight as for other mutants, but this is due to changes affecting all codons, not only

ACA. We verified the result by inspecting the footprint counts at the perturbed codon

relative to adjacent counts in the mutants compared to wild-type and saw no unusual

increase or decrease (Figure A.2). One prevailing hypothesis [133] is that the amount

of charged as opposed to total tRNA is the true predictor of codon elongation; our

measurements of aminoacylated tRNA suggest that these levels were manipulated as

expected and that this is not a confounding factor in the mutant samples. Hence, our

results suggest that several-fold changes in tRNA abundance do not affect ribosome

dwell time.

3.2.3 Translation Efficiency and tRNA Manipulaion

One of the major goals of codon optimization in biotechnology is an increase in protein

yield. Studies done on transgenes expressed at a large fraction of cellular mRNA

abundance report increased protein abundance when the mRNA was optimized for

codon bias [47, 71, 14], suggesting that codon usage contributes to efficiency [118,

121]. However, other studies observed that optimizing codon adaptation of a reporter

does not significantly improve TE or protein yield [137, 67, 133, 50, 72, 107]. Our

experiments likewise provide support for the view that the TE of endogenous mRNAs

is unchanged by effective codon optimization achieved by changes in the tRNA pool

(Figure 3.2.3). We find that increasing tRNA abundance or replacing the tRNA body


1

1.5

2

2.5

rate

AG

G−

OE

r=0.99, p=2e−55

1

1.5

2

2.5

rate

AG

G−

QC

r=1.00, p=2e−62

1 1.5 2 2.51

1.5

2

2.5

rate wild−type

rate

AC

A−

K

r=0.91, p=1e−24

Figure 3.3: Comparison between codon translation rates in wild-type and mutants.Correlation between estimated codon translation rates in wild-type versus mutantfor the three mutant samples (the manipulated codon is highlighted in red). Ratesare normalized by the minimum one in each sample. Pearson correlations are nearlyexact, indicating that the mutant rates are generally unaffected.


sequence by one with higher tAI does not improve efficiency: most genes remain

unchanged in TE between the wild-type and mutant samples (Pearson r = 0.96 for

AGG-OE and r = 0.95 for AGG-QC). Further, the top 200 genes that do deviate

most in TE relative to the wild-type sample have mutant TE that is both lower

(reduced TE genes) and higher (increased TE genes) compared to wild-type, with

bias towards reduced TE genes (123 reduced vs 77 increased for AGG-OE and 133 vs

67 for AGG-QC). In AGG-OE, we observe no correlation between the fraction of AGG

codons per message and the change between mutant and wild-type TE (Spearman

r = 0.00002, p = 0.99); we would expect a positive correlation if increasing tRNA

abundance increased TE. Further, despite the many-fold overexpression of tRNA,

the correlation between TE and fraction of codon per message for AGG is not higher

than the correlation for any of the other codons (Figure 3.2.3). AGG-QC behaves

similarly, such that manipulating the tRNA to be “faster” does not lead to a scenario

where AGG outperforms other codons in affecting translation efficiency. Finally, these

observations also hold if we look at protein synthesis rates instead of TE (Figure A.2).

While improving codon optimization by changes in tRNA structure or abundance

does not seem to causally affect TE, we do see evidence for a modest impact from

tRNA depletion (Figure 3.2.3). Mutant and wild-type TEs are generally correlated in

the ACA-K mutant (Pearson r = 0.96). Although there are more reduced TE genes

than increased TE genes (127 versus 73), this difference is not significant via a per-

mutation test (see Materials and Methods). However, we find a negative correlation,

the lowest of all codons, between the fraction of ACA codons per message and the

change in TE between mutant and wild-type (Spearman r = -0.08, p < 10−8), as

we would expect if decreasing tRNA abundance decreases TE through a direct effect

on its cognate codon. One explanation is that tRNA reduction could compromise

TE if the demand is higher than the supply the number of ACA occurrences in the

genome is about the average number of occurrences over all codons, but we reduced


−15 −10 −5

−15

−10

−5

log(TE−wt)

log(

TE

−A

CA

−K

)

73 increased

127 reduced

r=0.96

−15

−10

−5

log(

TE

−A

GG

−O

E)

77 increased

123 reduced

r=0.96

−15

−10

−5

log(

TE

−A

GG

−Q

C)

67 increased

133 reduced

r=0.95

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

mut

AC

A−

K

AC

AC

AC

TC

CC

CC

AC

TC

GA

AG

TA

CC

TC

TA

AC

AT

AA

TG

TC

AC

CA

GT

AT

AC

AA

TA

CG

AT

CC

GG

CA

TA

GC

TA

TA

GG

TC

GC

TC

CC

GC

AG

CA

AA

GA

GT

TT

TC

CG

CT

GG

GG

AC

CT

CT

AT

TT

TT

AG

CC

TG

TT

GC

GC

GC

TT

CG

TA

AG

GG

TG

CT

GC

AC

TG

GG

GG

TC

GT

GT

TG

AA

AG

AC

AT

TG

GC

GA

TG

AG

GA

A

−0.1

−0.05

0

0.05

0.1

0.15

0.2

mut

AG

G−

OE

Correlation between log(TE−mut/TE−wt)and % codon per gene

GC

CG

GT

AA

GG

GC

GC

TG

TC

AT

GG

TG

GA

CG

GG

TA

CT

GG

GC

GG

CA

AC

CT

GT

GG

AA

GA

GA

GC

AC

CG

TA

GG

GT

TA

TC

AC

GT

GC

CT

GT

TC

TT

GA

CT

CG

CA

AC

CC

AC

CC

CA

GT

CC

AG

CC

GG

CC

TG

AA

CA

TG

TA

CA

AC

TC

CG

AA

GT

TC

GC

CG

TT

TC

TT

GA

TC

TA

TC

TA

TT

TA

TA

CA

AT

AA

AA

TC

AT

TA

AA

T

−0.2

−0.1

0

0.1

0.2

0.3m

ut A

GG

−Q

C

GC

CG

GT

GC

TG

TC

AC

CT

CC

TA

CC

CA

AT

CT

TC

GG

CA

AG

GT

TC

AC

TT

GA

CT

GA

CT

CT

CG

TA

AC

AT

GT

GG

GT

GA

GA

CC

TC

AA

TG

TG

CG

GC

AC

CC

GG

GC

GC

TG

CC

TG

TC

GA

CG

AT

TC

AG

CT

CC

TA

GG

AC

AT

AG

CC

CG

CG

GG

AG

GA

AT

TT

CT

TG

TA

AC

AA

GT

TA

TA

GG

CG

AT

CA

TT

AG

AT

AA

AA

TA

AA

T

Figure 3.4: Comparison between translation efficiency in wild-type and mutants.Left: Wild-type TE compared to mutant TE for the three mutant samples. StrongSpearman correlations shown suggest TE is generally unaffected by tRNA manipu-lation. Right: Spearman correlation, for each codon, between the ratio of mutantTE to wild-type TE and the percent of codon per gene. Significant correlations areshown as filled dots. For AGG mutants, the correlation is not higher for the manip-ulated codon (highlighted) than for other codons, indicating that optimizing codonusage does not affect TE. For ACA-K, the correlation is negative for the ACA codon,suggesting a mild effect.


its levels below those of any other tRNA. However, if protein synthesis and thus TE

are controlled by initiation, this implies some feedback from slowed elongation on ini-

tiation, whereby affected ACA codons might stack ribosomes. In particular, reduced

TE genes compared to increased TE genes have slower-than-expected codons closer

to the 5’ end and stronger pausing in the first 100 codons (Figure A.2; significant un-

der Kolmogorov-Smirnov test; see next section for definition of slower-than-expected

codons as “outliers”). These confounding factors might contribute to the decrease

in TE for ACA-heavy genes. Alternatively, ribosome stacking at ACA codons could

induce fall-off and reduced processivity that manifests as decreased TE.

To situate our results in the context of many previous studies on codon bias and

tRNA abundance, we note that our observation focuses on endogenous messages with

physiological or near-physiological tRNA levels. When the tRNA pool is limited

compared to the number of free ribosomes, as in strong overexpression of transgenes,

simulations indeed show that large demand for tRNAs can be rate-limiting [22, 23,

107]. Experiments showing rate-limiting effects of tRNA abundance likely operated in

this non-physiological regime. In addition, manipulation of codon usage rather than

the tRNA abundance can perturb mRNA structure and other non-coding sequence

features; our experiment is less susceptible to those issues.

3.2.4 Factors Correlating with Elongation Efficiency

The notably modest effect of dramatic changes to the tRNA pool motivates the ques-

tion: what signals do affect elongation efficiency and translation efficiency? We first

take advantage of the ribosome profiling data to understand elongation efficiency the

time for a ribosome to finish translating a transcript once initiated by studying rate-

limiting elongation signals via inspection of outliers in the footprint counts. Based

on the observed footprint counts and our model parameters for expected codon dwell

time, we define slow outliers and fast outliers at each position k along a message m as


positions where ribosomes are stalled more or less than expected, respectively. We de-

note their deviation from expected dwell time as outlier strength ∆mk (see Materials

and Methods). We considered a broad array of potential correlates of ∆mk, based on

literature hypothesizing their association with variation in codon translation rate or

pausing, classified into eight categories (Table A.2): position on message, structure in

downstream windows, protein folding, wobble basepairs, reuse of tRNAs from nearby

codons, downstream RNA binding protein motifs, nascent peptide effects, and global

features. Table A.2 shows these correlations, which include significant features in the

position, structure, wobble, and nascent peptide categories. We discuss these below

and in Appendix A.

The strongest correlation to outlier strength for slow outliers is proximity to the 5’

end, with larger pauses occurring closer to the beginning of a message, even relative to

gene length or even when aligned by stop codon as opposed to start codon (position

from 5’ correlates to ∆mk with Spearman r = -0.043; position from 5’per length

with r = -0.144; and position from 5’ end with r = 0.162, p ≈ 0 for all). Similar

observations of increased ribosome occupancy at the 5’ end have produced various

hypotheses for the causal basis. In the “ramp” model [123], the presence of more slow

codons (low tAI) at the beginning of a message is thought to separate ribosomes early

to avoid the wasteful expenditure of resources on stacked, idling ribosomes. However,

we observe a correlation between position from 5’ end and slow outlier strength even

when conditioning on the codon (Figure 3.2.4), and thereby controlling for differences

in codon usage at different positions within the gene, suggesting that there is an

initial low translation speed, regardless of codon usage, which gradually increases as

translation proceeds. Additionally, our model helps account for length, position, and

abundance biases when calculating outliers in a particular message in two ways: first,

we include message-specific codon dwell times, and, second, we exclude the first 100

codons from each gene during model learning (see Materials and Methods) to avoid


inflating or otherwise biasing the expected rates µmc and µc. Our analysis indicates

that pausing occurs at the 5’ end, even after accounting for major factors such as

codon bias and gene length.

Other explanatory signals have been suggested for pausing in ribosome profiling

datasets [113, 73, 17]. Our analysis shows a (mild) correlation between pausing and

computationally-predicted downstream mRNA secondary structure (Spearman r =

0.021, p ≈ 0 with structure measured by the density of stems). This correlation

is reproduced when considering experimentally derived in vivo structure data from

high-throughput DMS probing of unpaired A and C bases [105] (r = -0.033). It

is also maintained when we restrict our analysis to slow outliers in the first 100

codons (r = 0.015 for density of stems and a similarly reduced r = -0.026 for in

vivo energy, potentially due to genes with short UTRs and the decreased reliability of

DMS structure probing data at ≈20nt or less from the 5’ end), and so the effect is not

necessarily caused by structure elsewhere on the strand. Single molecule experiments

with bacterial ribosomes [19] found that some hairpin and pseudoknot constructs

at varying distances downstream of the active codon can slow down the ribosome;

structural energy could therefore potentially contribute to the excess ribosome density

at the 5’ end. We also see a positive correlation on that same order of magnitude

between slow outliers and the number of proline codons in the two sites upstream of

the active codon (r = 0.069, p ≈ 0), as observed in other organisms [58, 136]. Two

correlations that we observed are not expected on the basis of previous studies. A

study showing pausing specifically at CGA [72] suggests slower elongation on wobble

base pairs, whereas we observe the opposite correlation; this discrepancy might arise

because the wobble effect is limited to a few specific codons, or to repeated wobble

codons, or because of an incomplete characterization of codon / anticodon pairings

which limits our assignment of wobble decoding. The correlation to charge observed

by Charneski & Hurst [17] holds in sign but not in significance even when considering


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.25

−0.2

−0.15

−0.1

−0.05

0

Spe

arm

an r

Correlation per codon between outlier strengthand position per length from 5’ end for slow outliers

tAI

Figure 3.5: All codons show negative correlation between outlier strength and prox-imity to gene start.Correlation between slow outlier strength and position per length from 5’ end, con-ditioned by the codon, plotted against codon tAI. For each codon c, we calculate theSpearman correlation for outlier strength ∆mk and position per length from 5’ end(k/Lm) but restricted to the (m, k) that satisfy codon(m, k) = c. All codons exceptone (hollow circle), which has the lowest abundance in the genome, have a significantnegative correlation. This indicates that 5’ end outliers are slower even independentof codon bias.


the number of Arg and Lys residues in a window upstream of the active codon,

although this result was later attributed to technical artifacts relating to the strand

orientation [18].

3.2.5 Factors Correlating with Translation Efficiency

While elongation efficiency measures time required to synthesize a new protein, trans-

lation efficiency measures the throughput of protein synthesis. Besides codon adap-

tation, which we find to play little or no causal role in improving efficiency, other

significant correlates to TE include structural features and the sequence motif around

the start codon (Figure A.2).

Structure is reduced near the translation start site in many organisms [46, 143]

and, in combination with specific structural motifs downstream, can promote or halt

initiation [66, 63, 104]. We performed a sliding window analysis (see Materials and

Methods and Figure 3.2.5) to correlate TE with RNA secondary structure in 40nt

windows along the gene, for both experimental in vitro and in vivo structural en-

ergy [105]. The window near the start codon is most significant, as reported previ-

ously for computational and in vitro structure measurements [67, 62, 121, 61]; the

positive correlation indicates that increased TE corresponds to loose structure in this

region. Indeed, this is also the window with highest energy, corresponding to the

lowest structure, as averaged over all genes (first red line in Figure 3.2.5). Interest-

ingly, the correlation to TE for in vivo structure is less pronounced and the window

is shifted 3 codons downstream. We call this Window A.

Our attention was also drawn to the window downstream of the start codon at

∼60nt in vitro and ∼80nt in vivo (second red line in Figure 3.2.5) with the lowest

energy (more structure) compared to neighboring positions. We call this Window

B. The most likely role for this energy barrier seems to be a stalling mechanism.

Ribosome density is high nearby: at 132nt (approximately two to three ribosome


0 50 100 150 200 250

0

0.01

0.02

0.03

0.04

0.05

0.06

less

str

uctu

re

DM

S in

vitr

o en

ergy

0 50 100 150 200 250

0

0.05

0.1

0.15

log(

TE

) ~

DM

S in

vitr

o en

ergy

less

str

uctu

re ~

hig

h T

E

not significantsignificant

0 50 100 150 200 250−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

position [nt]

DM

S in

viv

o en

ergy

0 50 100 150 200 250−0.1

−0.05

0

0.05

position [nt]

log(

TE

) ~

DM

S in

viv

o en

ergy


Figure 3.6: RNA structure energy and its relationship to translation efficiency.Left: Energy averaged in sliding windows of 40nt (see Materials and Methods) acrossall genes for in vitro and in vivo measures of energy via DMS probing [105]. Thesecond red line corresponds to the first window with lowest energy (≈60nt for in vitroand ≈80nt in vivo). Right: Spearman correlation between the energy windows andTE. The first red line corresponds to the first window with significant correlation (9ntfor in vitro and 18nt for in vivo).


footprints downstream), our model-estimated ribosome density has a notable peak

that is reduced when we exclude outliers, which capture positions where sufficient

pausing could stack ribosomes (Figure A.2). Although properly placed downstream

structure can improve the efficiency of initiation by stalling the scanning pre-initiation

complex [104], or might be selected for heavy structure in order to prevent other

regions (namely, around the start codon) from being paired, the lack of significant

correlation to TE for Window B suggests that ribosome flow control here optimizes

other aspects of translation besides throughput.

In addition to low structure at the start codon, initiation may be assisted by

recognition of a 12-mer motif around the start codon called the Kozak sequence in

eukaryotes [65], derived in yeast based on a sequence consensus from highly expressed

genes by Hamilton et al [49]. As expected, due to a tight correlation between mRNA

abundance and TE (Figure A.2), similarity to the Kozak motif correlates strongly

to TE (Spearman r = -0.21, p < 10−45) (measuring similarity by Kullback-Leibler

divergence to the position-weight matrix where 0 divergence means a closer match).

The 3rd nucleotide preceding AUG is the most significant (Spearman r = -0.17, p <

10−29), consistent with experimental measures of initiation efficiency after modifying

positions in the Kozak site [138, 75]. Using a linear regression model for predicting

TE based on a set of correlates suggested in literature (see Materials and Methods),

we learn a refined Kozak motif to reflect highly efficient genes (Figure 3.2.5). Our

learned Kozak motif reduces the error of our regression model predictions relative to

an equivalent model using the original motif (from 0.84 to 0.75, averaged over 100

test sets selected randomly, compared to a null model error of 0.97) (Table A.2). This

indicates that our refined motif better corresponds to highly translated genes, likely

because it was trained directly on translation efficiency measurements rather than on

a proxy such as mRNA abundance.

Finally, we tested the correlation between translation efficiency and other mRNA


1 2 3 4 5 6 7 8 9 10 11 12

Position

0

0.2

0.4

0.6

0.8

1

Pro

babi

lity

Figure 3.7: Estimated Kozak motif for efficient genes.Estimated TE-driven Kozak motif based on a regression model (see Materials andMethods). The original Kozak consensus for yeast [49] is WAMAMAATGTCY.

features often discussed in literature (Figure A.2). We find a negative correlation to

evolutionary rate that is suggestive of the intuitive fact that more conserved genes

are more highly translated. The positive correlation we find with mRNA abundance

suggests a model of co-expression where the need for high protein abundance drives

high translation of abundant transcripts. Consistent with previous studies [57], we

observe a very small negative correlation to length. We also find a positive correlation

(although weaker than that for tAI) to the codon translation rates geometrically

averaged over the codons within a gene. Lastly, RNA-binding proteins (RBPs) have

recently received attention for their roles in post-transcription regulation, and we also

see high Spearman correlations between RBP occupancy and TE. When looking at

enrichment of 15 proteins, we find the expected correlation to translation efficiency (as

suggested by literature) in eight of ten cases. One of the two “unexpected” proteins,

scp160, was recently reported to be required for translational efficiency of particular


mRNAs in yeast [52], even though it correlates negatively to ribosome occupancy in

Hogan et al [54]; our analysis encouragingly suggests the former correlation. Appendix

A has further discussion.

3.3 Discussion

In this section, we presented a statistical model to extract codon translation rates

and protein synthesis levels from ribosome profiling data. Our model is designed to

account for the complexities of ribosome profiling data while keeping parameter esti-

mation tractable. Although average footprint density on a gene is well correlated to

protein abundance, outliers can pull the estimate provided by the mean away from the

true level, especially when ribosome stacking is common. Thus, properly accounting

for differential elongation rates can improve inference of protein synthesis levels from

this data. We maintain a simple translation model (for example, we do not explicitly

include a rate of ribosome falloff or an analytical treatment of codons being processed

in series), but our design choices trade-off for model simplicity, algorithmic stability,

and smoothing of noisy data. Using one model parameter for all codon instances in a

gene, as opposed to an individual dwell per position, has several advantages: it aver-

ages out sequence biases in footprint fragments, makes the optimization algorithm less

susceptible to local minima and hence robust to parameter initialization, and allows

us to infer parameters even for low abundance genes by offsetting the lack of data

with soft prior constraints. We reassuringly find qualitatively similar results when

we replace our refined protein synthesis rates with a simple average of the footprints

per gene, while obtaining better quantitative estimates compared to existing protein

abundance datasets. More physics-based or simulation models [141, 102, 122] require

knowledge of the kinetic parameters of translation, can necessitate grossly simplify-

ing assumptions such as a single codon translation rate per gene, base certain model


quantities on a limited set of features, or directly assume that codon rate is correlated

to codon adaptation. In comparison, our method reduces the number of assumptions

made by directly modeling the experimental processing and fitting the model param-

eters to the data under the single concept of flow conservation. On the other hand,

methods that aggregate the data directly [98, 17, 43], similar to our baseline method

for calculating codon translation rates, do not readily lend themselves to computing

other quantities. For example, because we have an underlying model, detection of

outlier codon positions follows easily within our framework, whereas other works rely

on choosing an adjacent window of appropriate size to compare counts. Similarly,

we can easily study other potentially interesting effects, such as codon translation

rate variance within genes and among genes. Finally, our method would particularly

be useful in situations where ribosomal profiling data is scarce or noisy. By using a

probabilistic model, we infer rates of interest from the observed, noisy data without

needing to exclude genes with sparse information. With the growing usage of ribo-

some profiling, a robust framework for studying rates of elongation and synthesis is

essential.

The robust framework of our model allows us to shed new light on causality in

regulation of translation and characterize the features associated with efficient elon-

gation and translation. Although codon usage is a strong correlate to TE (Figure

A.2), our mutant experiments suggest (via the correlation between codon bias and

tRNA abundance) that codon usage may not causally influence efficiency. The direct

impact of codon usage on efficiency and the basis of the selective force underlying

codon bias has remained a topic of controversy for decades. Some authors have pro-

posed that codon optimization serves directly to enhance the translational efficiency

of specific genes, perhaps by speeding elongation on their mRNAs. Our work provides

direct experimental evidence against this view. Rather, our work is consistent with

an alternative model, aligned with previous results for Escherichia coli [67], in which


codon bias in highly translated genes results from selection to optimize utilization

of the translational machinery, whose abundance and production represents a major

limitation on cell growth [5, 13, 67]; this selection induces a correlation without im-

plying that increasing codon bias optimizes efficiency on individual genes [133]. In

this view, initiation is rate-limiting and thereby determines translational efficiency.

When the demand-supply balance for a tRNA is not compromised by extremely high

expression of a transgene not adapted to the host organism, we propose that selective

forces beyond the TEs of individual messages guide the distribution of codons. The

positive correlation between elongation rate and TE suggests a potential contributor,

namely, selection for efficient use of ribosomes and translation factors, and that this

selective force is strongest for high-expression, high-TE genes. Such selection pressure

is consistent with studies of overall cell growth and protein synthesis, which indicate

that the translational apparatus is rate-limiting for cell growth and that reduction in

the amount of ribosome time devoted to producing an abundant protein can speed

cell growth [5, 7, 67, 85]. As elongation rate is not the strongest correlate to TE,

other mechanisms also deserve further study. For example, there may be selective

pressures on the mRNA sequence itself (e.g., to induce certain secondary structures),

which in turn create pressure in the cell to ensure a sufficient supply of tRNAs for

efficient translation of the highly translated messages. Our results are also consistent

with the prevalent view that initiation is typically the rate-limiting step in protein

synthesis, which does not provide a clear mechanism for codon usage in the body

of a gene to affect its efficiency, and particularly not through increased elongation

rates. Instead, tRNA levels are likely forced to match the lack of disfavored codons

by selection against the cost of tRNA production or against poor decoding accuracy.

Our resulting analyses address the contributions of initiation versus elongation

to efficiency [7, 68, 107]. While efficient usage of ribosomes and elongation factors

influence the overall amount of protein produced from the whole genome, initiation


may dictate differences between genes [39]. We characterize two initiation signals

that could play a role in translation regulation via a two-stage metering-light model:

reduced structure around the start codon and favorable sequence context to promote

ribosome binding, followed by an increase in structure that could, in turn, serve

to reduce misfolding of the emergent polypeptide by allowing sufficient time for re-

cruitment of chaperones to the ribosome exit tunnel [40]. This barrier could reflect

the observed universal per-gene effect, independent of codon identity, whereby the

strengths of slow outlier positions correlate to 5’ end proximity. Since translation

is resource-heavy, requiring tRNAs, mRNAs, and ribosomes, with the latter being

especially costly to produce, we intuit that the cell must balance use of these finite

resources while at the same time producing functional protein products. Structure

around the 5’ end could be one of the key mechanisms through which the cell regulates

translation so as to avoid wasting resources.

The region of slow elongation at the 5’ end certainly merits further exploration. In

contrast to the slow-codon ramp proposed in Tuller et al [123], our model shows that

while there may be an abundance of low tAI codons near the 5’ end, these codons do

not cause slow elongation (Figure A.2). We find (mild) correlations between pausing

and downstream structure, between tAI and downstream structure over the first 50

codons of all genes (Spearman r = -0.0055, p = 0.01 for stem density, and insignificant

for in vitro or in vivo structure), but not between codon usage and codon translation

rate. A study performed over diverse bacteria, controlling for GC content, proposes

that structure drives codon usage early at the 5’ end [11]; in yeast, there may be

similar selection whereby structure-related constraints induce a low-tAI ramp.

The impact of secondary structure on translation is complex. In addition to a

role in initiation, high structure regions could also act by influencing elongation [19].

Outliers in the high-variance ribosome profiling data can differ from expected dwell

times by a factor of 40, and are distributed throughout the message (Figure A.10).


One explanation is the presence of downstream structural features that create an

energy barrier to elongation; these correlate (more weakly) to outlier strength when

ignoring the first 100 codons (whole gene versus truncated gene has r = -0.033 versus

r = -0.034 for downstream in vivo energy and r = 0.021 versus r = 0.010 for density

of stems), precluding the possibility that high ribosome density (based on the 5’ end

as a proxy) drives the effect. In addition, mRNA-binding factors can interact with

structure [28], but whether structure performs any common genome-wide functions

is not yet established. One possibility is that secondary structure slows the ribosome

during elongation to promote correct folding of the nascent protein during its vectorial

synthesis by the ribosome.

The significant but mild correlation to structure suggests that other factors are

important in pausing. Experiments suggest that the wobble base in CGA causes sig-

nificant pausing [72, 113], clusters of slowly translated codons could stall ribosomes

more than the sum of their individual decoding times [140], and effects from the

nascent peptide stall elongation at prolines [58, 136]. It is likely that a compendium

of biological features interact to dictate elongation rate. Although our genome-wide

outlier analysis shows promising correlations to pausing, the small magnitude of cor-

relation could be improved by looking at more restrictive or genetically meaningful

sets of positions. The growing interest in ribosome profiling poses exciting directions

for further investigation of the interactions between these features and the changes

that may occur in different conditions. With this additional data and measurements

from single-molecule experiments [134, 124], our model could be extended to include

finer-grained parameters for codon translation rates, partitioned in various ways, in

order to better understand how rate changes over a transcript. Further analysis is

also needed into how structure and the sequence around the initiation site work with

or against each other. For example, heavy structure can promote initiation in spite

of weak initiation context, but the ways in which they interact are still unknown.


3.4 Materials and Methods

Ribosome Profiling Datasets All experiments were done on yeast strain 288C.

Cells were collected for ribosome profiling by filtering ∼250ml culture of OD = 0.6 and

immediately flash freezing on liquid nitrogen. For all ribosome-profiling experiments,

footprints were obtained as described before [56]. Three out of four copies of Threo-

nine tRNA (tT(UGU)G2, tT(UGU)H, tT(UGU)P), recognizing the ACA codon, were

knocked out using the standard technique of homologous recombination from a plas-

mid PCR product. The resulting strain was marked with nourseothricin, kanamycin,

and hygromycin B resistance respectively. Successfully transformed yeast were iden-

tified by check PCR. tRNA arginine (tR(CCU)J) recognizing the AGG codon was

overexpressed by cloning into a URA marked 2-micron plasmid (pRS426) and trans-

forming wild-type yeast using –URA selection. For the tRNA body swap, tRNA se-

quence from tR(UCU)B was mutated in the anticodon to CCU using QuikChange site-

directed mutagenesis kit (Stratagene) in order for the tRNA product from tR(UCU)B

to recognize the AGG codon. The mutated tRNA was then cloned in the 2-micron

plasmid pRS426 and transformed into 288C.

Ribosome-protected fragments were aligned against assembly R63 from the Sac-

charomyces Genome Database (SGD, http://www.yeastgenome.org) and we kept

uniquely mapped reads with no more than 2 mismatches and lengths between 28 and

31. To identify the active codon for ribosome-protected fragments, we let 0 be the

first nucleotide of the read and if the read begins on the first/last/middle nucleotide

of a codon, the active codon starts at nucleotide 15/16/17, respectively. An mRNA

fragment was mapped to a gene if it begins less than 16nt upstream of the start codon

and more than 16nt upstream of the stop codon. Genes were ignored if they did not

have an AUG start codon, had internal stop codons, had less than 50% of positions

on the coding sequence with at least one mapped mRNA count, or if all the footprint

http://www.yeastgenome.org


counts were 0 over the gene length used in the translation model (see below), leaving

around 5000 genes in each sample. When comparing mutants to wild-type samples,

we used the intersection of the valid genes in each sample. The AGG mutants were

compared against the wild-type sample with a URA plasmid.

Analysis of tRNA Charging and Relative RNA Levels For analysis of charg-

ing levels of tRNAs, duplicate samples of each strain were grown under conditions

used for ribosome profiling, followed by harvesting of ≈4 OD-ml of cells. Then, bulk

RNA was prepared from each pellet under acidic conditions (pH 4.5) using glass

beads, and RNA was resolved on a 6.5% acrylamide gel at pH 5 for 15 hours at 4◦C,

transferred to Hybond N+ membrane, and hybridized with appropriate 5’-labeled

oligonucleotide probes, as described [3]. Charging levels were visualized on a Typhoon

PhosphorImager (GE Healthcare) and quantified using ImageQuant, and relative lev-

els of tRNAArg(CCU) were measured by normalization to levels of tRNALeu(CAA) in

the corresponding lane.

Feature Calculations tRNA gene copy numbers were obtained from the tRNAscan-

SE database [76]. To measure codon usage bias, we use tAI, which ranges from 0 to

1 for more preferred codons, calculated as in dos Reis et al [34] with refined weights

described in Tuller et al [123].

Experimentally derived structure data from DMS probing [105] was normalized in

windows of size 150nt by the minimum count in the top 5% of A and C nucleotides, and

the top 5% of counts were set to 1. Windows with less than ten A and C nucleotides

in the top 5%, windows with a zero normalization constant, genes without data,

and genes without a characterized UTR [87] were ignored in analyses. In the sliding

window energy analysis, energy windows were normalized per gene by the mean over

windows on each gene. In the energy profile, normalized windows were then averaged


across positions without missing data, aligned by start codon. In the energy-TE

correlation profile, we applied a conservative Bonferroni correction by multiplying

the p-values by the number of windows (30 upstream of the start codon and 250

downstream, since this span covered the maximum number of genes). To calculate

the location of the dip in the energy profile, we identified global minimums within

spans of 90nt and took the first minimum.

The correlation between tAI and downstream energy is for tAI over windows of

3 codons in the first 50 codons of all genes and the associated average of the 40nt

energy windows 15nt downstream from each nucleotide in the tAI window. Energy

windows are calculated as above using the number of stems and DMS in vitro and in

vivo energy.

Translation Model As discussed in the main text, we optimize our objective over

the parameters µmc and µc and solve for Jm. Since individual footprint counts can

be noisy and sparse, we smooth the data in three ways. First, we use a single µmc

for every copy of codon c on message m. The dwells The dwells µcm for a specific c

over all genes m softly agree with the global µc in a weighted geometric average with

weight wcm: the number of codons c on gene m normalized by the number of codons

c over all genes. Hence, genes with more copies of codon c get a larger vote in the

average estimating µc. Second, we add a pseudo-count of 1 to all footprint counts

and use the logarithm of normalized counts in the Poisson term (similar to a more

robust geometric average as opposed to an arithmetic average that is easily skewed

by outliers), first scaling the flow-normalized counts by a single factor over all (m, k)

so that the lowest one is 1. We refer to these transformed counts as d’. Third, during

model training, we ignore the first 100 codons (or the first 25% for genes shorter than

100 codons) since this region may have unusual flow conservation properties. If it

doesn’t, excluding these codons should not affect the learned rates. We refer to these


restricted positions as k’. The second term in the objective function is multiplied by

a constant C = 100 so as to not be greatly outweighed by the data term. Altogether,

we solve the following optimization problem (where k′ is restricted and d′ are scaled

as described above):

maxµc

m,µclog Πm,k′µ

cm

(d′mk/Jm) exp(−µcm)− C[∑m,c

wcm(log µcm − log µc)2]

We verified that the constant C did not affect our results by running the main

analyses again – correlations for codon bias measures, protein abundance, and outliers

– on several other values (1, 10, 1000, 10000, 100000). We note no significant change

(Table A.2), except for some outlier correlations for 100000 (stemsGC-down15 is

now not significant; cluster-ArgLys-up-1 is significant) and for 1 and 10 (internal-

down is now significant). Similar to taking the limit of the constant to infinity, we

also considered a model with only µc parameters and no µmc (and hence no second

term in the objective function) (Table A.2). Again, no extreme change exists in the

correlation between codon translation rate and codon bias measures. Perhaps because

we have removed a layer of parameters, we do see a slight decrease in correlation to

protein abundance and some changes to outlier correlations: multi-down is no longer

significant but still shows a similar correlation strength; is-in-domain is significant,

suggesting that slow outliers lie outside of protein domains, and the upstream number

of Arg/Lys codons is now significant.

The optimization algorithm is as follows: Jm is fixed to Dm =∑

k∈m dmk/Lm and

µmc and µc are initialized to dwells from the baseline method (see below), shifted in

log space so that the mean is log(7.2), plus a small random number. The value 7.2 is

the mean over all (m, k) of the flow-normalized counts normalized and smoothed as

described above for the wild-type sample. The appropriate mean value was replaced

for each of the mutant samples. The parameters are estimated via coordinate descent


by iterating through codons c and learning the associated µmc and µc. Optimization

per c used an L-BFGS method [15] in Matlab (Matlab wrapper from http://www.cs.

toronto.edu/~liam/software.shtml). with the following stopping criteria: max

number of iterations 5000; gradient tolerance 10−5; function tolerance 103. Coordinate

descent was stopped when the difference in weights was less than 5 ∗ 10−5 or the

difference in function value was less than 10−5. Codons not appearing in a particular

gene m did not have an associated µmc and we also excluded the stop codons. We

then compute Jm =∑

k∈mdmk/µmk

Lm=∑

k∈mdmk/µm,c=codon(m,k)

Lm. The optimization is

not sensitive to initialization (Figure A.2).

Although less robust, we also optimized a model with a separate dwell time µmk for

every (m, k) with the following initialization of weights: µmk = dmk/Dm, with 0 counts

replaced by the mean of all non-zero counts, shifted in log space so that the mean is

log(7.2); µc are dwells from the baseline method (see below) shifted in log space so

that the mean is log(7.2); all weights perturbed by a small random value. The value

7.2 was chosen as above. L-BFGS settings were as above. Coordinate descent was

stopped when the difference in weights was less than 10−2 or the difference in function

value was less than 10−1. The overall codon dwell times µc were well correlated to

those in the original model (Pearson r = 0.99, p < 10−74), but analyses based on

dwell times per (m, k) could be impacted, since these parameters are more sensitive

to initialization. So we verified all qualitative observations presented still hold. The

correlation between codon translation rate and codon bias measures is insignificant

(r = 0.151, p = 0.359 for Cy5; r = 0.138, p = 0.401 for Cy3; r = 0.223, p = 0.084 for

tAI). Protein abundance estimates correlate similarly to external measures (r = 0.671

for de Godoy [26] data and r = 0.778 for Newman et al [88] data, p = 0 for both). In

the outlier analysis, all correlations still hold except for the structure features only the

density of stems 12nt and 9nt downstream are significant but the others are on the

same order of magnitude, the protein domain feature is significant for bases inside

http://www.cs.toronto.edu/~liam/software.shtml

http://www.cs.toronto.edu/~liam/software.shtml


a domain, and the feature for upstream number of Arg/Lys codons is significant.

Correlations between TE and gene-level features are similar except Kozak position -2

is now barely not significant, experimental in vitro energy for the mRNA sequence is

barely not significant, and Npl3 is significant (in the expected direction). The energy-

TE correlation profile is the same except the window at 18nt for in vivo energy is

barely not significant but still a peak. The ribosome density graph has the same

peak at 132nt and decreases when outliers are removed. The refined Kozak motif has

the same dominant bases except position 1 in Figure 3.2.5 has the non-dominant T

swapped with A. Finally, the error when replacing the learned Kozak motif with the

original similarly increases from 0.69 to 0.77.

Baseline Method for Codon Translation Rate To get dwell time per codon c

from the raw data, we average over counts (m, k) for which codon(m, k) = c, normal-

ized by the average per gene (Dm =∑

k∈m dmk/Lm). Rate is the reciprocal of dwell

time. As above, we first add a pseudo-count of one to each dmk and ignore the first

100 codons (or the first 25% for genes shorter than 100).

Analysis of Translation Efficiency in Mutants To test if the difference in the

number of reduced TE genes versus increased TE genes (127 versus 73) in ACA-

K is significant, we permuted the mutant TE values 1000 times and calculated the

number of reduced TE versus increased TE genes for each permutation. There were 0

cases where the difference was less than the original difference, indicating the original

difference is not statistically significant.


Model for Translation Efficiency We used a regression model to predict TE of

an mRNA message based on various features:

minw

∑m

(TEm − wTfm)2 + λ1

∑p

|wp|+ λ2

∑w2p

The first term fits an optimal set of weights w to the TE of a set of genes m using

a linear combination of the set of features fm. The last two terms enforce sparsity (so

that features that do not explain the data well receive a weight of 0) and shrinkage (so

that weights are kept at a small scale). Under a standard machine learning framework,

we divide the genes in our yeast dataset into a test set (size 400 genes) and a training

set (the remaining genes). The hyperparameters λ1 and λ2 are learned via cross-

validation: we further divide the training set into fifths, and evaluate the error for a

grid of hyperparameter values on each fifth of the training set. The weights w are then

learned on the whole training set with the best hyperparameters (with lowest cross-

validation error). Test set error is the squared norm difference between predicted

and actual TE, averaged over all genes in the test set. For reference, we create a

null model where the weights are learned from TEs randomly permuted among the

genes. The final weights are the average over all training/test combinations. The

features used are minimal in order to maximize the number of genes that have these

characterized: tAI of gene; computationally predicted energy of 3’ UTR, 3’ UTR,

mRNA, and window around the start codon with highest correlation to TE; length

of coding sequence; mRNA abundance; identity of bases overlapping the Kozac site

(genes without a characterized UTR [87] were excluded).

To compute the weights for the refined Kozak site, we include a feature fk in f

defined as fk = 1/(1 + exp(x ∗ g)). The vector g has 36 indicators, 4 per each of

the 9 positions in the Kozak site (excludes the start codon). The vector x has the

corresponding weights for each indicator, is included in the shrinkage term, and is


learned iteratively with w. The refined Kozak motif in Figure 3.2.5 is the average

of the 100 values of x learned separately for each training set. To create a position-

weight matrix from these weights, we shift the weights for each position so that the

most negative value (if any) is 0 and normalize by the sum of the four weights at

that position. The sequence logo was generated by seqLogo (seqLogo: Sequence

logos for DNA sequence alignments, R package v1.28.0, http://bioconductor.org/

packages/release/bioc/html/seqLogo.html).

To test whether the refined motif provides better TE predictions than the original

Kozak motif, within each of 100 training sets, we fix fk for each sequence with x set

to the original motif (scaling the weights so that the sum at each position matches

the sum of the learned motif) and learn the remaining weights as before. We then

compute accuracy on the corresponding test set.

Outlier Model The strength of an outlier ∆mk at position (m, k) is defined as

the difference between the observed count (dmk) and the expected count (Jmkµmc),

divided by smk, a standard deviation representing the variance in that count due to

the abundance of the gene and the codon it corresponds to. For smk, we divide the

genes into 32 quantiles by abundance and compute the standard deviation of the

counts in each bin per codon. Thirty-two was chosen as the maximum number that

still gave at least three counts in each bin per codon and no zero-valued smk. This

normalization helps distinguish true biological outliers from outliers arising due to

differential mRNA sampling and abundance depths across genes. Counts are as in

the optimization setup (dmk have a pseudo-count of 1 and Jmk are scaled by a single

factor). A slow outlier is an (m, k) with ∆mk > T for some threshold T. Non-outliers

are (m, k) with −1 < ∆mk < 1, excluding slow outliers.

Since there is a small uncertainty in the position of the active codon within

http://bioconductor.org/packages/release/bioc/html/seqLogo.html

http://bioconductor.org/packages/release/bioc/html/seqLogo.html


ribosome-protected fragments of certain lengths, what we might see as a fast out-

lier (a position (m, k) where ∆mk < −T and, for example, a wrongly-labeled count of

0) could actually have a fragment that was falsely associated with an adjacent slow

position. The opposite is much less likely; an observed slow outlier has many more

counts than expected, making it unlikely that so many fragments were wrongly at-

tributed and belong instead to an adjacent fast outlier. For that reason, we compare

slow outliers only to non-outliers.

When correlating features to outlier strength (Table A.2), we call features signifi-

cant only if they pass a stringent set of conditions: Pearson and Spearman correlations

must have the same sign for all slow outlier thresholds (T = 0, 0.5, 1, 1.5, 2, 2.5) and

be significant; the correlation when binned by codons must have at least 30 significant

codons; the sign of the correlation must match the direction suggested by the com-

parison of means for slow versus non-outliers. When referring to significant features

in Table A.2, we cite the correlation for T = 0 since all thresholds are significant. For

a more stringent set of outliers, we use T = 1 in analyses requiring a fixed T (Figure

A.2, Figure A.2, Figure A.10).

Data Availability Data is available at GEO Series accession number GSE63789.

The conclusions in this chapter are published in [96].

3.5 Conclusions

In this chapter, we presented a method that provides a rigorous framework for ana-

lyzing the increasing number of ribosome profiling data sets, and thereby addressing

the outstanding questions raised in the discussion about correlations between trans-

lation efficiency, codon bias, and RNA secondary structure. We illustrate the use of

the method in the context of one of these data sets to create a high-level view of


the mechanisms involved in initiation and elongation, to study the factors affecting

initiation as the rate-limiting step for translation, and to support a model in which

the direction of causality goes from translation efficiency to codon usage rather than

the opposite.

Chapter 4

Translation in Humans

4.1 Introduction

Translation in higher-order organisms is notoriously more difficult to model and un-

derstand. Alternative splicing complicates sequencing processes like RNA-seq, and

now also ribosome profiling. In particular, common exons cannot unambiguously be

mapped to the correct isoform without additional information or computational tech-

niques that are only now being tackled [84]. Nevertheless, several ribosome profiling

datasets exist in a variety of conditions and human tissues and while these additional

intricacies complicate data analysis, these data also yield valuable insight towards the

genetic basis for translation.

Previous studies have focused on understanding the impact of genetic variation on

expression levels, protein levels, or ribosome occupancy levels [1, 8, 10, 82]. However,

due to the difficulty of obtaining a clean signal from codon-resolution ribosome frag-

ment counts, few if no studies have looked at genetic variation affecting intermediate

signals during translation as opposed to the genome-wide ribosome throughput. In

this section, we will present the results of our translation model on a large dataset

of many human individuals. We will show that there exist SNPs associated with

54

CHAPTER 4. TRANSLATION IN HUMANS 55

significant differences in codon translation rates, suggesting that genetic variability

might cause variability during elongation.

This analysis was performed in collaboration with Jonathan Pritchard.

4.2 Results

4.2.1 Allele-Specific Ribosome Dwell Times

A recent ribosome profiling dataset on 71 human individuals [10] allows us to com-

pare translation rates and protein synthesis rates between sequence variants. In this

data, we do find that the reference allele and the alternate allele are not always as-

signed the same number of ribosome footprint counts during mapping to the genome.

For each (gene, codon) pair, we therefore calculate allele-specific ribosome fragment

counts, giving us an estimate of how often each version of that transcript is seen by

a translating ribosome. For example, if a codon AAA contains an A/T SNP at the

third base, we count all the ribosome fragments which map to an AAA and all the

ribosome fragments which map to an AAT. To calculate a score representing this ratio

of ribosome fragment counts per allele-pair (e.g. the AAA-AAT pair), we aggregate

over all instances of that SNP-induced pair in every gene and every individual (see

Methods). Our analysis reveals several allele pairs for which the ratio is significant

compared to a binomial test (Figure 4.2.1; bottom-left triangle of each square).

To clarify, these fragment counts correspond to a specific codon location – if one of

more SNPs modify a location within the codon, we examine all possible pairs. We note

also that this method weighs genes with higher abundance more than those with lower

abundance, affording us some smoothing from using raw (sparse and noisy) ribosome

fragment counts, as opposed to taking an average of ratios. This score aggregates over

up to 30000 instances depending on the allele-pair. The number of ribosome fragment


AA

A

AA

C

AA

G

AA

T

AAA

AAC

AAG

AAT

AC

A

AC

C

AC

G

AC

T

ACA

ACC

ACG

ACT

AG

A

AG

C

AG

G

AG

T

AGA

AGC

AGG

AGT

AT

A

AT

C

AT

G

AT

T

ATA

ATC

ATG

ATT

CA

A

CA

C

CA

G

CA

T

CAA

CAC

CAG

CAT

CC

A

CC

C

CC

G

CC

T

CCA

CCC

CCG

CCT

CG

A

CG

C

CG

G

CG

T

CGA

CGC

CGG

CGT

CT

A

CT

C

CT

G

CT

T

CTA

CTC

CTG

CTT

GA

A

GA

C

GA

G

GA

T

GAA

GAC

GAG

GAT

GC

A

GC

C

GC

G

GC

T

GCA

GCC

GCG

GCT

GG

A

GG

C

GG

G

GG

T

GGA

GGC

GGG

GGT

GT

A

GT

C

GT

G

GT

T

GTA

GTC

GTG

GTT

TA

A

TA

C

TA

G

TA

T

TAA

TAC

TAG

TAT

TC

A

TC

C

TC

G

TC

T

TCA

TCC

TCG

TCT

TG

A

TG

C

TG

G

TG

T

TGA

TGC

TGG

TGTT

TA

TT

C

TT

G

TT

T

TTA

TTC

TTG

TTT

0 0.5 1

Binomial Test p−value

Figure 4.1: Comparison of ribosome fragment counts between alleles at SNPs.The bottom-left triangle is calculated for the score derived from raw counts and thetop-right triangle is calculated for the score derived from outlier strengths inferredfrom our translation model. The values are p-values under a binomial test, with0 representing a pair with significant differences between ribosome occupancy (orstrength of the pausing) of the two alleles.


counts in the numerator and denominator range from tens up to 20000 counts. We

remove several problematic genes from this analysis (roughly 30 instances from each

allele-pair), where we see a highly skewed ratio (e.g. zero counts for one allele versus

hundreds for the other) across all allele pairs. Since this skew occurs regardless of

the pair, these are likely due to artifacts in the experimental protocol. We also

remove any transcripts with high sequence similarity to another transcript in order

to alleviate the ambiguity caused by multiple isoforms (see Methods). Finally, we

focus on pairs with a difference in the wobble-base (the third base). These have been

implicated in translational control [114], although this analysis can be performed over

any allele-pair.

As seen previously in this thesis, the raw ribosome counts are a noisy observation

of the true dwell times. We therefore applied our probabilistic method for analyzing

ribosome profiling data (described in the previous chapter) to this human dataset,

learning as before the dwell times per codon c over two different granularities: a

global dwell time µc and a per-gene dwell time µcm. Using these parameters, our

model can also extract outliers strengths, basically measuring how much more the

ribosome dwells than expected. Repeating the ratio analysis as before, but now

substituting the raw ribosome fragment counts with the inferred outlier strengths,

we find similar results: several allele-pairs have a ratio that is significant compared

to a binomial test, but there are fewer pairs than when using the raw counts (fewer

p-values close to 0 or white cells in Figure 4.2.1; top-right triangle in each square).

This is not surprising since the raw counts are noisy observations of the true rate and

hence their high variance, especially in high abundance genes, can introduce noise

into the ratio. All except one of the allele-pairs that remain significant when using

outlier strength represent synonymous codons: for example, AAA/AAG code for

Lysine, AGA/AGG code for Arginine, and AGC/AGT code for Serine. Interestingly,

the ATA-ATG combination is significant, perhaps indicating that ATA is a potential


alternative start codon.

The raw count and the outlier strength represent different levels of granularity.

The raw count, the finest, is a noisy observation of the true phenomenon and hence

would be the most susceptible to perturbations in the system caused by artifacts

that don’t correspond to fluctuations in the true codon translation rates. The outlier

strengths, compared to the global dwell times also learned by the model, capture the

gene-specific dynamics but help alleviate some of the differences observed due purely

to noise. The latter is therefore a more conservative estimate. With careful analysis

of the genes and the aggregation of counts, we therefore find the potential for variable

ribosome pausing rates associated with genetic variation.

4.2.2 Codon Translation Rates Across Individuals

Applying our translation model from the previous chapter to this human dataset

allows us to explore other aspects of translation. In line with the results on variable

ribosome pausing at different alleles, we see variability in codon translation rates

between individuals (Figure 4.2). However, this analysis is somewhat sensitive to the

complexities of a higher-order organism. When we include more genes in our model

(3000, 5000, 10000 genes), we have more data to learn from, but we also have more

ambiguous data due to alternatively-spliced transcripts. The correlations between

global codon dwell times in the models pairs are Pearson r = 0.25, r = 0.35, and r

= 0.33. As such, we based all inferences throughout this chapter on the 3000 genes

with highest RNA levels and lowest similarity to other transcripts (see Methods) in

order to strike a balance between sufficient data and unambiguous data.

We also find, as in yeast and other organisms [98, 73, 96], that codon translation

rates do not correlate well to tAI [34], a measure of codon bias. Spearman r-values

range from -0.38 to 0.11, depending on the individual, with all p-values except two

being greater than 0.01.


1 1.02 1.04 1.06 1.08 1.11

1.02

1.04

1.06

1.08

1.1

Individual 24

Indi

vidu

al 1

9

1 1.01 1.02 1.03 1.04 1.051

1.01

1.02

1.03

1.04

1.05

1.06

1.07

Individual 40

Indi

vidu

al 3

8

1 1.02 1.04 1.06 1.08 1.11

1.02

1.04

1.06

1.08

Individual 66

Indi

vidu

al 2

6

1 1.02 1.04 1.06 1.08 1.11

1.02

1.04

1.06

1.08

1.1

1.12

Individual 2

Indi

vidu

al 1

8

Figure 4.2: Comparison of inferred codon dwell times between four random pairs ofhuman individuals.


4.3 Discussion

In this work, we showed that allele-specific ribosome counts and ribosome dwell times

exist at SNP locations for several allele-pairs. Recent studies in yeast and human

[1, 8, 10, 82] have surprisingly shown that eQTLs – quantitative trait loci associated

with an effect on RNA expression levels – have a significantly reduced effect size

on protein levels. It then follows to ask whether SNPs are potentially associated

with variation at an elongation level as opposed to a protein synthesis level. In

particular, it might be the case that genetic variation acts on codon translation rates

in order to affect other mechanisms beyond ribosome throughput. We also showed

that translation rates between individuals can differ. The source of this can be,

for example, accumulated differences in per-allele rates from accumulated genetic

differences between these individuals, although a more thorough investigation of other

biological factors, or potentially confounding factors, would be interesting to perform.

Through comparison of ribosome profiling datasets on several individuals, we illus-

trated the potential impact of genetic variation on ribosome pausing, but the mecha-

nism behind this variability deserves further exploration. It was recently shown that

genetic variability is also associated with differences in structure via a PARS assay on

three human individuals [130]. This biological feature, as well as those suggested in

the previous chapter in yeast, are candidates for biophysical characteristics that can

act via the genome to affect translation and eventually create a phenotype of inter-

est. It would be interesting to apply combine this analysis with datasets on complex

diseases to gauge whether elongation-level SNPs can help explain those phenotypes

or boost predictive power.

Finally, the ribosome profiling dataset analyzed in this work provides many differ-

ent signals of interest. We can extract other per-gene quantities potentially associated


with QTLs (replacing the “e” with other letters). For example, we could look at pe-

riodicity of the ribosome occupancy signal in order to understand whether shifts in

frame, potentially caused by RNA secondary structure, have a genetic basis. Over-

lapping loci associated with different signals and with different biological features

could lead to an elucidating cross-comparison of QTL affects, and would be useful in

elevating our understanding of translation in humans.

4.4 Methods

Ribosome Profiling Datasets Ribosome profiling data was gathered for 71 hu-

man individuals in Yoruba lymphoblastoid cell lines (LCLs) [10]. RNA-seq data and

genome-wide genotypes were obtained from [94].

Ribosome fragment counts were mapped to the genome and the active codon

was determined as explained for our translation model in the Methods section of the

previous chapter. When choosing the genes to train over, we computed a score as

follows: RNA*RNA/similarity. Here, RNA represents the average RNA level per gene

and similarity represents how similar the RNA sequence is to any other transcript (a

value of X means similarity to X other transcripts). We chose the first 3000 genes

with highest score. For comparison, we also looked at a model over the first 5000 and

10000 genes.

Ratio Score To aggregate the ratios of different allele pairs, suppose we have a

position on the genome where a SNP gives us the following genetic variants: AAA

and AAT. We scan every codon on every gene, looking at SNP locations which have

this specific pair. We then keep a running sum of the ribosome counts at every

AAA and at every AAT. Lastly, we take the ratio of the sum of AAA counts to

the sum of AAT counts. In the subsequent analyses, we replace the count with the


outlier strength learned from our translation model. This ratio is compared against

a binomial test to obtain a p-value representing the significance of ratios that differ

from one. For example, a large ratio means that the AAA allele is translated slower

than the AAT allele.

Since several genes illustrated artifacts in the experimental protocol where we saw

zero ribosome fragment counts despite the specified genotype, we anecdotally scanned

for such anomalies (where we saw these genes creating the same skew for over half of

the allele-pairs), and removed from the analysis these genes. This only eliminated on

average 30 instances from an allele-pair (a relatively small fraction).

The global codon dwell times and the outliers were calculated according to the

previous chapter. In calculating the standard deviation for normalizing the outlier

residual, we used the reference allele at each position.

4.5 Conclusions

In this section, we presented an analysis of ribosome profiling data for 71 human

individuals. By aggregating various measures of ribosome pausing, we illustrated

the potential impact of genetic variation on variable codon translation rates. This

analysis can be harnessed for asking other questions: how are codon translation rates

affected amongst synonymous allele-pairs, how are elongation-level features related

to variable translation rates, and how expression-level, protein-level, and ribosome-

occupancy-level QTLs are related to elongation-level variation.

Chapter 5

RNA Secondary Structure

Prediction

5.1 Introduction

The development of genome-wide RNA secondary structure-probing assays has en-

abled new insight into the role of secondary structure in gene regulation, and has

spawned new computational methods that leverage these data for RNA secondary

structure prediction.

To date, RNA secondary structure prediction methods incorporating structure-

probing data extend energy-based methods, generally using the data to constrain

the space of possible structures considered by the algorithm. [142] and [73] adopt the

straightforward approach of enforcing hard constraints that particular nucleotides are

paired or unpaired in the MFE computation. [27] and [48] use the data as soft con-

straints, biasing the energy model to pair or unpair nucleotides based on their probing

signal. [131] estimate a perturbation to the energy model to encourage agreement be-

tween the basepairings predicted by the energy model and those inferred from the

experimental data. This perturbed energy model is then used to predict structures

63

CHAPTER 5. RNA SECONDARY STRUCTURE PREDICTION 64

using an MFE algorithm. [99] and [92] use the probing data in a post-processing

step to select amongst structures sampled from the structure ensemble defined by the

energy model [30].

In this section, we build upon the success of statistical methods and present a novel

method, CONTRAfold-SE, that incorporates multiple structure-probing datasets to

achieve improved prediction accuracy on diverse RNA sequences. A statistical ap-

proach provides two key advantages. First, it obviates the need for heuristic treat-

ments of the probing data (as in existing methods), such as thresholding the data

to a binary value (reflecting whether a base is paired or not) or incorporating it in

the energy model as a pseudo-energy term. Second, a statistical approach provides

a principled framework for combining data from multiple structure-probing experi-

ments. Each probing strategy has specific biases, and combining data obtained from

small-scale experiments using different strategies has been shown to improve predic-

tion accuracy [24].

Our method, CONTRAfold-SE, extends the statistical model of CONTRAfold

[33], one of the best-performing secondary structure prediction methods [103, 97], to

model the structure-probing data as observations of possibly unknown secondary

structures. This model can be learned from datasets containing only structure-

probing data, or a mix of known structures and probing data. CONTRAfold-SE can

then generate predictions on novel sequences from this learned model. By contrast,

CONTRAfold requires a set of complete structures to learn a model. We evaluated

CONTRAfold-SE using three genome-wide structure-probing datasets in yeast, based

on two different probing techniques that are performed in two different conditions.

We show that when predicting the structure of a novel sequence, CONTRAfold-SE is

competitive with current methods, and slightly outperforms CONTRAfold, on several

test sets of known RNA structures, whether using structure-probing data available

for the novel sequence or just the sequence itself. We find that combining datasets


in different probing conditions can have an adverse effect on performance, but other-

wise allows for cross-correction of errors in the data. CONTRAfold-SE outperforms

competing methods in predicting genes bound by RNA binding proteins (RBPs),

and is able to identify specific structural motifs bound by RBPs. Surprisingly, while

CONTRAfold-SE outperforms the existing state-of-the-art method SeqFold [92], we

find that its gains over CONTRAfold are modest, suggesting that using accurate sta-

tistical prediction models is an important supplement to current structure-probing

data. This method was developed in collaboration with Chuan-Sheng Foo.

5.2 Results

CONTRAfold-SE is a probabilistic model for single-sequence RNA secondary struc-

ture prediction that can utilize structure-probing data both for training the model

and in making predictions. For a single RNA sequence x, CONTRAfold-SE mod-

els the conditional probability of secondary structure y and structure-probing data

from sources d1,. . . , dn (when available), given x. Secondary structure y is modelled

using the CONTRAfold statistical model. Briefly, the CONTRAfold model is a con-

ditional log-linear model for secondary structure given sequence, that uses features

analogous to structural motifs used in energy-based models; model parameters are

trained on a set of RNAs with known structures. Structure-probing data d are mod-

elled as observations of the (often unobserved) secondary structure. We refer to these

two components as the structure model (with associated CONTRAfold parameters)

and the data model (with parameters representing the distribution over the data).

CONTRAfold-SE takes as input {(x, y, d)}: a training set of RNA sequences with

associated secondary structures and one or more sources of structure-probing data

(if available). Having both known structure and probing data is not necessary, but

nonetheless desirable.


To estimate the parameters of the structure model and the data model, we maxi-

mize the model likelihood on the given training set. These estimated parameters can

then be used to perform predictions on arbitrary RNA sequences, with or without

supporting structure-probing data. If structure-probing data is not available or is too

noisy to be used, we can predict a structure based on the structure model alone, as

in CONTRAfold. If structure-probing data is available, it can be incorporated by

first updating model parameters in an additional round of training with the query

sequence as an single training example and then using the updated structure model

as before. The tradeoff between these two prediction methods are discussed in the

following sections, but unless otherwise specified, prediction is done without using

structure probing data. Figure 5.2 summarizes the components of CONTRAfold-SE,

and a detailed description of the model, estimation, and inference procedures are

found in the Methods section and Appendix B.

5.2.1 Improved Secondary Structure Predictions

We use the following notation: CONTRAfold-SE trained on “Train(DataSource)”

represents training on the specific set of sequences represented by “Train” with

structure-probing data from the source(s) labeled “DataSource”. We considered com-

binations of the largest high-throughput assays in yeast – the parallel analysis of RNA

structure (PARS) [62], and DMS-seq assays [105]. In the PARS assay, the RNA struc-

ture signal is obtained by treating RNA with enzymes that preferentially cleave either

paired or unpaired nucleotides. The DMS-seq assay relies instead on the reactivity of

unpaired nucleotides to the dimethyl-sulfate chemical; the DMS-seq assay was applied

to both renatured RNA and live yeast. We denote these sources PARS, DMS-vitro,

and DMS-vivo, respectively. The Methods section summarizes the training and test

sets we used. When comparing CONTRAfold-SE to CONTRAfold, we simply ex-

clude from the training set the data-only sequences and keep the same structure-only


…

C U A G U C A A G G!G G U C A G U U C C!

A U U!

C C U!. . . . . . . . . .!

!A A U C G C A A U U U G C C C C!

unpaired paired

structure-only sequences

data-only sequences

STRUCTURE MODEL

DATA MODEL

LEARNED MODEL WEIGHTS: w, θ

w, θ

C C A C C C A A U U U G G G!

C C A C C C! G G G!

A A! U!U U!

. . .!

w, θ

C C A C ! G !

C C A A! U!G G U U!

. !

C C A C C C A A U U U G G G!

TRAINING PREDICTION (no data)

PREDICTION (with data)

y | x,w ~ exp(wTF(x,y)) exp(wTF(x,y'))

y '∑

dk | x, y,θ ~Gammaxk ,paired (k,y) (dk :θ )…

x

y

d1 dS C C A C C C A A U U U G G G!

Figure 5.1: Overview of CONTRAfold-SE.During training, we learn the model parameters w and θ from a training set con-sisting of sequences with only known structure or only structure-probing data (or acombination of both, although in practice there are few such sequences available forboth training and testing). At prediction, we use the model parameters to predictthe structure of a new sequence (prediction without data). If data is available, wecan also predict by incorporating this information (prediction with data).


sequences for fair comparison. We evaluate methods based on F-measure (calcu-

lated from sensitivity and positive predictive value (PPV)), accuracy, and AUC (see

Methods).

We first demonstrate how CONTRAfold-SE performs in comparison to CON-

TRAfold and SeqFold, the current state-of-the-art algorithm incorporating probing

data, on the small set of sequences with known structures presented in [92] (denoted by

Test-SeqFold). Table 5.2.1 shows that CONTRAfold-SE trained on Train-A(PARS),

a combination of structure-only sequences and data-only yeast mRNA sequences,

does at least as well as SeqFold on 6 of 10 of the sequences and at least as well as

CONTRAfold on all of the sequences. No probing data was included during predic-

tion and performance was measured by F-measure (to allow comparison to SeqFold).

Notably, CONTRAfold-SE achieves the same performance as CONTRAfold on 4

of 10 of the sequences, indicating that structure-probing data does not necessarily

contribute significantly to prediction quality. CONTRAfold-SE, like CONTRAfold,

offers a sensitivity-PPV tradeoff via a hyperparameter γ, which essentially adds more

basepairs with increasing γ; we select γ based on cross-validation (see Methods) when

comparing to other methods (such as SeqFold) with a single point on the sensitivity-

PPV curve; the full sensitivity-PPV curves are shown in Figures B.1 - B.10.

CONTRAfold-SE can also incorporate available structure-probing data for the

query sequences during prediction. In this prediction mode, a data tuning parameter

shifts the structure model either closer or farther from the distribution represented

by the data of the query sequence. Table 5.2.1 shows the F-measure on Test-SeqFold

where structures are predicted using the cross-validated data tuning parameter (see

Methods). In one case (snR81), the prediction with data improves over the prediction

mode without data by a high value of 8%, but surprisingly remains the same in four

cases and drops in five cases. The data quality for a single query sequence is likely poor

enough or diverse enough that without pushing the prediction toward the (smoothed)


Sequence C-SE C-SE (query data) SeqFold CONTRAfold

ASH1-E1 0.58 0.58 0.79 0.52RDN58-2 0.54 0.54 0.52 0.53p4p6 0.90 0.84 0.82 0.90p9 1.00 0.98 0.98 1.00snR10 0.72 0.66 0.83 0.69snR33 0.89 0.73 0.76 0.89snR37 0.73 0.72 0.94 0.71snR46 0.75 0.75 0.88 0.74snR53 0.67 0.67 0.56 0.67snR81 0.82 0.89 0.77 0.80

Table 5.1: F-measure of CONTRAfold-SE (C-SE) trained on Train-A(PARS) andevaluated on Test-SeqFold.The presented structure for CONTRAfold-SE and CONTRAfold are based on a hy-perparameter γ selected by cross-validation (see Methods). SeqFold F-measure iscalculated from the sensitivity and PPV presented in [92]. CONTRAfold-SE withquery data (C-SE (query data)) incorporates the data per test sequence during pre-diction based on a cross-validated data tuning parameter δ (see Methods) and a fixedγ (the average cross-validated γ from CONTRAfold-SE). CONTRAfold-SE is com-petitive with SeqFold on 6 of 10 sequences, even without data at prediction. Boldednumbers indicate the algorithm with highest F-measure across all algorithms.

structure model parameters, the algorithm cannot sufficiently correct it.

Since CONTRAfold alone does as least as well as SeqFold on 6 of 10 sequences,

and is only marginally worse than CONTRAfold-SE, we will largely focus the subse-

quent results on how CONTRAfold-SE compares to CONTRAfold. We next demon-

strate that structure-probing data can be beneficial for learning general-purpose sec-

ondary structure prediction models, by evaluating the learned models on two test

sets with a diverse set of RNA structures, as compiled in [103] (denoted by Test-

Tornado-TestSetA and Test-Tornado-TestSetB). In this experiment (Table 5.2.1),

CONTRAfold-SE trained on Train-A(PARS) outperforms CONTRAfold on all three

metrics. The overall performance differences are fairly small, possibly because the

sequences are short (a mean of 192nt and 121nt) and hence presumably easier to


predict in the first place. These sequences also cover a wide range of RNA families

that may not be reflected in the training set.

AUC F-measure Accuracy

Test-Tornado-TestSetA

Train-A with CONTRAfold 0.7110 0.7122 0.7610Train-A(PARS) 0.7169 0.7177 0.7640Train-B(PARS) 0.7209 0.7203 0.7662Train-B(DMS-vitro) 0.7126 0.7128 0.7616Train-B(DMS-vivo) 0.7096 0.7105 0.7604Train-B(PARS,DMS-vitro) 0.7240 0.7214 0.7662

Test-Tornado-TestSetB

Train-A with CONTRAfold 0.6178 0.6478 0.7498Train-A(PARS) 0.6236 0.6537 0.7519Train-B(PARS) 0.6256 0.6554 0.7535Train-B(DMS-vitro) 0.6201 0.6499 0.7514Train-B(DMS-vivo) 0.6172 0.6483 0.7498Train-BPARS,DMS-vitro) 0.6252 0.6558 0.7551

Test-mRNA

Train-A with CONTRAfold 0.7158 0.7152 0.7597Train-A(PARS) 0.7158 0.7141 0.7578Train-B(PARS) 0.7129 0.7097 0.7549Train-B(DMS-vitro) 0.7189 0.7164 0.7608Train-B(DMS-vivo) 0.7159 0.7147 0.7600Train-B(PARS,DMS-vitro) 0.7170 0.7149 0.7576

Table 5.2: Performance of CONTRAfold-SE trained on Train-A and Train-B andevaluated on three general test sets.CONTRAfold performance is shown for reference. Performance metrics are explainedin Methods.

We hence extended our evaluation with an additional test set that better reflects

the training set, Test-mRNA, which includes 188 highly conserved mRNA sequences

[105] (see Methods). However, we find that incorporating data during training with

CONTRAfold-SE does not result in significant improvement, potentially because the


“ground truth” structures in this case may differ from the ones reflected by the data.

That is, true structures are calculated here from phylogenetic conservation of sec-

ondary structure, in which basepairs covary between species. Table 5.2.1 shows that

there is no significant improvement of CONTRAfold-SE trained on Train-A(PARS)

compared with CONTRAfold.

5.2.2 The Value of Structure-Probing Data

Although probabilistic models can help account for uncertainty and noise in measure-

ments, they still require data with strong signal for robust parameter estimation. In

[99], the noise level of the data was manipulated in a synthetic train-test environ-

ment in order to explore the sensitivity of RNA secondary structure prediction to

noise. Here, we instead manipulate the noise level by varying the number of probing

data-only instances in the training set.

We first explored how performance changes when increasing the fraction of se-

quences with only PARS structure-probing data from 50% (Train-A50%) up to 100%

(Train-A100%) while keeping the total number of sequences in the training set con-

stant. Performance on Test-SeqFold degrades rapidly as we rely more on structure-

probing data (Table 5.2.2). This indicates that having a substantial set of known

structures is important so that the noise in the experimental data does not over-

whelm the signal in the known structures. To explore how much value the probing

data brings to model estimation, we next fix the number of sequences with known

structure and increase the number of sequences with probing data (Train-A75 to

Train-A100). We again see that performance is harmed (though less rapidly) as we

rely more on experimental information. The training set sequences with structure-

probing data (mRNAs) are not necessarily representative of the test set (mainly

rRNAs), and so as the number of these sequences in the training set increases, the


algorithm could be learning a different set of conformational rules. The same re-

sults hold for Test-mRNA (Table 5.2.2); however, increasing the amount of data-only

structures in the training set still degrades performance (perhaps slightly less so) de-

spite the fact that both the training and test set have mRNA sequences. This result

highlights the difficulty of learning from incomplete data, and suggests that a deeper

understanding of the biases in the data is required in order to improve the data model.

(#known,#data) AUC Accuracy F-measure

Test-SeqFold

Train-A Contrafold (119, 0) 0.8665 0.8158 0.8466Train-A50% (119, 119) 0.8751 0.8323 0.8580Train-A75% (60, 178) 0.8533 0.8223 0.8468Train-A100% (0, 238) 0.2883 0.3586 0.5614Train-A75 (119, 178) 0.8734 0.8309 0.8581Train-A100 (119, 238) 0.8662 0.8274 0.8558

Test-mRNA

Train-A CONTRAfold (119, 0) 0.7158 0.7152 0.7597Train-A50% (119, 119) 0.7158 0.7141 0.7578Train-A75% (60, 178) 0.7030 0.7081 0.7514Train-A100% (0, 238) 0.2808 0.3683 0.5508Train-A75 (119, 178) 0.7142 0.7106 0.7559Train-A100 (119, 238) 0.7109 0.7075 0.7519

Table 5.3: Performance of CONTRAfold-SE trained on sets of varying compositionswith PARS data and evaluated on two test sets.In each test set, the first row gives the CONTRAfold performance. The next threerows maintain the same total number of training sequences (238) while changingthe fraction of sequences with only structure-probing data. Train-A50% is the exacttraining set Train-A(PARS) (and as Train-A50). The last two rows maintain thesame number of training sequences with known structure (119) while increasing thenumber of sequences with structure-probing data. Performance metrics are explainedin Methods.


AUC F−measure Accuracy0.75

0.8

0.85

0.9CONTRAfold−SE performance on Test−SeqFold, trained on Train−B

ContrafoldPARSDMS−vitroDMS−vivoPARS + DMS−vitroPARS + DMS−vivoDMS−vivo + DMS−vitroPARS + DMS−vitro + DMS−vivo

Figure 5.2: CONTRAfold-SE performance using different data sources.Evaluation is on Test-SeqFold trained on all combinations of PARS, DMS-vitro, andDMS-vivo data sources for Train-B.

5.2.3 Combining Data from Multiple Data Sources

To our knowledge, CONTRAfold-SE is the first algorithm that can incorporate mul-

tiple data sources, using its probabilistic framework to combine them in a principled

way, and enabling cross-correction of errors. Figure 5.2 shows prediction performance

on Test-SeqFold using all possible combinations of the PARS, DMS-vitro, and DMS-

vivo data sources for Train-A. Combining the two in vitro assays, DMS-vitro and

PARS (green bar), yields the best performance, and boosts it above that of each

assay alone, thereby demonstrating how we can compensate for errors in one source

with a second, complementary data source.

Overall, each combination of two data sources performs better than the individual

sources alone. However, interestingly, using all three data sources (maroon bar)


does not outperform the PARS/DMS-vitro combination. This is consistent with the

observation in [105] that many sequences have different structures in vivo as compared

to in vitro. Indeed, in the combinations with two sources, the ones including an in

vivo data source perform worst, even when the same DMS-seq assay is used (red bar).

These conflicts between in vivo and in vitro sources, in conjunction with the fact that

we are evaluating on “true” structures measured in an in vitro setting, can reduce the

ability of training to generalize to unseen test sequences, thereby decreasing prediction

accuracy when training on all three sources. On the sets Test-Tornado-TestSetA

and Test-Tornado-TestSetB, CONTRAfold-SE trained on Train-B(PARS,DMS-vitro)

does better than trained on each individual dataset (Table 5.2.1). When using Test-

mRNA, we observe that the PARS/DMS-vitro combination does worse than DMS-

vitro alone (Table 5.2.1), perhaps because our “ground truth” in this set may not

reflect the true structure.

5.2.4 Classification of RNA-Binding Protein Targets

Although many RBPs recognize specific sequence motifs situated in single-stranded

RNA, the secondary structure context near the motif plays an important role in target

recognition [9, 91, 53]. [41] developed CapR, a method to analyze binding data from

RNA-protein interaction assays, specifically cross-linking immunoprecipitation high-

throughput sequencing (CLIP-Seq), to determine if RBPs bind specific structural

motifs. In addition, structure prediction with SeqFold [92] is shown to better classify

RBP targets determined by such an assay than simply using a threshold on the

number of motifs to distinguish bound and unbound targets.

Here, we give further validation of CONTRAfold-SE as a tool for predicting sec-

ondary structure as it affects regulatory functions. Specifically, we will show that

CONTRAfold-SE accurately distinguishes RBP sequence motifs that are bound from

those that are unbound, and suggest an associated structure specificity profile for


each RBP. We again use the RIP-chip study in yeast [54] in a setup similar to that in

[92] (see Methods), and predict the structure (specifically, the probability that each

base is paired) for each mRNA using CONTRAfold-SE trained on Train-B(PARS,

DMS-vitro) (i.e. the best-performing data combination). In addition, we evaluate

performance of CONTRAfold-SE trained on Train-B(DMS-vivo), since RBP binding

only occurs in vivo.

By aggregating the accessibility over motifs per gene, we obtain a score that we can

threshold to predict whether the gene is truly bound, thereby generating a receiver

operating characteristic (ROC) curve from different thresholds. Figure 5.3 compares

the ROC curves for CONTRAfold-SE on Train-B(PARS, DMS-vitro) computed for

10 motifs. The accessibility per gene is calculated using the same sum over motif

instances as in [92], but modified to account for gene length (see Methods), which is

a better measure than the sum itself (Table B.1). Compared to a similarly computed

score for SeqFold and the motif count baseline, CONTRAfold-SE yields better AUC

on 8 and 9 sequences, respectively, out of 10. CONTRAfold-SE also outperforms

CONTRAfold itself (for 6 sequences), as well as the motif count divided by gene

length (for 8 sequences). However, CONTRAfold-SE trained on Train-B(DMS-vivo)

does not generally outperform the algorithm trained on Train-B(PARS, DMS-vitro),

and the differences between CONTRAfold-SE and CONTRAfold are again small.

Finally, the same qualitative comparisons hold when using the aggregate score not

normalized by gene length (Table B.1).

5.2.5 Nucleotide-Level Structure Contexts for RNA-Binding

Proteins

In addition to classification of RBP targets, CONTRAfold-SE can be used to study

the specific pairing partners of bases involved in bound RBP motifs. We return to the


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive Rate

TruePositive

Rate

PUF4−1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1PUB1−1

TruePositive

Rate

False Positive Rate

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1PUF2−1

TruePositive

Rate

False Positive Rate0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1PAB1−1

TruePositive

Rate

False Positive Rate

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive Rate

TruePositive

Rate

KHD1−1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1NAB2−1

TruePositive

Rate

False Positive Rate

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1YLL032C−1

TruePositive

Rate

False Positive Rate0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1VTS1−1

TruePositive

Rate

False Positive Rate

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive Rate

TruePositive

Rate

PIN4−1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1NRD1−1

TruePositive

Rate

False Positive Rate

CONTRAfold−SE(PARS,DMS−vitro)SeqFold# Motifs

Legend:

Figure 5.3: Classification of RNA binding protein targets into true bound versusfalse bound genes.The receiver operating characteristic (ROC) curve uses a thresholded sum of acces-sibilities over motifs, normalized by gene length (see Methods). CONTRAfold-SEoutperforms Seqfold and a baseline of the motif count per gene.


more complex human setting and compute the structure profile and potential pairing

partners within and around the motif region for ten human RBPs. In a setup similar

to the evaluation of CapR [41], we use a CLIP-seq dataset to identify true bound

and false bound targets. CONTRAfold-SE is trained on PARS data from human

lymphoblastoid cell lines [130] (see Methods). Figure 5.4 shows our predictions on

binding protein FXR2 with motif WGGA. The top panel gives the average structure

profile between the true bound versus false bound regions. The value of the separation

is measured by a Mann-Whitney-Wilcoxon test (middle panel). More interestingly,

we can compute a dotplot (bottom panel) showing the pairing probabilities for each

base partner. Although CapR is able to predict the probability that a specific RBP

sequence motif lies in various structure contexts (e.g. a hairpin region or a stem

region; Figure 4 in [41]), our algorithm predicts the exact pairing partners (i.e. the

coordinates of the stem itself). For FXR2 in particular, CapR reflects an affinity for

stems near the start of the motif; indeed, we also show a high affinity for pairedness,

and in addition, can identify that it most likely runs through positions 35-39 paired

with 24-20. Our results qualitatively agree with the four other motifs presented in [41]

(Figure B.11). Interestingly, for SF2ASF, CapR predicts the motif to be unpaired,

except with reduced uncertainty in the middle of the motif. We similarly show a drop

in pairedness in the middle of the motif, surrounded by higher accessibility, although

we find two possible scenarios: that the motif region is unpaired and flanked by two

stems (bottom left and top right of the dotplot) or lies in the hairpin loop of a stem

(top left and bottom right of the dotplot). CONTRAfold-SE can therefore be used

as a tool for nucleotide-level analysis of putative structure profiles determined by a

CapR screen.


0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

0.52

Pai

ring

Pro

babi

lity

FXR2 (WGGA)

true false

0

5

10

Man

n−W

hitn

ey−

Wilc

oxon

−lo

g 10 p

−va

lue

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40

Pairing Partners Heat Map for True Bound Genes

Sequence Position

Seq

uenc

e P

ositi

on

0

1

2

3

4

5

6

7

x 10−3

Figure 5.4: Nucleotide-level structure prediction for the true bound sequences ofRNA binding protein FXR2 with motif WGGA.The top panel shows the average pairing probability for the true bound versus falsebound motifs. Dashed red lines mark the position of the motif within the 40bp regionaround it. The middle panel shows the significance of the p-value from a Mann-Whitney-Wilcox test on the separation of the structure profile distributions at eachposition. The bottom panel shows the dotplot of probabilities of specific basepairingsin the region, such that darker squares correspond to a higher probability that thosetwo base positions are paired.


5.2.6 Structure and Translation Efficiency under Oxidative

Stress

As we showed in previous chapters, in eukaryotes (and other species), secondary struc-

ture is thought to play an important role in regulation of translation [86], especially

during initiation [112]. Particularly, PARS structure-probing data in endogenous con-

ditions [62] has been shown to correlate strongest with ribosome density at translation

initiation and in this thesis, and in [96] we showed a similar result with DMS-seq data.

With our new method, CONTRAfold-SE, we can calculate this correlation at a more

reliable resolution. Indeed, we find an even stronger correlation (Spearman r = 0.2

versus r = 0.07 or r = 0.15) for a larger region around the AUG when we replace

accessibility calculated from raw DMS data with accessibility calculated from the

pairing probabilities of CONTRAfold-SE (trained on Train-B with DMS-vitro and

DMS-vivo; see Methods) (Figure 5.5). Notably, with CONTRAfold-SE, unlike with

the (sparse) raw DMS data, we can include many more genes in this analysis.

Under stress conditions, the translation mechanism becomes even more compli-

cated, with genes relying, for example, on structured elements called internal ribo-

some entry sites (IRESs) to bypass cap-dependent recruitment of the mRNA. Several

features have been studied in correlation with varying protein expression and mRNA

levels in oxidative stress [127]; here we use CONTRAfold-SE to study the effect of sec-

ondary structure on these dynamics. Using ribosome profiling data gathered in yeast

at different time points under oxidative stress [44], we obtain estimates of transla-

tion efficiency (TE), which we correlate to the pairing probability over several regions

of interest, calculated from CONTRAfold-SE trained on Train-B(DMS-vivo) (Table

B.2). Average structure in the coding region correlates significantly with the change

in efficiency between baseline and 30 minutes (Spearman p = 8 ∗ 10−25). The corre-

lation is positive (Spearman r = 0.19), suggesting that more structured regions are


0 50 100 150 200 250

−0.05

0

0.05

0.1

0.15

0.2

Log

Tra

nsla

tion

Effi

cien

cy C

orre

latio

nw

ith C

ON

TR

Afo

ld−

SE

(DM

S−

vitr

o) A

cces

sibi

lity


0 50 100 150 200 250

−0.05

0

0.05

0.1

0.15

0.2

position [nt]

Log

Tra

nsla

tion

Effi

cien

cy C

orre

latio

nw

ith C

ON

TR

Afo

ld−

SE

(DM

S−

vivo

) A

cces

sibi

lity


Figure 5.5: Correlation between translation efficiency per gene and the accessibilityin rolling windows of 40nt, as predicted by CONTRAfold-SE.


associated with a greater drop in efficiency under heavier stress conditions. Protein

expression measurements gathered via mass spectrometry also indicate that genes

with low protein expression under stress show enrichment for structure in the coding

region [127]. Interestingly, the correlation for the coding sequence feature is higher

than that for the 5’ UTR (r = 0.13, p = 6 ∗ 10−13) and much higher than that for the

3’ UTR (r = 0.05, p = 3 ∗ 10−3), indicating these regions may play different roles in

translation. Reassuringly, other metrics for aggregating pairing probability in each

region are similarly significant, and the same correlations hold for CONTRAfold-SE

on Train-B(PARS, DMS-vitro) (Table B.2). One possible explanation for this finding

might be that unfolding of highly structured RNA during elongation requires more

energy and cellular resources that are not available under stress conditions, although

the mechanism behind this hypothesis requires further investigation. Indeed, the

correlation is stronger for the 30min time point compared to the 15min time point

(Table B.2). Furthermore, the correlation to change in TE is stronger than to each

individual TE (Table B.2). Although the correlation to initial TE is strong (p-values

< 1 ∗ 10−7 for the coding sequence and 5’ UTR features), the correlation to change

in TE remains significant (p-value < 0.05) when conditioning by the initial TE (for

all features in replicate 2 and all features except the coding sequence in replicate 1),

indicating that structure may play a role during stress.

5.3 Discussion

Our method, CONTRAfold-SE, shows improved performance over existing methods,

whether or not structure-probing data is provided at test time. Strikingly, the CON-

TRAfold method alone is highly competitive in all our experiments. In many cases,

CONTRAfold even outperforms methods that utilize the structure-probing data. This


suggests that much of the previously demonstrated improvements in prediction per-

formance when using structure-probing data may be more simply obtained through

the use of more accurate statistical models. Our results suggest that including probing

data into these models only provides a relatively modest improvement.

There are several potential reasons why the probing data did not provide a sig-

nificant boost in performance. Firstly, the probing data are very sparse: there is

typically no signal for many bases in an RNA transcript. The bases that are mea-

sured therefore may not provide further information to the already accurate statistical

models. Deeper sequencing of the probing libraries would help reduce data sparsity.

Secondly, and more importantly, there are complex structure-dependent biases in the

probing data used [37, 135], that we (and many others) do not account for. While

data from the selective 2-hydroxyl acylation analyzed by primer extension (SHAPE)

[83] method for structure-probing is thought to be less biased [36], a large-scale assay

for it has yet to be developed. As such, we were unable to evaluate the improve-

ment from data generated using this method, since CONTRAfold-SE requires at

least 50-100 sequences with probing data to learn a good model. In addition, [131]

observe that the probing data from current SHAPE chemistry still has a dependence

on structure context, suggesting that the expected improvement with SHAPE may

not be substantially greater than with the DMS and PARS datasets we used. A third

possible reason is that since RNA typically folds into several structures that co-exist,

the probing data is in fact derived from a mixture of structures. This would violate

the assumption in the CONTRAfold-SE model that the observed probing data orig-

inates from a single structure. Extending CONTRAfold-SE to account for the fact

that multiple structures could give rise to the probing data is not trivial, and is an

interesting topic for future work.

A key benefit of our probabilistic framework is its modularity, allowing the re-

placement of individual components with small changes to the learning and prediction


algorithms. One could, for instance, replace the structure model with an alternative

that allows higher-order structures such as pseudoknots, which are involved in var-

ious important functions, such as frameshifting during translation [115]. One could

also replace the Gamma distributions used in the data model with Poisson distribu-

tions to directly model the count nature of the sequencing data. In fact, it would

be straightforward to use any data model in the exponential family, a broad class

that covers most commonly used distributions. A more involved extension would be

to model the structure-dependent biases in the probing data, for instance, by having

separate models for bases in stacked pairs and at the end of hairpin loops, as done

in [117]. Finally, one could integrate other information about the structure of the

sequence as another data source. For example, information from solvent accessibility

models could used to incorporate the fact that if an unpaired base is inaccessible, it

may falsely appear to be paired in a DMS probing assay [105].

Our approach is most similar in spirit to that of [131], in that we adapt an ex-

isting model to the structure-probing data in a principled way; the method in [131]

learns a perturbation to the thermodynamic energies to try and match the posterior

probabilities of base pairs in the thermodynamic ensemble to the probing data. Like

[24] and [117], we make use of probabilistic models for the experimental data, and

integrate data from multiple sources. We use the probing data to update our prob-

abilistic model via Bayes rule, which is reminiscent of SeqFold’s heuristic approach

of picking the peak in the structure distribution that is closest to the observed data

[92]. However, unlike these other works, we have unified these various components in

a probabilistic framework, thus enabling additional synergies amongst its parts. For

instance, because CONTRAfold-SE jointly estimates the data and structure mod-

els, the method can learn which probing data sources are more reliable and reduce

its reliance on noisier data sources in estimating the structure model. By contrast,

standalone estimation of data models (e.g. in [24, 131, 92, 117]) requires associated


ground-truth structures, and it is not immediately obvious how multiple data sources

can be combined. Furthermore, if the data are extremely noisy, fitting their distri-

bution alone on a small set of ground-truth structures could still lead to inaccurate

results when used in conjunction with a prediction algorithm that uses the likelihoods

as in [116, 36]. A recent review [36] provides a comprehensive critique of algorithms

that use structure-probing data for secondary structure prediction and describes an

alternative method that integrates structure-probing data in a probabilistic way. Our

method was developed independently and differs from the proposals in [36], as we

also enable parameter learning for the structure model using the probing data.

In an analysis capacity, we demonstrated two benefits of our method. First, we

showed that by incorporating one or more probing datasets in a principled proba-

bilistic framework, CONTRAfold-SE can help mitigate limitations in the structure-

probing data. Combining multiple datasets effectively increases data density, and

allows the model to perform some correction for the biases in the individual datasets.

These benefits are seen in our experiments, where combining PARS and DMS-vitro

data led to improved prediction performance beyond that of using either data alone.

We also found that combining datasets from in vitro and in vivo studies led to de-

graded prediction performance, consistent with the findings in [105] that in vitro

and in vivo RNA structures differ. Second, once trained on (one or more) structure-

probing datasets, CONTRAfold-SE can be used to provide per-base accessibility esti-

mates, filling in the many gaps in the sparse probing data. Indeed, we can explore the

structure profiles of RBP bound sites and classify RBP targets with CONTRAfold-SE.

If we had to rely on structure-probing data alone, these analyses would be hampered

by the need to throw out many sequences without sufficient data, and reduced signal-

to-noise in the noisy raw data. Similarly, sparsity of coverage in specific parts of a

sequence, such as at the 5’ end for DMS-seq, reduces the numbers of data points over

which we can test hypotheses about the effect of RNA structure on gene regulation,


such as about translation efficiency.

A key requirement for obtaining good performance with statistical methods for

RNA secondary structure prediction is the use of large, diverse sets of training data

[103]. However, most available structures are those of RNA found in bacteria and

viruses. Few structures are available for structural elements in the 5’ and 3’ UTRs of

mammalian mRNAs, which are increasingly found to play critical roles in regulating

gene expression. We suggest that genome-wide RNA structure-probing data can

plug this data gap, and will allow greatly improved prediction performance on this

important class of structures. Indeed, CONTRAfold-SE’s gains in performance over

CONTRAfold varies with the test set. The growing number of structure-probing

datasets will provide a rich source of data for elevating the performance of statistical

methods for RNA secondary structure prediction to the next level, by allowing the

effective training of more sophisticated structure models [139, 103]. Improvement

in the prediction of mammalian RNA structures, particularly of regulatory RNAs

and regulatory regions of mRNAs, and its integration into downstream discovery

applications, such as translational dynamics and functional elements, will certainly

lead to an expanded understanding of the role of RNA structure in gene regulation.

5.4 Methods

5.4.1 The CONTRAfold-SE Model

Our model assumes that structure-probing data are available at per-base resolution

– that there is a probing signal for some set of bases in a given RNA sequence.

The processed DMS-seq and PARS signals at each base are modelled using Gamma

distributions. While our probabilistic framework can accommodate any distribution

in the exponential family (see Appendix B), we chose the Gamma distribution as


a flexible family of distributions for modelling the continuous, unbounded, probing

signal. In addition, the Gamma distributions models the data well (Figure B.12),

and has been previously used for DMS-seq, SHAPE and CMCT reactivities [24]. We

assume that bases are independently modified, so that the resultant probing signals

are independent of the actual location within the RNA sequence. However, as the

reactivities of different bases could differ based on their identity and whether they are

paired (for instance, DMS preferentially modifies unpaired adenines and cytosines),

we have incorporated a separate distribution for each combination of base identity

(A, C, T or G) and pairedness state (paired or unpaired) for a total of 8 separate

Gamma distributions in our data model.

Formally, for an RNA sequence x of length L with secondary structure y and

associated structure-probing data d = (d1, . . . , dL), the distribution for the probing

signal dk at base k in the sequence is given by

dk|xk, y ∼ Gamma(αxk,paired(k,y), βxk,paired(k,y)) (5.1)

where xk ∈ {A,C, T,G} is the identity of base k in the sequence, paired(k, y) denotes

whether base k in structure y is paired, and the Gamma density is defined as

xα−1

Γ(α)βαexp(−x/β), for x ∼ Gamma(α, β).

Model Specification Let x be an RNA sequence of length L with structure y and S

associated structure-probing datasets d. We denote by d(j)k the probing signal for the

jth data source at base k in the sequence. CONTRAfold-SE models the conditional

joint probability of the structure and probing data given sequence as

P (y, d|x;w, θ) = P (y|x;w)S∏j=1

L∏k=1

P (d(j)k |xk, y; θ(j)). (5.2)


Here, the structure model P (y|x;w) is given by the conditional log-linear model of

CONTRAfold with parameters w, and P (d(j)k |xk, y; θ(j)) is the Gamma distribution as

defined in Equation 5.1, with θ(j) being the vector of parameters for the 8 Gamma dis-

tributions for dataset j. In the absence of structure-probing data, the CONTRAfold-

SE model reduces to the CONTRAfold model.

Parameter Estimation Given a training set, we estimate parameters w and θ by

maximizing the conditional log-likelihood. For a training set D = DS ∪ DP ∪ DS+P

of sequences with: i) only known structures and no probing data (DS), ii) only prob-

ing data but unknown (missing) structure (DP), and iii) both known structure and

probing data (DS+P), we find w, θ that maximize the (regularized) conditional log-

likelihood

∑(x,y)∈DS

logP (y|x;w) + λ ·∑

(x,d)∈DP

log∑y

P (y, d|x;w, θ)

+∑

(x,y,d)∈DS+P

logP (y, d|x;w, θ) + logP (w) + logP (θ) (5.3)

The hyperparameter λ controls the weighting of data-only training instances as

compared to instances with known structure, thus mitigating the adverse effects of

noisy, partial data on model estimation; this strategy is common in the machine

learning literature [89]. λ is set by cross-validation (Appendix B). We used the L-

BFGS algorithm [74] to find a local maximum of the likelihood. The key technical

challenge is that the gradient computations for the second term in the sum (the

likelihood for sequences with unknown structures) requires inference. Fortunately,

the log-linear form of the structure model allows the data model to be represented as

additional base-level features in the structure model. [36] independently presents a

similar observation in the context of thermodynamic models for RNA structure, which


have a similar log-linear form. We thus adapted the existing inference algorithms

in CONTRAfold to efficiently compute the required gradients (see Appendix B for

further details). While the log-likelihood for the CONTRAfold-SE model is non-

convex, in practice our gradient-based parameter estimation algorithm achieves stable

parameter estimates (Appendix B).

Predicting Secondary Structures CONTRAfold-SE has two options for gener-

ating predictions on query sequences: 1) prediction without probing data – ignoring

any probing data associated with the test example and returning the structure that

maximizes the expected accuracy based on the structure model, as in CONTRAfold,

and 2) prediction with probing data – re-estimating the model parameters based on

the single example of the query sequence with structure-probing data, initialized with

the parameters learned on the training set. In option 2, a data tuning parameter δ

controls a regularization term that shifts the model parameters either toward (small

δ) or away (large δ) from the data, and hence away or toward the learned parame-

ters on the training set. This allows us to control how much the algorithm relies on

(potentially) noisy data in the test sequence.

5.4.2 Dataset Setup

Training and Test Sets [103] show that the careful construction of training and

test sets is necessary for proper evaluation of statistical methods for structure pre-

diction. We follow their procedures to ensure that our training sets do not contain

significant similarity to our test sets. We briefly describe the construction procedure

(see Appendix B for more details).

Train-A has two components making up 238 sequences: 119 sequences with only

known secondary structure and 119 sequences with only structure-probing data (where

we chose 119 in order to obtain a 50%-50% split). We exclude any sequences that


share an RFAM match with the test sets or the yeast mRNA genes (so that these

can be included as sequences with structure-probing data for greater diversity in the

training set). This ensures that there is no family similarity between the training

and test sets. The data-only sequences are the shortest, most data-dense sequences,

namely those with the lowest data-sparsity score (see Appendix B). Train-A50% is

the same as Train-A. Train-A75% and Train-A100% are constructed similarly, but

contain either 75% or 100% sequences with structure-probing data, at the same total

number of sequences. Train-A75 contains the same sequences with known structure

as Train-A, but an additional set of data-only sequences at the same amount as in

Train-A75%; Train-A100 is constructed similarly.

Train-B contains the same 119 structure-only sequences as Train-A and the first

119 sequences with lowest data-sparsity scores, cycling through DMS-vitro, DMS-

vivo, and PARS data. To clarify, the sequences included are the same amongst Train-

B(PARS), Train-B(DMS-vitro), Train-B(DMS-vivo), etc, but the data that model

is trained on is only PARS in the first case, only DMS-vitro in the second case,

only DMS-vivo in the third case, etc. Using the same set of sequences allows fair

comparison of these different data sources.

Test-SeqFold is the set of sequences in Table 1 of [92]. Test-Tornado-TestSetA and

Test-Tornado-TestSetB are TestSetA and TestSetB, respectively, from [103]. Test-

mRNA has the sequences presented in the conservation analysis of [105].

Running and Evaluating CONTRAfold-SE CONTRAfold training allows spec-

ification of several hyperparameters, set as described in Appendix B.

For evaluating performance, we define, in standard fashion, sensitivity, TPTP+FN

,

and positive predictive value (PPV), TPTP+FP

, where the number of true positives (TP)

is the number of correctly predicted basepairs, the number of false positives (FP) is

the number of basepairs predicted but not in the true structure, and the number of


false negatives (FN) is the number of true basepairs predicted to be unpaired. In

general, we desire both high sensitivity and high PPV. To combine sensitivity and

PPV into a single value, we use F-measure, 2×sensitivity×PPVsensitivity+PPV

. Accuracy is the number

of correctly predicted basepairs.

For each sequence in Table 5.2.1, we report, in standard cross-validation fashion

for CONTRAfold-SE and CONTRAfold, F-measure for the γ that gives the best

average F-measure over the remaining sequences (γ = 2 for most sequences). For

CONTRAfold-SE with data at prediction, we therefore fix γ = 2 and cross-validate

the data tuning parameter δ by reporting F-measure for the tuning parameter which

gives the best average F-measure over all remaining sequences.

For each sequence in the test sets in Tables 5.2.1 and 5.2.2, we calculate three

metrics over varying γ: AUC over the sensitivity and PPV, maximum accuracy, and

maximum F-measure. We then average across the sequences in the test set.

Structure-Probing Data DMS structure-probing data for yeast was obtained

from [105]. Since the assay described in [105] only modifies A and C bases, we

have no data model for G and T. Similar to [105], raw DMS counts were normalized

in windows of 250nt per gene. In each window, the A positions were normalized by

the median of the top 5% of A positions; the C positions were similarly normalized.

If the median was zero, we normalized by the mean. We ignored zero counts, as well

as G and T positions.

PARS structure-probing data for yeast was obtained from [62]. We ignored po-

sitions with a zero PARS score, and added 8 to all scores to make them positive.

Human PARS structure-probing data was obtained from the GM12878 strain mea-

sured in [130]. We again ignored positions with a zero PARS score, and added 15 to

all scores to make them positive.


RNA-Binding Protein Data For yeast RBP analysis, we identified bound and

unbound transcripts as in [92] and [54]. In the RIP-chip dataset of [54], we defined true

bound transcripts as those having FDR < 1%; for Ssd1, Khd1, Puf1-5 we used local

FDR < 1%; for She2 we used identified targets. The remaining transcripts identified

in the RIP-chip data are deemed to be false bound. Both true and false bound tar-

gets must contain at least one instance of the RBP motif. For all CONTRAfold-style

algorithms, we predict the structure over each yeast gene using parameters learned

on Train-B, and calculate the probability that a base is paired from the “posterior”

output mode of CONTRAfold (and CONTRAfold-SE) by summing the pairing prob-

abilities associated with each pairing partner for that base. The accessibility is 1

minus this pairing probability. For SeqFold pairing probability, we used 1 minus the

accessibility predicted by SeqFold.

For aggregating the accessibility over all motifs in a gene, we sum the individual

accessibility per position per motif per instance of motif on the gene, and then divide

by the gene length. As an alternative equivalent to [92], we compute the same score

but do not normalize by the gene length.

For the classification task, we choose the 14 highlighted RBPs (or, more precisely,

RBPs with specific motifs) from Table S4 of [54] and filter out the ones with less than

200 instances of the motif per gene, leaving 10 RBPs.

For the human RBP analysis, we use a procedure similar to the evaluation in

[41] to identify bound and unbound transcripts. We obtain CLIP-seq data from the

doRiNA database [4] on the following human RBPs and their associated sequence

motifs: Pum2 (UGUANAUA), SRSF1 (GAAGAA), FXR1 (ACUK, WGGA), FXR2

(ACUK, WGGA), FMR1 7 (ACUK, WGGA) and FMR1 1 (ACUK, WGGA), where

we excluded QKI since there were fewer than 10 sequences in the true bound set

(described below). We use the RefSeq genes on assembly hg19 to identify transcripts


with at least one motif instance. From these, true bound motifs are those with a CLIP-

seq peak that starts within the motif, and false bound motifs are those such that there

is no CLIP-seq peak within 1000bp of the start of the motif. Since there are typically

many more false bound motifs than true bound motifs, we randomly sample the same

number of false bound motifs as true bound motifs. For each instance in the true and

false bound sets, we predict the structure of the 200bp window centered around the

start of the motif instance using parameters learned over the following training set:

the first 119 (structure-only) sequences in S151 [33] and the 119 sequences with lowest

data-sparsity score for the PARS data for the GM12878 strain measured in [130] over

sequences from UCSC RefSeq and Gencode v12 (hg19 assembly). The cross-validated

λ here was 1. We plot the structure profile on 20bp upstream and downstream

of the motif start. For one motif, FXR2(WGGA), we additionally check that two

modifications do not significantly affect the structure profile or the qualitative results:

we predict the structure of the 500bp window around the motif to show that the

window size does not matter, and we include all motifs in the false bound set to show

that subsampling does not matter.

Oxidative Stress Analysis To find the correlation between accessibility and trans-

lation efficiency around the AUG, as in [96], we first calculate the accessibility at each

position (namely, the pairing probability predicted by CONTRAfold-SE) and average

in sliding windows of 40nt. Windows are normalized by the mean over windows on

that gene, and correlated with the translation efficiency from [96] of all genes that

cover that position.

For oxidative stress data, we used the average ribosome footprint levels (FP) and

mRNA levels from the ribosome profiling dataset from [44] to calculate translation

efficiency as FP / mRNA. Transcripts were restricted over those with FP ≥ 10 RPKM

for all three time points.


Software Availability and References CONTRAfold-SE is freely available for

non-commercial use and can be downloaded at http://www.cs.stanford.edu/~cpop/

contrafoldse.html.

5.5 Conclusions

In this work, we explored the benefits of a fully probabilistic method for RNA

secondary structure prediction that incorporates high-throughput structure-probing

data. CONTRAfold-SE outperforms existing methods in terms of prediction accu-

racy, and also has several other features which make it a valuable tool for the analysis

of structure probing data.

CONTRAfold-SE’s ability to combine multiple probing datasets allows us to de-

rive insights about which types of data should be combined for optimal performance.

Another distinguishing feature of our computational methods is that it augments our

knowledge about the structure of a sequence, allowing analyses that would not be

possible using the raw data alone where information is noisy or missing. For exam-

ple, we notably showed that we can obtain a better correlation between structure

and translation efficiency at the start codon when using CONTRAfold-SE to fill in

structure predictions for the bases that do not have sufficient or reliable data. At

the same time, as the quality and coverage of structure-probing data increases, our

experiments with different training set compositions show the exciting potential of a

statistical method like CONTRAfold-SE to yield further improvements in prediction

performance. Finally, CONTRAfold-SE is an invaluable tool in downstream appli-

cations where structure informs functions. We predicted nucleotide-level structural

contexts that define binding sites for RNA-binding proteins (RBPs), classified bound

versus unbound genes, and showed that RBP-bound sites are more accessible due to

active unfolding.

http://www.cs.stanford.edu/~cpop/contrafoldse.html

http://www.cs.stanford.edu/~cpop/contrafoldse.html

Chapter 6

Conclusions

6.1 Contributions

In this thesis, we presented two sets of high-throughput, genome-wide assays repre-

senting two biological processes defining translation and the factors involved in its

regulation. We built probabilistic models specific to each dataset in order to extract

useful information about biologically meaningful variables of interest from noisy and

sparse data – a process that would have otherwise relied on ad-hoc tuning parameters,

would have been less extensible and modular, and would have excluded a lot of sparse

data (positions or even whole genes).

We first described the concept of ribosome profiling. Ribosome profiling data re-

ports on the distribution of translating ribosomes, at steady-state, with codon-level

resolution. We presented a robust method to extract codon translation rates and

protein synthesis rates from these data, and identify causal features associated with

elongation and translation efficiency in physiological conditions in yeast. We showed

that neither elongation rate nor translational efficiency is improved by experimental

manipulation of the abundance or body sequence of the rare AGG tRNA. Deletion

of three of the four copies of the heavily used ACA tRNA showed a modest efficiency

94

CHAPTER 6. CONCLUSIONS 95

decrease that could be explained by other rate-reducing signals at gene start. This

suggests that correlation between codon bias and efficiency arises as selection for

codons to utilize translation machinery efficiently in highly translated genes. We also

showed a correlation between efficiency and RNA structure calculated both computa-

tionally and from recent structure probing data, as well as the Kozak initiation motif,

which may comprise a mechanism to regulate initiation.

Second, we explored ribosome pausing in a higher-order organism. By comparing

ribosome fragment counts for different alleles at the same SNP location, we found

the potential for a genetic basis for ribosome pausing. Other measures of ribosome

pausing – dwell times and slow outlier strength estimated from our translation model

– also indicate that variation between individuals could be driven by elongation-level

changes. In conjunction with other, albeit noisy, measures of biological features that

could play a role during elongation, such as RNA secondary structure, this work

points to an exciting direction for understanding the mechanism behind phenotypic

changes.

The strong correlation we observed between translation efficiency and RNA sec-

ondary structure motivated our last section: determining secondary structure with

higher accuracy. We presented CONTRAfold-SE, a probabilistic method for RNA

secondary structure prediction that incorporates structure-probing data by build-

ing on the CONTRAfold structure model and representing structure-probing data

as observations of the underlying (possibly unknown) structure. Our probabilistic

framework allows us to use any of the growing number of structure-probing datasets

that provide per-base measurements, and combine them together in a principled way.

Evaluated on benchmark datasets, CONTRAfold-SE outperforms competing meth-

ods even when our method does not have structure-probing data available for test

sequences. Importantly, we showed that using CONTRAfold-SE reveals a stronger

correlation between structure at the 5’ end and translation efficiency, and extended


the analysis to ribosome profiling datasets in stress conditions, where structure is also

important.

In summary, we presented a tool for analyzing ribosome profiling datasets, pre-

sented a tool for using structure-probing data in highly accurate structure prediction,

and demonstrated how these probabilistic methods can uncover important and novel

biological results.

6.2 Going Forward

Our models for translation and RNA secondary structure are important for under-

standing the latest experimental assays and integrating them with leading proba-

bilistic methods. Our translation model parallels RNA-seq analysis, dealing with the

intricacies of variable ribosome pausing and inter-gene differences. Our secondary

structure model can be trained on different datasets depending on the application

and can be used with high accuracy in both prediction tasks and application tasks

where structure is an input. With these models as a foundation, going forward we

can explore several interesting technical and biological directions.

In terms of our model for translation, a probabilistic framework for ribosome profil-

ing data alleviates many concerns of previous methods. The technique we present here

can therefore be used as a standard method for easily comparing different datasets.

Our analysis of both yeast and human ribosome profiling data can be extended to

other biological features that could be related to translation efficiency and ribosome

pausing – signals that could even be included as features in our model for a more

cohesive analysis pipeline. It would be interesting to use this integrated approach to

understand the relative importance of these features between organisms. The relation-

ship between biological features must also be better understood, perhaps by looking

at correlations between our estimated parameters and feature interaction terms.


The translation model itself could be extended in several ways. Incorporating

other biological processes, such as ribosome drop-off, is certainly possible as another

rate in the model. Since this might create unidentifiability in the parameters, it

would be useful to incorporate this either as a global prior or as a feature derived

from experimental measurements when these become available.

The location of the active codon within a ribosome fragment is traditionally de-

termined by looking for enrichment of the AUG codon in fragments at the 5’ end.

While this is usually sufficient for a reasonable sequencing depth, we can improve

this estimation by including it into our model – for example, by using or learning

a weighted average of the ribosome fragment counts potentially contributing to that

active location. Learning this weight would require careful construction of the model

parameters, and would benefit from additional experimental information as a ground-

truth training set.

One of the other major concerns we saw was in dealing with alternative splicing.

This situation is similar to sequencing in multiploid genomes. There, sequence frag-

ments can map to regions that look similar but originate from distinct copies. The

goal is to identify the correct distribution. We can potentially draw upon ideas from

this adjacent area or from effects like frameshifting that impact ribosome distribution

[84]. For example, we can consider an EM-style algorithm that alternates between

inferring the correct fragment attribution and learning the remaining parameters,

perhaps using non-ambiguous fragment information as training data.

In terms of our structure model, by jointly modelling structure-probing data and

RNA secondary structure in a probabilistic framework, CONTRAfold-SE is a first

step towards describing the sequence and structure biases of various probing reagents.

Integrating multiple genome-wide structure-probing datasets with CONTRAfold-SE

allows for cross-correction of errors, and reveals principles on how datasets should be

combined. With the growing number of structure-probing datasets, we can exploit


the flexibility and modularity of this probabilistic model to include information about

biases in different contexts (e.g. ability to capture dynamics at ends of stems com-

pared to the middle of stems). Because CONTRAfold-SE can learn from a training

set where full structures are not available, we can also apply it to learn class-specific

structures where experimental methods are lacking. Finally, CONTRAfold-SE would

be useful in various applications that depend on function. It would be an excellent

tool for uncovering structural preferences in vivo, where structure-probing data alone

is sparse or noisy, but where understanding the underlying mechanism is essential

for physiological models of the cell. This tool could also be used in genome-wide

experiments where full experimental data is not available, but where we want to ex-

plore structural changes due to genetic variation, potentially identifying a mechanism

associated with an observed phenotype.

We still have much more to explore in the landscape of translation regulation. We

showed an initial set of results in human, a much more complex setting for under-

standing the interaction between RNA sequence and protein sequence. The multitude

of ribosome profiling datasets afford us a scaffold for comparison between species for

understanding mechanism differences, comparison between conditions for understand-

ing synthetic and physiological conditions, and comparison between individuals for

genetic variation analysis. In conjunction with other datasets that inform associated

mechanisms, such as genome-wide RNA structure datasets, we can make even more

informed connections between the causes and effects of translation in vivo.

Appendix A

Ribosome Profiling

A.1 Supplementary Methods

Feature Calculations for Outlier Analysis

Computationally predicted mRNA secondary structures and associated energies were

computed using Unafold v3.6 [80] with the default settings. In the outlier analy-

sis (feature “energy-down”), we ignored downstream regions with energies of 0 and

above. Structural features (e.g. stems) were counted based on the structure of the

whole mRNA strand, including characterized UTRs [87]. Genes without a charac-

terized UTR were ignored for all energy-related features. For experimentally derived

structure from the PARS method [62], we used the PARS score; genes without a

PARS score were ignored.

Protein domain boundaries were based on Pfam-A domains from Pfam [38]. Wob-

ble codons were set to be those with mismatches to the anticodon and those with an

“I” base in the tRNA that can recognize either a C or a U.

For RNA binding protein enrichment features in the outlier analysis, we computed

the Kullback-Leibler (KL) divergence between each of the 60 motifs from Table S4 in

99

APPENDIX A. RIBOSOME PROFILING 100

[54] and positions along each coding sequence. We then calculated the mean/mini-

mum KL divergence in 3-codon windows 5 codons downstream of the active site and

took the mean/minimum score over all motifs.

Feature Calculations for Translation Efficiency

Evolutionary rate is adjusted dN/dS from [128]. The Kozak site motif is from [49];

we ignored this in genes without characterized UTRs. Energies are calculated as

described in the outlier analysis section above. Energies near the start codon are

those with the most significant Spearman correlation (as calculated by looking at

global maximums in spans of 20nt and taking the first such maximum). These energies

are corrected for multiple hypothesis testing as described in the sliding window energy

analysis. The tAI per gene or per window is the weighted average of all codons in

that range, excluding stop codons.

The RNA binding protein enrichment features are the scores reported from the

Significance Analysis of Microarray algorithm in Dataset S3 of [54]. We selected

the top fifteen RBPs with the largest number of RNA targets from Table S2. Sug-

gested “true” correlations between RNA binding proteins enrichment and translation

efficiency are drawn from ribosome occupancy correlations using polysome profiling

(Table S3 in [54]), where possible. In other cases, we use additional literature: Puf4 is

most commonly studied in mRNA stability and localization and is also likely a player

in translation regulation [45]. As noted in the main text, scp160 has an additional

contradictory source indicating a positive role in translational efficiency [52]. Ypl184c

was proposed to repress translation due to its association with Pab1 and mRNAs un-

der translational control [54]. The proteins Cbc2, Gbp2, Nab3, and Nop56 do not

seem to have documented direct associations with translation.


A.2 Supplementary Figures and Tables

4 6 8 10 12

−10

−5

0

mod

el fl

ow

Pearson: r=0.7885

4 6 8 10 12−5

0

5

measured protein abundance(Newman et al)

base

line

aver

age

coun

ts

Pearson: r=0.7755

10 15 20 25

−10

−5

0

Pearson: r=0.6802

10 15 20 25−5

0

5

measured protein abundance(de Godoy et al)

Pearson: r=0.6704

Figure A.1: Correlation between experimental measures of protein abundance, andestimated flow and average footprint count (baseline).


WT

QC

OE

Deacyl

Deacyl

tL(CAA)

tT(UGU)

WT

QC

OE

(A) (B)

87

tR(CCU)

85 86 87 87 83 91 91 92 88 2.7 0.3 0.9 2.7 2.7

% charged μg 2.7 0.3 0.9 2.7 2.7

Figure A.2: Overexpression of tRNAArg(CCU) does not significantly alter amino acidcharging levels.Bulk RNAs from strains as indicated were resolved at pH 5 by PAGE, transferred,and hybridized with oligonucleotide probes specific for tRNA species as indicated,and relative tRNAArg(CCU) levels and charging levels were evaluated as described inMaterials and Methods. Solid arrows show deacylated tRNAs; dashed arrows showcharged tRNAs; % charged refers to tRNAArg(CCU).


0.85

0.9

0.95 1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

rate AGG−OE / rate wt

0.85

0.9

0.95 1

1.05

1.1

1.15

1.2

1.25

1.3

1.35rate AGG−QC / rate wt

0.85

0.9

0.95 1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

rate ACA−K / rate wt

GCA

GCCGCGGCTAGAAGG

CGACGCCGGCGT

AACAATGACGAT

TGCTGTCAACAGGAA

GAGGGAGGCGGG

GGTCACCATATAATC

ATTCTA

CTCCTG

CTTTTA

TTGAAAAAG

ATGTTCTTTCCA

CCCCCGCCTAGC

AGTTCATCCTCGTCT

ACAACCACGACT

TGGTACTATGTAGTC

GTGGTT

Figure A.3: The ratio between estimated mutant and wild-type rates.The mean (solid black line) and standard deviation (dashed line) are shown. ACA-Khas a larger spread, but the manipulated codon (shown in red) is not an outlier inany sample. Codons are grouped and sorted by amino acid.


0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

mut

AC

A−

K

AA

G

AT

T

GA

A

GA

C

GG

C

GT

T

AG

T

CG

A

CG

G

CT

C

CT

G

CT

T

AA

A

AC

C

CA

C

GC

C

TA

C

TC

C

AC

A

AG

G

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5m

ut A

GG

−O

E

AA

G

AT

T

GA

A

GA

C

GG

C

GT

T

AG

T

CG

A

CG

G

CT

C

CT

G

CT

T

AA

A

AC

C

CA

C

GC

C

TA

C

TC

C

AC

A

AG

G

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

mut

AG

G−

QC

AA

G

AT

T

GA

A

GA

C

GG

C

GT

T

AG

T

CG

A

CG

G

CT

C

CT

G

CT

T

AA

A

AC

C

CA

C

GC

C

TA

C

TC

C

AC

A

AG

G

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

AA

G

AT

T

GA

A

GA

C

GG

C

GT

T

AG

T

CG

A

CG

G

CT

C

CT

G

CT

T

AA

A

AC

C

CA

C

GC

C

TA

C

TC

C

AC

A

AG

G

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

AA

G

AT

T

GA

A

GA

C

GG

C

GT

T

AG

T

CG

A

CG

G

CT

C

CT

G

CT

T

AA

A

AC

C

CA

C

GC

C

TA

C

TC

C

AC

A

AG

G

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

AA

G

AT

T

GA

A

GA

C

GG

C

GT

T

AG

T

CG

A

CG

G

CT

C

CT

G

CT

T

AA

A

AC

C

CA

C

GC

C

TA

C

TC

C

AC

A

AG

G

window before

window after

window around

high tAI

low tAI

mid tAI

of interest

Normalized footprint ratio for mut/wt averaged over occurances 1 to 5 of each codon

Figure A.4: The ratio of mutant to wild-type footprint count per codon.Counts are averaged over the first 5 occurrences of the codon per gene over all genesand presented for the three mutant samples. Counts are normalized by the averagein the 15-codon window before (red line), after (green line), or around (blue line) thecodon. We show a subset of the codons: the 5 with lowest tAI (dots), the 5 withhighest tAI (squares), and the 6 with middle tAI (stars), in addition to the two codonsACA and AGG (diamonds). In each case, if the manipulated codon of interest inducesa change in speed under the common hypothesis (lower for ACA-K and higher forAGG-OE and AGG-QC), we expect a corresponding peak or valley, respectively, inthe presented ratio. However, the ratios at ACA and AGG are not significantly higherthan 1-standard deviation (dotted line) or than the other representative codons. Left:Counts are raw footprint counts. Right: Counts are dwell-corrected footprint counts.


−10 −5 0

−10

−5

0

log(PA−wt)

log(

PA

−A

CA

−K

)

62 increased

138 reduced

r=0.99

−10

−5

0

log(

PA

−A

GG

−O

E)

88 increased

112 reduced

r=0.99

−10

−5

0

log(

PA

−A

GG

−Q

C)

95 increased

105 reduced

r=0.99

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

mut

AC

A−

K

AT

AA

CA

CG

AA

AT

AA

AA

GT

GT

AT

TA

TA

TT

CA

AG

GC

AT

AG

CC

GG

CT

TA

TG

TT

TG

GA

CT

AC

TC

AC

GA

GA

TG

CC

AG

CC

GT

CG

CC

CT

GT

CA

AA

AC

CT

GG

AT

CG

CA

CT

CA

CA

AG

TG

GG

AA

AT

TG

GG

TC

TG

CG

GA

GG

CA

CC

AC

GT

CC

TT

AC

AC

CG

TT

AT

CT

TC

TC

CG

TG

TT

GG

GC

GA

CG

TC

GG

TG

CT

GC

C

−0.1

−0.05

0

0.05

0.1

mut

AG

G−

OE

Correlation between log(PA−mut/PA−wt)and % codon per gene

GG

AA

GG

GG

GG

TG

CT

GC

GG

CC

CG

GC

AC

GT

GC

AT

AC

AG

CG

AA

GC

CT

CA

TG

TG

GC

TT

GC

GC

CG

GT

AT

TT

GA

GT

CG

CC

TG

CA

TA

TC

AT

CG

CA

GT

AC

AT

AC

CA

CT

GT

CT

AG

AC

AA

TA

AG

TC

AA

AC

TT

CA

AA

GC

CA

CC

GT

CA

CT

AG

AG

AT

TC

CC

GT

AT

CG

CT

GG

TT

TG

TT

AC

AA

AT

TC

CA

TC

TG

TT

GA

A

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

mut

AG

G−

QC

GG

TG

CT

GT

CG

CC

GT

TA

AG

AT

CA

CC

TT

GA

GA

CC

AC

GT

TT

CT

AC

TC

CA

CT

CA

CT

GT

GA

CG

GC

TG

GT

CT

AT

GC

AA

GA

AG

TG

AA

CG

CA

AT

TG

GG

TG

CG

CG

CG

CC

CT

GG

AC

TA

CT

GC

AT

GA

GG

TA

TT

TC

GG

CT

CC

CC

TC

GC

TT

CC

GT

TA

GA

TA

CG

AC

AC

AG

AG

CA

GT

CG

AA

AA

TA

TT

CA

AG

GA

TA

AA

T

Figure A.5: The analysis of Figure A.2 repeated on flow instead of TE.As before, wild-type and mutant flows generally agree. Correlations between the ratioof mutant flow to wild-type flow and the percent of codon per gene are not higherfor the manipulated codons compared to other codons, despite the dramatic changein tRNA abundance.


0 0.5 10

100

200

300

400

500

redu

ced

TE

gen

es

p=3.24e−07mean: 0.4052

0 0.5 10

50

100

150

200

position per length from 5’ endof slow outliers

incr

ease

d T

E g

enes

mean: 0.4517

0 10 20 30 400

100

200

300

400

500

600

700

strength of slow outliersin first 100 codons

mean: 2.6864

0 10 20 30 400

200

400

600

800

1000

p=1.73e−01mean: 2.9256

0 0.5 10

5

10

15

20

p=7.59e−01mean: 0.4214

0 0.5 10

1

2

3

4

5

6

position per length from 5’ endof ACA slow outliers

mean: 0.4118

Figure A.6: Distribution of three features among reduced TE genes and increased TEgenes in ACA-K.Distributions are skewed for reduced TE genes (with lower TE in mutant comparedto wild-type) toward initiation signals that could confound the TE decrease. Slower-than-expected codons with an excess number of ribosome counts are defined formallyas “outliers” (see Materials and Methods). Each feature distribution is calculatedover all positions in the genes in the specified gene set (either reduced TE genesor increased TE genes) satisfying the specified criteria (a position that is a slowoutlier, a position that is a slow outlier in the first 100 codons, or a position that isa slow outlier and an ACA codon). The feature distributions for reduced TE versusincreased TE genes are distinct (p-values shown are calculated to be significant undera Kolmogorov-Smirnov test). Outlier positions are calculated in the ACA-K mutant.


−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

tAI (coding sequence)average elongation rate

energy exp vivo (5’ UTR)energy exp vivo (3’ UTR)

energy exp vivo (mRNA sequence)energy exp vivo (win 11 to 50)

energy exp vitro (5’ UTR)energy exp vitro (3’ UTR)

energy exp vitro (mRNA sequence)energy exp vitro (win −11 to 28)

energy (5’ UTR)energy (3’ UTR)

energy (mRNA sequence)energy (win −16 to 23)

KL divergence to KozakKL divergence to Kozak (pos −6)KL divergence to Kozak (pos −5)KL divergence to Kozak (pos −4)KL divergence to Kozak (pos −3)KL divergence to Kozak (pos −2)KL divergence to Kozak (pos −1)KL divergence to Kozak (pos 3)KL divergence to Kozak (pos 4)KL divergence to Kozak (pos 5)

length (coding sequence)mRNA abundance

evolutionary rate

significantnot significant

−0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4

Khd1

Scp160

Bfr1

Nab2

Pab1

Pub1

Puf4

Cbc2

Gbp2

Nab3

Nop56

Npl3

Nrd1

Nsr1

Ypl184c

Spearman correlation to log(TE)

RN

A b

indi

ng p

rote

in e

nric

hmen

t

expectednot expectedexpected, not signot expected, not sigunknown

Figure A.7: Correlation between log(TE) and gene-level features.Cis-features and RNA binding protein enrichment are described in Materials andMethods. Significant threshold is p = 0.05. (See Appendix A for how expectedcorrelations for the RNA binding proteins were determined.)


0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4x 10

−4

dwel

l−co

rrec

ted

flow

−no

rmal

ized

cou

nts

aver

aged

per

pos

ition

acr

oss

gene

s

position [codon]

all positions, no slow outliersall positionsuniform

Figure A.8: Dwell-corrected footprint counts normalized by flow.Counts are geometrically averaged per position over all genes aligned by start codon(ignoring 0 footprint counts). Removing slow outliers (red curve) reduces the peak indensity at ≈44 codons (132 nt).


0 50 100 150 200 250 300 350 4000.7

0.8

0.9

position [codon]

codo

n tr

ansl

atio

n ra

te

Rates and tAI in 17−codon windows

0 50 100 150 200 250 300 350 4000.35

0.4

0.45

tAI

Figure A.9: Codon translation rates versus tAI.The tAI in sliding windows of 17-codons is averaged across all the genes aligned bystart codon (red curve). The same analysis with our estimated codon translationrates (scaled up by 1000) (black curve) shows that rates at the 5’ end are not lowercompared to the rest of the gene.


500 1000 1500 2000 2500 3000 3500 4000 45000

2

4

6

x 105

position [nt]

num

ber

of n

on−

outli

ers

500 1000 1500 2000 2500 3000 3500 4000 45000

5

10

x 104

num

ber

of s

low

out

liers

Figure A.10: Histograms of positions of slow outliers and non-outliers are similar.


Figure A.11: Two different initializations of the parameters for the translation model.Estimated parameters are nearly exact, demonstrating the model is robust to initial-ization.


tRNA (anticodon) RPM (ACA-K) RPM (wt) RPM (ACA-K)

tK(UUU)D 78 80 0.98tY(GUA)F1 19 16 1.19tM(CAU)C 11 11 1.00tD(GUC)B 218 251 0.87tE(UUC)B 582 428 1.36tN(GUU)C 225 166 1.36tS(UGA)P 122 148 0.82tP(AGG)N 24 27 0.89tC(GCA)B 82 58 1.41tQ(UUG)B 103 106 0.97tW(CCA)G1 35 44 0.80tG(UCC)O 143 96 1.49tT(UGU)G1 25 75 0.33tR(UCU)E 138 172 0.80tA(AGC)D 72 42 1.71tT(CGU)K 9 9 1.00tV(AAC)E1 129 82 1.57tQ(CUG)M 166 138 1.20tA(UGC)Q 3 3 1.00tL(UAA)J 72 81 0.89tI(AAU)B 98 46 2.13tH(GUG)E1 328 266 1.23tT(AGU)B 152 141 1.08tF(GAA)B 124 115 1.08tK(CUU)C 1328 1914 0.69

Table A.1: Counts of tRNA in RPM (number of reads per million) in ACA-K andwild-type.The threonine tRNA recognizing the ACA codon (highlighted) is reduced to 1/3 ofthe wild-type level.


Category Features

PositionDistance from 5’ end (pos)Distance from 5’ end per length (pos-per-len)Distance from 3’ end (pos-from-end)

Structure

Minimum free energy (energy-down)In vitro energy (vitroDMS-energy-down) [105]In vivo energy (vivoDMS-energy-down) [105]In vitro inverse-energy (PARS-invenergy-down) [62]Number of hairpins (hairpins-down)Number of internal loops (internal-down)Number of multi-loops (multi-down)Number of stems (stems-down15)Number of GC pairs in stems (stemsGC-down15)Number of stems 12nt downstream (stems-down12)Number of stems 9nt downstream (stems-down9)

Protein foldingActive site is inside a protein domain (is-in-domain)Domain ends 30 codons upstream (is-end-domain-up-30)

Wobble bases Is wobble base at P-site (is-wobble)

tRNAs Reuse

Distance from same codon upstream (dist-prev-codon)Distance from upstream iso-accepting tRNA (dist-prev-trna)Is codon in window upstream (is-prev-codon-close)Is iso-accepting tRNA in window upstream (is-prev-trna-close)

RBPsKL divergence combined via mean (rbp-mean)KL divergence combined via min (rbp-min)

Peptide

Charge of active codon (charge)Mean charge in window upstream (cluster-charge-up-1)Arg/Lys fraction in window upstream (cluster-ArgLys-up-1)Pro fraction in P, E sites (pair-Pro-up)Pro fraction downstream (pair-Pro-down)

GlobalLength (len)Abundance (abund)

Table A.2: Eight categories of potential correlates to outlier strength.Distances are relative to active codon; upstream windows are 10-codons long. Struc-ture is calculated in 25nt windows 15nt downstream and, unless indicated, derivedcomputationally. RBP (RNA binding protein) motifs [54] are aggregated by KL-divergence in 3-codon windows 5 codons downstream.


Feature r-value p-value Mean Std Mean Std(Slow) (Slow) (Non) (Non)

pos -0.046 0 355.02 371.58 415.29 406.38pos-per-len -0.148 0 0.47 0.29 0.52 0.28pos-from-end 0.126 0 396.23 398.43 381.56 388.88energy-down 0 0.6 -2.65 1.75 -2.62 1.72vitroDMS-energy-down -0.013 0 0.49 0.15 0.48 0.15vivoDMS-energy-down -0.028 0 0.51 0.17 0.50 0.17PARS-invenergy-down -0.017 0 0.32 0.55 0.31 0.54hairpins-down 0.024 0 5.92 5.00 5.79 4.98internal-down 0.017 0 1.13 1.33 1.10 1.32multi-down 0.023 0 0.18 0.43 0.17 0.42stems-down15 0.024 0 5.92 5.00 5.79 4.98stemsGC-down15 0.021 0 2.33 2.28 2.26 2.26stems-down12 0.025 0 5.94 5.00 5.78 4.98stems-down9 0.027 0 5.97 5.01 5.77 4.97is-in-domain -0.022 0 0.72 0.44 0.73 0.44is-end-domain-up-30 -0.005 0 0.004 0.06 0.004 0.06is-wobble -0.032 0 0.42 0.49 0.46 0.49dist-prev-codon -0.031 0 43.48 57.95 46.96 63.11dist-prev-trna -0.024 0 35.58 47.50 37.72 50.44is-prev-codon-close 0.012 0 0.26 0.44 0.25 0.43is-prev-trna-close 0.007 0 0.30 0.45 0.29 0.45rbp-mean -0.001 0.3 11.62 0.69 11.60 0.69rbp-min -0.001 0.2 2.57 1.11 2.56 1.11charge -0.006 0 0.009 0.52 0.02 0.50cluster-charge-up-1 0.017 0 0.01 0.18 0.01 0.17cluster-ArgLys-up-1 0.024 0 0.12 0.11 0.11 0.10pair-Pro-up 0.092 0 0.05 0.15 0.04 0.13pair-Pro-down -0.01 0 0.04 0.14 0.04 0.14len 0.061 0 750.25 541.11 795.85 572.19abund 0.016 0 13.41 67.88 9.54 51.27

Table A.3: Spearman correlation between outlier strength and features, separated bytype and highlighted if significant.Outliers (slow and non) are calculated for a threshold of 0. See Appendix A for morediscussion.


Regression Regression - Kozak Null ModelMean Std Mean Std Mean Std

Error 0.7549 0.0508 0.8443 0.0486 0.9674 0.0581Error (Train) 0.7499 0.0057 0.8438 0.0051 0.9569 0.0066Spearman r 0.6614 0.0278 0.5161 0.0382 0.0385 0.0491Spearman p 0.0000 0.0000 0.0000 0.0000 0.4325 0.3022Pearson r 0.6224 0.0329 0.5094 0.0381 0.0307 0.0483Pearson p 0.0000 0.0000 0.0000 0.0000 0.4587 0.3028

Table A.4: Performance of TE regression model.Error (should be low) and correlation (should be high) between predicted and actualTE is measured on 100 random test sets of genes not used during model training.Performance drops in a null model learned on randomized TE labels (last column).Performance also drops when using the original Kozak motif (middle column). Erroron the training set is included to show that our model generalizes to genes not used intraining (it is close to test set error). See Materials and Methods for further details.


Result c=1 c=10 c=1000 c=10000 c=100000 No µcm

µc (c=100)r 1.000 1.000 1.000 1.000 1.000 1.000p 10−202 10−206 10−150 10−105 10−95 10−96

µcm (c=100)r 1.000 1.000 1.000 0.983 0.838 NAp 0 0 0 0 0 NA

Jm (c=100)r 1.000 1.000 1.000 1.000 0.999 0.994p 0 0 0 0 0 0

tAIr 0.210 0.210 0.210 0.213 0.217 0.211p 0.104 0.104 0.104 0.100 0.094 0.103

tRNA (Cy5)r 0.144 0.144 0.140 0.140 0.140 0.133p 0.380 0.4380 0.393 0.393 0.393 0.420

tRNA (Cy3)r 0.144 0.144 0.140 0.140 0.140 0.133p 0.417 0.417 0.429 0.429 0.429 0.456

PA [88] r 0.7885 0.7885 0.7886 0.7889 0.7882 0.7782PA [26] r 0.6802 0.6802 0.6802 0.6802 0.6786 0.6710

Table A.5: Summary of main results for model variations.The first five columns are models with different constants for the second term inthe objective function and the last column is a model without µcm parameters (seeMaterials and Methods). Rows 1-3 represent correlation between our parameters inour model and in the model variation. Rows 4-6 represent correlation between codontranslation rates in model variations and codon bias measures. Rows 7-8 representcorrelation between protein synthesis rates in model variation and protein abundancemeasures. Results are similar to the ones reported for the model used throughout thepaper (const c = 100).

Appendix B

RNA Secondary Structure

B.1 Supplementary Methods

B.1.1 Model Specification

Let x be an RNA sequence of length Lx with structure y. Let Sx be the set of indices

of available structure–probing datasets for sequence x so that Sx ⊆ {1, . . . , S}, where

S is the total number of structure–probing datasets. We denote the collection of

probing signals as d, where d(j)k the probing signal in the jth data source at base k

in the sequence. CONTRAfold-SE models the conditional joint probability of the

structure and probing data given sequence as

P (y, d|x;w, θ) = P (y|x;w)∏j∈Sx

Lx∏k=1

P (d(j)k |xk, y; θ(j)) (B.1)

In this equation,

� P (y|x;w) is given by the conditional log-linear model of CONTRAfold with

parameters w,

117

APPENDIX B. RNA SECONDARY STRUCTURE 118

� P (d(j)k |xk, y; θ(j)) is the Gamma distribution for the probing data for dataset j.

� θ(j) = {α(j)b,p, β

(j)b,p |b ∈ {A,C, T,G}, p ∈ {paired, unpaired}} is the set of Gamma

parameters for dataset j.

� θ = ∪Sj=1θ(j) is the set of Gamma parameters over all datasets.

In the absence of structure-probing data, the CONTRAfold-SE model reduces to the

CONTRAfold model.

Parameter Estimation

The parameters of the CONTRAfold-SE model, w and θ, are estimated by maximiz-

ing the conditional log-likelihood of the known structures and probing data, given

sequence. Formally, for a training set D = DS ∪ DP ∪ DS+P of sequences with: i)

only known structures and no probing data (DS), ii) only probing data but unknown

(missing) structure (DP), and iii) both known structure and probing data (DS+P),

we find w, θ that maximize the (regularized) conditional log-likelihood

`(w, θ;D) =∑

(x,y)∈DS

logP (y|x;w) + λ ·∑

(x,d)∈DP

log∑y

P (y, d|x;w, θ)

+∑

(x,y,d)∈DS+P

logP (y, d|x;w, θ)

The hyperparameter λ is added as in [89] to temper the use of partial evidence

against ground truth. The main difficulty with solving this optimization problem is

that the likelihood for training instances with missing structures requires summing the

probability in equation B.1 over all possible structures. In addition, the parameters for

the Gamma distributions (α(j)b,p, β

(j)b,p ) are constrained to be non-negative. We handle

this constraint by parameterizing these Gamma parameters in terms of unconstrained


variables α(j)b,p, β

(j)b,p such that α

(j)b,p = exp(α

(j)b,p), β

(j)b,p = exp(β

(j)b,p ). We then solve the

optimization problem over these new variables.

Gradient Computation

We use the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm

for parameter estimation, and the key technical detail is how the gradient of the

conditional log-likelihood is computed. We will discuss the gradient computation for

the likelihood over the training examples in DS ,DS+P , and DP in turn; the complete

gradient is simply the sum of these three gradients.

Gradient for Examples in DS

The gradient for each term in the sum∑

(x,y)∈DS logP (y|x;w) is simply the gradient,

for the particular training example, of the conditional log-likelihood of the original

CONTRAfold model. As P (y|x;w) = exp(wTF (x,y))Py′ exp(wTF (x,y′))

(the features are RNA struc-

tural motifs whose descriptions may be found in the original CONTRAfold paper)

this is given by

∇w logP (y|x;w) = ∇w

[wTF (x, y)− log

∑y′

exp(wTF (x, y′))

]

= F (x, y)−∑y′

P (y′|x;w)F (x, y′)

= F (x, y)− E[F (x, y)]

A detailed description of how this gradient (in particular, the feature expectations

with respect to the model E[F (x, y)]) may be computed efficiently via dynamic

programming is found in the Supplementary Material of the original CONTRAfold

manuscript [33].


Gradient for Examples in DS+P

Substituting the expression for the CONTRAfold-SE model (Equation B.1), we see

that each term in the sum∑

(x,y,d)∈DS+PlogP (y, d|x;w, θ) decomposes as:

logP (y, d|x;w, θ) = logP (y|x;w) +∑j∈Sx

Lx∑k=1

logP (d(j)k |xk, y; θ(j))

The first term is the original conditional log-likelihood of the CONTRAfold model.

The second term is the sum of log-likelihoods of the various probing data Gamma

distributions, for which gradients may be computed analytically by straightforward

differentiation. Let α = α(j)xk,paired(k,y)

, β = β(j)xk,paired(k,y)

, then

logP (d(j)k |xk, y; θ(j)) = (α− 1) log d

(j)k − log Γ(α)− α log β − d

(j)k

β∂

∂αlogP (d

(j)k |xk, y; θ(j)) = log d

(j)k − ψ(α)− log β

∂

∂βlogP (d

(j)k |xk, y; θ(j)) = −α

β+d

(j)k

β2

We can then use the chain rule to compute the gradients with respect to the uncon-

strained variables α ≡ α(j)xk,paired(k,y)

, β ≡ β(j)xk,paired(k,y)

. As ∂α∂α

= α, ∂β∂β

= β,

∂

∂αlogP (d

(j)k |xk, y; θ(j)) = log d

(j)k − ψ(α)− log βα

∂

∂βlogP (d

(j)k |xk, y; θ(j)) = −α

β+d

(j)k

β2β = −α +

d(j)k

β

Note that the gradients with respect to the other Gamma parameters (i.e., α 6=

α(j)xk,paired(k,y)

, β 6= β(j)xk,paired(k,y)

) will be 0. Therefore, more generally, for any of the


unconstrained Gamma distribution parameters α(j)b,p, β

(j)b,p , we have that

∂

∂α(j)b,p

logP (d(j)k |xk, y; θ(j)) = I[j ∈ Sx]I[xk = b]I[paired(k, y) = p](

log d(j)k − ψ(α

(j)b,p)− log β

(j)b,p

)α

(j)b,p

∂

∂β(j)b,p

logP (d(j)k |xk, y; θ(j)) = I[j ∈ Sx]I[xk = b]I[paired(k, y) = p](

−α(j)b,p +

d(j)k

β(j)b,p

)

Here, I[condition] is the indicator variable that is 1 when condition is true, and 0

otherwise.

Gradient for Examples in DP

Consider a single term in the (outer) sum∑

(x,d)∈DP log∑

y P (y, d|x;w, θ), which cor-

responds to the log-likelihood for a single training example in DP . This is the chal-

lenging case due to the sum over exponentially many possible structures y. Following

the argument in Theorem 19.6 in [64], or by directly differentiating this log-likelihood

term, we find that the gradient of the log-likelihood is equal to the gradient of the

expected log-likelihood, where the expectation is taken over the posterior distribution

over unknown structures y given the observed probing data d, Q(y) = P (y|d, x;w, θ).

Note that Q(y) is the posterior evaluated at the particular parameter values w, θ for

which we wish to compute a gradient, and therefore Q(y) has no more dependence

on w or θ. More formally,

∇w,θ log∑y

P (y, d|x;w, θ) = ∇w,θ EQ[P (y, d|x;w, θ)]


Expanding the expected log-likelihood given by the model in equation B.1,

EQ[P (y, d|x;w, θ)] =∑y

Q(y) logP (y, d|x;w, θ)

=∑y

Q(y) log

[P (y|x;w)

∏j∈Sx

Lx∏k=1

P (d(j)k |xk, y; θ(j))

]

=∑y

Q(y) logP (y|x;w) +∑y

Q(y)∑j∈Sx

Lx∑k=1

logP (d(j)k |xk, y; θ(j))

=∑y

Q(y) logP (y|x;w) +∑j∈Sx

Lx∑k=1

∑y

Q(y) logP (d(j)k |xk, y; θ(j))

we see that the likelihood decomposes (additively) over the CONTRAfold model

and over each of the separate Gamma distributions. We will describe the gradient

computation for each of these components in turn. Conceptually, these are similar

to the gradient computation when structures are known, except that terms involving

the sufficient statistics (e.g. features F (x, y)), will be replaced by expected sufficient

statistics ; the challenge is to compute these efficiently.

Gradient over w The required gradient is given by

∑y

Q(y) · ∇w logP (y|x;w) =∑y

Q(y)F (x, y)−∑y

P (y|x;w)F (x, y′)

We can compute the required feature expectations over Q(y) (the first term) by

adapting the existing routines in CONTRAfold for computing feature expectations

over the CONTRAfold model (the second term). We rewrite Q(y) in terms of known

quantities: the model probabilities in Equation B.1 and the form of the CONTRAfold


log-linear model, P (y|x;w) = exp(wTF (x,y)Py′ exp(wTF (x,y′)

Q(y) = P (y|d, x;w, θ)

=P (y, d|x;w, θ)∑y′ P (y′, d|x;w, θ)

=P (y|x;w)

∏j∈S∏Lx

k=1 P (d(j)k |xk, y; θ(j))∑

y′ P (y′|x;w)∏

j∈S∏Lx

k=1 P (d(j)k |xk, y′; θ(j))

=exp(wTF (x, y))

∏j∈S∏Lx

k=1 P (d(j)k |xk, y; θ(j))∑

y′ exp(wTF (x, y′))∏

j∈S∏Lx

k=1 P (d(j)k |xk, y′; θ(j))

=exp

(wTF (x, y) +

∑j∈S∑Lx

k=1 logP (d(j)k |xk, y; θ(j))

)∑

y′ exp(wTF (x, y′) +

∑j∈S∑Lx

k=1 logP (d(j)k |xk, y′; θ(j))

)

We see that Q(y) is also a log-linear model like CONTRAfold, but with additional

features for each base in the sequence given by the densities of the structure–probing

data. This means that we can simply modify the dynamic programming recurrences in

CONTRAfold to add the appropriate density terms whenever a base-pair or unpaired-

base is scored.

Gradient over θ Similar to the case for examples in DS+P , the partial derivatives

with respect to the unconstrained Gamma distribution parameters α(j)b,p, β

(j)b,p are given


by

∂

∂α(j)b,p

(∑y


)

=∑y

Q(y)I[j ∈ Sx]I[xk = b]I[paired(k, y) = p](

log d(j)k − ψ(α

(j)b,p)− log β

(j)b,p

)α

(j)b,p

= I[j ∈ Sx]I[xk = b](

log d(j)k − ψ(α

(j)b,p)− log β

(j)b,p

)α

(j)b,p

∑y

Q(y) I[paired(k, y) = p]

∂

∂β(j)b,p

(∑y


)

=∑y

[Q(y) I[j ∈ Sx]I[xk = b]I[paired(k, y) = p]

(−α(j)

b,p +d

(j)k

β(j)b,p

)]

= I[j ∈ Sx]I[xk = b]

(−α(j)

b,p +d

(j)k

β(j)b,p

)∑y

Q(y) I[paired(k, y) = p]

The required expectations∑

yQ(y) I[paired(k, y) = p] can be computed by adapting

the existing CONTRAfold routines for computing base pairing posteriors. Specifi-

cally, we can adapt the CONTRAfold routine to compute the posterior probability

pi,j, that base i pairs with base j under Q(y) instead of the original CONTRAfold

model (as previously described). Then, we can compute∑

yQ(y) I[paired(k, y) = p]

by summing the posteriors over the appropriate positions. For example, if we wish

to find the sum over all structures for sequence x where paired(1, y) = paired, then

we compute∑

j p1,j. If we wish to find the sum where paired(1, y) = unpaired, then

we compute the sum as 1−∑

j p1,j.

B.1.2 Dataset Setup

Parameter Optimization CONTRAfold-SE is a gradient-based method that re-

quires an initialization for the model parameters. For different initializations, we find

that the metrics at consecutive gradient steps during parameter training converge,


and that the accuracy for different parameter initializations are also consistent. In

addition, the learned parameters are also weakly correlated for different initializations

(Figure B.13). Since performance is mostly agnostic to initialization, we select one ini-

tialization style (described below) and use that throughout all experiments. We also

find that the accuracy across iterations saturates and hence the number of iterations

at which optimization was stopped also does not play a major role in practice.

Running and Evaluating CONTRAfold-SE Unless specified, we use the fol-

lowing settings throughout: regularize = 1, maxiter = 1000. For initial weights

(“initweight”), we concatenate the original CONTRAfold parameters (for the struc-

ture model) with 16 parameters specifying the natural logarithm of the shape and

scale parameters of 8 Gamma distributions, one for each paired or unpaired base

A, C, G, T (for the data model). Throughout, unless otherwise noted, we initial-

ize the parameters as follows: the structure model parameters are set to the opti-

mal ones given in CONTRAfold v2.02 (available at http://contra.stanford.edu/

contrafold/contrafold_v2_02.tar.gz); the data model parameters are initialized

by fitting a Gamma distribution to all bases in the first 2000 sequences that are

data-dense and short, determined as described in the section on training sets. This

corresponds to “init0” in Figure B.13. In addition, we check two other initializations:

1) the structure model parameters are as above and the data model parameters are

randomly set (init1); and 2) the non-zero structure model parameters are set to a

random value between -1 and 1 and the data model parameters are randomly set

(init2). For the parameter γ, we run on a grid from 0.000001 to 1024 (namely: 1e-4,

2e-4, 3e-4, 4e-4, 5e-4, 8e-4, 1e-3, 2e-3, 5e-3, 6e-3, 8e-3, 1-2, 2e-2, and 2e-5 through

2e10 incrementing the power). This tuning parameter roughly controls the number

of bases included in the final structure and affects specificity or sensitivity.

We select the optimal λ using 10-fold cross-validation over a grid of values (0.001,

http://contra.stanford.edu/contrafold/contrafold_v2_02.tar.gz

http://contra.stanford.edu/contrafold/contrafold_v2_02.tar.gz


0.01, 0.05, 0.1, 0.5, 1): we divide the set of known-structure sequences into 10 sets,

evaluate AUC (see below) on each set (trained on the remaining structure-only and

all data-only sequences), and average across all 10 sets for each possible λ. We then

set the λ to that with the highest average AUC and learn the parameters over the

complete training set. We perform this procedure for Train-A and Train-B. Selected

λ are typically near 0.05. Train-A75%, Train-A100%, Train-A75, and Train-A100 use

the same λ as Train-A (namely, 0.05), since we are interested in seeing how adjusting

the data composition affects performances and λ also modulates that.

Training and Test Sets Train-A has two components making up 238 sequences:

sequences with only known secondary structure and sequences with only structure-

probing data. For the first component, we take the first 119 sequences from the 151

training set sequences compiled from RFAM for the CONTRAfold training set (set

S151 in [33]), after excluding any sequences that share an RFAM match with the test

sets described below or any of the yeast mRNA genes. For the second component, we

assign to each yeast mRNA sequence with structure-probing data a data-sparsity score

calculated as the length divided by the number of non-zero data counts per length.

We select the first 119 sequences with smallest score, again excluding any that share

an RFAM match with the test sets. This ensures that we are first using sequences that

are both short (faster running time) and have dense data (more structural information

for the algorithm to use). Train-B is constructed similarly but using the data-sparsity

score cycling through DMS-vitro, DMS-vivo, and PARS data instead.


B.2 Supplementary Figures and Tables

Figure B.1: Sensitivity-PPV curve for ASH1-E1 in Test-SeqFold.

Figure B.2: Sensitivity-PPV curve for RDN58-2 in Test-SeqFold.


Figure B.3: Sensitivity-PPV curve for p4p6 in Test-SeqFold.

Figure B.4: Sensitivity-PPV curve for p9 in Test-SeqFold.


Figure B.5: Sensitivity-PPV curve for snR10 in Test-SeqFold.



0.3

0.32

0.34

0.36

0.38

0.4

Pai

ring

Pro

babi

lity

Pum2

true false

0

2

4

6

8

Man

n−W

hitn

ey−

Wilc

oxon

−lo

g 10 p

−va

lue

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40


Sequence Position

Seq

uenc

e P

ositi

on

0

2

4

6

8

10

x 10−3

0.15

0.2

0.25

0.3

0.35

Pai

ring

Pro

babi

lity

SF2ASF

true false

20

40

60

Man

n−W

hitn

ey−

Wilc

oxon

−lo

g 10 p

−va

lue

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40


Sequence Position

Seq

uenc

e P

ositi

on

0

1

2

3

4

5

6

x 10−3

0.3

0.32

0.34

0.36

0.38

0.4

0.42

Pai

ring

Pro

babi

lity

FMR1_1 (ACUK)

true false

0

10

20

30

Man

n−W

hitn

ey−

Wilc

oxon

−lo

g 10 p

−va

lue

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40


Sequence Position

Seq

uenc

e P

ositi

on

1

2

3

4

5

6

x 10−3

0.38

0.4

0.42

0.44

0.46

0.48

0.5

0.52

Pai

ring

Pro

babi

lity

FMR1_1 (WGGA)

true false

20406080

100120140

Man

n−W

hitn

ey−

Wilc

oxon

−lo

g 10 p

−va

lue

5 10 15 20 25 30 35 40

5

10

15

20

25

30

35

40


Sequence Position

Seq

uenc

e P

ositi

on

1

2

3

4

5

6

7x 10−3

Figure B.11: Structure profiles for human RNA binding proteins.


Figure B.12: Learned noise model for structure probing data.


−4−2 0 2

−4−202

init0 − structure

init1

− s

truc

ture

r=0.98

−4−2 0 2−6−4−202

init0 − structurein

it2 −

str

uctu

re

r=0.97

−4−2 0 2

−4−202

init0 − structure

init3

− s

truc

ture

r=0.99

−4−2 0 2−6−4−202

init1 − structure

init2

− s

truc

ture

r=0.97

−4−2 0 2

−4−202

init1 − structurein

it3 −

str

uctu

re

r=0.98

−6−4−2 0 2

−4−202

init2 − structure

init3

− s

truc

ture

r=0.97

−1 0 1 2 3−10123

init0 − data

init1

− d

ata

r=1.00

−1 0 1 2 3−10123

init0 − data

init2

− d

ata

r=1.00

−1 0 1 2 3−10123

init0 − data

init3

− d

ata

r=1.00

−1 0 1 2 3−10123

init1 − data

init2

− d

ata

r=1.00

−1 0 1 2 3−10123

init1 − data

init3

− d

ata

r=1.00

−1 0 1 2 3−10123

init2 − data

init3

− d

ata

r=1.00

Figure B.13: Correlation between learned parameters for different parameter initial-izations.


Motif AUC for RNA binding protein classification

Not normalized by gene length

C-SE(P,D-vitro) C-SE(D-vivo) C SeqFold #MotifsPUF4-1 0.682 0.687 0.688 0.660 0.625PUB1-1 0.606 0.606 0.598 0.598 0.598PUF2-1 0.767 0.751 0.757 0.751 0.749PAB1-1 0.678 0.677 0.682 0.642 0.665KHD1-1 0.499 0.497 0.497 0.509 0.491NAB2-1 0.549 0.540 0.547 0.545 0.548YLL032C-1 0.708 0.683 0.699 0.706 0.670VTS1-1 0.446 0.448 0.439 0.514 0.547PIN4-1 0.956 0.974 0.975 0.947 0.948NRD1-1 0.527 0.561 0.552 0.502 0.557

Normalized by gene length

C-SE(P,D-vitro) C-SE(D-vivo) C SeqFold # MotifsPUF4-1 0.664 0.682 0.695 0.609 0.534PUB1-1 0.595 0.600 0.597 0.603 0.590PUF2-1 0.695 0.682 0.673 0.662 0.676PAB1-1 0.493 0.488 0.494 0.489 0.509KHD1-1 0.507 0.503 0.508 0.513 0.499NAB2-1 0.507 0.499 0.503 0.511 0.505YLL032C-1 0.662 0.612 0.634 0.649 0.550VTS1-1 0.403 0.399 0.410 0.472 0.489PIN4-1 0.896 0.759 0.847 0.692 0.644NRD1-1 0.489 0.533 0.516 0.463 0.494

Table B.1: AUC for receiver-operating-characteristic curves classifying bound RBPgenes.C-SE represents CONTRAfold-SE and C represents CONTRAfold, trained on Train-B with the following data: PARS (P), DMS-vitro (D-vitro), DMS-vivo (D-vivo). Thetop half uses the aggregate accessibility of motifs and motif count normalized by genelength; the bottom half has no normalization (see Methods).


rep1 (r) rep1 (p) rep2 (r) rep2 (p)

5’ UTR mean 0.13 5.7e-13 0.14 4.3e-14min 0.06 4.8e-04 0.04 2.9e-02max 0.09 3.4e-07 0.14 7.1e-14

median 0.14 7.3e-14 0.13 4.3e-12std 0.07 9.7e-05 0.11 1.3e-08CV -0.08 6.7e-06 -0.04 4.1e-02

CDS mean 0.19 7.7e-25 0.23 2.0e-34min -0.18 5.9e-24 -0.22 4.0e-31max 0.34 1.1e-82 0.44 1.2e-137




All mean 0.24 1.9e-39 0.28 1.4e-51min -0.15 1.9e-16 -0.19 1.2e-23max 0.33 4.5e-75 0.43 1.1e-127


Table B.2: Spearman correlation between CONTRAfold-SE and translation efficiencyon in vivo data.Correlation is between CONTRAfold-SE pairing probabilities trained on Train-B(DMS-vivo) and log fold change in translation efficiency at time 0 and time 30minutes: log(initial TE) - log(TE at 30min). Pairing probability is calculated overdifferent regions and metrics (CV is coefficient of variation = std / mean). Efficiencyis calculated over two replicates (see Methods).







3’ UTR mean 0.02 1.9e-01 0.01 7.4e-01min 0.06 1.3e-03 0.06 1.8e-03max -0.01 4.4e-01 -0.00 8.2e-01

median 0.03 8.6e-02 -0.00 9.8e-01std -0.01 6.9e-01 -0.00 8.4e-01CV -0.04 3.9e-02 -0.01 5.0e-01



Table B.3: Spearman correlation between CONTRAfold-SE and translation efficiencyon in vitro data.Correlation is between CONTRAfold-SE pairing probability trained on Train-B(PARS,DMS-vitro) and log fold change in translation efficiency at time 0 and time30 minutes: log(initial TE) - log(TE at 30min). Quantities are calculated as in FigureB.2.







3’ UTR mean 0.02 2.2e-01 0.03 7.8e-02min 0.04 3.2e-02 0.04 5.0e-02max -0.00 8.0e-01 0.01 5.9e-01




Table B.4: Spearman correlation between CONTRAfold-SE and translation efficiencyat earlier time point.Correlation is between CONTRAfold-SE on Train-B(DMS-vivo) pairing probabilityand log fold change in translation efficiency at time 0 and time 15 minutes: log(initialTE) - log(TE at 15min). Quantities are calculated as in Figure B.2.



log(TE at 30min)

5’ UTR mean -0.23 5.5e-38 -0.29 1.2e-54CDS mean -0.00 9.6e-01 0.02 3.8e-01

3’ UTR mean -0.10 1.5e-07 -0.09 4.9e-06All mean -0.06 1.0e-03 -0.05 7.0e-03

log(initial TE)

5’ UTR mean -0.10 6.9e-08 -0.09 7.1e-07CDS mean 0.18 1.2e-23 0.15 3.8e-16

3’ UTR mean -0.03 8.5e-02 -0.04 5.5e-02All mean 0.17 1.6e-21 0.15 6.8e-15

log(initial TE) - log(TE at 30min) conditioned on log(initial TE)

5’ UTR mean 0.24 2.0e-38 0.21 2.4e-30CDS mean 0.03 1.4e-01 0.11 1.0e-09

3’ UTR mean 0.04 2.1e-02 0.05 4.0e-03All mean 0.09 3.9e-06 0.16 5.2e-18

Table B.5: Spearman correlation between CONTRAfold-SE in vivo and various TEquantities.CONTRAfold-SE is trained on Train-B(DMS-vivo) and other quantities are calcu-lated as in Figure B.2.

Bibliography

[1] Frank W Albert, Dale Muzzey, Jonathan S Weissman, and Leonid Kruglyak.

Genetic influences on translation in yeast. PLoS genetics, 10(10):e1004692,

October 2014.

[2] Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts,

and Peter Walter. Molecular Biology of the Cell, 2002.

[3] Andrei Alexandrov, Irina Chernyakov, Weifeng Gu, Shawna L Hiley, Timo-

thy R Hughes, Elizabeth J Grayhack, and Eric M Phizicky. Rapid tRNA decay

can result from lack of nonessential modifications. Molecular cell, 21(1):87–96,

January 2006.

[4] Gerd Anders, Sebastian D Mackowiak, Marvin Jens, Jonas Maaskola, Andreas

Kuntzagk, Nikolaus Rajewsky, Markus Landthaler, and Christoph Dieterich.

doRiNA: a database of RNA interactions in post-transcriptional regulation.

Nucleic acids research, 40(Database issue):D180–6, January 2012.

[5] S G Andersson and C G Kurland. Codon preferences in free-living microorgan-

isms. Microbiological reviews, 54(2):198–210, June 1990.

[6] Yoav Arava, F Edward Boas, Patrick O Brown, and Daniel Herschlag. Dissect-

ing eukaryotic translation and its control by ribosome density mapping. Nucleic

acids research, 33(8):2421–32, January 2005.

140

BIBLIOGRAPHY 141

[7] Yoav Arava, Yulei Wang, John D Storey, Chih Long Liu, Patrick O Brown,

and Daniel Herschlag. Genome-wide analysis of mRNA translation profiles in

Saccharomyces cerevisiae. Proceedings of the National Academy of Sciences of

the United States of America, 100(7):3889–94, April 2003.

[8] Carlo G Artieri and Hunter B Fraser. Evolution at two levels of gene expression

in yeast. Genome research, 24(3):411–21, March 2014.

[9] Tzvi Aviv, Zhen Lin, Giora Ben-Ari, Craig A Smibert, and Frank Sicheri.

Sequence-specific recognition of RNA hairpins by the SAM domain of Vts1p.

Nature structural & molecular biology, 13(2):168–76, February 2006.

[10] A. Battle, Z. Khan, S. H. Wang, A. Mitrano, M. J. Ford, J. K. Pritchard,

and Y. Gilad. Impact of regulatory variation from RNA to protein. Science,

347(6222):664–667, December 2014.

[11] Kajetan Bentele, Paul Saffert, Robert Rauscher, Zoya Ignatova, and Nils

Bluthgen. Efficient translation initiation dictates codon usage at gene start.

Molecular systems biology, 9:675, January 2013.

[12] F Bonekamp, H Dalbø ge, T Christensen, and K F Jensen. Translation rates of

individual codons are not correlated with tRNA abundances or with frequen-

cies of utilization in Escherichia coli. Journal of bacteriology, 171(11):5812–6,

November 1989.

[13] M Bulmer. The selection-mutation-drift theory of synonymous codon usage.

Genetics, 129(3):897–907, November 1991.

BIBLIOGRAPHY 142

[14] Nicola A Burgess-Brown, Sujata Sharma, Frank Sobott, Christoph Loenarz,

Udo Oppermann, and Opher Gileadi. Codon optimization can improve expres-

sion of human genes in Escherichia coli: A multi-gene study. Protein expression

and purification, 59(1):94–102, May 2008.

[15] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A Limited

Memory Algorithm for Bound Constrained Optimization. SIAM Journal on

Scientific Computing, 16(5):1190–1208, September 1995.

[16] J H Cate and J A Doudna. Solving large RNA structures by X-ray crystallog-

raphy. Methods in enzymology, 317:169–80, January 2000.

[17] Catherine A Charneski and Laurence D Hurst. Positively charged residues are

the major determinants of ribosomal velocity. PLoS biology, 11(3):e1001508,

January 2013.

[18] Catherine A Charneski and Laurence D Hurst. Positive charge loading at pro-

tein termini is due to membrane protein topology, not a translational ramp.

Molecular biology and evolution, 31(1):70–84, January 2014.

[19] Chunlai Chen, Haibo Zhang, Steven L Broitman, Michael Reiche, Ian Farrell,

Barry S Cooperman, and Yale E Goldman. Dynamics of translation by single

ribosomes through mRNA secondary structures. Nature structural & molecular

biology, 20(5):582–8, May 2013.

[20] L Cheng and E Goldman. Absence of effect of varying Thr-Leu codon pairs on

protein synthesis in a T7 system. Biochemistry, 40(20):6102–6, May 2001.

[21] Fabienne F V Chevance, Soazig Le Guyon, and Kelly T Hughes. The effects

of codon context on in vivo translation speed. PLoS genetics, 10(6):e1004392,

June 2014.

BIBLIOGRAPHY 143

[22] Dominique Chu, David J Barnes, and Tobias von der Haar. The role of tRNA

and ribosome competition in coupling the expression of different mRNAs in

Saccharomyces cerevisiae. Nucleic acids research, 39(15):6705–14, August 2011.

[23] Dominique Chu and Tobias von der Haar. The architecture of eukaryotic trans-

lation. Nucleic acids research, 40(20):10098–106, November 2012.

[24] Pablo Cordero, Wipapat Kladwang, Christopher C VanLang, and Rhiju Das.

Quantitative dimethyl sulfate mapping for automated RNA secondary structure

inference. Biochemistry, 51(36):7037–9, September 2012.

[25] J F Curran and M Yarus. Rates of aminoacyl-tRNA selection at 29 sense codons

in vivo. Journal of molecular biology, 209(1):65–77, September 1989.

[26] Lyris M F de Godoy, Jesper V Olsen, Jurgen Cox, Michael L Nielsen, Nina C

Hubner, Florian Frohlich, Tobias C Walther, and Matthias Mann. Comprehen-

sive mass-spectrometry-based proteome quantification of haploid versus diploid

yeast. Nature, 455(7217):1251–4, October 2008.

[27] Katherine E Deigan, Tian W Li, David H Mathews, and Kevin M Weeks.

Accurate SHAPE-directed RNA structure determination. Proceedings of the

National Academy of Sciences of the United States of America, 106(1):97–102,

January 2009.

[28] Elizabeth A Dethoff, Jeetender Chugh, Anthony M Mustoe, and Hashim M

Al-Hashimi. Functional complexity and regulation through RNA dynamics.

Nature, 482(7385):322–30, February 2012.

[29] Yang Ding, Premal Shah, and Joshua B Plotkin. Weak 5’-mRNA secondary

structures in short eukaryotic genes. Genome biology and evolution, 4(10):1046–

53, January 2012.

BIBLIOGRAPHY 144

[30] Ye Ding and Charles E. Lawrence. A statistical sampling algorithm for RNA

secondary structure prediction. Nucleic Acids Research, 31(24):7280–7301, De-

cember 2003.

[31] Yiliang Ding, Yin Tang, Chun Kit Kwok, Yu Zhang, Philip C Bevilacqua, and

Sarah M Assmann. In vivo genome-wide profiling of RNA secondary structure

reveals novel regulatory features. Nature, November 2013.

[32] Kimberly A Dittmar, Evelyn M Mobley, Agnes Jancso Radek, and Tao Pan.

Exploring the regulation of tRNA distribution on the genomic scale. Journal

of molecular biology, 337(1):31–47, March 2004.

[33] Chuong B Do, Daniel A Woods, and Serafim Batzoglou. CONTRAfold: RNA

secondary structure prediction without physics-based models. Bioinformatics,

22(14):e90–e98, July 2006.

[34] Mario dos Reis, Renos Savva, and Lorenz Wernisch. Solving the riddle of codon

usage preferences: a test for translational selection. Nucleic acids research,

32(17):5036–44, January 2004.

[35] Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. Bio-

logical Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.

Cambridge University Press, Cambridge, Massachusetts, 1998.

[36] Sean R Eddy. Computational analysis of conserved RNA secondary structure

in transcriptomes and genomes. Annual Review of Biophysics, 43:433–456, Jan-

uary 2014.

[37] Chantal Ehresmann, Florence Baudin, Marylene Mougel, Pascale Romby, Jean-

Pierre Ebel, and Bernard Ehresmann. Probing the structure of RNAs in solu-

tion. Nucleic Acids Research, 15(22):9109–9128, November 1987.

BIBLIOGRAPHY 145

[38] Robert D Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y Eber-

hardt, Sean R Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina

Mistry, Erik L L Sonnhammer, John Tate, and Marco Punta. Pfam: the protein

families database. Nucleic acids research, 42(Database issue):D222–30, January

2014.

[39] Helena Firczuk, Shichina Kannambath, Jurgen Pahle, Amy Claydon, Robert

Beynon, John Duncan, Hans Westerhoff, Pedro Mendes, and John Eg Mc-

Carthy. An in vivo control map for the eukaryotic mRNA translation machinery.

Molecular systems biology, 9:635, January 2013.

[40] Kurt Fredrick and Michael Ibba. How the sequence of a gene can tune its

translation. Cell, 141(2):227–9, April 2010.

[41] Tsukasa Fukunaga, Haruka Ozaki, Goro Terai, Kiyoshi Asai, Wataru Iwasaki,

and Hisanori Kiryu. CapR: revealing structural specificities of RNA-binding

protein target recognition using CLIP-seq data. Genome Biology, 15(1):R16,

January 2014.

[42] Boris Furtig, Christian Richter, Jens Wohnert, and Harald Schwalbe. NMR

spectroscopy of RNA. ChemBioChem, 4(10):936–962, October 2003.

[43] Justin Gardin, Rukhsana Yeasmin, Alisa Yurovsky, Ying Cai, Steve Skiena, and

Bruce Futcher. Measurement of average decoding rates of the 61 sense codons

in vivo. eLife, 3, January 2014.

[44] Maxim V Gerashchenko, Alexei V Lobanov, and Vadim N Gladyshev. Genome-

wide ribosome profiling reveals complex translational regulation in response to

oxidative stress. Proceedings of the National Academy of Sciences of the United

States of America, 109(43):17394–9, October 2012.

BIBLIOGRAPHY 146

[45] Aaron C Goldstrohm, Brad A Hook, Daniel J Seay, and Marvin Wickens. PUF

proteins bind Pop2p to regulate messenger RNAs. Nature structural & molecular

biology, 13:533–539, 2006.

[46] Wanjun Gu, Tong Zhou, and Claus O Wilke. A universal trend of reduced

mRNA stability near the translation-initiation site in prokaryotes and eukary-

otes. PLoS computational biology, 6(2):e1000664, February 2010.

[47] Claes Gustafsson, Sridhar Govindarajan, and Jeremy Minshull. Codon bias

and heterologous protein expression. Trends in biotechnology, 22(7):346–53,

July 2004.

[48] Christine E Hajdin, Stanislav Bellaousov, Wayne Huggins, Christopher W

Leonard, David H Mathews, and Kevin M Weeks. Accurate SHAPE-directed

RNA secondary structure modeling, including pseudoknots. Proceedings of the

National Academy of Sciences of the United States of America, 110(14):5498–

5503, April 2013.

[49] R Hamilton, C K Watanabe, and H A de Boer. Compilation and comparison of

the sequence context around the AUG startcodons in Saccharomyces cerevisiae

mRNAs. Nucleic acids research, 15(8):3581–93, April 1987.

[50] Winfried Hense, Nathan Anderson, Stephan Hutter, Wolfgang Stephan, John

Parsch, and David B Carlini. Experimentally increased codon bias in the

Drosophila Adh gene leads to an increase in larval, but not adult, alcohol de-

hydrogenase activity. Genetics, 184(2):547–55, February 2010.

[51] Ruth Hershberg and Dmitri A Petrov. Selection on codon bias. Annual review

of genetics, 42:287–99, January 2008.

BIBLIOGRAPHY 147

[52] Wolf D Hirschmann, Heidrun Westendorf, Andreas Mayer, Gina Cannarozzi,

Patrick Cramer, and Ralf-Peter Jansen. Scp160p is required for translational

efficiency of codon-optimized mRNAs in yeast. Nucleic acids research, pages

gkt1392–, January 2014.

[53] Jessica I Hoell, Erik Larsson, Simon Runge, Jeffrey D Nusbaum, Sujitha Dug-

gimpudi, Thalia A Farazi, Markus Hafner, Arndt Borkhardt, Chris Sander, and

Thomas Tuschl. RNA targets of wild-type and mutant FET family proteins.

Nature structural & molecular biology, 18(12):1428–31, December 2011.

[54] Daniel J Hogan, Daniel P Riordan, Andre P Gerber, Daniel Herschlag, and

Patrick O Brown. Diverse RNA-binding proteins interact with functionally

related sets of RNAs, suggesting an extensive regulatory system. PLoS Biology,

6(10):e255, October 2008.

[55] Nicholas T. Ingolia. Ribosome profiling: new views of translation, from single

codons to genome scale. Nature Reviews Genetics, 15(3):205–213, January 2014.

[56] Nicholas T Ingolia, Gloria A Brar, Silvia Rouskin, Anna M McGeachy, and

Jonathan S Weissman. The ribosome profiling strategy for monitoring transla-

tion in vivo by deep sequencing of ribosome-protected mRNA fragments. Nature

protocols, 7(8):1534–50, August 2012.

[57] Nicholas T Ingolia, Sina Ghaemmaghami, John R S Newman, and Jonathan S

Weissman. Genome-wide analysis in vivo of translation with nucleotide reso-

lution using ribosome profiling. Science (New York, N.Y.), 324(5924):218–23,

April 2009.

[58] Nicholas T Ingolia, Liana F Lareau, and Jonathan S Weissman. Ribosome

profiling of mouse embryonic stem cells reveals the complexity and dynamics of

mammalian proteomes. Cell, 147(4):789–802, November 2011.

BIBLIOGRAPHY 148

[59] B. Irwin, J. D. Heck, and G. W. Hatfield. Codon Pair Utilization Biases In-

fluence Translational Elongation Step Times. Journal of Biological Chemistry,

270(39):22801–22806, September 1995.

[60] Ailong Ke and Jennifer A Doudna. Crystallization of RNA and RNA-protein

complexes. Methods (San Diego, Calif.), 34(3):408–14, November 2004.

[61] Thomas E Keller, S David Mis, Kevin E Jia, and Claus O Wilke. Reduced

mRNA secondary-structure stability near the start codon indicates functional

genes in prokaryotes. Genome biology and evolution, 4(2):80–8, January 2012.

[62] Michael Kertesz, Yue Wan, Elad Mazor, John L Rinn, Robert C Nutter,

Howard Y Chang, and Eran Segal. Genome-wide measurement of RNA sec-

ondary structure in yeast. Nature, 467(7311):103–7, September 2010.

[63] Alex V Kochetov, Andrey Palyanov, Igor I Titov, Dmitry Grigorovich, Akinori

Sarai, and Nikolay A Kolchanov. AUG hairpin: prediction of a downstream

secondary structure influencing the recognition of a translation start site. BMC

bioinformatics, 8:318, January 2007.

[64] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles

and Techniques. Adaptive Computation and Machine Learning. MIT Press,

Cambridge, Massachusetts, 2009.

[65] M Kozak. Possible role of flanking nucleotides in recognition of the AUG ini-

tiator codon by eukaryotic ribosomes. Nucleic acids research, 9(20):5233–52,

October 1981.

[66] M Kozak. Downstream secondary structure facilitates recognition of initiator

codons by eukaryotic ribosomes. Proceedings of the National Academy of Sci-

ences of the United States of America, 87(21):8301–5, November 1990.

BIBLIOGRAPHY 149

[67] Grzegorz Kudla, Andrew W Murray, David Tollervey, and Joshua B Plotkin.

Coding-sequence determinants of gene expression in Escherichia coli. Science

(New York, N.Y.), 324(5924):255–8, April 2009.

[68] Daniel H Lackner and Jurg Bahler. Translational control of gene expression

from transcripts to transcriptomes. International review of cell and molecular

biology, 271:199–251, January 2008.

[69] Daniel H. Lackner, Traude H. Beilharz, Samuel Marguerat, Juan Mata, Stephen

Watt, Falk Schubert, Thomas Preiss, and Jurg Bahler. A Network of Multiple

Regulatory Layers Shapes Gene Expression in Fission Yeast. Molecular Cell,

26(1):145–155, April 2007.

[70] Liana F. Lareau, Dustin H. Hite, Gregory J. Hogan, and Patrick O. Brown. Dis-

tinct stages of the translation elongation cycle revealed by sequencing ribosome-

protected mRNA fragments. eLife, 2014, 2014.

[71] Yizhar Lavner and Daniel Kotlar. Codon bias as a factor in regulating expres-

sion via translation rate in the human genome. Gene, 345(1):127–38, January

2005.

[72] Daniel P Letzring, Kimberly M Dean, and Elizabeth J Grayhack. Control

of translation efficiency in yeast by codon-anticodon interactions. RNA (New

York, N.Y.), 16(12):2516–28, December 2010.

[73] Fan Li, Qi Zheng, Paul Ryvkin, Isabelle Dragomir, Yaanik Desai, Subhadra

Aiyer, Otto Valladares, Jamie Yang, Shelly Bambina, Leah R Sabin, John I

Murray, Todd Lamitina, Arjun Raj, Sara Cherry, Li-San Wang, and Brian D

Gregory. Global analysis of RNA secondary structure in two metazoans. Cell

Reports, 1(1):69–82, January 2012.

BIBLIOGRAPHY 150

[74] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for

large scale optimization. Mathematical Programming, 45(1-3):503–528, August

1989.

[75] A C Looman and J A Kuivenhoven. Influence of the three nucleotides up-

stream of the initiation codon on expression of the Escherichia coli lacZ gene in

Saccharomyces cerevisiae. Nucleic acids research, 21(18):4268–71, September

1993.

[76] T M Lowe and S R Eddy. tRNAscan-SE: a program for improved detection of

transfer RNA genes in genomic sequence. Nucleic acids research, 25(5):955–64,

March 1997.

[77] Julius B Lucks, Stefanie A Mortimer, Cole Trapnell, Shujun Luo, Sharon Avi-

ran, Gary P Schroth, Lior Pachter, Jennifer A Doudna, and Adam P Arkin.

Multiplexed RNA structure characterization with selective 2’-hydroxyl acyla-

tion analyzed by primer extension sequencing (SHAPE-Seq). Proceedings of the

National Academy of Sciences of the United States of America, 108(27):11063–8,

July 2011.

[78] Tobias Maier, Marc Guell, and Luis Serrano. Correlation of mRNA and protein

in complex biological samples. FEBS letters, 583(24):3966–73, December 2009.

[79] Orna Man and Yitzhak Pilpel. Differential translation efficiency of orthologous

genes is involved in phenotypic divergence of yeast species. Nature genetics,

39(3):415–21, March 2007.

[80] Nicholas R Markham and Michael Zuker. UNAFold: software for nucleic acid

folding and hybridization. Methods in molecular biology (Clifton, N.J.), 453:3–

31, January 2008.

BIBLIOGRAPHY 151

[81] David H Mathews, Matthew D Disney, Jessica L Childs, Susan J Schroeder,

Michael Zuker, and Douglas H Turner. Incorporating chemical modification

constraints into a dynamic programming algorithm for prediction of RNA sec-

ondary structure. Proceedings of the National Academy of Sciences of the United

States of America, 101(19):7287–92, May 2004.

[82] C Joel McManus, Gemma E May, Pieter Spealman, and Alan Shteyman. Ribo-

some profiling reveals post-transcriptional buffering of divergent gene expression

in yeast. Genome research, 24(3):422–30, March 2014.

[83] Edward J Merino, Kevin A Wilkinson, Jennifer L Coughlan, and Kevin M

Weeks. RNA structure analysis at single nucleotide resolution by selective 2’-

hydroxyl acylation and primer extension (SHAPE). Journal of the American

Chemical Society, 127(12):4223–31, March 2005.

[84] Audrey M Michel, Kingshuk Roy Choudhury, Andrew E Firth, Nicholas T

Ingolia, John F Atkins, and Pavel V Baranov. Observation of dually decoded

regions of the human genome using ribosome profiling data. Genome research,

22(11):2219–29, November 2012.

[85] Namiko Mitarai and Steen Pedersen. Control of ribosome traffic by position-

dependent choice of synonymous codons. Physical biology, 10(5):056011, Octo-

ber 2013.

[86] Stefanie A Mortimer, Mary Anne Kidwell, and Jennifer A Doudna. Insights

into RNA structure and function from genome-wide studies. Nature reviews.

Genetics, 15(7):469–79, July 2014.

[87] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish

Raha, Mark Gerstein, and Michael Snyder. The transcriptional landscape of

BIBLIOGRAPHY 152

the yeast genome defined by RNA sequencing. Science (New York, N.Y.),

320(5881):1344–9, June 2008.

[88] John R S Newman, Sina Ghaemmaghami, Jan Ihmels, David K Breslow,

Matthew Noble, Joseph L DeRisi, and Jonathan S Weissman. Single-cell pro-

teomic analysis of S. cerevisiae reveals the architecture of biological noise. Na-

ture, 441(7095):840–6, June 2006.

[89] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom

Mitchell. Text Classification from Labeled and Unlabeled Documents using

EM. Machine Learning, 39(2-3):103–134, May 2000.

[90] Daniel A Nissley and Edward P O’Brien. Timing is everything: unifying codon

translation rates and nascent proteome behavior. Journal of the American

Chemical Society, 136(52):17892–8, December 2014.

[91] Florian C Oberstrass, Albert Lee, Richard Stefl, Michael Janis, Guillaume

Chanfreau, and Frederic H-T Allain. Shape-specific recognition in the structure

of the Vts1p SAM domain with RNA. Nature structural & molecular biology,

13(2):160–7, February 2006.

[92] Zhengqing Ouyang, Michael P Snyder, and Howard Y Chang. SeqFold: genome-

scale reconstruction of RNA secondary structure integrating high-throughput

sequencing data. Genome Research, 23(2):377–87, February 2013.

[93] Marc Parisien and Francois Major. The MC-Fold and MC-Sym pipeline infers

RNA structure from sequence data. Nature, 452(7183):51–5, March 2008.

[94] Joseph K Pickrell, John C Marioni, Athma A Pai, Jacob F Degner, Barbara E

Engelhardt, Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens,

Yoav Gilad, and Jonathan K Pritchard. Understanding mechanisms underlying

BIBLIOGRAPHY 153

human gene expression variation with RNA sequencing. Nature, 464(7289):768–

72, April 2010.

[95] Joshua B Plotkin and Grzegorz Kudla. Synonymous but not the same: the

causes and consequences of codon bias. Nature reviews. Genetics, 12(1):32–42,

January 2011.

[96] Cristina Pop, Silvi Rouskin, Nicholas T Ingolia, Lu Han, Eric M Phizicky,

Jonathan S Weissman, and Daphne Koller. Causal signals between codon bias,

mRNA structure, and the efficiency of translation and elongation. Molecular

systems biology, 10(12):770, January 2014.

[97] Tomasz Puton, Lukasz P Kozlowski, Kristian M Rother, and Janusz M Bujnicki.

CompaRNA: a server for continuous benchmarking of automated methods for

RNA secondary structure prediction. Nucleic Acids Research, 41(7):4307–4323,

April 2013.

[98] Wenfeng Qian, Jian-Rong Yang, Nathaniel M Pearson, Calum Maclean, and

Jianzhi Zhang. Balanced codon usage optimizes eukaryotic translational effi-

ciency. PLoS genetics, 8(3):e1002603, January 2012.

[99] Scott Quarrier, Joshua S Martin, Lauren Davis-Neulander, Arthur Beauregard,

and Alain Laederach. Evaluation of the information content of RNA structure

mapping data for secondary structure prediction. RNA (New York, N.Y.),

16(6):1108–17, June 2010.

[100] Vladimir Reinharz, Francois Major, and Jerome Waldispuhl. Towards 3D struc-

ture prediction of large RNA molecules: an integer programming framework to

insert local 3D motifs in RNA secondary structure. Bioinformatics (Oxford,

England), 28(12):i207–14, June 2012.

BIBLIOGRAPHY 154

[101] Jessica S Reuter and David H Mathews. RNAstructure: software for RNA

secondary structure prediction and analysis. BMC Bioinformatics, 11(1):129,

January 2010.

[102] Shlomi Reuveni, Isaac Meilijson, Martin Kupiec, Eytan Ruppin, and Tamir

Tuller. Genome-scale analysis of translation elongation with a ribosome flow

model. PLoS computational biology, 7(9):e1002127, September 2011.

[103] Elena Rivas, Raymond Lang, and Sean R Eddy. A range of complex probabilis-

tic models for RNA secondary structure prediction that includes the nearest-

neighbor model and more. RNA, 18(2):193–212, February 2012.

[104] A Robbins-Pianka, M D Rice, and M P Weir. The mRNA landscape at yeast

translation initiation sites. Bioinformatics (Oxford, England), 26(21):2651–5,

November 2010.

[105] Silvi Rouskin, Meghan Zubradt, Stefan Washietl, Manolis Kellis, and

Jonathan S. Weissman. Genome-wide probing of RNA structure reveals ac-

tive unfolding of mRNA structures in vivo. Nature, December 2013.

[106] Zuben E Sauna and Chava Kimchi-Sarfaty. Understanding the contribution of

synonymous mutations to human disease. Nature reviews. Genetics, 12(10):683–

91, October 2011.

[107] Premal Shah, Yang Ding, Malwina Niemczyk, Grzegorz Kudla, and Joshua B

Plotkin. Rate-limiting steps in yeast protein translation. Cell, 153(7):1589–601,

June 2013.

[108] Bruce A Shapiro, Yaroslava G Yingling, Wojciech Kasprzak, and Eckart Binde-

wald. Bridging the gap in RNA structure prediction. Current opinion in struc-

tural biology, 17(2):157–65, April 2007.

BIBLIOGRAPHY 155

[109] J Shine and L Dalgarno. The 3’-terminal sequence of Escherichia coli 16S

ribosomal RNA: complementarity to nonsense triplets and ribosome binding

sites. Proceedings of the National Academy of Sciences of the United States of

America, 71(4):1342–6, April 1974.

[110] M A Sø rensen, C G Kurland, and S Pedersen. Codon usage determines trans-

lation rate in Escherichia coli. Journal of molecular biology, 207(2):365–77, May

1989.

[111] M A Sø rensen and S Pedersen. Absolute in vivo translation rates of individual

codons in Escherichia coli. The two glutamic acid codons GAA and GAG are

translated with a threefold difference in rate. Journal of molecular biology,

222(2):265–80, November 1991.

[112] Keith A Spriggs, Martin Bushell, and Anne E Willis. Translational regulation

of gene expression during conditions of cell stress. Molecular cell, 40(2):228–37,

October 2010.

[113] Michael Stadler and Andrew Fire. Wobble base-pairing slows in vivo translation

elongation in metazoans. RNA (New York, N.Y.), 17(12):2063–73, December

2011.

[114] Michael Stadler and Andrew Fire. Wobble base-pairing slows in vivo translation

elongation in metazoans. RNA (New York, N.Y.), 17(12):2063–73, December

2011.

[115] David W Staple and Samuel E Butcher. Pseudoknots: RNA structures with

diverse functions. PLoS Biology, 3(6):e213, June 2005.

[116] Zsuzsanna Sukosd, Bjarne Knudsen, Jø rgen Kjems, and Christian N S Ped-

ersen. PPfold 3.0: fast RNA secondary structure prediction using phylogeny

BIBLIOGRAPHY 156

and auxiliary data. Bioinformatics (Oxford, England), 28(20):2691–2, October

2012.

[117] Zsuzsanna Sukosd, M Shel Swenson, Jø rgen Kjems, and Christine E Heitsch.

Evaluating the accuracy of SHAPE-directed RNA secondary structure predic-

tions. Nucleic acids research, 41(5):2807–16, March 2013.

[118] Fran Supek and Tomislav Smuc. On relevance of codon usage to expression of

synthetic and natural genes in Escherichia coli. Genetics, 185(3):1129–34, July

2010.

[119] Jesper Tholstrup, Lene B Oddershede, and Michael A Sø rensen. mRNA pseu-

doknot structures can act as ribosomal roadblocks. Nucleic acids research,

40(1):303–13, January 2012.

[120] T. Tuller and H. Zur. Multiple roles of the coding sequence 5’ end in gene

expression regulation. Nucleic Acids Research, pages gku1313–, December 2014.

[121] Tamir Tuller, Asaf Carmi, Kalin Vestsigian, Sivan Navon, Yuval Dorfan, John

Zaborske, Tao Pan, Orna Dahan, Itay Furman, and Yitzhak Pilpel. An evolu-

tionarily conserved mechanism for controlling the efficiency of protein transla-

tion. Cell, 141(2):344–54, April 2010.

[122] Tamir Tuller, Isana Veksler-Lublinsky, Nir Gazit, Martin Kupiec, Eytan Rup-

pin, and Michal Ziv-Ukelson. Composite effects of gene determinants on the

translation speed and density of ribosomes. Genome biology, 12(11):R110, Jan-

uary 2011.

BIBLIOGRAPHY 157

[123] Tamir Tuller, Yedael Y Waldman, Martin Kupiec, and Eytan Ruppin. Trans-

lation efficiency is determined by both codon bias and folding energy. Pro-

ceedings of the National Academy of Sciences of the United States of America,

107(8):3645–50, February 2010.

[124] Sotaro Uemura, Colin Echeverrıa Aitken, Jonas Korlach, Benjamin A Flusberg,

Stephen W Turner, and Joseph D Puglisi. Real-time tRNA transit on single

translating ribosomes at codon resolution. Nature, 464(7291):1012–7, April

2010.

[125] Jason G Underwood, Andrew V Uzilov, Sol Katzman, Courtney S Onodera,

Jacob E Mainzer, David H Mathews, Todd M Lowe, Sofie R Salama, and David

Haussler. FragSeq: transcriptome-wide RNA structure probing using high-

throughput sequencing. Nature Methods, 7(12):995–1001, December 2010.

[126] S Varenne, J Buc, R Lloubes, and C Lazdunski. Translation is a non-uniform

process. Effect of tRNA availability on the rate of elongation of nascent polypep-

tide chains. Journal of molecular biology, 180(3):549–76, December 1984.

[127] Christine Vogel, Gustavo Monteiro Silva, and Edward M Marcotte. Protein

expression regulation under oxidative stress. Molecular & cellular proteomics :

MCP, 10(12):M111.009217, December 2011.

[128] Dennis P Wall, Aaron E Hirsh, Hunter B Fraser, Jochen Kumm, Guri Giaever,

Michael B Eisen, and Marcus W Feldman. Functional genomic analysis of the

rates of protein evolution. Proceedings of the National Academy of Sciences of

the United States of America, 102:5483–5488, 2005.

[129] Yue Wan, Michael Kertesz, Robert C Spitale, Eran Segal, and Howard Y Chang.

Understanding the transcriptome through RNA structure. Nature Reviews Ge-

netics, 12(9):641–655, September 2011.

BIBLIOGRAPHY 158

[130] Yue Wan, Kun Qu, Qiangfeng Cliff Zhang, Ryan A Flynn, Ohad Manor,

Zhengqing Ouyang, Jiajing Zhang, Robert C Spitale, Michael P Snyder, Eran

Segal, and Howard Y Chang. Landscape and variation of RNA secondary struc-

ture across the human transcriptome. Nature, 505(7485):706–9, January 2014.

[131] Stefan Washietl, Ivo L Hofacker, Peter F Stadler, and Manolis Kellis. RNA

folding with soft constraints: reconciliation of probing data and thermodynamic

secondary structure prediction. Nucleic Acids Research, 40(10):4261–72, May

2012.

[132] Kevin M Weeks. Advances in RNA structure analysis by chemical probing.

Current Opinion in Structural Biology, 20(3):295–304, June 2010.

[133] Mark Welch, Sridhar Govindarajan, Jon E Ness, Alan Villalobos, Austin Gur-

ney, Jeremy Minshull, and Claes Gustafsson. Design parameters to control

synthetic gene expression in Escherichia coli. PloS one, 4(9):e7002, January

2009.

[134] Jin-Der Wen, Laura Lancaster, Courtney Hodges, Ana-Carolina Zeri, Shige H

Yoshimura, Harry F Noller, Carlos Bustamante, and Ignacio Tinoco. Following

translation by single ribosomes one codon at a time. Nature, 452(7187):598–603,

April 2008.

[135] Kevin A Wilkinson, Suzy M Vasa, Katherine E Deigan, Stefanie A Mortimer,

Morgan C Giddings, and Kevin M Weeks. Influence of nucleotide identity on

ribose 2’-hydroxyl reactivity in RNA. RNA (New York, N.Y.), 15(7):1314–21,

July 2009.

[136] Christopher J Woolstenhulme, Shankar Parajuli, David W Healey, Diana P

Valverde, E Nicholas Petersen, Agata L Starosta, Nicholas R Guydosh, W Evan

BIBLIOGRAPHY 159

Johnson, Daniel N Wilson, and Allen R Buskirk. Nascent peptides that block

protein synthesis in bacteria. Proceedings of the National Academy of Sciences

of the United States of America, 110(10):E878–87, March 2013.

[137] Xiaoqiu Wu, Hans Jornvall, Kurt D Berndt, and Udo Oppermann. Codon

optimization reveals critical factors for high level expression of two rare codon

genes in Escherichia coli: RNA stability and secondary structure but not tRNA

abundance. Biochemical and biophysical research communications, 313(1):89–

96, January 2004.

[138] D F Yun, T M Laz, J M Clements, and F Sherman. mRNA sequences in-

fluencing translation and the selection of AUG initiator codons in the yeast

Saccharomyces cerevisiae. Molecular microbiology, 19(6):1225–39, March 1996.

[139] Shay Zakov, Yoav Goldberg, Michael Elhadad, and Michal Ziv-Ukelson. Rich

parameterization improves RNA structure prediction. Journal of Computational

Biology, 18(11):1525–1542, November 2011.

[140] Gong Zhang, Magdalena Hubalewska, and Zoya Ignatova. Transient ribosomal

attenuation coordinates protein synthesis and co-translational folding. Nature

structural & molecular biology, 16(3):274–80, March 2009.

[141] S Zhang, E Goldman, and G Zubay. Clustering of low usage codons and ribo-

some movement. Journal of theoretical biology, 170(4):339–54, October 1994.

[142] Qi Zheng, Paul Ryvkin, Fan Li, Isabelle Dragomir, Otto Valladares, Jamie

Yang, Kajia Cao, Li-San Wang, and Brian D Gregory. Genome-wide double-

stranded RNA sequencing reveals the functional significance of base-paired

RNAs in Arabidopsis. PLoS genetics, 6(9):e1001141, September 2010.

BIBLIOGRAPHY 160

[143] Tong Zhou and Claus O Wilke. Reduced stability of mRNA secondary structure

near the translation-initiation site in dsDNA viruses. BMC evolutionary biology,

11(1):59, January 2011.

[144] Michael Zuker and Patrick Stiegler. Optimal computer folding of large RNA

sequences using thermodynamics and auxiliary information. Nucleic Acids Re-

search, 9(1):133–148, January 1981.

[145] Hadas Zur and Tamir Tuller. Strong association between mRNA folding

strength and protein abundance in S. cerevisiae. EMBO reports, 13(3):272–

7, March 2012.