phd thesis - ku kielpinski.pdf · primære, sekundære og tertiære struktur, samt interaktioner...
TRANSCRIPT
F A C U L T Y O F S C I E N C E
U N I V E R S I T Y O F C O P E N H A G E N
PhD thesis
Łukasz Jan Kiełpiński
High-throughput sequencing based methods of RNA structure investigation
Academic advisors:
Associate Professor Jeppe Vinther
Associate Professor Jan Christiansen
Submitted: 14/02/2014
HIGH‐THROUGHPUTSEQUENCINGBASEDMETHODSOFRNASTRUCTUREINVESTIGATION
ŁukaszJanKiełpiński
PhDThesis
February 2014
This thesis has been submitted to
the PhD School of The Faculty of Science,
University of Copenhagen
Contents
1 Summary ............................................................................................................................................... 2
2 Dansk resumé ........................................................................................................................................ 4
3 Streszczenie po polsku .......................................................................................................................... 6
4 Acknowledgments ................................................................................................................................. 8
5 Abstract ................................................................................................................................................. 9
6 Objectives............................................................................................................................................ 10
7 Description of the research project .................................................................................................... 11
7.1 Background information ............................................................................................................. 11
7.1.1 Ribonucleic acid .................................................................................................................. 11
7.1.2 RNA structure ...................................................................................................................... 11
7.1.3 Interactions between RNA and antisense oligonucleotides ............................................... 14
7.1.4 Massive parallel sequencing ............................................................................................... 14
7.1.5 Application of the massive parallel sequencing for RNA structure determination ............ 15
7.2 Project motivations ..................................................................................................................... 15
8 Summary of the results in the papers and their relation to the international state‐of‐the‐art ......... 18
8.1 Paper 1 ........................................................................................................................................ 18
8.2 Paper 2 ........................................................................................................................................ 19
8.3 Paper 3 ........................................................................................................................................ 20
8.4 Paper 4 ........................................................................................................................................ 21
9 Conclusions and perspectives ............................................................................................................. 23
10 References ...................................................................................................................................... 25
11 Papers ............................................................................................................................................. 30
11.1 Paper 1: Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and
Massive Parallel Sequencing ................................................................................................................... 31
11.2 Paper 2: Massive parallel sequencing based hydroxyl radical probing of RNA accessibility ...... 51
11.3 Paper 3: Transcriptome‐wide detection of binding sites of Locked Nucleic Acid containing
oligonucleotides (LNA‐Stop‐Seq) ............................................................................................................ 83
11.4 Paper 4: The search for functional RNA secondary structures within 3’ untranslated regions by
enzymatic probing of liver transcripts from multiple species (FragSeq2) ............................................ 109
1
1 SummaryRNA exists in cells in the form of dynamic, three dimensional entities, but to assist its description
researchers resort to studying its primary (sequence), secondary (base pairing) and finally the tertiary
(three dimensional) structure. Traditional methods of studying the secondary and tertiary structures are
labor intensive and require analyzing every single molecule of interest separately. Since the emergence
of massive parallel sequencing the RNA structure determination field is undergoing rapid changes,
immensely increasing the throughput of experiments and proposing the new ways of data analysis. This
thesis consists of four manuscripts which describe developments within this methodological shift by
presenting and validating the novel experimental and computational approaches of harnessing the next‐
generation sequencing for RNA structural studies.
The first paper (“Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and
Massive Parallel Sequencing”) presents a flexible, easy to follow method of preparing Illumina
sequencing libraries that allows for massive identification of reverse transcription termination sites
(RTTS) – RTTS‐Seq. The detection of RTTS can be utilized for investigation of various RNA properties,
ranging from mapping 5’ ends, susceptibility towards certain treatments (e.g. structure probing),
detecting base modifications or others, which depends on the experimental design. Apart from
describing the detailed experimental protocol we provide the data analysis workflow suitable for
researchers without bioinformatics expertise. The experience from the RTTS‐Seq method has been
utilized in the second paper (“Massive parallel sequencing based hydroxyl radical probing of RNA
accessibility”) for the tertiary RNA structure probing. It has been extended with PCR bias tackling
technique and combined with normalization scheme that takes into consideration local coverage and
background reverse transcription terminations as assessed by the control reaction. The method allows
for probing multiple, long molecules simultaneously and the obtained signal correlates well with a
backbone solvent accessibility for both assayed molecules (RNase P specificity domain and the 16S
ribosomal RNA). Another included paper (“The search for functional RNA secondary structures within 3’
untranslated regions by enzymatic probing of liver transcripts from multiple species (FragSeq2)”)
presents the method of RNA secondary structure probing which is again an RTTS‐Seq modification but is
compatible with the nuclease‐based (P1 and V1) probing. In this protocol we ligated the adapter at the
RNA level as opposed to the cDNA level ligation in the RTTS‐Seq approach. Moreover, we have
performed the reverse transcription that was anchored at the poly(A) tail border, focusing the assay for
the 3’ untranslated regions. This set‐up required establishing a new data normalization workflow that
incorporates the signal decay from the 3’ ends of molecules. We have performed the experiments with
liver RNA from three species, which allows us to combine the nuclease probing data with a structure
conservation analysis creating an information rich dataset. We validate the method by comparing the
nuclease signal with the known structures for three classes of RNA molecules. The search for the novel
functional structures is ongoing.
In parallel to studying the RNA structure we have investigated the interactions between RNA and an
oligonucleotide with therapeutic potential (“Transcriptome‐wide detection of binding sites of Locked
Nucleic Acid containing oligonucleotides (LNA‐Stop‐Seq)”). We describe a development of a method
that can detect the hybridization sites on the transcriptome‐wide scale – LNA‐Stop‐Seq. We characterize
2
and optimize various steps in the procedure and propose strategies of enriching for cDNA molecules
terminated upon reaching the crosslinked oligonucleotide. Finally, the sequencing results confirm that
the enrichment works but the unexpected signal distribution requires additional data analysis efforts.
The methods presented in this thesis are capable of providing a holistic view of RNA, its primary,
secondary and tertiary structure, as well as interactions with oligonucleotides. We expect that the
advances made in the experimental and computational methods, as well as the gathered results, should
allow for better understanding of the RNA structure‐function relationship on top of the better and
simpler antisense drugs design.
3
2 DanskresuméRNA eksisterer i celler i form af dynamiske, tredimensionelle enheder, men for at lette beskrivelsen af
disse former, tyer forskere til studiet af den primære (sekvensen), den sekundære (baseparringer), og
endelig den tertiære (tredimensionelle) struktur. Traditionelle metoder hvormed man studerer den
sekundære og tertiære struktur er tidskrævende og begrænser sig til analyse af hvert molekyle
enkeltvist. Siden massiv parallel sekventering blev introduceret, har forskningsfeltet som beskæftiger sig
med bestemmelse af RNA struktur ændret sig hastigt; effektiviteten i eksperimenter er øget umådeligt
og nye måder at analysere data på er udviklet. Denne afhandling består af fire manuskripter som
beskriver forbedringer inden for dette metodiske skifte, ved at introducere og validere de nye
eksperimentelle og computationelle tilgange til at udnytte næste generation sekventering af RNA
strukturer.
Den første artikel (“Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and
Massive Parallel Sequencing”) demonstrerer en fleksibel, let forståelig metode til at opbygge Illumina
sekventerings biblioteker, som tillader massiv identifikation af revers transkription terminerings‐
positioner (RTTS) – RTTS‐Seq. Identificeringen af RTTS kan bruges til at undersøge forskellige RNA
egenskaber, fra kortlægningen af 5’‐ender og følsomhed overfor bestemte behandlinger (fx struktur
probning), til detektering af base modifikationer, alt afhængigt af det eksperimentelle design. Udover
beskrivelsen af den detaljerede eksperimentelle protokol, forklarer vi arbejdsgangen i dataanalysen,
som kan benyttes af forskere uden ekspertise i bioinformatik. Erfaringerne fra RTTS‐Seq metoden er
blevet brugt i den anden artikel (“Massive parallel sequencing based hydroxyl radical probing of RNA
accessibility”) til den tertiære struktur probning. Det er blevet udvidet med en teknik til at håndtere
systematiske fejl i PCR, og er blevet kombineret med normalisering som tager højde for lokal dækning og
for baggrund som stammer fra terminering i revers transskription, estimeret ved hjælp af kontrol
reaktionen. Metoden tillader probning af adskillige lange molekyler samtidig, og signalet korrelerer godt
med ribose‐fosfat‐kæden (backbone) solvent tilgængeligheden af begge de studerede molekyler (RNase
P specificitets domænet og 16S ribosomalt RNA). En anden artikel som er inkluderet (“The search for
functional RNA secondary structures within 3’ untranslated regions by enzymatic probing of liver
transcripts from multiple species (FragSeq2)”), præsenterer metoden til probning af sekundær RNA
struktur, som igen er en RTTS‐Seq modifikation, men er forenelig med nuklease‐baseret (P1 og V1)
probning. I denne protokol ligerede vi adapteren på RNA‐ niveau i modsætning til ligering på cDNA‐
niveau i RTTS ‐ Seq fremgangsmåden. Desuden har vi udført revers transskription, forankret i poly(A)‐
halen for at kunne fokusere analysen på 3'‐utranslaterede regioner. Dette set‐up nødvendiggjorde
udvikling af en ny data normalisering arbejdsmetode, der inkorporerer henfald i signalet fra 3'‐ender af
molekyler. Vi har udført forsøgene med lever RNA fra tre arter, som giver os mulighed for at kombinere
nuklease‐resistens fra probnings data med struktur‐konserverings analyse, og skabe et informationsrigt
datasæt. Vi validerer metoden ved at sammenligne nuklease signalet med de kendte strukturer for tre
klasser af RNA‐molekyler. Jagten på nye funktionelle strukturer er igangværende.
Parallelt til RNA struktur studierne, har vi undersøgt samspillet mellem RNA og et oligonukleotid med
terapeutisk potentiale (“Transcriptome‐wide detection of binding sites of Locked Nucleic Acid
containing oligonucleotides (LNA‐Stop‐Seq)”). Vi beskriver udviklingen af en metode, der kan detektere
4
hybridiserings‐steder på transskriptom‐skala ‐ LNA‐Stop Seq. Vi karakteriserer og optimerer forskellige
trin i proceduren og foreslår strategier til at berige for cDNA‐molekyler som er termineret efter at have
nået det krydsbundne oligonukleotid. Sekventerings resultater bekræfter at berigelsen virker, men den
uventede fordeling i signalet kræver yderligere dataanalyse.
De metoder, der præsenteres i denne afhandling kan bidrage til et holistisk syn på RNA, med dets
primære, sekundære og tertiære struktur, samt interaktioner med oligonukleotider. Vi forventer, at
forbedringerne i de eksperimentelle og computationelle metoder, samt de indsamlede resultater, bør
give mulighed for en bedre forståelse af RNA struktur‐funktion‐forholdet, udover bedre og enklere
design af antisense lægemidler.
5
3 StreszczeniepopolskuRNA występuje w komórkach w postaci dynamicznych, trójwymiarowych bytów, ale do jego opisu
naukowcy uciekają się do badania jego pierwszo‐ (sekwencja nukleotydów) , drugo‐ (parowanie zasad ) i
ostatecznie trzeciorzędowej (trójwymiarowej) struktury. Tradycyjne metody badania struktur drugo‐ i
trzeciorzędowych są pracochłonne i wymagają osobnej analizy każdej cząsteczki. Od czasu powstania
sekwencjonowania masowo równoległego, obszar określenia struktur RNA przechodzi szybkie zmiany,
znacząco zwiększając wydajność doświadczeń i proponując nowe sposoby analizy danych. Ta praca
doktorska składa się z czterech artykułów opisujących postępy w owym skoku metodologicznym,
przedstawiając i walidując nowe metody eksperymentalne i obliczeniowe wykorzystujące
sekwencjonowanie nowej generacji do badań strukturalnych RNA.
Pierwszy artykuł ("Wykrywanie miejsc terminacji odwrotnej transkryptazy przy użyciu ligacji cDNA i
masowo równoległego sekwencjonowania") zawiera elastyczny, łatwy do naśladowania sposób
przygotowania biblioteki do sekwencjonowania w technologii Illumina, który pozwala na masową
identyfikację miejsc terminacji odwrotnej transkrypcji (RTTS) – zwany dalej RTTS‐Seq . Detekcja RTTS
może być wykorzystana do badania różnych właściwości RNA takich jak mapowanie końca 5',
mapowanie wrażliwości RNA na określone zabiegi (np. sondowanie struktur), wykrywanie
zmodyfikowanych nukleotydów lub inne, zależne od projektu doświadczenia. Oprócz podania
szczegółowego protokołu doświadczalnego przedstawiamy również proces analizy danych dostosowany
dla naukowców nieposiadających umiejętności z zakresu bioinformatyki. Doświadczenia zebrane z
metody RTTS‐Seq zostały wykorzystana w drugim artykule ("Sondowanie dostępności RNA przy
wykorzystaniu wolnych rodników hydroksylowych oparte na masowo równoległym
sekwencjonowaniu”) dla sondowania trzeciorzędowej struktury RNA. Metoda ta została rozbudowana o
technikę rozwiązującą błąd wynikający z reakcji PCR oraz połączona z systemem normalizacji, który
bierze pod uwagę lokalny poziom pokrycia i tło terminacji odwrotnej transkrypcji oceniane na podstawie
reakcji kontrolnej. Metoda ta umożliwia sondowanie wielu, długich cząsteczek jednocześnie i pozwoliła
uzyskać sygnał który dobrze koreluje z dostępnością szkieletu RNA dla rozpuszczalnika dla obu
testowanych cząsteczek (domeny specyficzności RNazy P oraz rybosomalnego RNA 16S). Kolejny zawarty
artykuł ("Poszukiwanie funkcjonalnych struktur drugorzędowych RNA w 3' regionach nieulegających
translacji poprzez enzymatyczne sondowanie transkryptów z wątroby z wielu gatunków (FragSeq2)" )
przedstawia metodę sondowania drugorzędowej struktury RNA, która jest modyfikacją metody RTTS‐
Seq kompatybilną z opartym o nukleazy (P1 oraz V1) sondowaniu. W tym protokole ligacja adaptera
przeprowadzana jest na poziomie RNA, w przeciwieństwie do ligacji na poziomie cDNA w RTTS‐Seq. Co
więcej, przeprowadzona odwrotna transkrypcja była zakotwiczona na granicy ogona poli‐A skupiając
naszą analizę na 3' regionach nieulegających translacji. Taka konfiguracja wymagała opracowania nowej
metody normalizacji danych, która uwzględnia zanik sygnału od końca 3'. Przeprowadziliśmy
eksperymenty z RNA z wątroby z trzech gatunków, co pozwala nam zespolić dane sondowania
nukleazami z analizą ewolucyjnego zachowania struktur tworząc bogaty w informacje zestaw danych. W
celu walidacji przedstawionej metody, sygnał cięcia nukleazami został porównany ze znanymi
strukturami dla cząsteczek RNA z trzech różnych klas. Poszukiwanie nowych funkcjonalnych struktur jest
w toku.
6
Równolegle do badania struktur RNA badaliśmy interakcje między RNA a oligonukleotydem o
terapeutycznym potencjale ("Detekcja w transkryptomie miejsc wiążących oligonukleotydy
zawierające zablokowane kwasy nukleinowe (LNA‐Stop‐Seq)"). Opisujemy opracowanie metody, która
pozwala wykryć miejsca hybrydyzacji w skali całego transkryptomu – LNA‐Stop‐Seq. Charakteryzujemy i
optymalizujemy różne kroki w procedurze i proponujemy strategie wzbogacania cząsteczek cDNA
zatrzymanych na związanych oligonukleotydach. Ostatecznie, wyniki sekwencjonowania potwierdzają,
że metoda wzbogacania działa, ale nieoczekiwany rozkład sygnału wymaga dodatkowej analizy danych .
Metody przedstawione w tej pracy mogą zapewnić całościowe spojrzenie na RNA, jego pierwszo‐ ,
drugo‐ i trzeciorzędowej struktury, a także interakcji z oligonukleotydami. Spodziewamy się, że postępy
w metodach eksperymentalnych i obliczeniowych, a także zebrane wyniki, powinny pozwolić na lepsze
zrozumienie relacji struktury z funkcją RNA i co więcej, lepsze i prostsze projektowanie leków opartych
na antysensowej terapii.
7
4 AcknowledgmentsResults presented in this thesis were possible to obtain only thanks to a wide support that I have
received during and before my PhD studies. First of all, I would like to thank my supervisor, Prof. Jeppe
Vinther, for guiding my scientific growth over last 3.5 years, for the opportunities to openly discuss and
try new ideas, for keeping me healthily motivated to work on them and for the constructive feedback
regarding this thesis. I am also very grateful to Prof. Jan Christiansen, my co‐supervisor, who always had
time to talk about science, sports and life, and who was very helpful with coping with the administrative
processes.
I owe many thanks to our lab technicians, Amal Al‐Chaer and Lena Bjørn Johansson for ensuring
efficiently functioning laboratory with a great atmosphere, and to my coworkers Jakob Lewin Rukov,
Signe Olivarius, Line Dahl Poulsen (thanks for translating the summary!), Christel Hougård Petersen,
Yanping Feng, Sidsel Kramshøj Adolph and Heidi Theil Hansen for fruitful discussions, being helpful and
keeping the University a place that one wants to come back to. Many thanks to the section leader –
Prof. Anders Krogh for scientific and social engagement and to Henriette Husum Bak‐Jensen for a
passionate organizational support.
I would especially like to thank Sofie Salama and the whole Haussler Lab for the great and productive
time during my academic stay in Santa Cruz, as well as Jakob Skou Pedersen and his lab for the valuable
collaboration. Special thanks go to the representatives of Santaris Pharma – Morten Lindow and Peter
Hagedorn, whose enthusiasm and expert insight gave the momentum to our joint projects.
I am greatly indebted for the high quality education I have received prior to my doctoral studies at the
Poznań University of Life Sciences and at the Saint Mary Magdalene High School in Poznań. I owe
particular thanks to dr Tomasz Pniewski for guiding me through my first research venture and to Prof.
Włodzimierz Krzyżosiak and his lab for the very important, scientifically forming experience during work
for my master project.
I would like to thank the Department of Biology for funding my scholarship and The Danish Council for
Strategic Research for funding most of the remaining expenses and the stay abroad.
Finally, I would especially like to thank all my friends living here in Denmark and my friends in Poland,
my girlfriend Gillian and to my whole family.
Dziękuję Wam moi Rodzice za miłość, wsparcie, oraz godny naśladowania wzór życia.
8
5 AbstractIn this thesis we describe the development of four related methods for RNA structure probing that
utilize massive parallel sequencing. Using them, we were able to gather structural data for multiple, long
molecules simultaneously. First, we have established an easy to follow experimental and computational
protocol for detecting the reverse transcription termination sites (RTTS‐Seq). This protocol was
subsequently applied to hydroxyl radical footprinting of three dimensional RNA structures to give a
probing signal that correlates well with the RNA backbone solvent accessibility. Moreover, we applied
RTTS‐Seq to detect antisense oligonucleotide binding sites within a transcriptome. In this case, we
applied an enrichment strategy to greatly reduce the background. Finally, we have modified the RTTS‐
Seq to study the secondary structure of 3’ untranslated regions with nuclease probing in combination
with the structure evolutionary conservation study. In the course of this thesis we describe several
computational methods. One that alleviates PCR bias by estimating number of unique molecules existing
before the amplification, and two methods for data normalization: one applicable when the paired end
sequencing is performed, and the other that works with the single read sequencing with known priming
sites.
9
6 ObjectivesThe overall objective of my thesis is to further RNA biology understanding and facilitate antisense
oligonucleotide drugs design by the development of methods for studying RNA properties in the
transcriptome‐wide manner with the use of massive parallel sequencing. Those overarching objectives
were split into working goals:
establishing a generic method of detecting the reverse transcription termination sites (which
can originate from the RNA structure probing or other RNA properties signal) with the Illumina
sequencing technology (Paper 1),
devising an experimental and computational workflow for studying the RNA tertiary structure
(Paper 2),
characterizing interactions between RNA and specific oligonucleotide with therapeutic potential
(Paper 3),
describing the secondary structure of 3’ untranslated regions of mRNA molecules in the
evolutionary context (Paper 4).
10
7 Descriptionoftheresearchproject
7.1 Backgroundinformation
7.1.1 RibonucleicacidRibonucleic acid (RNA) carry multiple functions and contribute to almost 4% of dry weight (DW) of
Escherichia coli and 20% DW of a typical mammalian cell (Alberts, 2002). The RNA molecules are
polymers composed of ordered adenosine (A), cytosine (C), guanosine (G) and uridine (U)
monophosphates with chemical repertoire being extended by nucleotide modifications (Cantara et al.,
2011; Limbach et al., 1994). Traditionally, RNAs have been categorized as coding and non‐coding, with
the main function of the coding molecules being an intermediate in the flow of genetic information from
DNA to proteins (Crick, 1970). Among non‐coding RNA molecules (ncRNA) we observe the astounding
variety of functions ranging from catalysis, delivering amino acids, detection of small molecules
(Serganov and Nudler, 2013) or temperature (Kortmann and Narberhaus, 2012), involvement in
reactions acting on other RNA molecules (Matera et al., 2007), genome management (Froberg et al.,
2013), telomere synthesis (Gesteland et al., 2006) and increasingly appreciated involvement in the post‐
transcriptional gene expression regulation (Carthew and Sontheimer, 2009; Ulitsky and Bartel, 2013)
among others.
7.1.2 RNAstructureAlthough RNA molecules are linear polymers they exist as three‐dimensional entities, whose structure is
dictated by their sequence, history of the molecule, solvent properties and molecular interactors. Under
physiological conditions the main interactions dictating the structure are base stacking and base pairing,
with many other forces shaping the final molecular organization. It is easiest to appreciate the
importance of RNA folding into its specific three dimensional structure when considering catalytic RNA
molecules – ribozymes – with known involvement in RNA processing and in the protein synthesis
performed by especially interesting catalytic RNA ‐ ribosomal RNA (Doudna and Cech, 2002). Ribosomal
RNA is the most abundant class of RNA present in living cells accounting for approximately 80% of RNA
mass, and which structure has been heavily studied since mid‐XX century (Bakowska‐Zywicka and
Tyczewska, 2009). Solving its three dimensional structure at the beginning of the XXI century with the
help of the X‐ray crystallography allowed the full appreciation of the importance of folded RNA in its
functioning (Steitz, 2008). We have used a small subunit of this complex as a benchmark for the method
of probing three dimensional RNA structures described in this thesis (Paper 2). Representatives of the
second most abundant class of RNA molecules, tRNAs, also require folding into specific three
dimensional structures to be charged with an amino acid by their particular aminoacyl synthetases and
deliver them to the ribosomes (Perona and Hadd, 2012). Apart from those well studied models there are
numerous known examples of RNA fold being important for the function. For instance, RNA folded into
a hairpin is a substrate in microRNA biogenesis pathway (Kim, 2005), the structure (secondary and
tertiary) of microRNA target sites can modulate silencing efficiancy (Gan and Gunsalus, 2013; Kertesz et
al., 2007; Wan et al., 2014), structures within pre‐mRNA are involved in alternative splicing regulation
(McManus and Graveley, 2011) and some RNA‐protein interactions require specific RNA fold (Lunde et
al., 2007). More examples of RNA structure roles have been summarized in (Wan et al., 2011).
11
7.1.2.1 HierarchyofRNAstructureTo better understand the RNA structure researchers describe it in the terms of secondary and tertiary
structure models. Secondary structure describes the pattern of base‐pairing forming structural features
such as stems, internal loops, hairpin loops, multi‐loops, bulges and pseudoknots (see (Andronescu et
al., 2008) for explanation). Thanks to the base stacking and hydrogen bonding, secondary structure
contributes to the most of the negative free energy of structure formation and assuming hierarchical
folding model it forms the basis for tertiary organization of the RNA molecules (Tinoco and Bustamante,
1999). Tertiary structure of RNA describes the three dimensional coordinates of its constituting atoms.
Observed patterns include very rich repertoire of forms including A‐form helix, coaxial stacking, helix
junctions, interactions between nucleotide and helix minor groove (A‐minor), kink‐turns, hook turns, S‐
turns, tetraloops and tetraloops receptors, intercalations, triple‐stranded RNA, G‐quadruplexes, ribose
zippers and interactions involving base pairing (hence sometimes considered to be secondary structure
features) such as kissing loops and pseudoknots, see (Butcher and Pyle, 2011) for more detailed
description. Moreover, apart from Watson‐Crick base pairs the spectrum of possible hydrogen bonding
between bases is enriched by non‐canonical pairs (Leontis and Westhof, 2001). Overall, relative
simplicity of determination and importance of RNA secondary structure directed more efforts towards
its solving as compared with three dimensional models building.
7.1.2.2 SecondarystructuredeterminationThere are various approaches towards investigating secondary structure of a given RNA molecule. One is
to use the energy minimization programs such as Mfold (Zuker, 2003), RNAStructure (Reuter and
Mathews, 2010) or many others, which use the primary RNA sequence as input and output the folding
patterns with calculated energies. Their predictions depend on the thermodynamic parameters to find
the secondary structure with the lowest free energy. Obtained structures are not guaranteed to be
actually present in the solution nor inform us that the structure is biologically relevant. On average their
accuracy is 73% and their high probability predictions are generally correct (Mathews, 2004).
Alternatives to free energy minimization include statistical learning algorithms (Do et al., 2006) or
statistical sampling from ensemble (Ding et al., 2004) among others.
Accuracy can be further increased by constraining the predictions with the results of structure probing
experiment. Its outline is to 1) fold the RNA molecule in the appropriate folding buffer and thermal
conditions (preferably establishing if the molecule is functional), 2) treat with the probing reagent, 3)
detect reactive sites by either direct electrophoresis of the beforehand labeled RNA molecule or
performing reverse transcription with labeled primer and cDNA electrophoresis (slab‐gel or capillary).
Commonly used probing reagents include structure‐sensitive endonucleases such as single‐strand
specific nucleases A, I, P1, S1, T2 or mung bean among others (Gite and Shankar, 1995; Ziehler and
Engelke, 2001), double‐strand‐specific nuclease V1 (Ziehler and Engelke, 2001), metal ions ‐ especially
Pb2+ (Kirsebom and Ciesiolka, 2008) or other chemical reagents such as DMS, SHAPE reagents, kethoxal,
CMCT or hydroxyl radicals (Weeks, 2010). Chemical probing is often preferred over enzymatic cleavage
due to better defined behavior and avoiding steric clashes between RNA and the small probing reagent.
Moreover, the lead(II) ions, some SHAPE reagents, X‐ray generated hydroxyl radicals and DMS are
applicable to in vivo probing (Adilakshmi, 2006; Ding et al., 2013; Lindell et al., 2002; Rouskin et al.,
12
2013; Spitale et al., 2013; Wells et al., 2000) which is considered superior over in vitro probing as it
provides information about RNA molecules in their natural setting.
Researchers studying RNA molecules with conserved structures are in the privileged position since they
can apply the gold‐standard secondary structure prediction method – building comparative structure
models. It relies on supporting the structure hypothesis by observation of compensatory mutations
which change the primary sequence but preserve the secondary structure. It gives results of very high
quality even for large molecules (Gutell et al., 2002) and since it implies that the structure has been
preserved in evolution it strongly suggests that it is functional. When given only a few aligned sequences
it is often beneficial to use the combination of thermodynamic optimization and comparative models, as
described in (Seetin and Mathews, 2012). As more and more genomes are being sequenced, the
comparative methods bring a possibility of genome‐wide searches of conserved structures as applied in
EvoFold (Pedersen et al., 2006). In a Paper 4 we describe the development of the method aiming at
combining the massive parallel sequencing based nucelase structure probing with the evolutionary
approch.
7.1.2.3 TertiarystructuredeterminationAs the comparative structural model of the RNA secondary structure was built first for tRNA molecules
(Madison et al., 1966), also the tertiary RNA structure determination was pioneered using the structure‐
conservation approach (Levitt, 1969) and was soon after mastered with the X‐ray crystallography (Kim
et al., 1974). X‐ray crystallography is now a method of choice for studying the tertiary structure of
biological molecules, including complex RNA (Ban et al., 2000; Wimberly et al., 2000). Although capable
of producing data of very high resolution, there are many drawbacks of applying this method. Producing
RNA crystals is time consuming, requires specialized equipment and skills, success is not guaranteed and
molecules are observed under artificial conditions. Moreover, producing suitable crystals often requires
molecular engineering to stabilize the structures (Ke and Doudna, 2004).
Another experimental method borrowed from studying the three dimensional structures of proteins is
NMR spectroscopy. It’s advantage is that it provides information about the molecules in solution, but
similarly to the X‐ray crystallography it has also high equipment and skills requirements and additionally
has a limitation for size of the molecule to up to 100 nt (Furtig et al., 2003).
Automated prediction of RNA tertiary structure has been approached with different methods. Since
molecular dynamics simulations are prohibitively computationally demanding, alternative methods have
been developed. Structures have been build using phylogenetic information (Michel and Westhof,
1990), simplified energy function (Das and Baker, 2007), assembled using nucleotide cyclic motifs
(Parisien and Major, 2008) or probabilistic modeling (Frellsen et al., 2009). Despite ongoing
improvements, automatically generated predictions are mostly largely deviated from the experimentally
solved structures (Laing and Schlick, 2010).
Considering the difficulties associated with experimental obtaining of the high resolution structural data
and confines of computational predictions it is advantageous to use easier to obtain low resolution
experimental data to guide molecular modeling. One of the approaches relies on using a small‐angle X‐
13
ray scattering (SAXS) which can be applied in the native conditions of RNA molecules to obtain low‐
resolution electron density map, which are especially useful to study the conformational changes
(Lipfert and Doniach, 2007). Other approach, requiring only standard molecular biology laboratory
equipment, is the measurement of hydroxyl radical reactivity of different nucleotides and using them as
guides for 3D modeling refinement (Ding et al., 2012) . Hydroxyl radical probing coupled with the next
generation sequencing is a method described in the Paper 2.
7.1.3 InteractionsbetweenRNAandantisenseoligonucleotidesOne of the reasons for studying RNA structure is its influence on antisense drugs efficiency. Interactions
between antisense oligonucleotides (ASOs) and RNA depend on multiple parameters such as sequence,
solvent parameters (usually physiological), RNA structure and bound proteins (Kedde et al., 2007). The
term RNA accessibility (not to be confused with the backbone solvent accessibility measured with the
hydroxyl radical probing) is used, which can be broadly defined as ability of RNA “to form stable
complexes with complementary oligonucleotides” (Allawi et al., 2001). Various experiments were
proposed for assessing RNA accessibility such as measuring the oligonucleotide‐RNA association with
dialysis, arrays of oligonucleotides or detection by enzymatic reaction (RNAse H or reverse
transcriptase), as summarized in (Allawi et al., 2001). Importantly, ASOs targeted towards accessible
regions are downregulating gene expression more efficiently (Allawi et al., 2001).
Apart from experimental methods, several computational approaches for assessing RNA accessibility
have been described. Some of them calculated accessibility as a difference between energy of ASO‐RNA
hybridization and probe intramolecular folding energy (Luebke et al., 2003) or RNA intramolecular
folding (cost of removing pairs in a given region) (Lu and Mathews, 2008). Others predict RNA structure
locally in the sliding window and assess the probability that a given region is base paired (Tafer et al.,
2008). Interestingly, the local structure prediction approach has been shown to be superior over global
(Lange et al., 2012). Paper 3 concerns studying The RNA‐ASO interactions.
7.1.4 MassiveparallelsequencingRecent years brought a revolution in DNA sequencing with so called High‐Throughput or Next‐
Generation Sequencing (NGS) technologies. Various NGS systems compete currently on the market, but
all of them are based on sequencing of the short stretches of the multiple DNA molecules
simultaneously (hence called massive parallel sequencing), yielding up to 4G reads per instrument per
run (Illumina HiSeq 2500). This unprecedented technological advance facilitated emergence of whole‐
new methods, such as genome sequencing, exome sequencing, RNA sequencing (Ozsolak and Milos,
2011), microRNA sequencing, crosslinking and immunoprecipitation sequencing (Hafner et al., 2010;
Konig et al., 2010; Licatalosi et al., 2008), chromatin immunoprecipitation and sequencing (Furey, 2012),
ribosome profiling (Ingolia et al., 2009), sequencing based RNA structure probing (Kertesz et al., 2010;
Underwood et al., 2010) and many other methods.
Samples, bgefore sequencing with an Illumina sequencing technology (which was utilized throughout
the thesis), must be transformed into suitable sequencing libraries that can bind a flow cell, generate
clusters in a bridge PCR amplification (with primers covalently attached to the flow cell) and hybridize
with the sequencing primers. The sequencing can be performed sequentially with three different
14
primers, first for the first sequencing read, optional second for the second sequencing read if paired‐end
sequencing is performed and the third primer, which reads out the sample specific index and allows for
distinguishing different samples in multiplexed sequencing. The sequencing reaction is based on a
sequencing‐by‐synthesis approach. In each cycle primers hybridized to the clustered amplicons, which
are derived from a single molecule in the library, are extended by one nucleotide bearing fluorescently
labeled extension terminator, with the fluorescent group being nucleotide‐specific. Next, the flow cell is
scanned for the colors of clusters, and the identity of the nucleotide attached to each cluster is saved
together with quality score. Following scanning, the terminators are removed and the cycle is repeated.
Final result of the sequencing is a FASTQ file that for each cluster contains the information about its
position within flow cell, sequence and quality at each nucleotide.
7.1.5 ApplicationofthemassiveparallelsequencingforRNAstructuredetermination
Several protocols have been established aiming at harnessing massive parallel sequencing for RNA
secondary structure probing detection. All of them utilized traditional probing reagents (structure
sensitive nucleases or chemicals leading to the RNA strand cleavage or modification) but alleviated the
need of electrophoretic separation of nucleic acids by detecting the sites of modifications with
sequencing. In the year 2010 three competing approaches were published. Two of them, parallel
analysis of RNA structure (PARS) (Kertesz et al., 2010) and FragSeq (Underwood et al., 2010) are based
on limited (as in traditional RNA structure probing) nuclease digestion and detection of cleavage sites as
sites to which the sequencing adapter has been ligated. They used different enzymes and data analysis
schemes. In the PARS method, the ratio between cleavage extent of double‐strand specific nuclease V1
and single‐strand specific nuclease S1 has been used to determine the state of a given nucleotide. On
the other hand the FragSeq method used only one enzyme, a single‐strand specific nuclease P1, to
determine which bases are single stranded and compared the cleavage extent to the cleavages observed
in the untreated control. The third method, dsRNA‐seq (Zheng et al., 2010), focused on finding double
stranded RNA regions by extensive degradation of single stranded RNA with RNase I and sequencing the
remaining RNA. The three described methods were based on ligating the sequencing adapters to the
RNA at the site of probing. This approach is not possible to apply if use of non‐cleaving probe is desired,
as in the SHAPE probing. Resolving that issue was an objective for development of the next NGS based
method of RNA structure investigation – SHAPE‐Seq (Lucks et al., 2011). In the SHAPE‐Seq the reverse
transcription terminates upon reaching the modification and the adapter is ligated to the cDNA
terminus. This method cannot be used for transcriptome‐wide studies, because it requires a specific
reverse transcription primer, which can anneal only to artificially introduced 3’ end cassettes. This
limitation has been resolved in two recently published papers which used DMS for in vivo RNA
secondary structure probing (Ding et al., 2013; Rouskin et al., 2013), with one comparing the extent of
terminations between treated and control sample (Ding et al., 2013) and the other taking advantage of
the novel selection protocol (Rouskin et al., 2013).
7.2 ProjectmotivationsAs exemplified above, knowledge of RNA structure is a key to understanding many biological
phenomena as well as constitutes an important parameter in the rational ASOs design (Vickers et al.,
15
2000). Computational methods for RNA secondary structure prediction are often useful for superficial
assessments of hypotheses but they suffer from many limitations. What’s more, methods for a tertiary
structure prediction from the sequence only are even less reliable. The accuracy of predictions can be
increased when providing the structure building algorithms with experimentally obtained constraints for
both secondary and tertiary structure predictions (Ding et al., 2012; Reuter and Mathews, 2010).
Unfortunately, performing the traditional structure probing experiments is a time consuming task,
requiring at least a standard molecular biology laboratory equipment and a separate analysis of each
molecule (in the case of long molecules the analysis must be split into smaller parts). Development of
massive parallel sequencing allowed simultaneous structural probing of complex mixtures of RNA
molecules remarkably increasing the throughput, covering in one experiment millions of bases “which is
approximately 100‐fold more than all published RNA footprints to date” (Kertesz et al., 2010).
Inspired by the early applications of NGS for RNA structure probing (Kertesz et al., 2010; Lucks et al.,
2011; Underwood et al., 2010), we aimed at strengthening the field with the development of both
experimental and data analysis methods. First, we needed a system for sequencing library preparation
that is flexible, easy to adapt for other applications and compatible with the standard, multiplexed
Illumina sequencing. We describe its design in the Paper 1.
Establishment of this method has opened multiple research opportunities for us. Recently published
development of the computational methods guiding the tertiary RNA structure predictions (Ding et al.,
2012) suggested that investigating the RNA three dimensional structures by detecting hydroxyl radical
footprinting (HRF) signal with NGS will open the venue for structure predictions of multiple long
molecules simultaneously. We show the method of coupling the HRF with the massive sequencing in the
Paper 2.
Our collaboration with the pharmaceutical company Santaris Pharma A/S which specializes in the
development of Locked Nucleic Acid (LNA) based ASOs led us to investigate how the oligonucleotides
interact with transcripts. For that purpose we have again used our established sequencing protocol. The
realization that the signal would contain a very high level of noise led us to develop methods of
enriching for the desired signal, which we describe in the Paper 3.
RNA structures are especially prominent within 3’ untranslated regions (3’ UTRs). 3’ UTRs are mRNA
segments known to be involved in a gene expression regulation and their functioning partly depends on
their specific fold (Bartel, 2009; Kuersten and Goodwin, 2003; Szostak and Gebauer, 2013; Wan et al.,
2014). Together with our collaborators from University of California, Santa Cruz (established the FragSeq
method) and Aarhus University (experienced with comparative analysis of RNA structure) we aimed at
developing a method for profiling the structures of 3’ UTRs. To utilize our combined experience we have
planned an experiment that uses library generation method similar to the one described in the Paper 1
to map the nuclease cleavage sites (as in FragSeq) and to perform this experiment with RNA samples
from different species allowing the use of the structure conservation information (Paper 4).
Studying RNA structure with the NGS raises many issues on how to properly interpret the data, including
need of resolving method specific biases, such as PCR bias (Weeks, 2011). What’s more the custom
16
methods of library preparation (as applied throughout this thesis) require custom data analysis since the
questions and assumptions of the available programs do not fit the experimental design. To address
those issues we aimed at developing computational methods of correcting the PCR bias and of the signal
normalization. The novel PCR bias correction method is described in Paper 2 and is also applied in the
Paper 3. Regarding the data normalization, we found ourselves in two different situations – obtaining
paired‐end (Paper 2) or single‐end (Paper 4) reads, for which we have proposed two different, albeit
related, normalization workflows.
17
8 Summaryoftheresultsinthepapersandtheirrelationtotheinternationalstate‐of‐the‐art
8.1 Paper1In the paper Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and Massive
Parallel Sequencing we give a detailed protocol of sequencing library preparation intended for detecting
reverse transcription termination sites (RTTS), called here RTTS‐Seq. Traditionally, RTTS were detected
with the slab‐gel or capillary electrophoresis in a wide range of applications such as finding transcripts 5’
termini (Simpson and Brown, 1995), RNA secondary structure probing with the SHAPE or other reagents
(Weeks, 2010), tertiary RNA structure probing or RNA‐protein interactions footprinting with hydroxyl
radicals (Adilakshmi, 2006) or identification of nucleotide modifications (Motorin et al., 2007). Our NGS
based protocol can in principle be used with all of the abovementioned procedures, but allows for much
higher throughput and easier data analysis.
To perform the experiment, we carried out the reverse transcription for which we used a primer with
the Illumina adapter overhang. Having synthesized cDNA that terminated upon reaching the feature of
interest we ligate the adapter to its 3’end (RTTS) with single‐strand DNA ligase. After the ligation, we
finish the library construction with a PCR that adds the sample specific index, allowing for mixing
multiple samples together and perform multiplexed sequencing. Our sequencing libraries structure is
shown on the Figure 1.
Apart from the experimental workflow, we describe the data analysis procedure and publish necessary
scripts focusing on users without extensive bioinformatics experience. We guide how to perform the
initial processing, mapping, trimming and how to visualize the data in the popular UCSC Genome
Browser. We introduce the concept of trimming the reads from the nucleotides added by the reverse
transcriptase via its terminal transferase activity, which would otherwise shift the mapped signal
upstream in the RNA molecule.
The RTTS‐Seq is similar to the procedure described in the SHAPE‐Seq paper (Lucks et al., 2011), but is
compatible with the Illumina multiplexed paired‐end DNA sequencing and with the random priming,
alleviating the need for introducing the structure cassette in the probed RNA. Recently published
method aiming at finding RTTS with massive parallel sequencing – MAP‐Seq (Seetin et al., 2014) has very
similar design to the proposed in RTTS‐Seq but is designed to work with the fixed primer that doesn’t
allow for transcriptome‐wide searches. On the other hand, MAP‐Seq protocol allows for skipping the
PCR step, avoiding some of the biases.
18
Figure 1. Schematic view of the Illumina sequencing library.
8.2 Paper2The paper Massive parallel sequencing based hydroxyl radical probing of RNA accessibility concerns
applying the method described in the Paper 1 for the tertiary RNA structure probing with hydroxyl
radical footprinting (HRF). The HRF is a well established method for measuring the nucleic acid backbone
solvent accessibility (Tullius and Greenbaum, 2005). Traditionally, the signal has been detected with the
electrophoresis of either end‐labeled, cleaved RNA molecule or of the primer extension product. Here,
by applying the modified RTTS‐Seq, we substitute the electrophoretic separation with the sequencing
allowing HRF‐Seq to probe multiple, long RNA molecules simultaneously.
The paper describes the analysis of two RNA molecules with the crystallographically solved three
dimensional structures – Bacillus subtilis RNase P specificity domain and Escherichia coli 16S ribosomal
RNA. The RNA molecules were probed with hydroxyl radicals and were used as templates for the
sequencing libraries preparation. Reverse transcription was performed with either single primer
(RNase P) or with the random primers (16S ribosomal RNA). As in the RTTS‐Seq, we have ligated the
adapter to the cDNA 3’ end, amplified the libraries with PCR and sequenced with the paired‐end
protocol.
The major novelty of the paper comes from the proposed data analysis workflow. First, we have
mapped the pairs of reads to the analyzed molecules, defining the start and the end of the insert,
corresponding to the RTTS and the priming site, respectively. At this step, many inserts had the same
start and end positions, raising the question which copies are derived from true biological replicates and
which are simply PCR duplicates. To resolve that issue we have used a 7 nt random bracode introduced
during adapter ligation and developed a framework for calculating estimated unique counts (EUC) of
each repeated insert based on the random sampling of unequally probable barcodes. Working with EUC
instead of raw counts gave us the advantage of alleviating PCR bias and allowing for proper use of count
statistics. This is a similar outcome as offered by the use of the amplification free MAP‐Seq, but in our
case we are avoiding working with very little amount of material which can be troublesome in certain
applications.
We have defined the coverage at a given location as the sum of EUC of inserts spanning it, and
calculated the termination coverage ratio (TCR) by dividing the EUC of inserts terminating at a given
location by the coverage. To estimate the extent of cleavages induced by HRF at a given location, we
needed to consider spontaneous reverse transcription terminations. We have calculated the ΔTCR,
which is a difference between TCRs of a hydroxyl radical treated and control samples. As expected, ΔTCR
correlates with the ribose solvent accessibility as measured from the crystal structures. The concept of
Flow cell binding
First and second read sequencing primer binding
DNA insert
Sample specific index
19
ΔTCR is analogous to the concept of signal intensity presented in the QuSHAPE method (Karabiber et al.,
2013), but brings it to a realm of information‐rich NGS data.
8.3 Paper3Transcriptome‐wide detection of binding sites of Locked Nucleic Acid containing oligonucleotides
(LNA‐Stop‐Seq) describes development of a method for mapping hybridization sites of an
oligonucleotide with a complex mix of transcripts. The antisense oligonucleotides (ASOs) form a new
class of pharmaceuticals, with two drugs being approved for the medical use by U.S. Food and Drug
Administration – fomivirsen intended to treat cytomegalovirus retinitis and mipomersen targeting ApoB
transcript in patients with familial hypercholesterolaemia (Jones, 2011) and many more in clinical trials
(Rayburn and Zhang, 2008). Action of ASOs starts with the hybridization to their intended target RNA
and several mechanisms of action have been utilized, including target degradation and splicing or
function alteration. Efficient ASOs need to be chemically modified to prevent their degradation and to
increase potency. One of the proposed modification is the use of LNA nucleotide analogues, which
protect from nucleases and increase the affinity (Koch et al., 2008). High affinity of the molecules leads
to the risk of causing hybridization‐dependent toxicity if the non‐targeted sequences are similar enough
to interact with the drug, creating off‐target effects (Lindow et al., 2012). Here we describe the process
of finding the off‐target binding sites of the potential LNA‐containing therapeutic molecule (Straarup et
al., 2010) in the mouse transcriptome. Proposed detection of the ASO‐RNA interaction sites is based on
the crosslinking of the hybridized oligonucleotides via 4‐thiothymidine (4‐thio‐T), biotin‐based selection
and detecting the locations with sequencing.
First, we describe various optimization steps, such as choice of the reverse transcriptase, way of
separating the non‐crosslinked oligonucleotides from the target (LNA modified oligonucleotides bind the
RNA with the affinity high enough to stop the reverse transcription even without the crosslinking),
deciding where in the oligonucleotide the 4‐thio‐T modification should be incorporated and for how
long the crosslinking should be performed.
Expecting the number of hybridization sites to be very limited as compared to the number of RNA 5’
ends, we needed to develop a method to enrich RTTS pool for the molecules that actually are derived
from the termination at the oligonucleotide rather than being derived from the mRNA 5’ ends or from
the spontaneous cDNA synthesis termination. We present two strategies of enrichment, both supported
by experimental evidence. One of the approaches is based on modifying RNA to bear 5’ phosphates and
degrading it with 5’ phosphate dependant exonuclease which terminates upon reaching crosslinked
oligonucleotide. This method is related to the one proposed in the RNase R exonuclease based SHAPE
modification detection procedure (Steen et al., 2010), where the covalent adduct terminates the
exonucleolytic degradation. In our setting we observe the remaining RNA after that treatment to be
composed of the RNA part downstream from the crosslinked ASO.
Another enrichment approach is based on utilizing the CAGE selection system (Takahashi et al., 2012),
but instead of selecting for the biotinylated 5’ cap structures we select for the biotin‐modified RNA‐
crosslinked oligonucleotides. One of the crucial steps of the CAGE selection is the RNase I degradation of
RNA that is not protected by the cDNA. In our scenario, it was vital that the RNA fragment between the
20
cDNA 3’ end and the crosslinked, biotinylated ASO is protected from the cleavage, which was shown to
be the case. After the RNase I cleavage, the RNA‐cDNA hybrids are bound to the streptavidin beads via
the biotinylated oligonucleotide and only the cDNA molecules that extended up to the ASO are kept and
their RTTS are sequenced.
Finally, we have prepared the CAGE‐like selected sequencing library and the non‐selected control. The
non‐selected sample is comparable to the HRF‐Seq dataset, but instead of probing with the hydroxyl
radicals, probing with the oligonucleotide was performed, and the expected target site gives a clear
signal of reverse transcription terminations. Comparison of the non‐selected with the selected samples
shows that the selection removes a big portion of the background signal (as expected) but also
introduces difficult to interpret peaks along the transcript. The sequence of the used oligonucleotide can
be recapitulated from the enrichment profile, indicating that the selection enriches for hybridization
partners.
Interestingly, we were able to find certain clear spots of interaction that would have been difficult to
define using traditionally performed in silico screening (Lindow et al., 2012), which raises hopes that the
further analysis of the dataset would allow defining new rules of hybridization. It is worth noting that
the off‐targets as defined by the LNA‐Stop‐Seq would not necessarily affect the transcript level, as we
don’t check for the ability of the duplex to trigger the action. What’s more, we have only tested the
transcripts present in liver, possibly missing physiologically relevant interactions with transcripts from
other tissues, which was an issue raised when discussing the use of microarrays for finding off‐targets
(Lindow et al., 2012).
8.4 Paper4The last included manuscript, The search for functional RNA secondary structures within 3’
untranslated regions by enzymatic probing of liver transcripts from multiple species (FragSeq2), is
focused on parallel probing of the secondary structure of the mRNA 3’ UTRs. 3’ UTRs are platforms for
translational regulation of gene expression, with their structure playing an important role via e.g.
microRNA or protein binding modulation. This work borrows on one side from the established
experimental protocols of FragSeq (Underwood et al., 2010) and PARS (Kertesz et al., 2010) which
combined the enzymatic probing with the high‐throughput sequencing, and on the other side from the
EvoFold (Pedersen et al., 2006), the method that uses the evolutionary information for the functional
structures determination.
The presented method relies on an in vitro RNA folding and probing with a single‐strand specific
nuclease P1 in two different concentrations, with a double‐strand specific RNase V1 and performing
random shattering with the magnesium ions at elevated temperature. To the cleavage‐generated
5’ phosphates (magnesium shattering required the phosphorylation reaction) an adaptor is ligated and
the RNA is reverse transcribed using the oligo‐dT primer bearing the 5’ adaptor, which focuses our assay
on the 3’ regions of the mRNA molecules. Synthesized cDNA is used for a PCR and sequenced with the
Illumina single‐read protocol, reading out the nuclease cleavage positions. In total, we have probed four
liver RNA samples: human, dog and mouse poly(A), and the ribosome depleted mouse sample. The
21
multiple species experimental design allows harnessing not only nuclease probing information, but also
the evolutionary conservation.
After mapping, we observed that some of the reads contained the information about the priming site
location, and we used that for the data normalization. Upon initial data analysis we have assumed that
the signal distribution from the priming sites is a function of (1) exponential decay expected from the
fact that if a reverse transcription stops at a cleavage site it will not be able to detect the cleavage sites
upstream in the RNA molecule, and (2) the size selection, that lowered the chance of observing the
short products. This required applying a novel normalization scheme that would be able to translate the
observed read count to the cleavage efficiency. Inspired by the QuSHAPE method (Karabiber et al.,
2013) we have modeled the extension of cDNA molecules from the priming sites and for each position
have estimated the count of cDNA molecules reaching that position, which can be compared with the
observed number of reads ending at a given site.
We show the structural signal from three classes of RNA molecules, structured spiked‐in RNA (E. coli
transfer‐messenger RNA), known 3’UTR structure (selenocysteine insertion element SECIS) and known,
structured non‐coding RNA (U1 spliceosomal RNA). All three RNA molecules show clear correlation
between the known structure and the the P1 and V1 signal. Interestingly, the signal for two SECIS
elements in selenoprotein P mRNA is consistent over all three species tested underscoring the
evolutionary perspective of the method, with very clear, high P1 cleavage rate for the apical loops. The
signal for U1 spliceosomal RNA, available in the enzymatically polyadenylated ribosome depleted mouse
RNA, has been compared with the previously compiled structure (Underwood et al., 2010) and shows
almost perfect agreement.
In the recent years we have witnessed the development of multiple methods of RNA structure probing
detected with massive parallel sequencing (Kertesz et al., 2010; Lucks et al., 2011; Underwood et al.,
2010; Wan et al., 2014). Propositions differed from each other with the used probing reagents, library
preparation protocols and the data analysis workflows. Latest published methods describe the in vivo
probing approach, which is especially relevant, as the in vitro folded structures may not necessarily be
the biologically relevant (Ding et al., 2013; Rouskin et al., 2013). Our way of improving the detection of
the biologically relevant structures is to combine the in vitro probing signal with the conservation signal.
This, in certain situations, may be superior to the in vivo probing approach, as the RNA in vivo may be
present in the functional state for only limited fraction of time making it difficult to detect.
22
9 ConclusionsandperspectivesWe have presented four intertwined projects broadly related to investigating RNA structural properties
on the massive scale with the next generation sequencing. We have started with the presentation of the
easy to follow, generic method for sequencing libraries generation that was later applied towards
obtaining a global perspective of the RNA structure: its secondary and tertiary organization, as well as
intermolecular interactions between RNA and antisense oligonucleotides.
We provide insights into RNA structure probing with the NGS, describing biases, ways of tackling them
and the data normalization schemes. We have confirmed that the NGS approach is suitable for the RNA
structure determination, and given the proper data analysis it performs comparably well to the low
throughput, traditional counterparts. The vast amount of gathered data should make it possible to
refine the folding parameters used in the computational prediction programs as well as lead to the
better understanding of used reagents.
Given the rising popularity of using the NGS methods we expect the HRF‐Seq to find an immediate
application with the combination of the HRF‐driven tertiary structure prediction algorithms for the
large‐scale 3D modeling projects (Ding et al., 2012). Such a marriage would make the data analysis much
easier and more reliable by feeding the structural algorithm with the digital data of known uncertainty
(count statistics). The analysis of many, long molecules simultaneously would possibly allow a discovery
of new folding rules. Another, not yet explored venue for the HRF‐Seq could be performing an
experiment that compares the same set of RNA molecules between different conditions, in which case
we expect the data to be of even higher quality, since sequence‐dependent biases should cancel out.
Moreover, the use of an X‐ray radiation would allow us to apply the method for in vivo studies,
answering how well the in vitro probing experiments recapitulate the physiological state.
In the FragSeq2 paper we describe the probing of the RNA secondary organization in vitro, creating a
dataset comprising the wealth of information of probing with different nucleases combined with the
conservation signal thanks to probing of three different species. As for now we have performed the
experiments and designed the data normalization procedure. The results are in agreement with the
known structures, hinting that the dataset possibly contains the information on the novel structural
elements. Next goal of the project is to perform the holistic data mining with the use of the nuclease
cleavage data and the evolutionary information. Insights gathered during this analysis can lead us to
develop the subsequent version of the combined nuclease‐conservation structure determination
approach where we would extend the probed sequence space to cover whole transcripts.
We have shown that the LNA‐Stop‐Seq can be successfully applied for finding the sites of interactions
between an ASO and RNA in vitro. We have performed only initial data analysis, and it suggests that we
may be able to improve our understanding of this kind of interactions. On the other hand, we didn’t
characterize if the biotin and 4‐thio‐T influence the hybridization. The procedure can be very easily
performed with oligonucleotides of different sequences or chemistries. It was develop with the future in
vivo application in mind, where the oligonucleotides would be delivered to the cultured cells via
transfection. The LNA‐Stop‐Seq describes the first application of the very specific CAGE‐like selection
23
outside of the conventional cap‐trapping, suggesting that this protocol can be adapted to enrich for
other interesting RNA properties.
It is worth noting that the FragSeq2 and the LNA‐Stop‐Seq methods are parts of bigger collaborations.
We have established experimental protocols and the initial data processing schemes and we expect
from the future analysis to define new 3’ UTR structures and correlate them with the cellular regulation
mechanism (e.g. microRNAs) as well as defining the rules governing LNA containing oligonucleotides
hybridization.
24
10 ReferencesAdilakshmi, T. (2006). Hydroxyl radical footprinting in vivo: mapping macromolecular structures with synchrotron radiation. Nucleic Acids Research 34, e64‐e64.
Alberts, B. (2002). Molecular biology of the cell, 4th edn (New York, Garland Science).
Allawi, H.T., Dong, F., Ip, H.S., Neri, B.P., and Lyamichev, V.I. (2001). Mapping of RNA accessible sites by extension of random oligonucleotide libraries with reverse transcriptase. RNA (New York, NY) 7, 314‐327.
Andronescu, M., Bereg, V., Hoos, H.H., and Condon, A. (2008). RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics 9, 340.
Bakowska‐Zywicka, K., and Tyczewska, A. (2009). The structure of the ribosome – short history. Biotechnologia 1, 14‐23.
Ban, N., Nissen, P., Hansen, J., Moore, P.B., and Steitz, T.A. (2000). The complete atomic structure of the large ribosomal subunit at 2.4 A resolution. Science 289, 905‐920.
Bartel, D.P. (2009). MicroRNAs: Target Recognition and Regulatory Functions. Cell 136, 215‐233.
Butcher, S.E., and Pyle, A.M. (2011). The molecular interactions that stabilize RNA tertiary structure: RNA motifs, patterns, and networks. Acc Chem Res 44, 1302‐1311.
Cantara, W.A., Crain, P.F., Rozenski, J., McCloskey, J.A., Harris, K.A., Zhang, X., Vendeix, F.A., Fabris, D., and Agris, P.F. (2011). The RNA Modification Database, RNAMDB: 2011 update. Nucleic Acids Res 39, D195‐201.
Carthew, R.W., and Sontheimer, E.J. (2009). Origins and Mechanisms of miRNAs and siRNAs. Cell 136, 642‐655.
Crick, F. (1970). Central dogma of molecular biology. Nature 227, 561‐563.
Das, R., and Baker, D. (2007). Automated de novo prediction of native‐like RNA tertiary structures. Proc Natl Acad Sci U S A 104, 14664‐14669.
Ding, F., Lavender, C.A., Weeks, K.M., and Dokholyan, N.V. (2012). Three‐dimensional RNA structure refinement by hydroxyl radical probing. Nat Methods.
Ding, Y., Chan, C.Y., and Lawrence, C.E. (2004). Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res 32, W135‐141.
Ding, Y., Tang, Y., Kwok, C.K., Zhang, Y., Bevilacqua, P.C., and Assmann, S.M. (2013). In vivo genome‐wide profiling of RNA secondary structure reveals novel regulatory features. Nature.
Do, C.B., Woods, D.A., and Batzoglou, S. (2006). CONTRAfold: RNA secondary structure prediction without physics‐based models. Bioinformatics 22, e90‐98.
Doudna, J.A., and Cech, T.R. (2002). The chemical repertoire of natural ribozymes. Nature 418, 222‐228.
Frellsen, J., Moltke, I., Thiim, M., Mardia, K.V., Ferkinghoff‐Borg, J., and Hamelryck, T. (2009). A probabilistic model of RNA conformational space. PLoS Comput Biol 5, e1000406.
Froberg, J.E., Yang, L., and Lee, J.T. (2013). Guided by RNAs: X‐inactivation as a model for lncRNA function. J Mol Biol 425, 3698‐3706.
Furey, T.S. (2012). ChIP‐seq and beyond: new and improved methodologies to detect and characterize protein‐DNA interactions. Nat Rev Genet 13, 840‐852.
25
Furtig, B., Richter, C., Wohnert, J., and Schwalbe, H. (2003). NMR spectroscopy of RNA. Chembiochem 4, 936‐962.
Gan, H.H., and Gunsalus, K.C. (2013). Tertiary structure‐based analysis of microRNA‐target interactions. RNA 19, 539‐551.
Gesteland, R.F., Cech, T., and Atkins, J.F. (2006). The RNA world : the nature of modern RNA suggests a prebiotic RNA world, 3rd edn (Cold Spring Harbor, N.Y., Cold Spring Harbor Laboratory Press).
Gite, S.U., and Shankar, V. (1995). Single‐strand‐specific nucleases. Crit Rev Microbiol 21, 101‐122.
Gutell, R.R., Lee, J.C., and Cannone, J.J. (2002). The accuracy of ribosomal RNA comparative structure models. Curr Opin Struct Biol 12, 301‐310.
Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., Rothballer, A., Ascano, M., Jungkamp, A.‐C., Munschauer, M., et al. (2010). Transcriptome‐wide Identification of RNA‐Binding Protein and MicroRNA Target Sites by PAR‐CLIP. Cell 141, 129‐141.
Ingolia, N.T., Ghaemmaghami, S., Newman, J.R.S., and Weissman, J.S. (2009). Genome‐Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling. Science 324, 218‐223.
Jones, D. (2011). The long march of antisense. Nature reviews Drug discovery 10, 401‐402.
Karabiber, F., McGinnis, J.L., Favorov, O.V., and Weeks, K.M. (2013). QuShape: rapid, accurate, and best‐practices quantification of nucleic acid probing information, resolved by capillary electrophoresis. RNA 19, 63‐73.
Ke, A., and Doudna, J.A. (2004). Crystallization of RNA and RNA‐protein complexes. Methods 34, 408‐414.
Kedde, M., Strasser, M.J., Boldajipour, B., Vrielink, J.A.F.O., Slanchev, K., le Sage, C., Nagel, R., Voorhoeve, P.M., van Duijse, J., Ørom, U.A., et al. (2007). RNA‐Binding Protein Dnd1 Inhibits MicroRNA Access to Target mRNA. Cell 131, 1273‐1286.
Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U., and Segal, E. (2007). The role of site accessibility in microRNA target recognition. Nat Genet 39, 1278‐1284.
Kertesz, M., Wan, Y., Mazor, E., Rinn, J.L., Nutter, R.C., Chang, H.Y., and Segal, E. (2010). Genome‐wide measurement of RNA secondary structure in yeast. Nature 467, 103‐107.
Kim, S.H., Sussman, J.L., Suddath, F.L., Quigley, G.J., McPherson, A., Wang, A.H., Seeman, N.C., and Rich, A. (1974). The general structure of transfer RNA molecules. Proc Natl Acad Sci U S A 71, 4970‐4974.
Kim, V.N. (2005). MicroRNA biogenesis: coordinated cropping and dicing. Nat Rev Mol Cell Biol 6, 376‐385.
Kirsebom, L.A., and Ciesiolka, J. (2008). Pb2+‐induced Cleavage of RNA. In Handbook of RNA Biochemistry (Wiley‐VCH Verlag GmbH), pp. 214‐228.
Koch, T., Rosenbohm, C., Hansen, H.F., Hansen, B., Marie Straarup, E., and Kauppinen, S. (2008). Chapter 5 Locked Nucleic Acid: Properties and Therapeutic Aspects. In Therapeutic Oligonucleotides (The Royal Society of Chemistry), pp. 103‐141.
Konig, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner, D.J., Luscombe, N.M., and Ule, J. (2010). iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat Struct Mol Biol 17, 909‐915.
26
Kortmann, J., and Narberhaus, F. (2012). Bacterial RNA thermometers: molecular zippers and switches. Nat Rev Microbiol 10, 255‐265.
Kuersten, S., and Goodwin, E.B. (2003). The power of the 3' UTR: translational control and development. Nat Rev Genet 4, 626‐637.
Laing, C., and Schlick, T. (2010). Computational approaches to 3D modeling of RNA. J Phys Condens Matter 22, 283101.
Lange, S.J., Maticzka, D., Mohl, M., Gagnon, J.N., Brown, C.M., and Backofen, R. (2012). Global or local? Predicting secondary structure and accessibility in mRNAs. Nucleic Acids Research.
Leontis, N.B., and Westhof, E. (2001). Geometric nomenclature and classification of RNA base pairs. RNA 7, 499‐512.
Levitt, M. (1969). Detailed molecular model for transfer ribonucleic acid. Nature 224, 759‐763.
Licatalosi, D.D., Mele, A., Fak, J.J., Ule, J., Kayikci, M., Chi, S.W., Clark, T.A., Schweitzer, A.C., Blume, J.E., Wang, X., et al. (2008). HITS‐CLIP yields genome‐wide insights into brain alternative RNA processing. Nature 456, 464‐469.
Limbach, P.A., Crain, P.F., and McCloskey, J.A. (1994). Summary: the modified nucleosides of RNA. Nucleic Acids Res 22, 2183‐2196.
Lindell, M., Romby, P., and Wagner, E.G. (2002). Lead(II) as a probe for investigating RNA structure in vivo. RNA 8, 534‐541.
Lindow, M., Vornlocher, H.‐P., Riley, D., Kornbrust, D.J., Burchard, J., Whiteley, L.O., Kamens, J., Thompson, J.D., Nochur, S., Younis, H., et al. (2012). Assessing unintended hybridization‐induced biological effects of oligonucleotides. Nature Biotechnology 30, 920‐923.
Lipfert, J., and Doniach, S. (2007). Small‐angle X‐ray scattering from RNA, proteins, and protein complexes. Annu Rev Biophys Biomol Struct 36, 307‐327.
Lu, Z.J., and Mathews, D.H. (2008). OligoWalk: an online siRNA design tool utilizing hybridization thermodynamics. Nucleic Acids Res 36, W104‐108.
Lucks, J.B., Mortimer, S.A., Trapnell, C., Luo, S., Aviran, S., Schroth, G.P., Pachter, L., Doudna, J.A., and Arkin, A.P. (2011). Multiplexed RNA structure characterization with selective 2'‐hydroxyl acylation analyzed by primer extension sequencing (SHAPE‐Seq). Proceedings of the National Academy of Sciences of the United States of America 108, 11063‐11068.
Luebke, K.J., Balog, R.P., and Garner, H.R. (2003). Prioritized selection of oligodeoxyribonucleotide probes for efficient hybridization to RNA transcripts. Nucleic Acids Res 31, 750‐758.
Lunde, B.M., Moore, C., and Varani, G. (2007). RNA‐binding proteins: modular design for efficient function. Nature Reviews Molecular Cell Biology 8, 479‐490.
Madison, J.T., Everett, G.A., and Kung, H. (1966). Nucleotide sequence of a yeast tyrosine transfer RNA. Science 153, 531‐534.
Matera, A.G., Terns, R.M., and Terns, M.P. (2007). Non‐coding RNAs: lessons from the small nuclear and small nucleolar RNAs. Nat Rev Mol Cell Biol 8, 209‐220.
Mathews, D.H. (2004). Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA 10, 1178‐1190.
27
McManus, C.J., and Graveley, B.R. (2011). RNA structure and the mechanisms of alternative splicing. Curr Opin Genet Dev 21, 373‐379.
Michel, F., and Westhof, E. (1990). Modelling of the three‐dimensional architecture of group I catalytic introns based on comparative sequence analysis. J Mol Biol 216, 585‐610.
Motorin, Y., Muller, S., Behm‐Ansmant, I., and Branlant, C. (2007). Identification of Modified Residues in RNAs by Reverse Transcription‐Based Methods. 425, 21‐53.
Ozsolak, F., and Milos, P.M. (2011). RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12, 87‐98.
Parisien, M., and Major, F. (2008). The MC‐Fold and MC‐Sym pipeline infers RNA structure from sequence data. Nature 452, 51‐55.
Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad‐Toh, K., Lander, E.S., Kent, J., Miller, W., and Haussler, D. (2006). Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2, e33.
Perona, J.J., and Hadd, A. (2012). Structural diversity and protein engineering of the aminoacyl‐tRNA synthetases. Biochemistry 51, 8705‐8729.
Rayburn, E.R., and Zhang, R. (2008). Antisense, RNAi, and gene silencing strategies for therapy: mission possible or impossible? Drug Discov Today 13, 513‐521.
Reuter, J.S., and Mathews, D.H. (2010). RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129.
Rouskin, S., Zubradt, M., Washietl, S., Kellis, M., and Weissman, J.S. (2013). Genome‐wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature.
Seetin, M.G., Kladwang, W., Bida, J.P., and Das, R. (2014). Massively parallel RNA chemical mapping with a reduced bias MAP‐seq protocol. Methods Mol Biol 1086, 95‐117.
Seetin, M.G., and Mathews, D.H. (2012). RNA structure prediction: an overview of methods. Methods Mol Biol 905, 99‐122.
Serganov, A., and Nudler, E. (2013). A decade of riboswitches. Cell 152, 17‐24.
Simpson, C.G., and Brown, J.W. (1995). Primer extension assay. Methods in molecular biology (Clifton, N J ) 49, 249‐256.
Spitale, R.C., Crisalli, P., Flynn, R.A., Torre, E.A., Kool, E.T., and Chang, H.Y. (2013). RNA SHAPE analysis in living cells. Nat Chem Biol 9, 18‐20.
Steen, K.A., Malhotra, A., and Weeks, K.M. (2010). Selective 2'‐hydroxyl acylation analyzed by protection from exoribonuclease. J Am Chem Soc 132, 9940‐9943.
Steitz, T.A. (2008). A structural understanding of the dynamic ribosome machine. Nat Rev Mol Cell Biol 9, 242‐253.
Straarup, E.M., Fisker, N., Hedtjarn, M., Lindholm, M.W., Rosenbohm, C., Aarup, V., Hansen, H.F., Orum, H., Hansen, J.B.R., and Koch, T. (2010). Short locked nucleic acid antisense oligonucleotides potently reduce apolipoprotein B mRNA and serum cholesterol in mice and non‐human primates. Nucleic Acids Research 38, 7100‐7111.
Szostak, E., and Gebauer, F. (2013). Translational control by 3'‐UTR‐binding proteins. Brief Funct Genomics 12, 58‐65.
28
Tafer, H., Ameres, S.L., Obernosterer, G., Gebeshuber, C.A., Schroeder, R., Martinez, J., and Hofacker, I.L. (2008). The impact of target site accessibility on the design of effective siRNAs. Nature Biotechnology 26, 578‐583.
Takahashi, H., Kato, S., Murata, M., and Carninci, P. (2012). CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods in Molecular Biology (Clifton, NJ) 786, 181‐200.
Tinoco, I., Jr., and Bustamante, C. (1999). How RNA folds. J Mol Biol 293, 271‐281.
Tullius, T.D., and Greenbaum, J.A. (2005). Mapping nucleic acid structure by hydroxyl radical cleavage. Curr Opin Chem Biol 9, 127‐134.
Ulitsky, I., and Bartel, D.P. (2013). lincRNAs: genomics, evolution, and mechanisms. Cell 154, 26‐46.
Underwood, J.G., Uzilov, A.V., Katzman, S., Onodera, C.S., Mainzer, J.E., Mathews, D.H., Lowe, T.M., Salama, S.R., and Haussler, D. (2010). FragSeq: transcriptome‐wide RNA structure probing using high‐throughput sequencing. Nature Methods 7, 995‐1001.
Vickers, T.A., Wyatt, J.R., and Freier, S.M. (2000). Effects of RNA secondary structure on cellular antisense activity. Nucleic Acids Res 28, 1340‐1347.
Wan, Y., Kertesz, M., Spitale, R.C., Segal, E., and Chang, H.Y. (2011). Understanding the transcriptome through RNA structure. Nat Rev Genet 12, 641‐655.
Wan, Y., Qu, K., Zhang, Q.C., Flynn, R.A., Manor, O., Ouyang, Z., Zhang, J., Spitale, R.C., Snyder, M.P., Segal, E., et al. (2014). Landscape and variation of RNA secondary structure across the human transcriptome. Nature 505, 706‐709.
Weeks, K.M. (2010). Advances in RNA structure analysis by chemical probing. Current Opinion in Structural Biology 20, 295‐304.
Weeks, K.M. (2011). RNA structure probing dash seq. Proc Natl Acad Sci U S A 108, 10933‐10934.
Wells, S.E., Hughes, J.M., Igel, A.H., and Ares, M., Jr. (2000). Use of dimethyl sulfate to probe RNA structure in vivo. Methods Enzymol 318, 479‐493.
Wimberly, B.T., Brodersen, D.E., Clemons, W.M., Jr., Morgan‐Warren, R.J., Carter, A.P., Vonrhein, C., Hartsch, T., and Ramakrishnan, V. (2000). Structure of the 30S ribosomal subunit. Nature 407, 327‐339.
Zheng, Q., Ryvkin, P., Li, F., Dragomir, I., Valladares, O., Yang, J., Cao, K., Wang, L.S., and Gregory, B.D. (2010). Genome‐wide double‐stranded RNA sequencing reveals the functional significance of base‐paired RNAs in Arabidopsis. PLoS Genet 6, e1001141.
Ziehler, W.A., and Engelke, D.R. (2001). Probing RNA structure with chemical reagents and enzymes. Curr Protoc Nucleic Acid Chem Chapter 6, Unit 6 1.
Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31, 3406‐3415.
29
11 Papers
30
11.1 Paper1:DetectionofReverseTranscriptaseTerminationSitesUsingcDNALigationandMassiveParallelSequencing
The book chapter reprinted with kind permission from Springer Science and Business Media.
Kielpinski, L.J., Boyd, M., Sandelin, A., and Vinther, J. (2013). Detection of reverse transcriptase
termination sites using cDNA ligation and massive parallel sequencing. Methods Mol Biol (Springer and
Humana Press) vol. 1038, pp 213‐231. Edited by Noam Shomron
© Springer Science+Business Media New York 2013
31
Chapter 13
Detection of Reverse Transcriptase Termination Sites UsingcDNA Ligation and Massive Parallel Sequencing
Lukasz J. Kielpinski, Mette Boyd, Albin Sandelin, and Jeppe Vinther
Abstract
Detection of reverse transcriptase termination sites is important in many different applications, such asstructural probing of RNAs, rapid amplification of cDNA 50 ends (50 RACE), cap analysis of geneexpression, and detection of RNA modifications and protein–RNA cross-links. The throughput of thesemethods can be increased by applying massive parallel sequencing technologies.Here, we describe a versatile method for detection of reverse transcriptase termination sites based on
ligation of an adapter to the 30 end of cDNA with bacteriophage TS2126 RNA ligase (CircLigase™). In thefollowing PCR amplification, Illumina adapters and index sequences are introduced, thereby allowingamplicons to be pooled and sequenced on the standard Illumina platform for genomic DNA sequencing.Moreover, we demonstrate how to map sequencing reads and perform analysis of the sequencing data withfreely available tools that do not require formal bioinformatics training. As an example, we apply themethod to detection of transcription start sites in mouse liver cells.
Key words Reverse transcription, Termination, Sequencing, TS2l26 RNA ligase, CAGE, Galaxy
1 Introduction
Detection of reverse transcriptase termination sites (RTTS) is ageneral strategy that can be used to detect different features ofRNA, such as their ends [1], modifications [2], structure [3], andbinding of proteins [4]. Historically, RTTS have been monitoredby fragment analysis using radioactive or fluorescent labelling of theprimer used for the reverse transcription and detection with dena-turing gel or capillary electrophoresis, respectively. Alternatively,RTTS can be detected by ligating an adapter to the 30 end of theterminated cDNA, cloning, and sequencing. While fragment anal-ysis has been very successfully used to investigate many differentRNA features, the decreasing cost of sequencing makes it increas-ingly more advantageous to use sequencing for detection of RTTS.It is therefore likely that existing RTTS-based methods will beadapted for sequencing and that new methods will be developed.
Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038,DOI 10.1007/978-1-62703-514-9_13, # Springer Science+Business Media New York 2013
213
32
The key step in the detection of RTTS by sequencing is toattach sequencing adapter sequences to the ends of the cDNA.Typically the 50 adapter sequence is included as overhang in gene-specific or random primer used for the first-strand reaction. Thenext step is the ligation of an adapter to the 30 end of the terminatedcDNA and several methods for doing this have been developed. Insingle-strand linker ligation a double-stranded adapter with a 30
overhang is ligated to the free 30 end of the RTTS cDNA using T4DNA ligase [5]. Alternatively, a single-stranded adapter can be usedfor ligation with the thermostable TS2126 RNA ligase (CircLigase)[6]. The efficiency of both of these enzymes are somewhat biasedby the sequence in the very 30 end of the cDNA that have to beligated (results not shown), but these biases are reproducible andare therefore not an issue if an appropriate control is used fornormalization. Another issue is the ability of reverse transcriptaseto add 1–3 untemplated nucleotides to the 30 end of cDNAs. Thisoccurs more efficiently at capped 50 ends compared to 50 endsending in OH (typical for degraded RNA) [7] and has to betaken into account when sequences are mapped to the RNA beinginvestigated. The added nucleotides allow the reverse transcriptaseto perform template switching, which can be exploited to add anadaptor sequence to the 30 of cDNAs [8].
Some RTTSmethods have successfully been adapted to massiveparallel sequencing. Cap analysis of gene expression (CAGE) hasbeen successfully used to identify transcription start sites (TSS) [9].Originally the CAGEmethodwas based on concatenation of CAGEtags and Sanger sequencing [10], but it has recently been adapted tomassive parallel sequencing [1]. Another example is SHAPE-basedprobing of RNA structure, which has been widely and successfullyused for investigating the structure of single RNAs using capillaryelectrophoresis [11]. Nevertheless, recent result demonstrating thatpopulations of RNA molecules can be SHAPE probed in parallelusing sequencing fuels hope that the throughput of structure prob-ing can be increased [12]. These successful implementations ofsequencing for RTTS detection suggest that RTTS methods gener-ally can be adapted to the new sequencing technologies.
Here, we describe a general method for detecting RTTS basedon the Illumina paired-end genomic DNA adapters, sequencingprimer, and indexing reads. Samples can therefore be multiplexedwith other samples containing the standard Illumina adaptors andused for both single- and paired-end sequencing. The method caneasily be adapted to detect RTTS produced by any experimentalprotocol. In addition, we demonstrate in detail how to go from theraw sequencing reads to counts of RTTSmapped to the RNA beinginvestigated and how to compare with the existing annotation andvisualize the results in the UCSC genome browser. An overview ofthe entire protocol is shown in Fig. 1.
214 Lukasz J. Kielpinski et al.
33
Fig. 1 Schematic outline of the analysis. The starting material are RNA molecules containing a feature ofinterest, which can cause reverse transcriptase termination. The RNA is reverse transcribed with a primercontaining a 50 adapter overhang. After cDNA purification, a second adapter is ligated to the 30 ends of theobtained cDNA. Molecules containing both adapters serve as templates for a PCR, which adds all necessaryelements for Illumina sequencing. After library sequencing, the resulting sequencing reads are mapped tosequences of interest (this could be the full genome or selected RNA sequences) and the location of the reads’50ends (corresponding to the feature of interest) counted. The resulting RTTS count file can be used for furtheranalysis, such as visualization in the UCSC genome browser, producing RTTS plots for specific RNA molecules,and comparing with the existing annotation
Reverse Transcriptase Termination Site (RTTS) Mapping 215
34
2 Materials
2.1 RNA Sample 1. Material to be analyzed: The RNA should be treated in a waythat reverse transcription will terminate on sites of interest. Thiscould be RNA strand breaks, RNA modifications, RNA 50 ends,protein–RNA cross-links among others.
2.2 Oligonucleotides 1. Oligonucleotide sequences are listed in Table 1. RT_random_-primer and LIGATION_ADAPTER were HPLC purified, andthe remaining oligonucleotides were PAGE purified.
2.3 Reverse
Transcription and
Purifications
1. PrimeScript™ ReverseTranscriptase including PrimeScript™5� buffer (Takara).
2. 10 mM dNTPs.
3. Sorbitol–trehalose mix (1.67 M sorbitol, 0.33 M trehalose).
4. Agencourt® AMPure® XP–PCR Purification (Beckman Coul-ter).
5. Agencourt® RNAClean® XP (Beckman Coulter).
6. 70 % EtOH.
7. 5 mM Na-citrate pH 6.
8. 10 mM Tris–HCI pH 8.3.
9. RNAseH (New England Biolabs).
2.4 Linker Ligation 1. CircLigase (Epicentre).
2. 1 mM ATP (Epicentre).
3. CircLigase buffer (Epicentre).
4. 50 mM MnCl2 (Epicentre).
5. 50 % PEG 6000 (filter sterilized).
6. 5 M glycine betaine (filter sterilized).
2.5 PCR 1. Phusion® High-Fidelity DNA Polymerase (NEB).
2. 5� HF Phusion buffer (NEB).
3. 10 mM dNTPs.
4. H2O (PCR grade).
2.6 Quality Control 1. Agarose electrophoresis.
2. 1� TBE buffer.
3. Agarose.
4. 6� DNA loading buffer (Fermentas).
5. DNA Size standard with 150 bp band (e.g., Ultra Low RangeDNA ladder—Fermentas).
216 Lukasz J. Kielpinski et al.
35
Table1
Oligonucleotides
used
inthis
study
Nam
ePrimer
sequence
RT_random_p
rimer
AGACGTGTGCTCTTCCGATCTNNNNNNNNS
LIG
ATIO
N_A
DAPTER
50 phosphate-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3
0 3NHC3
PCR_forw
ard
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCT
PCR_R
EVERSE_INDEX.1_A
TCACG
CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.2_C
GATGT
CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.3_T
TAGGC
CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.4_T
GACCA
CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.5_A
CAGTG
CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.6_G
CCAAT
CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.7_C
AGATC
CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.8_A
CTTGA
CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.9_G
ATCAG
CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.10_T
AGCTT
CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_REVERSE_IN
DEX.11_GGCTAC
CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_REVERSE_IN
DEX.12_CTTGTA
CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.13_A
GTCAA
CAAGCAGAAGACGGCATACGAGATTTGACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.14_A
GTTCC
CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.15_A
TGTCA
CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_R
EVERSE_INDEX.16_C
CGTCC
CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
Allsequen
cesare50 –30
Theoligonucleo
tidesequen
cesoftheIlluminagen
omicDNAadapters
arecopyrightedbyIllumina,Inc.2006.Allrightsreserved
LIG
ATIO
N_A
DAPTERisalinkerwithclonable50 endandwithan
amino-blocked
30 end
Index
sequen
cesareshownin
bold
Reverse Transcriptase Termination Site (RTTS) Mapping 217
36
6. Stain G (Serva).
7. Agilent DNA 1000 Kit (Agilent Technologies).
2.7 Equipment 1. Tubes.
(a) RT, purifications, ligation—0.5 ml PCR tube (BRAND,781310).
(b) PCR—0.2 ml 8-Strip tubes (alpha laboratories, LW2500).
2. Thermocyclers.
(a) RT, ligation: MJ Research, PTC-200 for 0.5 ml tubes.
(b) PCR: BIORAD S1000.
3. Magnetic stand.
4. NanoDrop 1000.
5. Agilent 2100 Bioanalyzer.
3 Methods
3.1 Reverse
Transcription
(Modified from ref. 1)
1. Mix 1 μg of RNA startingmaterial (could be in vitro-transcribedRNA or purified RNA) with 100 pmol of RT_random_primerin a 7.5 μl volume (optimal amount of primer can vary with thespecific application). Heat denature for 5 min at 65 �C, and puton ice (seeNote 1).
2. Preparemaster mix. For one reaction take 7.5 μl 5�PrimeScriptBuffer, 1.87 μl 10 mM dNTP, 7.5 μl sorbitol–trehalose mix(which is half of the concentration used in [1]), 9.38 μl H2O,and 3.75 μl PrimeScript enzyme. Add 30 μl master mix toRNA–primer, and mix by pipetting (seeNote 2).
3. Incubate as follows: 25 �C, 10 min (skip this incubation if agene-specific primer is used); 42 �C, 30 min; 50 �C, 10 min;56 �C, 10 min; 60 �C, 10 min; and hold on 4 �C. The result ofthe reverse transcription is a cDNA carrying a 50 adapter andterminating at the feature of interest (Fig. 2a).
4. Inactivate reverse transcriptase enzyme by incubating samplefor 15 min at 70 �C, then place on ice, and add 1 μl RNase Henzyme (New England Biolabs, 5,000 U/ml). Incubate for20 min at 37 �C to degrade the RNA (see Note 3).
3.2 cDNA
Purification (Modified
from ref. 1)
1. Add 67.5 μl RNAClean XP beads (room temperature, wellmixed) to reactions, and pipette mix. Incubate at room tem-perature for 30 min vortexing every 10 min.
2. Put on magnetic stand for 5 min, and aspirate cleared solution.
3. 2� wash with 70 % ethanol (used volume depends on the tubesused; for 500 μl tubes use 400 μl ethanol).
4. Add 40 μl 5 mMNa-citrate (pH 6) preheated to 37 �C, andmixextensively by pipetting. Incubate for 10 min at 37 �C.
218 Lukasz J. Kielpinski et al.
37
5. Place on magnetic stand, and transfer eluant to the new tube(see Note 4).
3.3 cDNA Ligation 1. Prepare master mix. For one reaction take 1 μl of CircLigasebuffer (Epicentre), 0.5 μl of 1 mM ATP, 0.5 μl 50 mMMnCl2, 2 μl of 50 % PEG 6000, 2 μl of 5 M betaine, 0.5 μl
Fig. 2 Outline of library generation. (a) The first steps in library generation are reverse transcription andligation of an adapter to the 30 end of the cDNA, which correspond to the location of the feature of interest. (b)In the subsequent PCR Illumina adapter sequences are added to produce a double-stranded DNA library that isready for sequencing on the Illumina genomic DNA platform
Reverse Transcriptase Termination Site (RTTS) Mapping 219
38
100 μM LIGATION_ ADAPTER, and 0.5 μl CircLigaseenzyme. Mix well.
2. Split master mix into 7 μl aliquots, and add 3 μl of cDNA.
3. Incubate as follows: 60 �C, 2 h; 68 �C, 1 h; 80 �C, 10 min; andhold on 4 �C.
4. Add 10 μl H2O to increase volume.
5. Purify as point 2, but using Ampure XP bead (20 μl ligationreaction + 36 μl Ampure beads). Elute in 16 μl H2O. Theresult of the cDNA ligation step is a single-stranded cDNAcontaining adapters both at the 50 and 30 end, which can beused for the subsequent PCR reaction (Fig. 2b).
3.4 PCR
Amplification
of Library
1. Prepare master mix. For one reaction take 3 μl PCR_forward10 μM primer, 10 μl Phusion 5� HF buffer, 1 μl 10 mMdNTPs, 27.5 μl H2O, and 1 μl Phusion DNA polymerase.Mix well.
2. Split master mix into 42.5 μl aliquots, and add 2.5 μl ofindexing primer (PCR_REVERSE_INDEX.##_NNNNNN)(see Note 5) and 5 μl purified linker-ligated cDNA. Start thePCR reaction program as follows: 98 �C, 3 min; (98 �C, 80 s;64 �C, 15 s; 72 �C, 30 s) � 4; (98 �C, 80 s; 72 �C, 45 s) � 15;72 �C, 5 min; and hold on 4 �C (see Note 6).
3. Agarose electrophoresis (see Note 7). Prepare 2 % agarose gelwith a DNA stain (e.g., Stain G). Apply 5 μl of samples (addloading dye) and size standard, and run at 4 V/cm untilbromophenol blue from loading dye has travelled approxi-mately 2.5 cm. Visualize under UV light. You should seesmears of products longer than 200 bp. Presence of amplifiedPCR product shorter than 150 bp is typically caused by lowamounts of starting material, combined with small amounts ofleftover reverse transcription primer in the ligation reaction,and is the result of amplification of directly ligated RT primer—LIGATION_ADAPTER molecules, which can be Illuminasequenced, but is uninformative (see Fig. 3a). To get rid ofthe short PCR product, try to redo the library with morestarting material or alternatively perform agarose gel purifica-tion to remove the short PCR product. In case that no ampli-fied library (smear) is detected at this step, perform small-scalePCR with different number of cycles and analyze the PCRreaction by agarose electrophoresis. Then repeat the PCR reac-tion with the lowest number of PCR cycles that allows fordetection of the library on the gel. Optimal number of cyclesdepends on the amount of starting material.
220 Lukasz J. Kielpinski et al.
39
3.5 Purification and
Quantification of
Library (See Note 7)
1. Ampure XP purification—as Subheading 3.2 but use AmpureXP beads and add 72 μl beads to 40 μl PCR reaction. Elute in20 μl preheated 10 mM Tris–HCl pH 8.3.
2. Measure concentration on NanoDrop (as dsDNA) and runBioanalyzer DNA 1000. Perform smear analysis (side panel ->Global ->Advanced ->Smear analysis ->regions) with range140–600 bp and usemolarity as a guideline for your sequencingorder. The library should contain dsDNA molecules of variedlength with a considerable fraction being above 200 bp andbelow 600 bp (Fig. 3b) (seeNote 8).
3. Samples can now be sequenced using standard Illumina DNAgenomic sequencing and can be multiplexed with other sam-ples made with the same adapters (genomic DNA) as long asthey utilize different indexes (see Note 5).
3.6 Data Analysis
Using Linux Command
Line
1. Data analysis of massive parallel sequencing experiments can bea challenge for scientists without formal training in bioinfor-matics. Below we demonstrate in detail how to go from the
Fig. 3 Expected result from PCR amplification. (a) PCR products are first checked by agarose electrophoresis.A successfully prepared library should form a smear of molecules longer than 150 bp (lane 2). Presence ofband shorter than 150 bp (lane 1) indicates problems with library preparation (see step 3 of Subheading 3.4).(b) Library is purified and checked for size distribution on Agilent Bioanalyzer DNA 1000 chip. A successfullyprepared library should have dsDNA molecules of varied length with a considerable fraction being above200 bp and below 600 bp
Reverse Transcriptase Termination Site (RTTS) Mapping 221
40
sequencing output (FASTQ file) to an RTTS count file withoutassuming prior knowledge of bioinformatics using tools avail-able in GALAXY [13]–16], including the Bowtie mapper forsequencing reads [17] and the FASTX toolkit [18]. However,using a Unix or an OSXmachine with a command line interfaceis recommended for large projects. For those users, the analysisimplemented in Subheadings 3.7–3.9 can be carried out usingBowtie and an awk script available at this URL http://people.binf.ku.dk/~lukasz/SAM2counts.awk.
3.7 Quality Check
of Sequencing Reads
1. Log in to Galaxy (http://usegalaxy.org/) and create a newGalaxy history. Upload the relevant FASTQ files to Galaxywith the “Upload File from your computer” tool found inthe “Get Data” tool category. Point to the location of therelevant FASTQ file on your computer and click execute (seeNote 9).
2. Check the integrity of the FASTQ files with the “FASTQGroomer” tool found in the “NGS: QC and manipulation”tool category. For newer FASTQ files (Illumina 1.8 and later)the quality is encoded in Sanger format. Choose the Galaxyhistory item containing the FASTQ file, set “Input FASTQquality scores type:” to Sanger, and click execute (seeNote 10).
3. Compute FASTQ quality statistics with the “Compute qualitystatistics” tool found in the “NGS: QC and manipulation” toolcategory. Choose the groomed FASTQ file and click execute.
4. Plot the distributions of quality scores for the differentsequencing cycles using the “Draw quality score boxplot”tool found in the “NGS: QC and manipulation” tool category.Choose the Galaxy history item containing the output of the“Compute quality statistics” tool and click execute. Look at theresulting boxplot by clicking on the eye icon next to the “Drawquality score boxplot” history item (see Fig. 4a). For mostexperiments, where the median quality is not very low (fallingbelow 25), it is unnecessary to filter the reads on quality.If quality is very low it may be an advantage to filter the readsfor low quality using the “Filter by quality” tool found in the“NGS: QC and manipulation” tool category. Set the “Qualitycut-off value” option to 20 and the “Percent of bases insequence that must have quality equal to/higher than cut-offvalue” option to 90 and click execute.
5. Plot nucleotide distributions of the different sequencing cyclesusing the “Draw nucleotides distribution chart” tool found inthe “NGS: QC and manipulation” tool category. Look at theresulting plot by clicking on the eye icon next to the “Drawnucleotides distribution chart” history item (see Fig. 4b).
222 Lukasz J. Kielpinski et al.
41
The nucleotide distributions are typically similar across thesequencing cycles, but if this is not the case, the librarymay not have sufficient complexity or be contaminated withadapter–adapter ligation products.
Fig. 4 Expected quality plots of sequencing reads. (a) Example of quality boxplot produced by Galaxy. The plotshows the median read quality in the different sequencing cycles. (b) Example of nucleotide distribution plotproduced by Galaxy. The plot shows the percentage of the nucleotides in the different sequencing cycles.Deviation from uniform distribution in the first cycle reflects a combination of bias for specific nucleotides interminal transferase activity of Reverse Transcriptase, bias in the TS2126 RNA ligase reaction and in somecases biased seqences of the genomic locations being mapped by the RTTS
Reverse Transcriptase Termination Site (RTTS) Mapping 223
42
3.8 Mapping Reads
with Bowtie
1. Depending on the nature of your experiment you can map yourreads either to the entire genome relevant for the experimentor to one or more RNA sequences. The genomes of the mostcommonly investigated species are pre-installed in Galaxy,whereas mapping to one or more specific RNAs requires thatthe sequence(s) is uploaded to Galaxy as a FASTA file.If necessary upload a FASTA file with the “Upload File fromyour computer” tool found in the “Get Data” tool category.Point to the location of the relevant FASTA file on your com-puter and click execute.
2. To map the reads, use the “Map with Bowtie for Illumina” toolfound in the “NGS: Mapping” tool category. If mapping to agenome that is pre-indexed in Galaxy, choose “Use a built-inindex” and the relevant genome. Otherwise choose “Use onefrom history” and select history item containing the uploadedFASTA file. Next, select the history item containing thegroomed (and filtered) FASTQ file under the “FASTQ file”option and choose “Full parameter list” in the “Bowtie settingsto use” drop-down menu. Then change “Maximum number ofmismatches permitted in the seed (-n)” to 3 and “Maximumpermitted total of quality values at mismatched read positions(-e)” to 300 and choose “Use best” in the “Whether or not tomake Bowtie guarantee that reported singleton alignments are‘best’ in terms of stratum and in terms of the quality values atthe mismatched positions (–best)” drop-down menu. Finallymap the reads by clicking Execute. The mapping may take awhile depending on the size of the FASTQ file and thesequence to be mapped against (see Note 11).
3.9 Preparing an
RTTS Count File from
SAM File
1. It is necessary to trim mapped reads that contain untemplatednucleotides added by reverse transctiptase (see Note 12). Thistrimming requires many Galaxy operations and we have there-fore created a Galaxy workflow to perform this operation andsubsequently count and sum RTTS. In this procedure reads aretrimmed, if they contain mismatches in the first three positions(Fig. 5). To download the workflow go to https://main.g2.bx.psu.edu/workflow/list_published and search for RTTS Map-per. Click on the workflow and import it into your own Galaxyaccount by clicking on “Import workflow” in the upper rightcorner. Alternatively, if a local instance of galaxy is used, theRTTS Mapper workflow can be imported into Galaxy by click-ing on “Workflow” on the top Galaxy bar and then on the“Upload and import workflow” button in the upper rightcorner. At this URL https://main.g2.bx.psu.edu/workflow/import_workflow, the workflow can be imported by providingthe URL http://people.binf.ku.dk/~lukasz/Galaxy_RTTS_mapper.ga as the “Galaxy workflow URL” and clicking import.
224 Lukasz J. Kielpinski et al.
43
Also import a control file to your history by pasting in this URLhttp://people.binf.ku.dk/~lukasz/RTTS_control.interval inthe URL/Text window in the “Upload File from your com-puter” tool found in the “Get Data” tool category.
2. To prepare RTTS count files, use the RTTS Mapper workflowimported above. Click on “Workflow” on the top Galaxy barand then on the “RTTS mapper” workflow and choose “Run.”Select the history item containing the SAMfile from the Bowtiemapping for the “Select dataset to convert” option and theRTTS_control.interval file for the “Select control file” optionand click “Run workflow” at the bottom of the page.
3. The resulting RTTS count files (for counts on the plus andminus strand, respectively) can be used for further analysis in R,Excel, or other data analysis program. The exact analysis willdepend on the nature of the experiment performed. Below weprovide tools for some common types of analysis using thefreely available tool R, which can easily be installed on anycomputer platform [19] (see Note 13).
3.10 Preparing Wig
File and Visualizing
in the UCSC Genome
Browser
If the RTTS experiments have been mapped to a genome assembly,it will often be advantageous to visualize the results on the UCSCGenome Browser and compare with the many kinds of data avail-able as tracks. To do this it is necessary to convert the RTTS file tothe UCSC wig format.
Fig. 5 Schematic representation of the trimming performed by the RTTS mapper. After mapping the reads togenome the three 5’ terminal nucleotides of mapped reads (corresponding to 3’ ends of cDNA molecules) areevaluated for mismatches to the reference sequence and trimmed if necessary. The four possible scenariosare the following: full match (a), mismatch at the terminal position (b), position one (c), or position two (d)before the terminal position, in which cases we trim 0, 1, 2, or 3 positions, respectively (returned positions areindicated by the triangles). Red boxes: Mismatched positions; white boxes: matched positions
Reverse Transcriptase Termination Site (RTTS) Mapping 225
44
1. Download RTTS count files to your local computer from theGalaxy server by clicking on the floppy disc icon for the relevanthistory items.
2. The RTTS count file can be converted to wig format by copy/pasting a small program (script) into R. Download theprovided script from http://people.binf.ku.dk/~lukasz/wig_generator.R. Open the file in a text editor and modify it bychanging the assignment of variables “input_filename_plus”and “input_filename_min” to names of files produced by galaxyworkflow.
3. Start-up R, and change working directory to the one contain-ing RTTS count files by writing “setwd (‘path of file direc-tory’)” in the console window and pressing enter or using the“Change dir” command found in the File menu. Then pastethe edited script into the R console and hit enter. This willproduce two new files named OUTPUTp.wig and OUT-PUTm.wig in the same folder.
4. Go to the UCSC genome browser (http://www.genome.ucsc.edu/cgi-bin/hgGateway) and choose the species and assemblythat were used for the mapping of the RTTS experiment in thedrop-down menu. Then click “manage custom tracks,” browsethe local drive for the wig files, and submit them one by one(after adding first one press “add custom tracks”). Finally press“go to the genome browser” with RTTS counts visualized ashistogram at each genomic position (Fig. 6a).
3.11 Making Plots
for Single RNAs
If the RTTS data was mapped to single RNAs (using a providedFASTA file) rather than the full genome, it will often be relevant tovisualize the RTTS counts across each of the different RNAs.
1. Create a new folder and download the FASTA file that wereused for mapping and the twoRTTS count files from theGalaxyhistory by clicking on the floppy disc icon for the relevanthistory items to the folder. Change file names of the RTTScount files to counts_plus.txt and counts_minus.txt.
2. Then download this R script http://people.binf.ku.dk/~lukasz/few_genes_histogram.r and open it in a text editor.Execute R and set the working directory (as described in Sub-heading 3.10.3) to the folder containing the FASTA file andthe RTTS count files. To generate RNA-specific RTTS plots foreach RNA present in FASTA file that have at least one readmapped, copy/paste the script to the R console window and hitenter (Fig. 6b).
3.12 Comparing
to Annotation Data
In some cases, it will be relevant to compare RTTS data to somekind of annotation to identify global trends. This can be done bysummarizing the read counts around a set of locations. We have
226 Lukasz J. Kielpinski et al.
45
prepared R script utilizing the bioconductor [20] for generatingsuch a plot from the RTTS count wigs and an additional file con-taining either user-supplied genomic locations or refseq TSS.
Fig. 6 Example of output produced with the described protocol. Mouse liver RNA was analyzed with thedescribed protocol, including an optional CAGE selection to enrich for RTTS corresponding to transcriptionstart sites. (a) Output of Subheading 3.10. The sequencing data was mapped to genome, RTTS counted,converted to wig file, and uploaded to UCSC genome browser. Height of the bar at each genomic locationcorresponds to the number of read 50 ends mapping to this location. Minus strand is shown with negativevalues using different scale. (b) Output of Subheading 3.11. Reads were mapped to a single sequence(Hmgcs2 mRNA) and count of 5’ ends at each location was plotted. Reads mapping to positive strand areshown as above 0, while those mapping to negative strand as below zero. (c) Output of Subheading 3.12.Upper plot shows sum of reads at each distance from annotated TSS. High peak at position 1 results frommany reads mapped to known TSS, while high peak at position 13 results from an alternative TSS for thehighly expressed albumin transcript
Reverse Transcriptase Termination Site (RTTS) Mapping 227
46
1. Prepare file with a set of locations that will serve as referencepoint for counting read locations. The format of the file is threetab-delimited columns. Columns must have headers (named“seqnames”—name of the chromosome, “position,”“strand”). Example given in the script. Positions must be 1-based (see Note 14).
2. Download the script from this URL http://people.binf.ku.dk/~lukasz/plot_around_locations_from_wig.r and open it in atext editor. In the text editor, edit the input file names tomatch two wig files prepared as described in Subheading 3.9and the position file prepared above. Also edit genome assem-bly name and the size of the window surrounding the givenpositions and used for summarizing read counts.
3. Start R, set proper working directory, and copy/paste the scriptto R console. This will produce a barplot of the RTTS countsrelative to the positions given as reference (Fig. 6c).
4 Notes
1. The amount of starting material can be reduced if necessary. Onthe other hand for samples that are to be used for CAGEselection a minimum of 5 μg of RNA is needed. The amountof reverse transcription primer should be scaled with theamount of RNA. The quality of the RNA starting material isvery important as degraded RNA will produce background inany type of experiment based on detection of RTTS. Moreover,random priming typically produces more background thangene-specific priming. In CAGE experiments the non-full-length cDNAs are removed in a selection step, thereby effec-tively reducing the background, but in other applications anegative control sample is required and can be used to normal-ize for reverse transcriptase pretermination.
2. The priming sequence used in RT_random_primer(..NNNNNNNNS-30) can be modified according to specificneeds. In many cases, such as RNA structure probing, a gene-specific primer with the 50 overhang sequence can be used(50-AGACGTGTGCTCTTCCGATCT-“gene specific sequence”).
3. If CAGE selection is to be performed this step should beskipped.
4. Optional selection of full-length cDNA for CAGE analysis ofTSS can be performed according to Subheadings 3.3–3.7[without concentration] as described in [1] and results in atotal volume 34 μl cap-selected RNA.
5. Be careful with low-level pooling of indexes since propersequencing requires that at each cycle there is at least one
228 Lukasz J. Kielpinski et al.
47
green laser read nucleotide (G or T) and one red laser red (A orC). See more at http://www.epibio.com/pdftechlit/312pl1211.pdf.
6. Using long denaturation time in PCR reaction helps alleviateGC bias and fosters reproducibility between different thermalcyclers [21].
7. To simplify the procedure and reduce the risk of contaminatinglaboratory space with generated libraries one can instead ofrunning agarose gel analyze and quantify the PCR productson Bioanalyzer DNA 1000 chip without prior purification. Thisallows pooling the crude reactions in right proportions (it isadvisable to add EDTA to the reactions before pooling to avoidindex switching) and performing only single Ampure XP puri-fication.
8. In case when prepared libraries have the same size distributionit is possible to pool them based on NanoDrop measuredconcentration.
9. The output from sequencing is one or more FASTQ filecontaining the sequence reads and the corresponding qualityscores. If several indexes were used for different experimentalconditions, FASTQ files from each index should be analyzedindividually. If using the main Galaxy server and dealing withlarge datasets (>2 GB), it is an advantage to use the Galaxyftp upload. A tutorial can be found here: http://screencast.g2.bx.psu.edu/quickie_17_ftp_upload/flow.html. The analysisdescribed below can be carried out on a local instance ofGalaxy or on the main Galaxy server (http://usegalaxy.org/).When using the main server be sure to make a login so thatyour analysis is saved. Alternatively the analysis can be per-formed on a Unix/OSX machine in-house (see Subhead-ing 3.6).
10. If the dataset consists of several FASTQ files they can bemerged into one file with the “Concatenate datasets” toolsfound in the “Text Manipulation” tool category at this pointto facilitate the further analysis of the full dataset.
11. Other sequencing read mappers can be used instead of theBowtie mapper. However, it is important not to use too strin-gent cutoff for mapping, because a considerable fraction ofreads contain untemplated sequence added by reverse tran-scriptase at the 50 end. The stringency of mapping conditionsshould be considered individually for each experiment whiletaking the quality of the sequencing reads and the complexityof the sequences that are being mapped against into account.When mapping against short sequences the coverage towardsthe 30 end can be improved by trimming sequencing reads fromthe 30 end.
Reverse Transcriptase Termination Site (RTTS) Mapping 229
48
12. Reverse transcriptase will in some cases add extra untemplatednucleotides after terminating at the 50 end of the RNA. This isespecially pronounced when the 50 end of the RNA is capped,which is the case for mRNAs. For the conditions describedhere, we find that the Primescript RT enzyme will add untem-plated nucleotides in 81 % of cases for RTTS located closer than50 nts to an annotated TSS (most of these presumably beingcapped), while the same is the case for 12 % of the RTTSlocated elsewhere. It is therefore necessary to trim the readsthat have one or more mismatches in the first three mappedpositions, which is implemented in the published workflow. Inthe cases where untemplated nucleotide matches the genomicsequence, it is not possible to do trimming.
13. R can be freely downloaded for any platform at http://cran.r-project.org/. Scripts are written for version 2.15.
14. At this step user must ensure that numbers provided as locationsof interest are in 1-based coordinate system. This system is used,e.g., in UCSC genome browser display window. Be aware thattables downloaded fromUCSC table browser are provided in 0-based system. To use TSS information from the table inprovided script one must add 1 to the starting positions. Readmore on coordinate systems on http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms.
Acknowledgments
The research was funded by the Danish Council for StrategicResearch, the Lundbeck Foundation and the Novo Nordisk Foun-dation. Morten Lindow and Susanna Obad, Santaris Pharma,provided mouse liver samples and RIKEN/Piero Carninci providedthe updated CAGE protocol as well as advice ahead of publication.
References
1. Takahashi H, Kato S, Murata M et al (2012)CAGE (cap analysis of gene expression): a pro-tocol for the detection of promoter and tran-scriptional networks. In: Deplancke B, GheldofN (eds) Gene regulatory networks, vol 786.Humana, Totowa, NJ, pp 181–200
2. Motorin Y, Muller S, Behm‐Ansmant I et al(2007) Identification of modified residues inRNAs by reverse transcription‐based methods.Methods Enzymol 425:21–53. doi:10.1016/s0076-6879(07)25002-5
3. Mortimer SA,WeeksKM(2009)Time-resolvedRNA SHAPE chemistry: quantitative RNAstructure analysis in one-second snapshots and
at single-nucleotide resolution. Nat Protoc4(10):1413–1421. doi:nprot.2009.126 [pii]10.1038/nprot.2009.126
4. Konig J, Zarnack K, Rot G et al (2010) iCLIPreveals the function of hnRNP particles insplicing at individual nucleotide resolution.Nat Struct Mol Biol 17(7):909–915.doi:10.1038/nsmb.1838
5. Shibata Y, Carninci P, Watahiki A et al (2001)Cloning full-length, cap-trapper-selectedcDNAs by using the single-strand linker liga-tion method. Biotechniques 30(6):1250–1254
6. Li TW, Weeks KM (2006) Structure-independent and quantitative ligation of
230 Lukasz J. Kielpinski et al.
49
single-stranded DNA. Anal Biochem 349(2):242–246. doi:10.1016/j.ab.2005.11.002
7. Hirzmann J, Luo D, Hahnen J et al(1993) Determination of messenger RNA5’-ends by reverse transcription of the capstructure. Nucleic Acids Res 21(15):3597–3598
8. Zhu YY, Machleder EM, Chenchik A et al(2001) Reverse transcriptase template switch-ing: a SMART approach for full-length cDNAlibrary construction. Biotechniques 30(4):892–897
9. Carninci P, Kasukawa T, Katayama S et al(2005) The transcriptional landscape of themammalian genome. Science 309(5740):1559–1563. doi:10.1126/science.1112014
10. Shiraki T, Kondo S, Katayama S et al (2003) Capanalysis gene expression for high-throughputanalysis of transcriptional starting point and iden-tification of promoter usage. Proc Natl Acad SciU S A 100(26):15776–15781. doi:10.1073/pnas.2136655100
11. Weeks KM,MaugerDM (2011) ExploringRNAstructural codes with SHAPE chemistry. AccChem Res 44(12):1280–1291. doi:10.1021/ar200051h
12. Lucks JB, Mortimer SA, Trapnell C et al (2011)Multiplexed RNA structure characterization withselective 2’-hydroxyl acylation analyzed by primerextension sequencing (SHAPE-Seq). Proc NatlAcad Sci U S A 108(27):11063–11068. doi:10.1073/pnas.1106501108
13. Giardine B, Riemer C, Hardison RC et al(2005) Galaxy: a platform for interactivelarge-scale genome analysis. Genome Res 15(10):1451–1455. doi:10.1101/Gr.4086505
14. Goecks J, Nekrutenko A, Taylor J et al (2010)Galaxy: a comprehensive approach for support-ing accessible, reproducible, and transparentcomputational research in the life sciences.Genome Biol 11(8):R86. doi:10.1186/Gb-2010-11-8-R86
15. Blankenberg D, Gordon A, Von Kuster G et al(2010) Manipulation of FASTQ data with Gal-axy. Bioinformatics 26(14):1783–1785.doi:10.1093/bioinformatics/btq281
16. Blankenberg D, Von Kuster G, Coraor N et al.(2010) Galaxy: a web-based genome analysistool for experimentalists. Curr Protoc MolBiol Chapter 19:Unit 19.10.11–21
17. Langmead B, Trapnell C, Pop M et al (2009)Ultrafast and memory-efficient alignment ofshort DNA sequences to the human genome.Genome Biol 10(3):R25. doi:10.1186/Gb-2009-10-3-R25
18. Hannon-Lab, Gordon A (2010) FASTX-toolkit:FASTQ/A short-reads pre-processing tools.http://hannonlab.cshl.edu/fastx_toolkit/
19. R Foundation for Statistical Computing(2012) R: A language and environment forstatistical computing, 2151st edn. R Founda-tion for Statistical Computing, Vienna, Austria
20. Gentleman RC, Carey VJ, Bates DM et al(2004) Bioconductor: open software develop-ment for computational biology and bioinfor-matics. Genome Biol 5(10):R80
21. Aird D, Ross MG, Chen WS et al (2011) Ana-lyzing and minimizing PCR amplification biasin Illumina sequencing libraries. Genome Biol12(2):R18. doi:10.1186/gb-2011-12-2-r18
Reverse Transcriptase Termination Site (RTTS) Mapping 231
50
11.2 Paper2:MassiveparallelsequencingbasedhydroxylradicalprobingofRNAaccessibility
This is a pre‐copy‐editing, author‐produced print of an article accepted for publication in Nucleic Acids
Research following peer review. The definitive publisher‐authenticated version will be available online
51
Massive parallel sequencing based hydroxyl radical probing of RNA accessibility
Lukasz Jan Kielpinski1, Jeppe Vinther1,*,
1Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 Copenhagen N, Denmark * To whom correspondence should be addressed. Tel: +4535321264; Fax: +4535322128; Email: [email protected]
ABSTRACT
Hydroxyl Radical Footprinting (HRF) is a tried-and-tested method for analysis of the tertiary structure
of RNA and for identification of protein footprints on RNA. The hydroxyl radical reaction breaks
accessible parts of the RNA backbone, thereby allowing ribose accessibility to be determined by
detection of reverse transcriptase termination sites. Current methods for HRF rely on reverse
transcription of a single primer and detection by fluorescent fragments by capillary electrophoresis.
Here, we describe an accurate and efficient massive parallel sequencing based method for probing
RNA accessibility with hydroxyl radicals, called HRF-Seq. Using random priming and a novel
barcoding scheme, we show that HRF-Seq dramatically increases the throughput of HRF experiments
and facilitates the parallel analysis of multiple RNAs or experimental conditions. Moreover, we
demonstrate that HRF-Seq data for the Escherichia coli 16S rRNA correlates well with the ribose
accessible surface area as determined by X-ray crystallography and have a resolution that readily
allows the difference in accessibility caused by exposure of one side of RNA helices to be observed.
INTRODUCTION
It is becoming clear that many RNA molecules from living cells and viruses have functions that do not
depend on being translated, but rather on adopting intricate structures and binding to proteins (1,2).
This is true for well characterized non-coding RNAs such as ribosomal, transfer, small nucleolar RNAs
and viral RNA genomes, but also for more recently discovered non-coding RNA families, such as long
non-coding RNAs and microRNAs. For many of the novel non-coding RNAs that have been
discovered during the past decade, the function remains unknown and even for some of those that
have been functionally characterized, details of the mechanism of action are lacking. In many cases,
knowledge of the tertiary structure of these RNA molecules will be necessary to identify and
understand their functions. Thus, there is a clear need for structure-probing methods that can deal
with the increasing number of known RNA molecules in cells. Computational methods for prediction of
tertiary RNA structure are improving (3), but they still demand large computational resources, cannot
be used with long RNAs and have large root mean square deviations from the experimental structures
52
(4). Moreover, experimental methods, such as X-ray crystallography and NMR, are especially
challenging for long or flexible RNA molecules (4).
As an attractive alternative, the RNA backbone solvent accessibility can be mapped by hydroxyl
radical footprinting (HRF) (5-7). The hydroxyl radical reacts with hydrogen atoms on the ribose C4’
and C5’ positions in parts of an RNA molecule exposed to the solvent, leading to RNA cleavage (8).
The cleavage pattern can be visualized by electrophoresis of cDNA fragments produced by reverse
transcription (6). Hydroxyl radicals can be conveniently produced in solution through the Fenton
reaction between Fe(II)–EDTA and hydrogen peroxide (5) or inside cells using a synchrotron X-ray
beam (9). HRF can therefore be applied to many different experimental conditions and allows
changes in the tertiary structure or accessibility of the RNA to be determined by comparison of the
abundance of fragments produced during reverse transcription. This type of comparison is relatively
insensitive to the background produced by non-specific termination of reverse transcriptase and has
successfully been used to identify the changes occurring during the folding of the RNA (10) and the
binding of ligands to riboswitches (11) or to map protein binding sites on RNA (also called footprinting)
(9,12). Alternatively, HRF data for RNA molecules can be compared to a non-hydroxyl radical treated
control to normalize for background termination of reverse transcription and in this way produce a
direct measure of the accessibility of the analyzed RNA molecule (6). Recently, it was demonstrated
that such normalized HRF data anti-correlates with the number of through-space ribose neighbors,
which is a measure that can be used to bias discrete molecular dynamics simulations of RNA tertiary
structure prediction. Importantly, addition of the experimental data led to significant improvements in
the accuracy of the predicted structures (13).
Historically, HRF data have been obtained with radioactive labelling of the reverse transcription primer,
gel electrophoresis and phosphor imaging, but the current use of fluorescently labelled primers,
capillary electrophoresis and automated data analysis have significantly improved the throughput of
HRF experiments (14,15). Nevertheless, the capillary methods still deal with a single RNA at a time
and typically provide data for only 3-400 nucleotides in a single experiment. Thus, the throughput of
HRF could be dramatically improved if its readout could be adapted to using modern massive parallel
sequencing technology. This has recently been shown to be possible for SHAPE probing of RNA
secondary structure allowing hundreds of in vitro transcribed RNA molecules to be analyzed in
parallel using a single primer (16). Here, we use massive parallel sequencing together with random
priming of reverse transcription and a novel barcoding and normalization scheme to dramatically
improve the throughput of HRF experiments. The method allows the probing of purified RNAs and
facilitates the parallel analysis of multiple RNAs or experimental conditions. Importantly, we
demonstrate that HRF-Seq data correlates well with the ribose accessible surface area as determined
by X-ray crystallography. The data have a resolution that readily allows the difference in accessibility
53
caused by exposure of one side of RNA helices to be observed, suggesting that HRF-Seq can be
applied in many different settings to gain insight into the functional relevance of tertiary RNA
structures.
MATERIAL AND METHODS
Ribosome preparation
Ribosomes were purified from the E. coli MRE600 strain (gift of Birte Vester, University of Southern
Denmark) as previously described (17). Briefly, bacteria were grown in LB medium until OD600 was
approximately 0.7, transferred to +4°C for 15 min to slowly cool down, pelleted and stored frozen.
1.25 g of the pellet was resuspended in 3.125 ml buffer A (20 mM Tris-HCl pH 7 at 22°C, 10.5 mM
MgOAc, 100 mM NH4Cl, 0.5 mM EDTA and 3 mM 2-mercaptoethanol) and lyzed twice with a French
press at 1000 psi. 125 µl DNase I (Fermentas) was added to 2.5 ml of lysate followed by 20 min
incubation on ice. The DNase treated lysate was centrifuged at 30000 g for 45 min and 1 ml of
supernatant was transferred onto 1 ml of 1.1 M sucrose made in buffer B (as buffer A, but with 0.5 M
NH4Cl) and centrifuged for 15 hours at 100000 g at 4°C. The pellet was washed with buffer A and
resuspended in 5 ml of buffer C (10 mM Tris-HCl pH 7, 10.5 mM MgOAc, 500 mM NH4Cl, 0.5 mM
EDTA and 7 mM 2-mercaptoethanol) followed by 16 hours centrifugation at 100000 g at 4°C. The
pellet was washed and dissolved in buffer EH (10 mM HEPES-Na pH 7.2, 10 mM MgOAc, 60 mM
NH4Cl, 3 mM 2-mercaptoethanol). Ribosomes were precipitated by addition of 81.25 µl ethanol to 125
µl ribosomes followed by incubation 30 minutes at -80°C and centrifugation at 16000 g for 15 min.
The supernatant was removed and the pellet was dissolved in buffer EH lacking 2-mercaptoethanol.
Just before probing, ribosomes were diluted to 10 ng/µl (NanoDrop) and incubated 5 minutes at 37°C.
RNase P specificity domain preparation
A plasmid containing the sequence of the RNase P specificity domain with a structure cassette as
previously described (16) was ordered as a gene synthesis from Eurofins MWG Operon. The plasmid
was linearized with BsaI-HF™ restriction enzyme (New England Biolabs) and used as a template for
an in vitro transcription reaction with T7 RNA polymerase, 0.7 mM rNTP, 6 mM MgCl2, 1 mM
spermidine, 5 mM DTT and 40 mM Tris-HCl pH 8. The reaction was incubated for 90 minutes at 37°C,
ethanol precipitated, centrifuged and resolved on a 5% polyacrylamide, 7M Urea, 1x TBE gel. The
RNA product was located with UV shadowing and the band was cut out and eluted from the gel
overnight in a buffer containing 250 mM NaAc and 1 mM EDTA in the presence of half of the volume
of phenol. The water phase was chloroform extracted and ethanol precipitated, followed by
centrifugation and resuspension in water. RNA was folded before probing as previously described (18)
with modifications. Briefly, 5.5 ng/ul RNA in 140 mM KCl and 20 mM Tris-HCl was incubated for 1
minute at 90°C and transferred to 37°C. After 15 minutes MgCl2 was added to the final concentration
54
of 2.5 mM (KCl and Tris-HCl concentrations kept constant) and the mixture was incubated for 5
minutes at 37°C.
Hydroxyl radical probing
Probing was performed according to the peroxidative Fenton chemistry protocol as previously
described (19). Briefly, three droplets, 2 µl each, with 5 mM ferrous ammonium sulfate-EDTA, 50 mM
sodium ascorbate and 1.5 % H2O2 were placed on the inside walls of a tube containing 100 µl of
prepared substrates (ribosomes or RNase P). The tubes were vigorously vortexed to mix the reagents
and after 60 seconds reactions were stopped by adding 318 µl ice-cold ethanol and 10 ug of glycogen.
The samples were incubated -80°C for 30 min, centrifuged and resuspended in 12.5 µl H2O. Control
reactions were performed in parallel, but with addition of 6 µl H2O instead of the three aforementioned
droplets.
Sequencing library preparation
Sequencing libraries were prepared as previously described (20) with modifications. The sequences
of the primers used in this study are listed in Supplementary Table 1. Briefly, 1 µl of primer (10 µM of
RT_random_primer for ribosomes, 1.7 µM RT_structure_cassette for RNase P probing) was added to
5 µl of probed RNA, followed by incubation 5 minutes at 65°C and transfer to ice. 14 µl of a master
mix was added to each reaction to obtain final concentrations of 50 mM HEPES pH 8.3, 75 mM KCl, 3
mM MgCl2, 0.5 mM dNTP, 0.67 M sorbitol, 0.13 M trehalose and 10 U/µl of PrimeScript Reverse
Transcriptase. The ribosome probing reactions were incubated for 30 sec. at 25°C, 30 min at 42°C,
10 min at 50°C, 10 min at 56°C, 10 min at 60°C and placed on ice. The RNase P probing reactions
were reverse transcribed using the same thermal conditions as used for the ribosome reaction, but
without the incubation at 25°C. The cDNA was recovered with RNAClean XP as described (20)
(ribosomes) or ethanol precipitation (RNase P) and resuspended in 25 µl 5 mM Na-citrate pH 6. The
cDNAs were diluted 200 times in H2O and 3 µl were mixed with 7 µl of a ligation master mix (prepared
by mixing 1 volume of CircLigaseTM 10x buffer, 0.5 volume of 1 mM ATP, 50 mM MnCl2, CircLigaseTM
enzyme, 100 µM LIGATION_ADAPTER_RB oligonucleotide and 2 volumes of 50% PEG 6000 and 5
M betaine). The ligation reaction was incubated for 2 hours at 60°C, 1 hour at 68°C and 10 minutes at
80°C and purified with Ampure XP beads as described (20) and eluted in 16 µl H2O. 1 µl of 10 µM
PCR_REVERSE_INDEX primer and 14 µl of PCR master mix (1.2 volume of 10 µM PCR_forward
primer, 4 volumes of Phusion 5x HF buffer, 1.6 volume of 2.5 mM dNTPs, 6.8 volume of H2O, 0.4
volume of Phusion polymerase) were added to 5 µl of the ligated cDNA. The reactions were incubated
using the following temperature profile: (3 min, 98°C)x1, (80 sec, 98°C; 15 sec, 64°C; 30 sec, 72°C)x4,
(80 sec, 98°C; 45 sec, 72°C)x20, (5 min, 72°C)x1, purified with Ampure XP beads as described (20).
The PCR reactions were pooled and size selected on an E-gel 2% SizeSelect gel to retain the
55
products in the size range 200-600 bp, which were further concentrated on a PCR purification column
(Qiagen) and finally purified on Ampure XP beads before being sequenced on an Illumina HiSeq
system with the 2X100 paired-end protocol. The raw sequencing data is available at
http://people.binf.ku.dk/jvinther/data/HRF-Seq/
Gel electrophoresis detection of RNase P hydroxyl radical probing
The RNase P RNA was prepared and probed as described above for the sequencing-based detection.
After probing, the RNA was mixed with radioactively labelled (T4 polynucleotide kinase and ATP γ-32P)
RT_structure_cassette oligonucleotide, incubated at 65°C for 5 minutes and placed on ice. 4.5 µl of
the reverse transcription master mix (2 volumes of PrimeScript 5x buffer and of H2O, 0.5 volumes of
10 mM dNTP) was added to 5 µl of the RNA-primer mix. The sample was transferred to 42°C and
after 5 minutes of incubation, 0.5 µl PrimeScript enzyme was added and incubation was continued for
30 minutes, followed by ethanol precipitation with glycogen as carrier. A sequencing ladder sample
was prepared in parallel with untreated RNase P by adding 1 µl 5 mM ddATP to the reaction. The
samples were dissolved in formamide loading dye (92.5% formamide, 5 mM EDTA, 0.025%
bromophenol blue, 0.025% xylene cyanol), denatured (2 min, 90°C) and resolved on 40 cm long, 8%
polyacrylamide, 7M Urea, 1x TBE gel at 45 W. After electrophoresis the gel was transferred onto
Whatman paper, dried, exposed to image plate and scanned (Cyclone Storage Phosphor, Packard).
Pre-processing of sequencing reads
The Cutadapt utility (21) was used to remove contaminating adapter sequences (“-a
AGATCGGAAGAGCACACGTCT” for the first and “-a
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT” for the second read in pair) and to filter out low
quality ends (“-q 17”). Using an awk script, the 7 nucleotide barcode was removed from the beginning
of the first read and saved in separate file and the last 7 nucleotides from the end of the second read
were removed. Finally, pairs containing a read shorter than 15 nucleotides after trimming were filtered
out.
Assembly of E. coli MRE600 16S rRNA sequence
The pre-processed sequence pairs were used as input for Trinity (22) to assemble the strain specific
16S rRNA sequence. Comparison of the Assembly to the sequence of chain A in 3OFA pdb structure
identified 5 mutations (r.80a>c, r.89u>g, r.93u>c, r.183c>u and r.1498u>g).
Mapping reads pairs to strain specific 16S rRNA sequence of RNase P specificity domain
sequence
56
The sequence pairs were mapped to the assembly-corrected 16S rRNA sequence or to the RNase P
specificity domain sequence using Bowtie 2 program (23) with options “-N 1 -L 15 --norc -X 700”.
Untemplated nucleotides, putatively added via terminal transferase activity of reverse transcriptase,
were trimmed as described previously (20). For the analysis of 16S rRNA, pairs that spanned less
than 100 nt were discarded to reduce effects of size selection.
Estimated Unique Counts (EUC)
We defined a fragment as a pair of sites, 1) the termination site, which is the last reverse transcribed
RNA nucleotide and 2) the priming site, which is the first sequenced nucleotide of the second read.
Relationship between the EUC (‘n’) and the number of observed unique barcodes (‘k’) was calculated
using formula 1, which is an extension of a previously used method (24), but allowing different
barcodes to be ligated with different probabilities (‘Pi’). We calculated the frequency of the different
nucleotides at each position of the barcode using the observed set of barcodes from mapped
fragments having a read count within three lowest quartiles of all fragments in the given dataset
(Supplementary Table 2). To estimate the Pi for each barcode in each performed ligation reaction, we
assumed that positions in the barcode are independent and multiplied the probabilities for all possible
sequence combinations. Finally, for each experiment we sum over all possible barcodes (‘m’) and
calculate the table of k(n) relationships, which was reversed to a n(k) table, rounded to nearest integer,
and used to read out the EUC (‘n’) for the observed (‘k’) for each fragment.
Formula 1:
k 1 1
RNase P hydroxyl radical probing gel quantification and correlation with sequencing.
The scanned gel image was quantified with ImageJ (25). The signals corresponding to nucleotides
117-221 in the RNase P RNA were manually assigned to the sequence by comparison to a ddATP
sequencing reaction run in parallel. For each band the maximal value was extracted, followed by
subtraction of the average signal intensity in the whole +/- 6 nt region to correct for unequal
background intensity over the gel length. To allow optimal comparison between sequencing EUC and
gel intensities, the sequencing data was not trimmed for untemplated additions to the 3’ end of the
cDNA by reverse transcriptase, because we expect these shifts in signal to be present in the gel
resolved fragments. For the plot in supplementary figure 2, we have used positions 117 to 186, which
were chosen due to bands compression in the region before and the effect of size selection of the
sequencing library in the region after.
57
Number of through-space contacts in RNase P specificity domain calculation
To calculate the number of through-space ribose contacts, we have used chain B of the 1NBS pdb
structure (26) with the positions 121-124 structurally aligned from chain A of the same structure. Atom
locations were obtained from the PDB file and used to calculate ribose positions, defined as the mean
of the C1’, C2’, C3’, C4’ and O4’ positions. Next, we used the ribose bead locations to calculate the
number of ribose positions (excluding the neighbouring riboses) within distance of 14 Å from a given
ribose position.
Solvent accessible surface area calculation
Solvent accessible surface area was calculated using the PyMOL get_area function with settings
dot_solvent=1, dot_density=3. For the RNase P specificity domain, chain B of 1NBS structure (chain
A for positions 120-125) (26) and a solvent radius of 1.4 Å was used, whereas chain A of 3OFA
structure in complex with 3OFC (27) and a solvent radius of 3 Å was used for 16S rRNA
(supplementary figure 1).
Running average of ∆TCR calculation
Termination count at a given position was calculated as the sum of the EUCs of fragments terminating
at the position. Effective coverage at a given position was calculated as the sum of the EUCs of the
fragments terminating at or spanning the position. In addition for the ribosome analysis, fragment
were only used for calculation of effective coverage for a given position, if distance between the
position and the priming position was at least 100 nt. For RNase P the coverage was calculated using
all fragments, but only positions 87-186 were used for the subsequent analysis. A coverage cut off
was set to coverage that would provide a 90 % probability that a termination count was observed
given the average cleavage probability (median ∆TCR). The Termination-Coverage ratio (TCR) of a
given position was calculated by dividing termination EUC by the effective coverage EUC. ∆TCR was
calculated according to formula 2. As a last step ∆TCR was smoothed with a moving average over a
window of 3 nucleotides and offset by 1 position upstream to reflect the fact that reverse transcription
terminates before cleaved position.
Formula 2:
∆ max1
, 0
58
RESULTS
Reducing the biases in massive parallel sequencing based readout of HRF
As in classic HRF, our massive parallel sequencing strategy (HRF-Seq) is based on the detection of
reverse transcription termination sites, but instead of analyzing the sample on a gel or a capillary, we
ligate an adaptor to the 3’ end of the cDNA and PCR amplify using primers containing adaptor and
index sequences allowing massive parallel sequencing of many different conditions in a single lane on
the Illumina platform (16,20) (Figure 1). After paired end sequencing, the resulting reads can be
mapped to the investigated RNA to give the precise coordinates of the priming and probing event.
Compared with capillary analysis, the great advantage of using sequencing is increased throughput,
but sequencing methods also introduce additional experimental biases during ligation, PCR
amplification and sequencing steps (28). To reduce these biases, we introduced a 7 nucleotide
random barcode sequence in the 5’ end of the adaptor used for ligation. The barcode serves two
purposes. First, it has been shown that using an adaptor pool significantly reduces ligation bias in
small RNA cloning experiments using T4 RNA ligases (29) and we expect that the same is true for the
TS2126 RNA ligase (CircLigaseTM) used in this study. Second, the barcode serves as a label that is
added to each fragment before introduction of PCR and sequencing biases. At low coverage the
number of unique barcodes can be used directly to give the count for the specific fragment before the
PCR. At high coverage, it becomes more likely that the barcodes of the same sequence are ligated to
the same fragment multiple times (become saturated). Saturation occurs when the fragment count
exceeds the square root of the number of barcodes and will affect the accuracy of quantification (30).
By assuming that all the barcodes have equal probability of being attached to a given fragment, it is
possible to correct for saturation and calculate an Estimated Unique Count (EUC) (24). In our
experiments, the ligation adaptor is prepared by standard oligonucleotide synthesis as a pool of
oligonucleotides having 7 degenerate positions at the 5’ end. During our analysis, we realized that the
individual barcodes are present at very different frequencies in the barcode pool (Figure 2A), meaning
that the observed distribution of barcodes is modelled very poorly when equal barcode frequencies in
the barcode pool is assumed (Figure 2B). We therefore devised a novel strategy for estimating
individual fragment counts based on the method previously implemented by Fu et al. (24), but taking
into account that barcodes are present at different frequencies in the adaptor pool. In our strategy, the
underlying barcode frequencies in the adaptor pool are estimated by determining the nucleotide
frequencies observed at the seven different positions in the barcode after excluding fragments with
counts in the top quartile to avoid bias from clonal amplification of specific fragments. These
nucleotide frequencies are stable across our different experiments (Supplementary Table 2),
suggesting that they are accurate. Assuming independence among the positions in the barcode, we
then estimate the barcode frequencies by multiplication of the nucleotide frequencies. In simulation,
59
the estimated underlying barcode frequencies produce an observed distribution of barcodes that are
similar to the actual observed distribution, although the observed data still have a more extreme
distribution, probably because of the presence of PCR duplicates (Figure 2B). We applied this
normalization strategy to calculate EUC for HRF of a short in vitro transcribed RNA (specificity domain
from the Bacillus subtilis RNase P RNA) and for HRF of a long RNA purified from cells (Escherichia
coli 16S ribosomal RNA), both probed with hydroxyl radicals. For the RNase P specificity domain RNA,
we obtained high coverage resulting in saturation of barcodes. This is corrected using our strategy,
but not using simple barcode counting or by assuming equal barcode frequencies (Figure 2C). The
saturation of barcodes was not observed with the 16S rRNA, because of much lower coverage
(Figure 2D). By comparing the observed fragment counts with the EUC and stratifying by fragment
length, it is clear that for the RNAse P RNA, most positions have no length dependent bias (counts
equals EUC) (Figure 2E). This is most likely because there is relatively little length difference between
the different fragments in the PCR. For some of the RNase P positions (the longest fragments), we
observe a bias, which is related to some of the barcodes containing deletions, leading to assignment
of RNase P sequence as part of the barcode and subsequent reduction in the barcode complexity
and underestimation of the EUC. This phenomenon will have a small, but significant effect on the
quality of our data and can be avoided in the future by extending the barcode and giving it a specific
signature that will allow true barcodes to be distinguished (30). For the 16S rRNA dataset, we
observe a striking overrepresentation of short fragments, which is most likely caused by PCR
amplification and sequencing biases (Figure 2F) and our barcode normalization strategy efficiently
corrects for this bias. For both the 16S rRNA and the RNase P RNA, the EUC calculated using
unequal barcode frequencies performs at least as well as the other normalization strategies when
comparing with accessibility data obtained from the crystal structures (Supplementary Table 3). The
superior performance of our method in determining the RNase P accessibility stems mainly from
saturation of barcodes for the fragments that reach the RNA fragment terminus, leading to
underestimation of signal in the other type of barcode normalization. In contrast, the 16S rRNA
coverage is lower, so that a simple count of unique barcodes allows the data to be normalized for
fragment length bias of PCR. Thus, our barcoding strategy corrects for fragment length bias and for
the barcode saturation that can occur at high coverage, allowing the strategy to be used regardless of
the level of coverage
60
Figure 1. Major experimental steps of the HRF‐Seq method. Following hydroxyl radical probing, primers containing a 5’ illumina adaptor overhang are extended by reverse transcriptase to positions of radical induced breaks. Adapters containing a 7 nt barcode are ligated to the 3’ ends of cDNAs, followed by PCR amplification with primers containing Illumina compatible adaptor and index sequences. After size selection, the library is sequenced with the Illumina paired‐end protocol to provide information of the positions of probing and priming.
61
Figure 2. Using barcodes to estimate unique counts. A) Observed barcode frequencies. Histogram showing the distribution of observed barcode frequencies in the hydroxyl radical treated RNase P experiment. The broken vertical line indicates the barcode frequency if all barcodes were present at equal frequencies. B) Estimation of barcode counts. The plot compares the observed barcode counts with simulated barcode counts as estimated by assuming equal barcode frequencies or the unequal barcode frequencies as estimated by our strategy. Data is from the hydroxyl radical treated RNase P experiment. C) Relationship between the number of observed unique barcodes and EUC for different types of barcode normalization strategies for the hydroxyl radical treated RNAse P experiment. The vertical line shows the highest count observed in the experiment. D) Relationship between the number of observed unique barcodes and EUC for different types of barcode normalization strategies for the hydroxyl radical treated 16S rRNA. The vertical line shows the highest count observed in the experiment. E) Length dependent bias of fragments in the probing of the RNAse P specificity domain RNA. F) Length dependent bias of fragments in the probing of the 16S rRNA.
HRF-Seq analysis of in vitro transcribed RNAse P RNA
To validate our sequencing based output of HRF, we first compared the EUCs obtained for the
specificity domain of B. subtilis RNase P RNA with the output obtained with classical gel based HRF
using identical conditions and the same primer for reverse transcription. The footprinting signals from
the two methods are strongly correlated (R = 0.80), showing that the HRF-Seq EUC captures the
same signal as classical hydroxyl radical footprinting (Supplementary Figure 2). The HRF signal
(Figure 3A) contains both background signal caused by spontaneous termination of the reverse
62
transcriptase and a signal decay resulting from termination of reverse transcriptase before the probed
position. To normalize for the background, we implemented a slightly modified version of the
QuShape normalization method recently described by Weeks and colleagues for analysis of SHAPE
data (15). In line with the QuShape method, we estimate the coverage across the RNA by summing
the EUC for the fragments that reach or pass a given position (Figure 3B). The observed coverage is
a measure of number of reverse transcriptases reaching a given position. This can be used to
normalize the termination EUC to give a Termination-Coverage ratio (TCR), which is the fraction of
reverse transcriptases that will terminate at a given position. The TCR of the treated sample is
composed of probing signal and background signal, whereas the control samples’ TCR is composed
of background signal only. Comparing the sum of TCR for the control and treated experiments after
excluding the 5’ run off indicates that the treated RNaseP sample contains 47 % background signal.
Assuming that background causes the same fraction of reverse transcriptases to terminate at a given
position in the control and treated sample, the probing signal can be normalized for spontaneous
termination of the reverse transcriptase by subtraction of the control sample TCR from the treated
sample TCR to give a normalized accessibility measure ∆TCR (see methods section for full
description). This is slightly different from the QuShape procedure, which assumes that the
background signal in the probed sample is a scaling of the signal observed in the control sample. The
median ∆TCR is a measure of the average hydroxyl radical induced cleavage probability and for
RNAse P probing it is 0.0033 (Supplementary Figure 3A and 3B) corresponding to 1 hydroxyl radical
induced cleavage per 300 nt and approximately 34 % probability of observing a single hit on the RNA.
HRF data is known to have high background signal and in some cases, barcode assignment and
terminal transferase activity of reverse transcriptase can cause the signal to shift by one or two
nucleotides. In order to reduce the overall experimental noise, we therefore take advantage of the
accessibility of neighboring positions being highly correlated and calculate the moving average of
∆TCR in a 3 nucleotide window (Fig. 3C). Comparing the moving average of ∆TCR with the moving
average of ribose accessibility calculated from the solved crystal structure for the RNAseP specificity
domain RNA, we find a significant correlation (R = 0.55) (Figure 3D). This correlation is slightly higher
than previously observed for this RNA using traditional HRF based on capillary analysis (13).
Moreover, we also find that the moving average of ∆TCR anti-correlates with through-space ribose
neighbors (R = -0.57) as calculated from the RNAse P crystal structure (Figure 2E), suggesting that
HRF-Seq data can be used to inform discrete molecular dynamics simulations of RNA tertiary
structure prediction (13). In the comparison with the crystal structure accessibility, we observe 4
positions (positions 99-102) that are clear outliers in our probing data, giving too high ∆TCR signal.
This region is a loop (Figure 3F) and the discrepancy between our data and the data from the crystal
structure probably reflects that this loop is more flexible and has a higher accessibility in solution.
63
Figure 3. HRF‐Seq analysis of RNase P RNA specificity domain. A) Termination signal for HRF treated sample calculated as the sum of EUC for fragments terminating at a given position. B) Coverage for HRF treated sample. C) Normalized HRF‐Seq signal calculated as the 3 nucleotides moving average of the termination coverage ratio for the HRF treated sample with the termination coverage ratio for the control sample subtracted. D) Correlation between the normalized HRF‐Seq signal and a three nucleotide moving average of ribose accessibility from the published crystal structure (26) using a 1.4 Å probe. E) Correlation between the normalized HRF‐Seq signal the number of ribose through‐space contacts from the published crystal structure (26). R values are calculated using the Pearson correlation. F) Normalized HRF‐Seq signal displayed on the crystal structure of the RNase P RNA specificity domain (26), gray indicates no data.
Random primed HRF-Seq analysis of purified 16S rRNA
Next, we wanted to extend HRF-Seq to the analysis of long RNA molecules isolated from the cellular
environment. To make our strategy general and applicable to the entire transcriptome, we used
random primers for reverse transcription, rather than the single primer strategy that we used for the
RNase P experiments and that were previously used for SHAPE-Seq (16). We chose the E. coli 16S
ribosomal RNA for validation of our strategy, because of the high abundance of the ribosome and the
solved crystal structure (27). Native ribosomes including ribosomal proteins were purified and used for
HRF-Seq using random priming during reverse transcription to obtain signals for the entire 16S RNA
molecule in a single experiment. We also obtained data for the 23S rRNA, but because of low stability
during purification and high prevalence of posttranscriptional modifications that terminate reverse
transcription, only parts of the 23S rRNA were covered. After mapping the reads to the 16S rRNA, we
again used the barcodes present in the ligation adaptors to calculate the EUC for each observed
64
fragment (Figure 4A). The fragments can be collapsed to give EUC for each termination position
(Figure 4B). Knowing the EUC and the exact probing and priming position for each fragment, we can
calculate the effective coverage at each position by taking the size selection that occurs during
preparation of the sequencing library into account. In our set-up a fragment size cut-off of 100
nucleotides ensures that the effective coverage of a position is affected only by the molecules that
potentially could have been observed at the specific position given their priming site. The data for the
hydroxyl radical treated sample and the control were obtained using 5.7 % of an Illumina HiSeq lane.
For the treated sample, 12% of 5.2 million reads mapped to 16S and provided good coverage across
the large majority of the 16S rRNA (Figure 4C). Using the termination EUC and the effective coverage,
we then calculated TCR for the hydroxyl treated sample and the control experiment (Figure 4D).
Comparing the sum of TCR for the control and treated experiments after excluding the 5’ run off
indicates that the treated sample in this case contain 86 % background signal. Surprisingly, we
observe a couple of positions that have very high signal in the control compared to the treated sample
(most notably position 330, 551, 552 and 1378). As the only difference between the treated and
control sample is the radical treatment, we speculate that these signals are the result of a nuclease
activity that co-purifies with the ribosome and becomes inactivated by the radical treatment. We
subtracted the control TCR from the treated TCR to give a ∆TCR value for each position. The median
∆TCR is 0.0018, which corresponds to 1 hydroxyl radical induced cleavage per 560 nt on average
(Supplementary Figure 3C and 3D). Finally, we applied the 3 nucleotides window moving average to
∆TCR to give accessibility values for the 16S E. coli rRNA. We find that the RNA accessibility
calculated from the ribosomal crystal structure (27) as a 3 nucleotides moving average of ribose
solvent accessibility using a solvent radius of 3 Å correlates with the HRF-Seq determined ∆TCR (R =
0.56) (Figure 4E). While the agreement between the crystal structure accessibility and the HRF-Seq
data in general is quite striking, 16S rRNA positions 723 and 729 shows high signal in the HRF-Seq
data, but are inaccessible in the crystal structure. In the ribosome crystal, position 723 of the 16S
rRNA is bound and hidden from solvent by ribosomal protein S21 (RPS21) and RPS21 has previously
been shown to crosslink to position 723 (31). Interestingly, RPS21 is known to have a fast off rate and
exchange rapidly in reconstitution experiments (32) and is therefore likely to have been lost during
purification, which would explain the discrepancy between our data and the crystal structure at this
position. Positions 723 and 729 are located in a loop and the high HRF-Seq signal at position 729
compared to the crystal accessibility indicates that the loop changes its conformation when RPS21 is
absent, thereby exposing position 729 to the solvent. In general, however, the footprints of ribosomal
proteins and the large ribosomal subunit on the 16S surface are readily observed in HRF-Seq data
(Figure 5). As exemplified by position 723, the resolution of the HRF-Seq accessibility signal is high.
Zooming in on H16/H17, which run parallel to the long axis of the subunit and are located on a rather
flat surface, it is clear that HRF-Seq allows the difference in accessibility caused by exposure of one
side of RNA helices to be attained (Figure 6A). In fact, even for the entire 16S molecule, we observe a
65
strong correlation in accessibility signal for positions separated by one or two helical turns (Figure 6B),
probably because a significant fraction of 16S rRNA is helical and exposed on the surface. As
expected for accessibility footprinting there is no significant difference in the HRF-Seq signal for base-
paired positions compared to non-base-paired positions, but interestingly the probing signal of
positions that are Watson-Crick base-paired correlates with the probing signal of positions on the
opposite strand located downstream (offset by 2 and 3 bases) from the paired position (R=0.41 and
0.43, respectively). This is in perfect agreement with what one would expect from the accessible
surface area of riboses in helical structure with one side facing the solvent.
66
Figure 4. HRF‐Seq analysis of E. coli 16S rRNA. A) Sequenced fragments (EUC) from the treated (left) and control (right) sample mapped to 16S rRNA sequence. Left terminus of each fragment corresponds to the reverse transcription termination site and the right terminus to the priming site. B) Sum of EUC termination signal at each position for HRF treated and control sample. C) EUC based coverage for HRF treated and control sample. D) Termination‐Coverage ratio (TCR) calculated by dividing the termination signal with the coverage for the treated and control samples. E) Top graph (red) shows normalized HRF‐Seq signal calculated by subtracting TCR for the control sample from the TCR obtained from the treated sample and taking the 3 nucleotide moving average. Bottom graph (blue) shows the area of ribose accessibility calculated from the crystal structure (27) as the 3 nucleotide moving average of the accessibility to a probe with 3 Å radius . R calculated using the Pearson correlation
67
Figure 5. 16S rRNA accessibility surface representation HRF‐Seq data. A) Three views of the crystal structure of the RNA part of the 16S small ribosomal subunit colored with moving average of ribose accessibility as measured from the crystal structure (27) using a 3 Å probe. P, H and S indicates the platform, head and shoulder of the ribosomal subunit as named in (34). B) Crystal structure of 16S small ribosomal subunit colored with the normalized HRF‐Seq signal, gray indicates no data.
68
Figure 6. Periodicity of RNA accessibility. A) Close‐up of the positions 400‐500 of the 16S rRNA colored with the normalized HRF‐Seq signal. B) Pearson correlation between HRF‐Seq signal and ribose accessibility from the crystal structure for nucleotides separated by the indicated offset.
DISCUSSION
We present a new method for HRF of RNA backbone accessibility using massive parallel sequencing
as the readout. Our study demonstrates that this method has dramatically improved throughput
compared to classical capillary based methods and produces data that agree well with RNA ribose
accessible surface areas and through-space contacts determined by the X-ray crystallography.
Importantly, we show that HRF-Seq makes it possible to analyze long RNA molecules and mixtures of
RNA molecules in parallel in a single tube by using random primers. To this end, we devised new
strategies for reducing PCR and sequencing biases based on barcodes in the ligation adaptor and on
data normalization using the probing and priming position information obtained during sequencing.
Both of these strategies could be implemented for other types of sequencing based probing methods.
During the final preparation of this manuscript, Das and colleagues published a method to reduce the
bias in probing experiments based on the detection of termination of reverse transcription also by
introducing barcodes, but only for in vitro transcribed RNAs with a single primer (33). An important
advantage of using massive parallel sequencing as readout for HRF experiments is the digital nature
69
of the data, which makes data processing relatively easy compared to the analysis of data obtained
by gel or capillary electrophoresis. Moreover, after mapping we find that a substantial fraction of the
reads (~20 % on average) have mismatches in the 3 positions corresponding to the very 3’ end of the
cDNA produced, which is indicative of untemplated nucleotides being added to the cDNA by the
terminal transferase activity of the reverse transcriptase. This causes a shift of signal in the 5’
direction of the RNA, which cannot be corrected when using gel and capillary based methods for data
readout. In contrast, using massive parallel sequencing readout, we can perform a simple trimming of
reads with terminal mismatches to correct the probing position for approximately 75 % of cases with
untemplated nucleotides added (20).
Hydroxyl radical footprinting is a versatile method that can be used to investigate changes in tertiary
RNA structure, identify protein footprints on RNA and guide the computational prediction of tertiary
RNA structure. Here, we compare a radical treated sample with a control sample to obtain an
accessibility signal that could be used for computational prediction of tertiary RNA structure by
calculating ∆TCR and averaging it over 3 positions. The averaging improves overall correlation
because of the high accessibility correlation with neighboring position observed in the dataset (Figure
6B), but also blurs the fine details. In other types of experiments, such as typical footprinting
experiments, where two probed conditions are compared, the objective will be to determine specific
position that have differential accessibility in the two conditions. In such cases, it would make sense to
analyze the data by comparing the coverage and termination EUCs of the two samples with the
Fisher exact test or a test based on the negative binominal distribution. In this way the coverage and
termination count will be taken into account in the calculation of the significant differences between
the two samples. Importantly, the use of X-rays allows hydroxyl radical footprinting to be performed
inside intact cells (9) and kinetic studies of RNA folding (10) to be performed. HRF-Seq should be
readily applicable to such types of analysis and we therefore expect that the throughput provided by
HRF-Seq will help pave the way for an increased understanding of the functional consequences of
RNA tertiary structure inside cells and the dynamics of RNA folding. In particular, HRF-Seq should
facilitate the probing of long RNA molecules, such as mRNAs, long ncRNAs and viral RNAs, for which
tertiary structure information currently is very limited.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR online.
70
ACKNOWLEDGEMENT
We are grateful to Jan Christiansen, who helped purify E. coli ribosomes and to Anders Krogh for
advice on the calculation of estimated unique counts. We thank the Danish National DNA Sequencing
Center for performing sequencing and the system administration at Section for Computational and
RNA Biology for providing computational infrastructure.
FUNDING
This work was supported by the Danish Council for Strategic Research [Center for Computational and
Applied Transcriptomics, DSF-10-092320]. LJK is funded by a PhD stipend from the Department of
Biology, University of Copenhagen. Funding for open access charge: the Danish Council for Strategic
Research.
REFERENCES
1. Wan, Y., Kertesz, M., Spitale, R.C., Segal, E. and Chang, H.Y. (2011) Understanding the
transcriptome through RNA structure. Nat Rev Genet, 12, 641-655.
2. Sharp, P.A. (2009) The centrality of RNA. Cell, 136, 577-580.
3. Cruz, J.A., Blanchet, M.F., Boniecki, M., Bujnicki, J.M., Chen, S.J., Cao, S., Das, R., Ding, F.,
Dokholyan, N.V., Flores, S.C. et al. (2012) RNA-Puzzles: a CASP-like evaluation of RNA
three-dimensional structure prediction. RNA, 18, 610-625.
4. Laing, C. and Schlick, T. (2010) Computational approaches to 3D modeling of RNA. J Phys
Condens Matter, 22, 283101.
5. Latham, J.A. and Cech, T.R. (1989) Defining the inside and outside of a catalytic RNA
molecule. Science, 245, 276-282.
6. Tullius, T.D. and Greenbaum, J.A. (2005) Mapping nucleic acid structure by hydroxyl radical
cleavage. Current opinion in chemical biology, 9, 127-134.
7. Brenowitz, M., R. Chance, M., Dhavan, G. and Takamoto, K. (2002) Probing the structural
dynamics of nucleic acids by quantitative time-resolved and equilibrium hydroxyl radical
‘footprinting’. Current Opinion in Structural Biology, 12, 648-653.
71
8. Balasubramanian, B., Pogozelski, W.K. and Tullius, T.D. (1998) DNA strand breaking by the
hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the
DNA backbone. Proceedings of the National Academy of Sciences, 95, 9738-9743.
9. Adilakshmi, T., Lease, R.A. and Woodson, S.A. (2006) Hydroxyl radical footprinting in vivo:
mapping macromolecular structures with synchrotron radiation. Nucleic Acids Res, 34, e64.
10. Sclavi, B., Sullivan, M., Chance, M.R., Brenowitz, M. and Woodson, S.A. (1998) RNA Folding
at Millisecond Intervals by Synchrotron Hydroxyl Radical Footprinting. Science, 279, 1940-
1943.
11. Lipfert, J., Das, R., Chu, V.B., Kudaravalli, M., Boyd, N., Herschlag, D. and Doniach, S. (2007)
Structural Transitions and Thermodynamics of a Glycine-Dependent Riboswitch from Vibrio
cholerae. Journal of Molecular Biology, 365, 1393-1406.
12. Powers, T. and Noller, H.F. (1995) HYDROXYL RADICAL FOOTPRINTING OF
RIBOSOMAL-PROTEINS ON 16S RIBOSOMAL-RNA. Rna-a Publication of the Rna Society,
1, 194-209.
13. Ding, F., Lavender, C.A., Weeks, K.M. and Dokholyan, N.V. (2012) Three-dimensional RNA
structure refinement by hydroxyl radical probing. Nature methods, 9, 603-608.
14. Yoon, S., Kim, J., Hum, J., Kim, H., Park, S., Kladwang, W. and Das, R. (2011) HiTRACE:
high-throughput robust analysis for capillary electrophoresis. Bioinformatics, 27, 1798-1805.
15. Karabiber, F., McGinnis, J.L., Favorov, O.V. and Weeks, K.M. (2013) QuShape: rapid,
accurate, and best-practices quantification of nucleic acid probing information, resolved by
capillary electrophoresis. RNA (New York, N Y ), 19, 63-73.
16. Lucks, J.B., Mortimer, S.A., Trapnell, C., Luo, S.J., Aviran, S., Schroth, G.P., Pachter, L.,
Doudna, J.A. and Arkin, A.P. (2011) Multiplexed RNA structure characterization with selective
2 '-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proceedings
of the National Academy of Sciences of the United States of America, 108, 11063-11068.
17. Spedding, G. (1990) Ribosomes and protein synthesis : a practical approach. IRL Press at
Oxford University Press, Oxford England ; New York.
72
18. Kjems, J., Egebjerg, J. and Christiansen, J. (1998) Analysis of RNA-protein complexes in vitro.
Elsevier, Amsterdam ; New York.
19. Shcherbakova, I. and Mitra, S. (2009) Hydroxyl-radical footprinting to probe equilibrium
changes in RNA tertiary structure. Methods in Enzymology, 468, 31-46.
20. Kielpinski, L.J., Boyd, M., Sandelin, A. and Vinther, J. (2013) Detection of reverse
transcriptase termination sites using cDNA ligation and massive parallel sequencing. Methods
Mol Biol, 1038, 213-231.
21. Martin, M. (2011) Cutadapt removes adapter sequences from high-throughput sequencing
reads. . EMBnet J 17, 10-12.
22. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X.,
Fan, L., Raychowdhury, R., Zeng, Q. et al. (2011) Full-length transcriptome assembly from
RNA-Seq data without a reference genome. Nat Biotechnol, 29, 644-652.
23. Langmead, B. and Salzberg, S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat
Methods, 9, 357-359.
24. Fu, G.K., Hu, J., Wang, P.-H. and Fodor, S.P.A. (2011) Counting individual DNA molecules by
the stochastic attachment of diverse labels. Proceedings of the National Academy of
Sciences, 108, 9026-9031.
25. Schneider, C.A., Rasband, W.S. and Eliceiri, K.W. (2012) NIH Image to ImageJ: 25 years of
image analysis. Nat Methods, 9, 671-675.
26. Krasilnikov, A.S., Yang, X., Pan, T. and Mondragon, A. (2003) Crystal structure of the
specificity domain of ribonuclease P. Nature, 421, 760-764.
27. Dunkle, J.A., Xiong, L., Mankin, A.S. and Cate, J.H. (2010) Structures of the Escherichia coli
ribosome with antibiotics bound near the peptidyl transferase center explain spectra of drug
action. Proc Natl Acad Sci U S A, 107, 17152-17157.
28. Weeks, K.M. (2011) RNA structure probing dash seq. Proceedings of the National Academy
of Sciences of the United States of America, 108, 10933-10934.
73
29. Jayaprakash, A.D., Jabado, O., Brown, B.D. and Sachidanandam, R. (2011) Identification and
remediation of biases in the activity of RNA ligases in small-RNA deep sequencing. Nucleic
Acids Res, 39, e141.
30. Casbon, J.A., Osborne, R.J., Brenner, S. and Lichtenstein, C.P. (2011) A method for counting
PCR template molecules with application to next-generation sequencing. Nucleic Acids Res,
39, e81.
31. Brimacombe, R., Atmadja, J., Stiege, W. and Schüler, D. (1988) A detailed model of the
three-dimensional structure of Escherichia coli 16 S ribosomal RNA in situ in the 30 S subunit.
Journal of Molecular Biology, 199, 115-136.
32. Bunner, A.E., Trauger, S.A., Siuzdak, G. and Williamson, J.R. (2008) Quantitative ESI-TOF
analysis of macromolecular assembly kinetics. Anal Chem, 80, 9379-9386.
33. Seetin, M.G., Kladwang, W., Bida, J.P. and Das, R. (2014) Massively Parallel RNA Chemical
Mapping with a Reduced Bias MAP-Seq Protocol. Methods Mol Biol, 1086, 95-117.
34. Schluenzen, F., Tocilj, A., Zarivach, R., Harms, J., Gluehmann, M., Janell, D., Bashan, A.,
Bartels, H., Agmon, I., Franceschi, F. et al. (2000) Structure of functionally activated small
ribosomal subunit at 3.3 angstroms resolution. Cell, 102, 615-623.
74
1
Supplementary information accompanying the paper:
Massive parallel sequencing based hydroxyl radical footprinting of RNA accessibility
Lukasz Jan Kielpinski1, Jeppe Vinther1,*,
1Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 Copenhagen N, Denmark
CONTENT Supplementary figures 1-3
Supplementary tables 1-3
75
2
Supplementary Figure 1. Impact of solvent radii used for calculation of ribose accessible surface area on the correlation with HRF-Seq data. A) Correlation between moving average of ΔTCR and moving average of ribose accessible surface area calculated with different solvent radii for RNase P. Highest correlation was observed for probe radii 1.4 Å. B) Correlation between moving average of ΔTCR and moving average of ribose accessible surface area calculated with different solvent radii for 16S rRNA. Highest correlation was observed for probe radii 3 Å.
76
3
Supplementary Figure 2. Comparison of classical hydroxyl radical probing with HRF-Seq.
A) Autoradiogram of gel electrophoresis of RNase P hydroxyl radical probing (lower lane) with the
ddATP sequencing as size marker (upper lane). B) Quantification of gel shown on plot A. C)
Termination EUC as obtained in RNase P-treated sequencing experiment. D) Correlation plot
between signal intensity and termination EUC. The shown R value is the Pearson correlation.
77
4
Supplementary Figure 3. ΔTCR values before averaging and zeroing.
A) Barplot of non-zeroed ΔTCR for the footprinting of the RNAseP RNA. B) Distribution (excluding 5%
top and 5% bottom values) of non-zeroed ΔTCR for the footprinting of the RNAseP. Dashed, vertical
red lines represent the median ΔTCR. C) Barplot of non-zeroed ΔTCR for the footprinting of the 16S
rRNA. D) Distribution (excluding 5% top and 5% bottom values) of non-zeroed ΔTCR for the
footprinting of the 16S rRNA. Dashed, vertical red lines represent the median ΔTCR.
78
5
Oligonucleotide name Oligonucleotide sequence (5’ to 3’)
RT_random_primer AGACGTGTGCTCTTCCGATCTNNNNNNNNS
RT_structure_cassette AGACGTGTGCTCTTCCGATCTGAACCGGACCGAAGCCCG
LIGATION_ADAPTER_RB PHO-NNNNNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3NHC3
PCR_forward AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCT
PCR_REVERSE_INDEX.14_AGTTCC
CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_REVERSE_INDEX.16_CCGTCC
CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_REVERSE_INDEX.22_CGTACG
CAAGCAGAAGACGGCATACGAGATCGTACGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
PCR_REVERSE_INDEX.24_GGTAGC
CAAGCAGAAGACGGCATACGAGATGCTACCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
Oligonucleotide sequences © 2007-2009 Illumina, Inc. All rights reserved.
Supplementary Table 1. Oligonucleotides used in the study
79
6
Sample Sequenced
nucleotide
Sequenced position
1 2 3 4 5 6 7
16S rRNA, Treated
A 0.26 0.26 0.24 0.27 0.3 0.42 0.08
C 0.3 0.33 0.33 0.31 0.28 0.28 0.55
G 0.19 0.17 0.19 0.19 0.17 0.1 0.09
T 0.25 0.24 0.24 0.23 0.24 0.19 0.28
16S rRNA, Control
A 0.25 0.25 0.24 0.28 0.31 0.43 0.08
C 0.3 0.33 0.32 0.31 0.28 0.27 0.58
G 0.2 0.18 0.19 0.19 0.17 0.1 0.09
T 0.25 0.24 0.25 0.22 0.24 0.2 0.26
RNase P, Treated
A 0.27 0.26 0.25 0.27 0.3 0.41 0.09
C 0.29 0.33 0.33 0.31 0.29 0.28 0.52
G 0.2 0.17 0.18 0.19 0.17 0.1 0.09
T 0.24 0.24 0.24 0.23 0.24 0.2 0.3
RNase P, Control
A 0.26 0.26 0.25 0.27 0.3 0.42 0.09
C 0.3 0.33 0.32 0.3 0.29 0.28 0.54
G 0.2 0.17 0.19 0.2 0.17 0.1 0.09
T 0.24 0.24 0.24 0.23 0.25 0.2 0.28
Supplementary Table 2. Nucleotide frequencies at each barcode position for each sample used to
calculate the barcode ligation probabilities.
80
7
Sample Counting reads
Counting unique barcodes
EUC equal barcode frequencies
EUC estimated barcode frequencies
RNase P 0.45 0.50 0.53 0.55
16S rRNA 0.49 0.56 0.56 0.56
Supplementary Table 3. Pearson correlation between HRF-Seq signal and ribose accessibility for
different methods of processing the sequencing data
81
82
11.3 Paper3:Transcriptome‐widedetectionofbindingsitesofLockedNucleicAcidcontainingoligonucleotides(LNA‐Stop‐Seq)
83
Transcriptome‐widedetectionofbindingsitesofLockedNucleicAcidcontainingoligonucleotides(LNA‐Stop‐Seq)
AbstractAntisense oligonucleotides (ASOs) form a new class of promising drug candidates that act by hybridizing to
RNA molecules and exploit various cellular mechanisms for their function. Here, we describe the
development of a method for transcriptome‐wide characterization of ASO binding sites by finding the ASO
induced reverse transcription termination sites. First, we have characterized several reverse transcriptase
enzymes and have chosen the PrimeScript for the remaining experiments. Next, we have optimized the
separation of hybridized oligonucleotides from RNA with a gel filtration in formamide. Then, we show the
characterization of the crosslinking of 4‐thiothymidine (4‐thio‐T) modified oligonucleotide to the RNA. We
have researched two possibilities of enriching for the ASO‐terminated cDNA molecules. First was based on
degradation of RNA molecules (or their parts) not protected by the crosslinked oligonucleotide. Second is
based on the CAGE‐like selection of cDNA molecules terminated upon reaching crosslinked, biotinylated
ASO. The second strategy was used to build the libraries for massive parallel sequencing. Motif generated
based on the sequencing results recapitulates the sequence of the used ASO and the overall signal shows
enrichment in the vicinity of the possible binding sites. On the other hand, portion of the signal is of no
obvious origin and the analysis is ongoing.
IntroductionAntisense oligonucleotides have been long imagined to have therapeutic potential and lured researchers by
the promised ease of designing drugs by simply synthesizing the molecule with sequence matching to the
troublesome gene. Many strategies of action were proposed, including hybridization with microRNAs to
inhibit their function, blocking splicing machinery to modulate mRNA maturation or most commonly
degrading disease‐causing transcripts with siRNAs or gapmers (Kole et al., 2012; Stenvang et al., 2012). It
was recognized that to improve the drug properties such as delivery to the tissue of interest, hybridization
and stability, various modifications are required. One of the promising modifications is a substitution of
some or all of the nucleotides with the nucleotide analog – locked nucleic acid (LNA) (Koch et al., 2008)
which protects the oligonucleotide (ASO) from degradation by nucleases and significantly increases affinity
for the target. The LNA is incorporated in many drug candidates, deploying strategies such as microRNA
inhibition by antisense hybridization (Lanford et al., 2009; Obad et al., 2011) or mRNA degradation with
gapmer (Straarup et al., 2010), that is the molecule with a DNA core (that recruits RNase H) and flanks
composed of LNA. In the case of siRNAs it was shown that they act not only on the intended targets but
also exhibit sequence‐ (Jackson et al., 2003; Lindow et al., 2012) or sequence‐non‐ (Olejniczak et al., 2010)
specific effects. In contrast, little is known about off‐target effects of LNA containing oligonucleotides,
which is of significant interest considering the current therapeutic developments of drugs based on the LNA
chemistry.
84
There were several published approaches towards profiling the RNA accessibility for interactions with
oligonucleotides. Those methods were based on hybridizing a target RNA with random
oligo(deoxy)nucleotides and detecting sites of efficient binding by dialysis, RNase H treatment or reverse
transcription priming. Alternatively, the RNA was hybridized to oligonucleotides coated arrays and detected
to which oligonucleotides it can stably bind (summarized in (Allawi et al., 2001)).
Here we describe the development of the method, named LNA‐Stop‐Seq, which allows for the identification
of binding sites of an oligonucleotide (here we use LNA modified) across the entire transcriptome. As a
proof of concept, we apply the LNA‐Stop‐Seq to find hybridization sites of a previously described gapmer,
which targets apolipoprotein B and reduces plasma level of non‐high‐density lipoprotein cholesterol
(Straarup et al., 2010). The method relies on crosslinking of the hybridized ASO bearing 4‐thiothymidine (4‐
thio‐T, Figure 1) to the transcripts and finding the specific sites of interactions with massive parallel
sequencing of reverse transcription terminations. The 4‐thio‐T is an analog of well characterized
crosslinking group – 4‐thiouridine (4‐thio‐U), which is a naturally occurring nucleotide that crosslinks at
close range with both amino acids and nucleotides upon long‐range UV (>320 nm) excitation. Crosslinking
sites to RNA can be detected by finding terminations of reverse transcription (Sontheimer, 1994). Among
advantages of using the photocrosslinkable nucleotide for covalent binding of the ASO to its hybridization
sites are preserving ASO structure, stability (until irradiation) and thanks to the used long UV wavelength
minimizing crosslinking between other groups present in the probed mixture (Meisenheimer and Koch,
1997).
Figure 1. 4‐thiothymidine structure
Materialsandmethods
Buffersused2x RNA folding buffer (40 mM Tris‐HCl pH 7.8, 280 mM KCl) (Kjems et al., 1998)
2x RNA folding buffer – EDTA (40 mM Tris‐HCl pH 7.8, 280 mM KCl, 0.01 mM EDTA)
10x Mg for RNA folding (20 mM Tris‐HCl pH 7.8, 140 mM KCl, 25 mM MgCl2)
10x Mg for RNA folding – EDTA (20 mM Tris‐HCl pH 7.8, 140 mM KCl, 25 mM MgCl2, 0.005 mM EDTA)
PreparationofinvitrotranscribedRNA1. ApoB RNA fragment
85
The PCR product derived from the human genomic DNA with primers ApoBrev and ApoBfor+T7 using Pfu
DNA polymerase has been used as a template for transcription with T7 RNA polymerase followed by
polyacrylamide gel purification with UV shadowing for product visualization. Expected RNA sequence is
GGGAGAUUCUCCUUUAAAUCAAGUGUCAUCACACUGAAUACCAAUGCUGAACUUUUUAACCAGUCAGAUAUUG
UUGCUCAUCUCCUUUCUUCAUCUUCAUCUGUCAUUGAUGCACUGCAGUACAAAUUAGAGGGCACCACAAGAUU
GACAAGAAAAAGGGGAUUGAAGUUAGCCACAGCUCUGUCUCUGAGCA.
2. ApoB mutated fragments
Mutated fragments of ApoB were obtained in the same way as ApoB fragment, but PCR products were
synthesized with either ApoB‐rev‐A, ApoB‐rev‐C, ApoB‐rev‐G or ApoB‐rev‐T primer in pair with ApoBfor+T7.
Expected RNA sequence is
GGGAGAUUCUCCUUUAAAUCAAGUGUCAUCACACUGAAUACCAAUGCXGAACUUUUUAACCAGUCAGAUAUUG
UUGCUCAUC, where X indicates the mutated base and can be either A, C, G or U.
3. IGF‐II RNA fragment
Human IGF‐II fragment RNA is a gift from Jan Christiansen (its predicted structure is shown on the figure 8
in (Christiansen et al., 1994) but the 3’ end of the used RNA molecule is located downstream from 3’ end
shown on the figure).
HighresolutionpolyacrylamideelectrophoresisThermally denatured samples were resolved on preheated 1xTBE, 7M Urea polyacrylamide (concentration
given in the method of specific experiments) gel at 45‐50 W. Gels were transferred onto Whatman paper,
dried, and exposed to phosphoimaging screen which was subsequently scanned with the Cyclone Storage
Phosphor System (Packard) usually after 16 hours of exposition.
Choiceofreversetranscriptase(Figure2)Reverse transcription reactions were performed according to manufactures recommendations with 5’ end
labeled (T4 PNK with ATP γ‐32P) ApoB_PE for ApoB fragment or IGF2_PE.h‐p for IGF‐II fragment primers
except: (1) final volume of reactions was 18.75 µl and contained 1 µl of respective enzyme, (2) all reactions
were supplemented with 667 mM sorbitol and 133 mM trehalose, (3) initial mixture was prepared by
mixing 100 fmol labeled primer with 100 fmol RNA and 100 ng tRNA and, for the reactions marked “H”,
1 pmol ApoB‐str.dis5' (which is complementary to 5’ part of RNA) ASO, (4) Thermal conditions: initial
mixture was heated to 65°C (70°C for IGF‐II) for 5 min and transferred on ice supplemented with master
mix and incubated as follows: 42°C for 10 min, ramp 0.1°/sec until 50°C and kept for 30 min, then 10 min at
56°C, 10 min at 60°C and cooled to 4°C. ThermoScript enzyme reactions were incubated 50°C for 20 min,
65°C for 40 min, 85°C for 5 min and cooled to 4°C. Volume of reactions with IGF‐II fragment was scaled
down by a factor of 2. Samples were mixed with equal volume of either formamide (ApoB fragment) or
urea gel loading solution and were resolved with high resolution polyacrylamide electrophoresis.
LNA‐RNAhybridsseparationassay(Figure3A)5 pmol of ApoB RNA fragment has been mixed with 5 µg of yeast tRNA and 10 pmol of either 6434 or 6435
ASOs, incubated 2 min at 65°C and kept at room temperature for 5 min followed by ethanol precipitation
and pellets resuspension in deionized formamide (84 µl). Columns (NucAway – Ambion, AutoSeq G‐50 – GE
86
Healthcare, illustra MicroSpin S‐300 – GE Healthcare) were either prepared with formamide (NucAway) or
buffer exchanged with formamide by 4x repeated application of 500 µl (S‐300) or 350 µl (AutoSeq) and
centrifugation at 730 g for 1 min. The RNA‐oligo mix in formamide (20 µl) was heat denatured (95°C for
2 min), applied on the columns and spun for 1 min at 730 g. The flow‐through was ethanol precipitated
(sample M1 and A with 4 volumes of ethanol, M2 and S with ethanol and sodium acetate – each sample
was precipitated with both protocols and chosen by the highest yield), dissolved in H2O, quantified with
NanoDrop, thermally denatured together with primer (100 fmol 5’ end labeled ApoB_PE; 70°C, 5 min) and
used for primer extension reaction (8.88 µl reactions with 1x PrimeScript buffer, 0.2 µl PrimeScript,
667 mM sorbitol and 133 mM trehalose, 0.5 mM dNTP, thermal conditions 42°C – 10 min , 50°C – 30 min,
56°C – 10 min, 60°C – 10 min and cooled to 4°C). After primer extension, the samples were ethanol
precipitated, pellets dissolved in 5 µl formamide loading dye out of which 2 µl were run on 10% high
resolution polyacrylamide gel. Signal quantification was performed with the ImageJ program (Schneider et
al., 2012) for each lane for the regions marked with the yellow bar by multiplying the mean signal from the
given region by its area and subtracting mean background signal (as measured on the relatively big area
outside the samples region) multiplied by the measured region area.
ASO‐RNAcrosslinking(Figure3B) 2.3 pmol of ApoB RNA fragment was mixed with 4 pmol respective ASO and 1056 ng yeast tRNA in 1.03x
PrimeScript buffer in 21 µl, heated to 65°C for 1 min and transferred to room temperature. 20 µl droplets
were spotted on Parafilm in Stratalinker 1800 equipped with EIKO F8T5 (Blacklight) lamps and irradiated for
approximately 23 minutes. Remaining liquid (due to evaporation only approximately 11 µl left) was ethanol
precipitated, resuspended in 22 µl deionized formamide and incubated 2 min at 95°C and 20 µl was applied
on formamide‐washed MicroSpin S‐300 columns as described in the description for Figure 3A. Flow‐
through was ethanol precipitated (with sodium acetate), pellet washed and dissolved in 10 µl H2O. 2.58 µl
was used for primer extension reaction and gel electrophoresis as described in the method for Figure 3A.
Basepairingwithdifferentnucleotide(Figure3C)1 pmol of an ApoB RNA fragment or of a mutated ApoB RNA fragment has been mixed with 1 µg of yeast
tRNA and 2 pmol of respective ASO in 1xRNA folding buffer in the volume of 18 µl. Samples were denatured
(1 min, 90°C) and transferred to room temperature. After cooling down, the magnesium concentration was
adjusted to 2.5 mM MgCl2 with 10x Mg for RNA folding buffer and the samples were crosslinked for 10 min
with NIS F8T5 Black Light bulbs in Stratalinker 1800, ethanol precipitated, resuspended in 20 µl deionized
formamide, purified with S‐300 columns as described in method for Figure 3B and analyzed by primer
extension and electrophoresis as described in method for Figure 3A.
Time‐gradientofcrosslinking(Figure4)Experiment performed as the experiment for Figure 3B with modifications: 20 µl mix that contained 1 pmol
of ApoB RNA fragment, 5 pmol of ASO (6434 or 4‐thio‐1) and 1 µg of yeast tRNA in 1x PrimeScript buffer
was irradiated with varying time.
EnrichmentwithTerminatorenzyme(Figure5)155 pmol cytidine‐3'‐phosphate was 5’ end labeled with 3.3 pmol ATP γ‐32P using T4 PNK enzyme in 1x T4
PNK buffer (Fermentas) by incubating in the volume of 5 µl for 40 min at 37°C followed by enzyme
inactivation at 70°C for 5 min. Obtained [5'‐32P]pCp (not purified) was ligated to the 3’ end of ApoB
87
fragment (used ~6 pmol) with T4 RNA ligase in 1x T4 RNA ligase buffer (Fermentas) supplemented with ATP
at 4°C over night to yield 3’ end labeled RNA molecule. Labeling was followed by NucAway purification
(Ambion) and 21.5 out of 38 µl of eluant was subject to RNA 5’polyphosphatase (Epicentre) treatment to
convert 5’ triphosphate to 5’ monophosphate (1x polyphosphatase buffer, 1 µl enzyme per 25 µl reaction,
37°C for 30 min) followed by one more NucAway purification. The RNA was split into 5 equal parts and
mixed with a respective ASO in 1x Terminator buffer A (Epicentre), heated to 65°C for 1 min and brought to
a room temperature for 5 min, crosslinked in open tubes with black light bulbs for 15 min. Volume was
adjusted with H2O to 10 µl and 5 µl of the reaction was transferred to the new tube for Terminator (5’
phosphate dependent exonuclease) digestion with 0.22 µl of the enzyme in 7.2 µl reaction (in 1x terminator
buffer A) for 30 min at 30°C followed by addition of 1 µg tRNA, ethanol precipitation and high resolution
10% polyacrylamide denaturing electrophoresis.
RNaseIprotectionassay(Figure6)A body‐labeled ApoB RNA fragment was synthesized with T7 RNA polymerase from the same PCR template
as a non‐labeled ApoB RNA fragment with addition of UTP α‐32P. The RNA was DNase I treated (Ambion)
and purified on a NucAway column. The labeled RNA was mixed with 4‐thio‐7 ASO in 1x RNA folding buffer,
heated to 90°C for 1 min, incubated at 37°C for 15 min, supplemented with 10x Mg for RNA folding to
obtain 1x concentration, incubated for 5 more minutes at 37°C and crosslinked for 20 min with black light
bulbs followed by adjusting the volume with H2O. Irradiated RNA duplexed with ASO underwent digestion
with RNase H (NEB) with 10 mM DTT for 1 hour at 37°C followed by the ethanol precipitation enhanced by
the addition of tRNA. Products were resolved on 6% polyacrylamide denaturing gel, visualized by
autoradiography and band expected to be derived from the RNase H cleavage of RNA with crosslinked ASO
(comparison with clearly visible band on the cold gel) was cut out, the RNA eluted and precipitated. Such
prepared sample of RNA‐ASO crosslinked complex was split into two parts and used as a template for
primer extension reaction with cold ApoBrev primer, in which samples of RNA were mixed with 10 pmol
primer in the volume of 22.5 µl, incubated 5 min at 65°C and placed on ice. Reverse transcription master
mix was prepared by mixing 22.5 µl 5x PrimeScript buffer, 5.63 µl 10 mM dNTP, 45 µl sorbitol‐trehalose mix
(1.67 M and 0.33 M) and 15.38 µl H2O. The master mix was split into 2 times 88.5 µl, one supplemented
with 1.5 µl PrimeScript enzyme, one with 1.5 µl H2O and added to the RNA‐primer mix and incubated in the
same thermal conditions as reactions for Figure 3A. Reverse transcription was transferred on ice, stopped
by addition of 15 µl 50 mM EDTA and 3 µg tRNA. Each reaction was split into six times 20 µl and
supplemented with different amount (3, 1.5, 0.75 or 0.375 µl) of RNase I (Fermentas), phenol‐chloroform
extracted, ethanol precipitated and resolved on 10% polyacrylamide denaturing gel.
Demonstrationofselection(Figure7)100 fmol of ApoB RNA fragment was mixed with 0, 100 or 1000 fmol of respective ASO in the volume of
36 µl in 1x RNA folding buffer – EDTA, folded in thermocycler following the program: 90°C for 1 min, ramp
0.1°C/s until 79°C, 79°C for 5 min, ramp 0.1°C/s until 74°C, 74°C for 10 min, ramp 0.1°C/s until 69°C, 69°C
for 5 min, ramp 0.1°C/s until 37°C, 37°C for 10 min and supplemented with 4 µl 10xMg for RNA folding –
EDTA. After folding, the samples were placed on Parafilm in Stratalinker and irradiated for 10 min with EIKO
black light bulbs, collected, ethanol precipitated, dissolved in formamide, spun through S‐300 columns (as
in the description for Figure 3A), reverse transcribed (as in the description for Figure 6, but scaled down to
25 µl per reaction and with using only 3/10 of such calculated amount of enzyme). After the reverse
transcription to each of the reactions 4 µl of 0.17 µg/µl tRNA and 41.7 µl EDTA was added, followed by
88
addition of 0.54 µl RNase I and incubation at 37°C for 30 min and purification with RNAClean XP beads (as
in (Kielpinski et al., 2013) but with elution in 25 µl 10 mM Tris‐HCl pH 8.3). Fraction of the sample (5 µl)
underwent selection as described in the CAGE protocol (Takahashi et al., 2012)[sections 3.6 and 3.7] but
scaled down by the factor of 7 (volume). An adapter (LIG_DNA) was ligated to the 3’ end of cDNA from
selected (1 µl out of 16.25 µl eluted from the beads) and non‐selected (3 µl taken from purification after
RNase I treatment) samples using CircLigase as described in (Kielpinski et al., 2013) but using 10 times less
enzyme. Samples were purified (Ampure XP) and PCR amplified with LIG_PCR and ApoB_qPCR_R primers
for 35 cycles, resolved on 3% agarose, 1x TBE gel, stained with ethidium bromide and UV visualized.
PreparationofsequencinglibraryMouse liver total RNA (Zyagen MR‐314) was poly(A) enriched using Ambion Poly(A) purist MAG kit with
1.6% yield. 200 ng of poly(A) RNA was folded with 0, 0.02, 0.2 or 2 pmol of ASO (4‐thio‐1 batch 2 or 4‐thio‐
1‐biotin – synthesized in parallel) in 1x RNA folding buffer – EDTA in 36 µl in a thermocycler following the
program: 90°C for 1 min, ramp 0.1°C/s until 79°C, 79°C for 5 min, ramp 0.1°C/s until 74°C, 74°C for 10 min,
ramp 0.1°C/s until 69°C, 69°C for 5 min, ramp 0.1°C/s until 37°C, 37°C for 10 min, add 4 µl 10x Mg for RNA
folding – EDTA, 37°C for 5 min, add oligo (in the volume of 2 µl, note that for this sample folding occurred in
34 µl in 1.06x RNA folding buffer – EDTA) to the “cofolded” sample, 37°C for 5 min, crosslink all samples for
10 min (drops on parafilm; no‐UV sample removed before irradiation) in Stratalinker 1800 with EIKO bulbs
and ethanol precipitate. The pellets were dissolved in 30 µl formamide (samples 11 and 12 in 8 µl H2O – no
column elution), RNA purified on S‐300 HR spin column as described in method for Figure 3A (the columns
were prepared by 2x spinning with 700 µl and 1x with 350 µl formamide) and subsequently ethanol
precipitated with glycogen as carrier and resuspended in 8 µl H2O. For the reverse transcription, 4 µl of RNA
was mixed with 1 µl 10 µM RT_15xN primer, incubated at 65°C for 5 min and cooled on ice, followed by
addition of 15 µl master mix prepared by mixing 4 volumes of 5x PrimeScript buffer, 1 volume of 10 mM
dNTP, 8 volumes of sorbitol‐trehalose mix (1.67 M and 0.33 M) and 2 volumes of PrimeScript enzyme. The
reactions were incubated in thermocycler following the program: 25°C for 30 sec, 42°C for 30 min, 50°C for
10 min, 56°C for 10 min, 60°C for 10 min and placed on ice. To each reaction 4 µl with 666 ng of tRNA and
167 nmol of EDTA and 0.5 µl RNase I was added followed by 30 min incubation at 37°C and RNAClean XP
purification as in RTTS‐Seq (Kielpinski et al., 2013) with elution in 25 µl of 10 mM Tris‐HCl pH 8.3. 20 µl from
each reaction was used for selection which was scaled down version of the reaction described in the CAGE
protocol (Takahashi et al., 2012). Briefly: 440 µg MPG streptavidin mix was incubated for 30 min with
132 µg tRNA, washed twice with wash buffer 1, resuspended in 352 µl wash buffer 1 and split into 40 µl
batches in low‐binding tubes to which 20 µl purified cDNA was added and allowed to bind for 30 min at
room temperature, followed by washings with buffers 1 (1x) , 2 (1x), 3 (2x), 4 (2x) and released with 30 µl
50 mM NaOH by incubating for 10 min. The supernatant was neutralized with 6 µl 1 M Tris‐HCl (pH 7) and
purified with 65 µl Ampure XP beads (mod. prot.) with elution in 8 µl H2O. Selected and non‐selected cDNA
was ligated with LIG_DNArandBARC oligonucleotide as described in (Kielpinski et al., 2013), purified with
Ampure XP beads (mod. prot.), eluted in 16 µl H2O. The ligated samples (5 µl) were used for PCR
amplification as described in (Kielpinski et al., 2013) in the total reaction volume of 20 µl using 22 for
selected and 12 for non‐selected cycles (plus 4 initial three‐step cycles). After PCR, the 10x diluted amplified
libraries were quantified and quality checked on Bioanalyzer High Sensitivity chips and mixed by adding
5.38 µl of the selected samples, 10 µl samples 9 and 10 and 4.27 µl samples 11 and 12 onto 250 nmol of
EDTA followed by purification with 137 µl Ampure XP according to the manufacturers protocol with elution
89
in 10 µl 10 mM Tris‐HCl, 2% E‐Gel SizeSelect (invitrogen) size selection keeping fragments between 200‐600
bp (buffer was collected from a lower chamber every 20 seconds of the electrophoresis run), volume
reduction with Qiagen PCR purification kit with elution in 30 µl Tris‐HCl obtaining concentration 5.5 ng/µl
(NanoDrop) and subsequent 94 nt long single‐read illumina HiSeq sequencing multiplexed with another
sample from the laboratory (Jakob Rukov).
MassiveparallelsequencingdataanalysisReads were pre‐processed with a Cutadapt utility (Martin, 2011) with options “‐m 27 ‐a
AGATCGGAAGAGCACACGTCT ‐q 17”, followed by trimming and keeping the barcode (first 7 nt), TopHat2
(Kim et al., 2013) mapping to a mouse mm9 genome assembly and trimming untemplated nucleotides from
the beginning of the remaining read (Kielpinski et al., 2013). Estimated unique counts (EUC) of reads
sharing reverse transcription termination site (RTTS) were calculated based on the number of unique
barcodes of all the reads mapping to a given location as described in the Paper 2. The EUC per position was
displayed using BedGraph track in a UCSC Genome Browser (Kent et al., 2002) as shown on a Figure 9. Input
for the MEME motif discovery (Bailey and Elkan, 1994) was generated with the custom script that uses the
Bioconductor package (Gentleman et al., 2004) which extracted the sequence of 20 nt located upstream
from the position with the highest EUC in each RefSeq representation (in the case of positions with equal
counts at the same transcript one was chosen randomly). cWords analysis (Rasmussen et al., 2013) was
performed on the web server (http://servers.binf.ku.dk/cwords/) on December 20th 2012 with the options
“Species:Mouse” and “Sequences:mRNA”. The input consisted of Ensembl gene IDs sorted (decreasing) by
the ratio of number of reads mapping to a given transcript (longest isoform of each gene) in the tested
sample (indices 3,5,7) to the sum of reads mapping to this transcript in the control samples (indices 11 and
12). Motifs reported were the top motifs enriched in the up‐regulated genes. Plots for the Figure 10B were
prepared according to the section 3.12 in (Kielpinski et al., 2013) using starting positions of Bowtie
(Langmead et al., 2009) mapped sequence of the used oligonucleotide as the annotation (options: “‐y ‐S ‐a
–n2 mm9 ‐c GCATTGGTATTCA”).
RNA‐RNAinteractionsprediction(Figure11)RNA‐RNA interactions were predicted using RNAStructure v5.3 (Reuter and Mathews, 2010) and figures
were generated using VARNA v3.9 (Darty et al., 2009).
Results
ChoiceofareversetranscriptaseIn this study we aimed at finding the RNA‐ASO hybridization sites on the transcriptome‐wide scale by
finding the ASO‐induced reverse transcription termination sites (RTTS). To reduce the biases and obtain a
signal of the highest possible quality, we first set out to choose which reverse transcriptase to use. First of
all, we found it important to select an enzyme that can efficiently pass thorough stable RNA structures
(Harrison et al., 1998). We have performed the primer extension reactions on the stable hairpins from
human IGF‐II mRNA (Christiansen et al., 1994) (Figure 2B). Comparison of seven commercially available
enzymes left us with SuperScript II, SuperScript III, PrimeScript and ThermoScript as being able to efficiently
pass through the structured RNA (Figure 2A, B). Another important consideration regarding choice of the
enzyme is its terminal transferase activity (Kulpa et al., 1997) that should be minimized to improve mapping
90
efficiency and precision. To test for that property, we ran a high resolution electrophoresis of primer
extension reactions performed with an in vitro transcribed RNA fragment as template. Use of AccuScript
and AffinityScript enzymes led us to a very well defined full‐length product (Figure 2C), use of a PrimeScript
enzyme resulted in two main bands, use of SuperScript enzymes gave rise to one main band and several
weaker, while a ThermoScript‐derived cDNA molecules had the widest length distribution.
Moreover, an additional concern with the strategy was, that performing randomly primed reverse
transcription reaction in the complex mixture may cause interference between synthesized cDNA
molecules. This phenomenon can happen if newly synthesized strand would terminate on the cDNA strand
synthesized upstream on the RNA molecule. That interference would be minimized if the reverse
transcriptase would have efficient strand displacement activity. To check for that property we have carried
out the primer extension reaction in the presence of DNA ASO hybridized to the RNA (upstream from the
site complementary to the labeled primer) which resulted in the detection of undesired shorter cDNA
molecules for the AccuScript, AffinityScript and ThermoScript. Low resolution of the gel doesn’t allow
distinguishing if the early termination is related to the presence of RNase H activity or inefficient strand
displacement (Figure 2D) but both would be undesired. Combination of the tests led us to choose the
PrimeScript as an optimal enzyme for our assay.
91
Figure 2. Characterization of reverse transcriptases. (A) Gel electrophoresis of primer extension reaction on highly structured IGF‐II RNA fragment (shown on panel B) with different reverse transcriptases. (B) Structure of the IGF‐II fragment after (Christiansen et al., 1994). The red arrow indicates the primer binding site (only reverse transcribed part of RNA shown). (C) Heterogeneity of full length products of primer extension on ApoB fragment. (D) Primer extension of ApoB fragment with (H) or without (C) complementary DNA oligonucleotide. Enzyme names abbreviations: SII – SuperScript II (Invitrogen), SIII – SuperScript III (Invitrogen), P – PrimeScript (TaKaRa), Ac – AccuScript (Agilent), Af – AffinityScript (Agilent), G – GoScript (Promega), T – ThermoScript (Invitrogen), N – no enzyme. Red rectangle indicates image resizing – given original proportions it is a square (which also applies to Figure 3 and Figure 4).
SII SIII P Ac Af T
C H C HC H C H C H C H
SII SIII Af GP Ac T N
CA
SII SIII Ac Af P T
C H C HC H C H C H C H
D
C
C
UG
ACU
C
C
C
U
GGUGUGCUCCU
GG
AA
GGAAGAU
CU
UGGGGA
C
CC C C C
C
A C
C
G G A G C A C A C CUA
G
G
G
A
U
CAUCUU
UGCC
CGU
CUCCUGGGGACC
CCC
CAA
G
A
AA U
GU
G GA
G U C C U C G G G G GC
C GU
G C AC U
G A U G
C
GG
G G AG
U
1
10
20
30
40
50
60
708090100
110 120 130
140
141
B
92
CrosslinkingcharacterizationAlthough our assay for finding the hybridization sites depends on reverse transcription termination upon
reaching the crosslinked ASO we hypothesized that the termination can be also induced by hybridized but
not crosslinked ASO, especially highly affine ASOs containing LNA, leading to the risk of observing reverse
transcription terminations on the hybridization sites that were not occupied during crosslinking but were
taken by the ASO in the later steps of the protocol. Therefore, it was crucial to develop a method of
removing non‐covalently bound LNA ASOs before reverse transcription reaction. Such a separation of ASO
from the bound RNA requires (1) dissociation and (2) subsequent physical separation preferably based on
the different molecular sizes. Based on the previous report (Pinder et al., 1974) formamide aids RNA duplex
melting and inhibits its reassociation upon cooling, properties that made it a perfect solvent for the
discussed process. We have compared several commercially available gel exclusion columns for size
separation of formamide dissolved, heat‐denatured RNA‐ASO mix and compared their performance with
primer extension assay by measuring the ratio of the signal in the region surrounding ASO binding site to
the signal of the full length product (Figure 3A). For the 8‐mer nucleotide (6435), all of the tested columns
gave similar results, diminishing the signal observed in the non‐purified sample to the approximately
background level indicating efficient separation. On the other hand, only the formamide soaked illustra
MicroSpin S‐300 HR column gave comparably good results when separating 13‐mer nucleotide from the
RNA, and we have decided to use it for subsequent experiments.
Next, we wanted to answer how the crosslinking to RNA depends on the position of 4‐thio‐T incorporation
within the ASO. We have performed crosslinking of 8 different ASOs – based on 13‐mer and 8‐mer, fully are
fully complementary to the in vitro transcribed RNA fragment. They were synthesized with 4‐thio‐T located
either internally, at their 5’ or 3’ end or without any crosslinkable group. We then annealed ASOs to the
RNA, crosslinked with long‐range UV and looked for primer extension terminations (Figure 3B). As a
confirmation of previous findings (Dubreuil et al., 1991), the internally incorporated 4‐thio‐T didn’t lead to
efficient crosslinking. Reactive groups incorporated on one of the termini crosslinked well to the RNA,
underscoring the requirement of structural flexibility. Based on the experiment we have chosen 5’
incorporation site of 4‐thio‐T for our future experiments due to better defined reverse transcription
termination sites upstream of the ASO binding site, although we cannot exclude possibility of distal
crosslinks outside of the surveyed region.
Since flexibility is required for the crosslinking we decided to test if the type of a nucleotide that could
potentially base‐pair with 4‐thio‐T impacts the crosslinking efficiency. We have performed experiment
analogous to the one described above, but checking (1) crosslinking of the same ASO to four RNA targets
that differ by the single nucleotide that can possibly base‐pair with 4‐thio‐T and (2) panel of different ASOs
positioned on the same target in a way that the 4‐thio‐T can base‐pair with different nucleotides (Figure
3C). Gel autoradiography revealed that base pairing of the terminal nucleotide does not exclude the
possibility of crosslinking but that the slight changes in positioning of the ASO on the target can have a big
impact on the crosslinking induced RTTS pattern.
We have also checked the dependence of crosslinking efficiency on irradiation time (Figure 4). Comparison
of signal strength upstream of the ASO binding to the full length product shows dose‐response (with
strongest response at the beginning of irradiation) for the samples with 4‐thio‐T and no effect in the
samples without 4‐thio‐T.
93
Figure 3. Characterization of crosslinking of 4‐thio‐T containing ASO to RNA by primer extension reaction on ApoB fragment. (A) A comparison of different ways of washing out non‐crosslinked hybridized oligonucleotides (N – no treatment, M1 and M2 – NucAway column (Ambion), S – illustra MicroSpin S‐300 (GE Healthcare), A – AutoSeq G‐50 (GE Healthcare)). Numbers above the lanes indicate the ratio between the signal in the oligonucleotide‐related termination (lower yellow bar) to the signal of the full length product (upper yellow bar). (B) Assessing the impact of the position of 4‐thio‐T incorporation in the oligonucleotide, N‐ no 4‐thio‐T, 5’‐ 4‐thio‐T at the 5’ end, 3’ – at the 3’ end, Int – 4‐thio‐T incorporated internally, 13‐mer – ASO design based on 6434, 8‐mer ‐ ASO design based on 6435 (see Table 1). (C) Impact of the identity of the nucleotide to which 4‐thio‐T can potentially basepair (indicated below the lanes) on crosslinking. In the figures (B) and (C), right panel is the copy of the left panel with the overlaid predicted region of hybridization (blue bar) and 4‐thio‐T location (red circle).
94
Figure 4. Impact of irradiation time on the crosslinking efficiency. Gel electrophoresis of a primer extension reaction on ApoB RNA fragment irradiated with UV for different amount of time with oligonucleotide with (4‐thio‐1) or without (6434) incorporated 4‐thiothymidine.
EnrichmentstrategiesPlanning to apply our method for transcriptome‐wide study, we have realized that it will be highly
advantageous and economical to enrich for cDNA molecules that terminated at the ASO crosslinking site,
while removing the background signal, including cDNA terminated at the RNA 5’ ends. We have explored
two enrichment strategies – (1) Terminator exonuclease digestion of RNA and (2) use of biotin labeled ASO
with streptavidin binding of full length RNA‐cDNA complexes .
5’phosphatedependentexonucleaseenrichmentBased on the previous report indicating successful use of an exonuclease to detect RNA modifications
(Steen et al., 2010) we hypothesized that the use of 5’ phosphate dependant Terminator exonuclease will
enable us to remove parts of RNA upstream from the crosslinked ASOs (Figure 5A). First, we needed to
show that the exonuclease would actually terminate upon reaching the crosslinked ASO. To test for that,
we have 3’ labeled an in vitro transcribed RNA molecule, modified its 5’ triphosphate to monophosphate to
make it a suitable substrate for the exonuclease, crosslinked with different ASOs, Terminator treated and
electrophoretically resolved on a high resolution polyacrylamide gel (Figure 5B). Results indicate that (1)
not crosslinked ASO doesn’t protect the RNA (lanes labeled “6434”), (2) the ASO crosslinking via group
attached to either 5’ or 3’ end of ASO can terminate the exonucleolytic degradation of RNA (lanes “4‐thio‐
1”, “4‐thio‐2” and “4‐thio‐5”), largely increasing the ratio between crosslinked and the remaining full length
RNA (α/β and δ/γ) while preserving large fraction of the intended target (δ/α) (Figure 5C).
95
Figure 5. Enrichment of RNA with crosslinked ASOs using 5’ phosphate dependent exonuclease (Terminator). (A) Strategy of enrichment. Mixture of RNA molecules with (1) or without (2) crosslinked ASO is treated with the Terminator exonuclease that fully digests species (2) but terminates on the crosslinked ASO from the species (1) yielding 3’ RNA fragments (3). (B) 3’ end labeled ApoB RNA fragment was crosslinked with various oligonucleotides and digested with the Terminator exonuclease. 5’ppp – sample not treated with 5’polyphosphatase hence bearing 5’ triphosphate which is not a substrate for used exonuclease, 5’p – not crosslinked RNA, 6434, 4‐thio‐1, 4‐thio‐2, 4‐thio‐5 – RNA crosslinked with one of the ASOs (see Table 1 for the sequences), C – no‐Terminator control, T – samples treated with Terminator. (C) Zoom‐in into 4‐thio‐1 crosslinked sample electrophoresis. Markings (1),(2) and (3) indicate suspected molecular species represented by a given band according to the model shown on the panel (A). (D) Ratios between quantified signals from boxes marked by Greek letters on the panel (C).
96
CAGE‐likeselectionWe have shown that the use of Terminator digestion strategy can degrade parts of RNA upstream from the
crosslinked ASO, but it doesn’t solve the problem of possible background arising from the spontaneous
cDNA synthesis terminations downstream from the crosslinked ASO. To resolve that issue we have devised
another strategy for enrichment, based on the idea for enrichment of capped molecules as described in the
CAGE protocol (Kodzius et al., 2006), but instead of biotinylating the cap we have planned to use a 3’
biotinylated ASO (Figure 6A). Relying on the very high specificity of CAGE method (Takahashi et al., 2012),
we expected that the large majority of cDNA molecules after the stringent washing will be derived from the
cDNA molecules whose synthesis terminated on the crosslinked, biotinylated ASO. First, we needed to
check if RNase I used in the CAGE study would or wouldn’t cleave RNA between cDNA 3’ terminus and the
crosslinked ASO. In order to check that, we have prepared a body‐labeled RNA molecule with the
crosslinked ASO at its 5’ terminus, which was used as a template for a primer extension reaction with the
primer complementary to its 3’ terminus. We digested such prepared cDNA‐RNA hybrid with different
concentrations of RNase I and resolved the products on the denaturing high‐resolution polyacrylamide gel
(Figure 6B). As expected, RNase I degraded the RNA in the samples without the protective cDNA (“RT ‐“
lanes). Moreover, the bands in the digested samples with the protective cDNA clustered in four groups.
Analysis of the gel lead us to conclude that (numbers in the brackets relate to the structures drawn on the
right side of a Figure 6B) (1) the longest species is the full length RNA with the crosslinked ASO – which
suggests that the RNA on the cDNA‐ASO border was protected from RNase I cleavage, (2) the second
longest is the species for which the RNase I cleaved between crosslinked ASO and cDNA, suggesting that
the protection is not fully efficient. Sequence analysis of the RNA molecule revealed that 30 nt downstream
from the fully matched complementary site lies partially complementary site (Figure 6C), which apparently
bound the ASO in our assay giving rise to the bands (3) and (4) as being analogous to bands (1) and (2). This
suspicion is strengthened by counting nucleotide bands in the degradation ladder in the [“RT ‐“, “RNase I ‐“]
lane which shows that clusters (1) and (3) are separated by 30 nt. We have shown that the RNA on which
cDNA and crosslinked ASO are hybridized is partially protected from RNase I cleavage at the border of cDNA
and ASO and that we can use this property for the CAGE‐like enrichment strategy.
97
Figure 6. CAGE‐like enrichment with biotinylated oligonucleotide. (A) Strategy of enrichment. The RNA with crosslinked ASO bearing biotin (blue circle) on the 3’ end is used in reverse transcription reaction yielding different products in the mix (red line – cDNA, black line – RNA, purple ‐ ASO) that are used as substrates for RNase I that cleaves RNA not protected by cDNA and finally for the capture of biotin‐conjugated oligonucleotides crosslinked to the RNA hybridized to the full length cDNA (not‐full‐length cDNA is washed away because of the RNase I cleavage between ASO and cDNA). (B) Gel electrophoresis of body‐labeled RNA with the oligonucleotide attached to its 5’ end that was (RT +) or was not (RT ‐) covered with cDNA and was digested with different concentrations of RNase I. Drawings on the right indicate suspected structure of the bands located at the same height. Note that only species (1) and (3) would be selected according to the model shown in (A). (C) Intended (left) and possible secondary (right) binding site of the oligonucleotide used in this experiment. Secondary binding site is responsible for emergence of species (3) and (4) shown on the panel (B).
To show that the biotin‐enrichment strategy can be used to enrich for cDNA molecules terminated before
crosslinked ASO over other species of cDNA molecules, we have performed an experiment with crosslinking
of a biotinylated ASO to an in vitro transcribed RNA molecule. In this experiment we reverse transcribed the
RNA crosslinked with different concentrations of ASOs with or without biotin on their 3’ end, followed by
RNase I treatment and selection on streptavidin beads. To the 3’ end of such selected cDNA an adapter was
ligated, the construct was PCR amplified with one primer matching the cDNA and the other complementary
to the ligated adapter (Figure 7). We have expected to observe two length species – (1) longer products
(expected length 183 bp) derived from cDNA molecules that reached the RNA terminus and (2) shorter
98
(expected length ~136 bp) that terminated before the crosslinked ASO. We have observed the biotin‐
dependent enrichment of the products of the second species, confirming that the CAGE‐like selection of
cDNA molecules terminated on the crosslinked ASO is feasible.
Figure 7. Demonstration of selection. ApoB RNA fragment was crosslinked with different concentrations of ASO (Oligo/RNA ratio) with (4‐thio‐1‐biotin) or without (4‐thio‐1) biotin, cDNA was synthesized with specific primer and underwent (S) or not (N) the selection on streptavidin beads, followed by linker ligation and PCR. The figure shows the agarose electrophoresis of PCR products. Arrows indicate products derived from the full‐length (1) or stopped at the ASO (2) cDNA.
Massiveparallelsequencingbasedtranscriptome‐widesearchforASObindingsitesEncouraged by the demonstration of working selection we decided to construct the transcriptome‐wide
map of ASO binding with the high‐throughput sequencing of selected cDNA molecules. We started by
crosslinking the poly(A) fraction of mouse liver RNA with different concentrations of biotinylated or non‐
biotinylated ApoB‐targeting ASO. Afterwards, we have performed the randomly primed reverse
transcription and the streptavidin based CAGE‐like selection (plus non‐selected controls). We have
expected to keep only cDNA molecules that reached the biotinylated ASO. Those cDNA molecules were
transformed into sequencing libraries and sequenced on the HiSeq 2000 sequencer with the protocol
described in the Paper 1. The obtained reads were mapped and the estimated unique counts (EUC) were
calculated as described in the Paper 2 (Figure 8, Table 2).
99
Figure 8. Workflow of sequencing library preparation. RNA (purple) is hybridized with the biotin (green) containing ASO (red), crosslinked and used for cDNA synthesis (blue). Following RNase I treatment, the mixture undergoes selection on the streptavidin beads, cDNA is released and adapter bearing 7 nt random barcode is ligated to the cDNA 3’ end. Subsequent PCR introduces sample‐specific index which allows multiplexed Illumina sequencing.
Sequencing and mapping statistics (Table 2) indicates that the number of sequencing reads in the selected
samples increases with more biotinylated ASO used (compare indices 3,5,7), as well as when allowing for
more favorable hybridization conditions (compare indices 5 and 8), phenomenon that was also observed
with Bioanalyzer quantification of prepared libraries (not shown). Moreover, we have observed that the
selection reduced the number of contaminating reads derived from the adapter ligation to the non‐
extended reverse transcription primer (see “Cutadapt trimmed” column), but significantly increased
number of observed PCR duplicates (Barcode collapsed/Mappings column) which most likely stems from
the limiting amount of material left after the selection reaction. Since reverse transcription terminates
upon reaching 5’ end of RNA templates one would expect to see reads ends enrichment on the 5’ side of
the mRNA molecules, which is indeed observed for non‐selected samples (compare columns ‘5’ UTR’ and
‘3’ UTR’ in Table 2). Selection of the samples with biotinylated ASO equalized the coverage over transcripts,
but for unknown reason the streptavidin selection of non‐biotinylated molecules (representing selection
Primer withsample-specific barcode
RT-stop sitesequencing
5’
3’ adapter withrandom barcode
5’
Pooling samplesSize selection
5’
5’
5’ 5’
5’
5’5’ adapteron RT primer
5’
5’Prematuretermination
5’
Input RNA with hybridized oligo
UV crosslinking
Reversetranscription
RNase I treatment
Selection
Adapter ligation
PCR
Illumina sequencing
5’
100
noise) seems to bias end mappings towards 3’ UTRs. As a first confirmation that the mapped reads are
indeed associated with the studied ASO we have looked at their distribution around the intended target
site located on ApoB transcript (Figure 9A). Visual inspection revealed ASO‐dependant signal just
downstream from the binding site in both the selected and non‐selected samples, but not in the selected
samples crosslinked to the non‐biotinylated ASO. Observed signal confirmed that the sequencing strategy
worked as expected. Inspection of regions located on ApoB transcript further away from the intended
target site revealed many peaks of the height comparable to the height around intended target site (Figure
9B), suggesting that the dataset contains some difficult to interpret additional information. To find if we can
observe the ASO binding signal on the transcriptome‐wide scale, we have extracted sequences (20 nt)
upstream of the position with the highest EUC in each transcript and used them as an input for a motif
discovery software MEME (Bailey and Elkan, 1994) yielding motifs highly similar to the ASO used in the
study (Figure 10A). Additionally, the analysis of changes in the total number of reads mapped to a given
transcript between treated (indices 3, 5, 7) and the control samples (indices 11, 12) with cWords
(Rasmussen et al., 2013) revealed that for the two higher concentrations of ASO used, the most enriched
motif in the upregulated by the selection genes is the motif derived from the used ASO (Figure 9B).
Interestingly, both motif discovery methods – MEME and cWords identified the 5’ part (proximal to the
crosslinking group) of the ASO as the most significant. In the reversed approach, we have used our
knowledge of the ASO sequence to find all matching sites in the genome (allowing up to 2 mismatches) and
calculated the sum of EUC as a function of distance from the ASO matched locations (Figure 10B). This
analysis confirmed that the signal density increases in the close proximity to the ASO 5’ end both in
selected and non‐selected samples (with reduced background in selected samples) but not in the selected
sample with non‐biotinylated ASO.
101
Figure 9. Mapped signal around the intended target site. (A) Genome browser view of the EUC per nucleotide in the vicinity of the ASO intended target site in the mouse ApoB transcript. Region complementary to the used ASO is highlighted with orange bar and the 4‐thio‐T position marked with a green circle. (B) Zoomed‐out view centered on the region shown on the panel (A) (highlighted in green). Samples description: S/NS – selected, non‐selected; T/TB – oligonucleotide with 4‐thio‐T, with 4‐thio‐T and biotin, number of “+” indicates amount of oligonucleotide used. Full samples description in Table 2.
chr12:--->
8,018,350 8,018,360 8,018,370 8,018,380 8,018,390A A T C A A G T G T C A T C A C A C T G A A T A C C A A T G C T G G A C T T T A T A A C C A A T C A G A T A
16 _
0 _13 _
0 _10 _
0 _8 _
0 _17 _
0 _26 _
0 _6 _
0 _9 _
0 _1 _
0 _1 _
0 _1 _
0 _
3 (S, TB, +)
5 (S, TB, ++)
7 (S, TB, +++)
8 (S, TB, ++)
9 (NS, T, ++)
10 (NS, TB, ++)
11 (NS, -)
12 (NS, -)
2 (S, T, +)
4 (S, T, ++)
6 (S, T, +++)
A
16 _
0 15 _
0 18 _
0 9 _
0 45 _
0 59 _
0 19 _
0 25 _
0 1 _
0 1 _
0 1 _
0
8,017,950 8,018,400 8,018,800
B
102
Figure 10. Analysis of sequencing data. (A) Motif recognized by MEME (logo) based on the location of the nucleotide with highest EUC in each transcript and by cWords (red bar under the logo) based on the set of transcripts up‐regulated after selection. Below logos is the sequence of used oligonucleotide (upper case – LNA, lower case – DNA, b – biotin, S – 4‐thio‐T). (B) Genome‐wide sum of EUC for nucleotides separated by +/‐ 50 nt from the 5’ end of ASO matched site (2 mismatches allowed). Samples description in the caption of the Figure 9.
General trends of the data, such as the ability to recover the ASO sequence by motif discovery and to
observe the genome wide signal enrichment around matching sites are encouraging but the aim of this
study is to find the precise locations of alternative binding site of the ASO. First clue of such a site came
from the observation that the fraction of EUC mapping to a mitochondrial genome was much higher in
selected biotin containing samples than in the remaining samples (Table 2). Distribution of EUCs over the
103
mitochondrial chromosome showed high peak at the position 5607 in the cytochrome c oxidase I (mt‐Co1)
transcript. In silico folding of the region preceding the high peak showed possible ASO binding site located
approximately 65 nucleotides upstream of it and separated by the hairpin structure (Figure 11A). This
observation hints that the location of detected signal comes from the nucleotide that was close to the
crosslinking group in the three dimensional space but not necessarily in the linear sequence. Another
interesting observation regarding finding novel binding sites is the site in highly expressed Fabp1 transcript
(Figure 11B). In this case our assay recovered binding that apart from mismatches contained bulged mRNA
nucleotide, which makes computational finding of such sites much more challenging than of sites differing
from the perfect match just by simple mismatches.
Figure 11. Examples of detected ASO binding sites. (A) Predicted ASO binding to mt‐Co1 transcript (nucleotide count in chromosome M coordinates) and (B) to the mouse Fabp1 gene (nucleotide counting in RefSeq transcript coordinates). Red‐filled circles – ASO, yellow circles – 4‐thio‐T. Arrows indicate sites with very high EUC in the selected samples.
DiscussionCurrent state‐of‐the‐art design of specific antisense sequences for drug discovery relies on computational
search of genome or transcript database for similarities, which to be accomplished in the reasonable time
needs to use simplified folding rules (Tafer and Hofacker, 2008). This approach is practical in the initial
screen of candidate molecules but it leaves the risk that the true off‐target binding sites do not fulfill the
algorithm criteria as exemplified by the site found on Fabp1 (Figure 11B). The most popular strategy of
determining the ASO off‐target sites is the transcriptome profiling with identification of down‐regulated
genes that contain the sequence match to the used ASO (Jackson et al., 2003). Such an approach is helpful
in finding safe drug candidates, but it is impossible to discern true off‐target interactions from the
secondary effects. On the other hand, our strategy is focused solely on the existence of direct interactions
between transcript and the ASO. It is worth noting that pinpointing the interaction doesn’t imply the
function, which is possible in the case when hybridization occurs but doesn’t lead to the gene regulation.
This suggests that the method is a supplement, but not necessarily a substitute of the transcriptome
profiling. The analysis of the obtained transcriptome‐wide dataset is currently ongoing. We hope that
inclusion of samples folded with different oligonucleotide concentrations will allow for better
understanding of binding thermodynamics to different locations, with strong binders being highly occupied
even when the low concentration was used, while weak targets being activated only after certain
concentration was reached (as exemplified in Figure 10A). Moreover, comparison of samples differing only
by the used folding protocol should let us better understand the impact of preexisting RNA structures on
C
U
U
C
AU A G
U
A A U A C C A A U
A
A U AA
U U G G A G G C U U U G
G
AA
AC U G
AC
U
U
G
U
C
C
CA
CUAA
UA
A
U
CGGAGCCCCAGA
UAUA
GCAU
SG
CATTGGTATT
CA
5600
A
5530
5’
B
GG C A
A G U A C C A A U U G C AG A
G
SGCATTGGTATTCA
60 70
5’
104
interactions with introduced ASOs. Furthermore, the results will show the possible RNA‐ASO hybridization
modes which can be used for improving the existing off‐target finding algorithms. We believe that the
presented method is suitable for wide adoption in the antisense drug discovery community and will further
our understanding of interactions between RNA molecules and oligonucleotides.
AcknowledgmentsWe thank our collaboration partners Morten Lindow and Peter Hagedorn from Santaris Pharma A/S for
insightful discussions and for supplying modified oligonucleotides used in this study.
ReferencesAllawi, H.T., Dong, F., Ip, H.S., Neri, B.P., and Lyamichev, V.I. (2001). Mapping of RNA accessible sites by extension of random oligonucleotide libraries with reverse transcriptase. RNA (New York, NY) 7, 314‐327.
Bailey, T.L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28‐36.
Christiansen, J., Kofod, M., and Nielsen, F.C. (1994). A guanosine quadruplex and two stable hairpins flank a major cleavage site in insulin‐like growth factor II mRNA. Nucleic Acids Res 22, 5709‐5716.
Darty, K., Denise, A., and Ponty, Y. (2009). VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics 25, 1974‐1975.
Dubreuil, Y.L., Expert‐Bezancon, A., and Favre, A. (1991). Conformation and structural fluctuations of a 218 nucleotides long rRNA fragment: 4‐thiouridine as an intrinsic photolabelling probe. Nucleic Acids Res 19, 3653‐3660.
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80.
Harrison, G.P., Mayo, M.S., Hunter, E., and Lever, A.M. (1998). Pausing of reverse transcriptase on retroviral RNA templates is influenced by secondary structures both 5' and 3' of the catalytic site. Nucleic Acids Res 26, 3433‐3442.
Jackson, A.L., Bartz, S.R., Schelter, J., Kobayashi, S.V., Burchard, J., Mao, M., Li, B., Cavet, G., and Linsley, P.S. (2003). Expression profiling reveals off‐target gene regulation by RNAi. Nat Biotechnol 21, 635‐637.
Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. (2002). The human genome browser at UCSC. Genome Res 12, 996‐1006.
Kielpinski, L.J., Boyd, M., Sandelin, A., and Vinther, J. (2013). Detection of reverse transcriptase termination sites using cDNA ligation and massive parallel sequencing. Methods Mol Biol 1038, 213‐231.
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S.L. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14, R36.
Kjems, J., Egebjerg, J., and Christiansen, J. (1998). Analysis of RNA‐protein complexes in vitro (Amsterdam ; New York, Elsevier).
Koch, T., Rosenbohm, C., Hansen, H.F., Hansen, B., Marie Straarup, E., and Kauppinen, S. (2008). Chapter 5 Locked Nucleic Acid: Properties and Therapeutic Aspects. In Therapeutic Oligonucleotides (The Royal Society of Chemistry), pp. 103‐141.
105
Kodzius, R., Kojima, M., Nishiyori, H., Nakamura, M., Fukuda, S., Tagami, M., Sasaki, D., Imamura, K., Kai, C., Harbers, M., et al. (2006). CAGE: cap analysis of gene expression. Nat Methods 3, 211‐222.
Kole, R., Krainer, A.R., and Altman, S. (2012). RNA therapeutics: beyond RNA interference and antisense oligonucleotides. Nat Rev Drug Discov 11, 125‐140.
Kulpa, D., Topping, R., and Telesnitsky, A. (1997). Determination of the site of first strand transfer during Moloney murine leukemia virus reverse transcription and identification of strand transfer‐associated reverse transcriptase errors. EMBO J 16, 856‐865.
Lanford, R.E., Hildebrandt‐Eriksen, E.S., Petri, A., Persson, R., Lindow, M., Munk, M.E., Kauppinen, S., and Orum, H. (2009). Therapeutic Silencing of MicroRNA‐122 in Primates with Chronic Hepatitis C Virus Infection. Science 327, 198‐201.
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory‐efficient alignment of short DNA sequences to the human genome. Genome Biology 10, R25.
Lindow, M., Vornlocher, H.‐P., Riley, D., Kornbrust, D.J., Burchard, J., Whiteley, L.O., Kamens, J., Thompson, J.D., Nochur, S., Younis, H., et al. (2012). Assessing unintended hybridization‐induced biological effects of oligonucleotides. Nature Biotechnology 30, 920‐923.
Martin, M. (2011). Cutadapt removes adapter sequences from high‐throughput sequencing reads. . EMBnet J 17, 10‐12.
Meisenheimer, K.M., and Koch, T.H. (1997). Photocross‐linking of nucleic acids to associated proteins. Crit Rev Biochem Mol Biol 32, 101‐140.
Obad, S., dos Santos, C.O., Petri, A., Heidenblad, M., Broom, O., Ruse, C., Fu, C., Lindow, M., Stenvang, J., Straarup, E.M., et al. (2011). Silencing of microRNA families by seed‐targeting tiny LNAs. Nature Genetics 43, 371‐378.
Olejniczak, M., Galka, P., and Krzyzosiak, W.J. (2010). Sequence‐non‐specific effects of RNA interference triggers and microRNA regulators. Nucleic Acids Res 38, 1‐16.
Pinder, J.C., Staynov, D.Z., and Gratzer, W.B. (1974). Properties of RNA in formamide. Biochemistry 13, 5367‐5373.
Rasmussen, S.H., Jacobsen, A., and Krogh, A. (2013). cWords ‐ systematic microRNA regulatory motif discovery from mRNA expression data. Silence 4, 2.
Reuter, J.S., and Mathews, D.H. (2010). RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129.
Schneider, C.A., Rasband, W.S., and Eliceiri, K.W. (2012). NIH Image to ImageJ: 25 years of image analysis. Nat Methods 9, 671‐675.
Sontheimer, E.J. (1994). Site‐specific RNA crosslinking with 4‐thiouridine. Mol Biol Rep 20, 35‐44.
Steen, K.A., Malhotra, A., and Weeks, K.M. (2010). Selective 2'‐hydroxyl acylation analyzed by protection from exoribonuclease. J Am Chem Soc 132, 9940‐9943.
Stenvang, J., Petri, A., Lindow, M., Obad, S., and Kauppinen, S. (2012). Inhibition of microRNA function by antimiR oligonucleotides. Silence 3, 1.
Straarup, E.M., Fisker, N., Hedtjarn, M., Lindholm, M.W., Rosenbohm, C., Aarup, V., Hansen, H.F., Orum, H., Hansen, J.B.R., and Koch, T. (2010). Short locked nucleic acid antisense oligonucleotides potently reduce apolipoprotein B mRNA and serum cholesterol in mice and non‐human primates. Nucleic Acids Research 38, 7100‐7111.
106
Tafer, H., and Hofacker, I.L. (2008). RNAplex: a fast tool for RNA‐RNA interaction search. Bioinformatics 24, 2657‐2663.
Takahashi, H., Kato, S., Murata, M., and Carninci, P. (2012). CAGE (Cap Analysis of Gene Expression): A Protocol for the Detection of Promoter and Transcriptional Networks. In Gene Regulatory Networks, B. Deplancke, and N. Gheldof, eds. (Totowa, NJ, Humana Press), pp. 181‐200.
107
Tables
Table 1. Oligonucleotides used in the study
Name Sequence Remarks ApoBrev TGCTCAGAGACAGAGCTGTG DNA ApoBfor+T7 CAGAGATGCATAATACGACTCACTATAGGGAGATTCTCCTTTAAATCAAGTGTCATCA DNA ApoB-PE GATGAGCAACAATATCTGACTGG DNA IGF2_PE.h-p TCCAACCGCCAGACTTCCCAC DNA ApoB-str.dis5' GTGTGATGACACTTGATTTAAAGGAGAATCTCCC DNA 6434 GCatTgGtatTCA Upper case – LNA, lower case –DNA;
“S” indicates 4-thio-T. Oligonucleotides synthesized by Santaris Pharma A/S. ASOs 4-thio-1 to 4-thio-3 are analogs of the gapmer described in (Straarup et al., 2010) lacking the long stretch of consecutive DNA nucleotides in the middle. This will enable us to apply our assay in the in vivo setting, since otherwise the RNase H directed degradation of the targets would have occurred.
6435 ATTGGTAT 4-thio-1 SGCatTgGtatTCA 4-thio-1-biotin SGCatTgGtatTCA-biotin 4-thio-2 GCatTgGtatTCAS 4-thio-3 GCatTgGSatTCA 4-thio-4 SATTGGTAT 4-thio-5 ATTGGTATS 4-thio-6 ATTGGSAT 4-thio-7 SGCattggtatTCA ApoB-rev-A GATGAGCAACAATATCTGACTGGTTAAAAAGTTCTGCATTGG DNA ApoB-rev-C GATGAGCAACAATATCTGACTGGTTAAAAAGTTCGGCATTGG DNA ApoB-rev-G GATGAGCAACAATATCTGACTGGTTAAAAAGTTCCGCATTGG DNA ApoB-rev-T GATGAGCAACAATATCTGACTGGTTAAAAAGTTCAGCATTGG DNA LIG_DNA PHO-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3NHC3 DNA, modifications LIG_DNArandBARC PHO-NNNNNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3NHC3 DNA, modifications LIG_PCR ACACTCTTTCCCTACACGACGCT DNA RT_15xN AGACGTGTGCTCTTCCGATCTNNNNNNNNNNNNNNN DNA
Table 2. Sequenced samples and mapping statistics
Index Selection Oligo Amount of oligo
No of reads
Cutadapt discarded (too short)
Mappings/Reads (Barcode
collapsed)/Mappings chrM
5' UTR
3' UTR
1 T - 0 7,558,169 1% 41% 2% 8% 1.05 2.96 2 T 4-thio-1 0.02 7,371,391 2% 38% 1% 8% 0.61 3.59 3 T 4-thio-1-biotin 0.02 3,823,000 5% 43% 2% 35% 1.06 1.41 4 T 4-thio-1 0.2 5,704,255 2% 41% 2% 8% 1.32 2.99 5 T 4-thio-1-biotin 0.2 20,405,141 1% 51% 2% 41% 0.97 1.58 6 T 4-thio-1 2 8,347,640 2% 40% 2% 8% 0.71 3.52 7 T 4-thio-1-biotin 2 35,482,366 1% 52% 2% 49% 0.46 0.79
8 T 4-thio-1-biotin (added after
folding) 0.2 12,938,259 1% 49% 1% 39% 0.69 1.83
9 F 4-thio-1 0.2 13,126,453 31% 40% 58% 10% 3.71 0.38 10 F 4-thio-1-biotin 0.2 15,299,916 26% 43% 57% 11% 3.86 0.39 11 F - (+UV) 0 9,596,743 18% 46% 73% 11% 4.39 0.41 12 F - (-UV) 0 10,898,260 13% 51% 70% 10% 4.61 0.39
Table 3. Abbreviations used in this chapter
4‐thio‐T 4‐thiothymidine 4‐thio‐U 4‐thiouridine LNA Locked Nucleic Acid ASO Antisense Oligonucleotide RTTS Reverse Transcription Termination Site
108
11.4 Paper4:ThesearchforfunctionalRNAsecondarystructureswithin3’untranslatedregionsbyenzymaticprobingoflivertranscriptsfrommultiplespecies(FragSeq2)
109
ThesearchforfunctionalRNAsecondarystructureswithin3’untranslatedregionsbyenzymaticprobingoflivertranscriptsfrommultiplespecies(FragSeq2)
Abstract3’ untranslated regions of mRNA molecules are multifunctional platforms for posttranscriptional gene
expression regulation. Some of the regulatory mechanisms depend on the RNA folding into specific
secondary structure. Moreover, functional structural elements are often evolutionary conserved. In
order to investigate those structures we have set up a method of massive parallel sequencing based on
a detection of cleavage sites of structure specific nucleases (P1 and V1) combined with a novel
normalization scheme which was applied to human, dog and mouse liver transcripts. Our results are
highly reproducible and largely agree with known, conserved structures of selenocysteine insertion
sequences. Additionally, applying the extra step of enzymatic polyadenylation before probing allows for
obtaining the data for 3’ regions of other RNA classes, exemplified by the clear structural signal seen for
a U1 spliceosomal RNA. The presented results validate the applicability of the method for
transcriptome‐wide 3’ regions structure probing and are starting point for ongoing search for functional
structures.
IntroductionMessenger RNA (mRNA) molecules can be functionally divided into 5’ cap, 5’ untranslated region (5’
UTR), coding region, 3’ untranslated region (3’ UTR) and poly(A) tail. 3’ UTR is bordered on its 5’ end by
the stop codon and on its 3’ end by the first adenosine of the poly(A) tail. Analysis of RefSeq annotation
(Pruitt et al., 2009) shows that the median length of 3’UTRs for mouse and human is 787 and 866 nt,
respectively. Interestingly, 3’UTRs have lower GC content than 5’UTRs or coding regions in vertebrae
(Zhang et al., 2004) but are nevertheless highly structured (Wan et al., 2014) and have the strongest
enrichment for the putative functional structures (Washietl et al., 2007). 3’ UTRs are rarely spliced,
which is an adaptation to a nonsense mediated decay regulation (Mignone et al., 2002) but their length
may vary due to utilization of alternative polyadenylation signals (Mayr and Bartel, 2009).
Since 3’ UTRs are not required to code for the functional proteins (except selenocystein (Seeher et al.,
2012)) they form a flexible platform for the emergence of regulatory features in the course of evolution.
The encoded regulatory elements can affect the protein expression via changes in mRNA stability or
translatability or can affect RNA cellular localization.
One of the modes of posttranscriptional gene expression regulation is microRNA (miRNA) mediated
gene silencing (Bartel, 2009), which is especially efficient if the target site is localized within 3’ UTR, thus
avoiding interactions with ribosomes (Grimson et al., 2007). To understand the mRNA‐miRNA
110
interactions it is crucial to investigate the secondary structure of 3’UTRs, as the target accessibility for
base‐pairing is an important determinant of silencing efficiency (Kertesz et al., 2007; Wan et al., 2014).
In a study analyzing the proteome occupancy on mRNA molecules it was shown that vast portions of
3’UTRs are protein interactors (Baltz et al., 2012). The binding proteins modulate mRNA stability (Ray et
al., 2013), translational efficiency (Morita et al., 2012) or mRNA localization (Jambhekar and Derisi,
2007). Although some of the protein‐RNA interactions are based on the primary sequence, for others it
is the RNA structure that is a determinant (Lunde et al., 2007). Especially interesting regulatory switch
has been observed in the 3’UTR of p27 gene, where binding of an RNA‐binding protein PUM1 modulates
the RNA structure allowing specific miRNAs to access the target site and downregulate the p27
expression (Kedde et al., 2010) effectively creating the AND logic gate.
In recent years several attempts of an RNA secondary structure prediction on the transcriptome‐wide
scale employing different approaches were published. Computational prediction strategies included
finding minimum free energy structures with local folding (Hofacker et al., 2004), utilizing evolutionary
conservation of the structure (Pedersen et al., 2006) or conservation coupled with thermodynamic
stability (Washietl et al., 2005). Advent of high‐throughput sequencing allowed probing complex mixture
of RNA molecules first in vitro (Kertesz et al., 2010; Underwood et al., 2010) and recently in vivo (Ding et
al., 2013; Rouskin et al., 2013).
Here we show an approach of probing the secondary structure of 3’ regions of in vitro folded liver
specific mRNA molecules in three species: mouse, dog and human. We believe that the combination of
enzymatic probing of RNA coupled with an analysis of conservation will allow finding functional
structures located in the 3’ UTRs.
Materials
InputRNA mouse liver total RNA (Zyagen),
dog liver total RNA (Zyagen),
human liver total RNA (Ambion, AM7960),
ERCC RNA Spike‐In Mix (Ambion, 4456740),
in vitro transcribed spike‐in structured RNA mix (using equal weights of different RNA molecules
as determined by UV spectrophotometry; molecules synthesized by Line Dahl Poulsen;
sequences in Table 1)
Kitsandreagents Ribo‐Zero™ Magnetic Kit (Human/Mouse/Rat) (Epicentre)
Poly(A)Purist™ MAG Kit (Ambion)
Agencourt RNAClean XP (Beckman Coulter)
Ampure XP (Beckman Coulter)
Poly(A) Polymerase, Yeast 600 U/µl and 5x buffer(USB)
Calf Intestinal Alkaline Phosphatase (CIAP) 20 U/µl (USB)
111
NEBuffer 3 (New England Biolabs)
Nuclease Stop Buffer (NSB) (380 mM NaOAc pH 5.2, 10 mM EDTA)
5x fragmentation buffer (250 mM Tris‐HCl pH 8, 25 mM MgCl2)
T4 Polynucleotide Kinase (T4 PNK) and buffer (New England Biolabs)
P1 nuclease dilution buffer (50% glycerol, 50 mM Tris‐HCl pH 7.5, 100 µM Zn(OAc)2)
V1 nuclease dilution buffer (50% glycerol, 10 mM Tris‐HCl pH 7.5, 200 mM KCl)
5x P1 buffer (250 mM Tris‐HCl pH 7.5, 750 mM NaCl, 25 mM MgCl2, 50 µM Zn(OAc)2)
5x V1 buffer (250 mM Tris‐HCl pH 7.5, 750 mM NaCl, 25 mM MgCl2)
T4 RNA Ligase 10 U/µl (Fermentas)
5x T4 RNA Ligase Buffer (250 mM Tris‐HCl pH 7.6, 50 mM MgCl2, 50 mM DTT, 5 mM ATP) – as
Fermentas T4 RNA Ligase buffer.
BSA 10 mg/ml (New England Biolabs)
PrimeScript enzyme and 5x buffer (Takara)
Phusion® High‐Fidelity DNA Polymerase and 5x HF Buffer (New England Biolabs)
100 bp DNA Ladder (New England Biolabs)
E‐Gel® SizeSelect™ 2% Gel (Invitrogen)
DNA purification spin columns (Zymo research)
QIAquick PCR Purification Kit (Qiagen)
High Sensitivity DNA Kit (Agilent)
DNA 1000 Kit (Agilent)
Oligonucleotides listed in Table 2
Methods
Probingreagentscalibration50 µl of solutions containing 0.5 ng/µl of fhlA220 and 0.5 ng/µl of Spot42 RNA fragments in 1x P1 or 1x
V1 buffers were allowed to fold (55°C, 5 min; 37°C, 10 min) and were supplemented with 0.5 µl of the
appropriate enzyme dilution, incubated at 37°C for 30 min followed by addition of 150 µl of the
nuclease stop buffer, phenol‐chloroform extraction (with double volume of phenol) and ethanol
precipitation. For comparison, the RNA was fragmented with 1x fragmentation buffer at 95°C for 90 sec
or 10 min or incubated in 10 mM Tris‐HCl on ice, followed by addition of 150 µl of the nuclease stop
buffer and ethanol precipitation. The pellets were dissolved in 7 µl 50 mM Tris‐HCl pH 7 and analyzed on
a Bioanalyzer RNA Pico chip.
SequencinglibrarypreparationSequencing libraries were prepared in two rounds with slight differences. In the first round (prep. 1)
mouse liver poly(A) and dog liver poly(A) fractions were probed, in the second round (prep. 2) mouse
liver ribosome depleted and human liver poly(A) fractions were probed.
Poly(A) enrichment of mouse, dog and human liver total RNA was performed with Poly(A) Purist MAG
kit according to the manufacturer’s recommendations
112
Depletion of ribosomal RNA from mouse liver total RNA was performed using RiboZero MAG kit
according to the manufacturer’s recommendations
RNAClean XP and Ampure XP purifications were performed as described in (Kielpinski et al., 2013)
unless stated otherwise.
Phenol‐chloroform extractions were performed by addition of 1 volume of phenol pH 8 to 1 volume of
extracted liquid, vigorous shaking, transfer of aqueous phase to a new tube, addition of 1 volume of
chloroform, shaking and transfer of aqueous phase to a new tube.
Ethanol precipitations were performed by addition of 2.5 volume of absolute ethanol (ice cold) to
1 volume of salt‐containing nucleic acid solution, incubation at ‐20°C overnight or at ‐80°C for 30 min,
centrifugation at 14000g for 30 min, removing the supernatant, washing the pellet in 1 ml of 70%
ethanol, short centrifugation at 14000g, removing the supernatant, air drying until no visible liquid is left
and dissolving in H2O.
Polyadenylation of 600 ng of structured spike‐in RNA molecules and of two batches of 432 ng of
ribosome depleted mouse liver RNA was performed in the presence of 0.5 mM ATP, 1x polyadenylation
buffer and 24 U/µl enzyme in the volume of 23.24 µl for 20 min at 37°C followed by RNA Clean XP
purification with 15 minutes incubation and elution in 15 µl H2O which resulted in concentrations
35.1 ng/µl and 36.2 ng/µl (mouse liver ribozero fraction) and 41 ng/µl (structured spike‐ins)
RNA dephosphorylation was performed with 1 U/µl of CIAP enzyme in 1x NEBuffer 3. Reactions
contained 660 ng RNA in total volume of 20 µl (prep. 1) or 770 ng RNA supplemented with 7.7 ng
polyadenylated structured spike‐ins and 7.7 µl of 10x diluted ERCC spike‐ins in total volume of 44 µl
(prep. 2). The reactions were incubated 50 min at 37°C followed by 10 min at 50°C followed by addition
of the NSB buffer (265 µl for 20 µl reactions, 283 µl for 44 µl reactions) and 5 mg/ml glycogen (12 µl and
14 µl), phenol‐chloroform extraction, collection of supernatant (250 µl and 292 µl), ethanol precipitation
and dissolving pellet in H2O (55 µl and 65.9 µl).
Structure probing was performed in PCR tubes in 6 different conditions using 10 µl of dephosphorylated
RNA for each. Different conditions are named throughout the chapter as: no treatment (NONE),
magnesium fragmentation, P1, P1/5, V1 and V1/5 (the V1/5 was employed only in prep. 2) and were
applied according to:
NONE –addition of 90 µl H2O and incubation on ice
Magnesium fragmentation – addition of 2 µl H2O and 3 µl 5x fragmentation buffer, 90 sec incubation
at 95°C, transfer on ice, addition of 2 µl 10x T4 PNK buffer, 2 µl of 10 mM ATP, 1 µl T4 PNK enzyme
and incubation 30 min at 37°C followed by adding 80 µl NSB. (Degradation of total RNA with the
given above ion concentrations in 50% formamide at 95°C resulted in cleavage probability
0.001/bond/minute)
P1, V1, P1/5 and V1/5 probings: addition of 70 µl H2O and 20 µl of 5x P1 or 5x V1 buffer followed by
5 min incubation at 55°C and 10 min at 37°C. While holding the tubes in the thermocycler, 1 µl of a
respective enzyme diluted with its respective dilution buffer was added (5 ng/µl of P1 for P1
113
probing, 1 ng/µl of P1 for P1/5 probing, 0.01 U/µl of V1 for V1 probing and 0.002 U/µl for V1/5
probing) followed by 30 min incubation at 37°C, transfer of the reaction to 300 µl of ice cold NSB
and immediate phenol‐chloroform extraction with the extra volume of organic solvent (800 µl),
ethanol precipitation and solubilizing pellet in 4 µl H2O.
Linker ligation was performed by adding 1 µl of 100 µM phosphoseqADAPT oligonucleotide to 4 µl of
probed RNA, heat denaturation (65°C for 5 min followed by transfer on ice) (Addo‐Quaye et al., 2008)
and adding master mix to final volume of 20 µl and concentrations of 1x T4 RNA Ligase buffer, 0.1 mg/ml
BSA, 10% DMSO, 0.5 U/µl T4 RNA Ligase, incubation at 37°C for 2 hours, purification with RNAClean XP
with a final elution in 20 µl H2O
Reverse transcription was performed by mixing 9.5 µl of linker‐ligated RNA with 0.5 µl 10 µM
Adapter_oligo_dT oligonucleotide, incubation at 65°C for 5 minutes followed by transfer on ice, adding
9 µl of master mix composed of 4 µl 5x PrimeScript buffer, 4 µl of 2.5 mM dNTP and 1 µl H2O, incubation
at 42°C for 5 minutes, addition of 1 µl of PrimeScript enzyme and incubation at 42°C for 60 min followed
by 15 min at 72°C and purification with RNAClean XP beads with final elution in 20 µl H2O.
PCR amplification of cDNA was performed on 9.5 µl of purified cDNA in the total volume of 20 µl with
the concentrations of 0.6 µM multi1_short oligonucleotide, 0.5 µM respective INDEX#_long
oligonucleotide (see Table 3 for sample‐primer pairs), 1x Phusion HF buffer, 0.2 mM dNTP and 0.04 U/µl
Phusion polymerase. Thermocycling conditions were as follows: (3 min, 98°C)x1, (80 sec, 98°C; 15 sec,
64°C; 30 sec, 72°C)x4, (80 sec, 98°C; 45 sec, 72°C)x[15 for prep 1.; 18 for prep. 2], (5 min, 72°C)x1. PCR
reactions were assessed with 2% agarose electrophoresis.
Post‐PCR treatment of the prep. 1 started with adding 10 µl from each PCR reaction to one tube
containing 20 µl 50 mM EDTA followed by Ampure XP purification with 5 minutes incubation with beads
and elution in 50 µl 10 mM Tris‐HCl pH 8 which contained 53.8 ng/µl DNA (Nanodrop). Post‐PCR
treatment of the prep. 2 started with quantification of the amount of DNA in the 200‐600 bp range
(Bioanalyzer DNA 1000 Kit) and mixing the equimolar amount from each reaction (1/5 for NONE
treatment; total volume 65.7 ul) to one tube containing 13.14 µl 50 mM EDTA and subsequent
purification with Ampure XP beads.
Size selection of the combined sequencing libraries was performed on a 2% SizeSelect gel using 100 bp
DNA Ladder and collecting the molecules in the size range of 200‐500 bp (prep. 1) or 200‐600 bp (prep.
2) with collection and buffer replacement in the lower well every 20 sec. Collected DNA in the running
buffer was bound to DNA purification columns (Zymo Research for prep. 1, QIAGEN for prep. 2) and
eluted in 30 µl of 10 mM Tris‐HCl pH 8 (prep. 1) or pH 8.5 (prep. 2).
The size distribution of libraries was checked with a Bioanalyzer High Sensitivity (prep. 1) or DNA 1000
(prep. 2) kits and the libraries were sent for single read sequencing on Illumina HiSeq platform
multiplexed with another sample from the laboratory (Jakob Rukov’s sample containing a low
complexity library).
114
The library size distribution for V1 treatment in two rounds (prep. 1 vs. prep 2.) varied very much, in the
second round being much more digested. Probable cause is the storage of the diluted enzyme before
probing in the first round, which is known to decrease the enzyme activity (Lowman and Draper, 1986)
but not before the second probing (speculation – time of dilution preparation was not noted). For the
analysis, only treatment V1 from the first round and the V1/5 from the second round were taken and
both were called V1 treatments.
Dataanalysis
ReadsprocessingInitial processing
Sequencing results were obtained as demultiplexed FASTQ files (Cock et al., 2010). Sequences and
quality scores were trimmed from the first two nucleotides using an awk script in order to remove the
random nucleotides introduced during ligation. Reads matching the multiplexed low complexity library
were filtered out.
Reference sequences for mapping
Reads were mapped to the ENSEMBL transcripts sequences associated with genome assemblies
canFam3 (dog), hg19 (human) and mm9 (mouse) combined with the spike‐in sequences and, only for
dog mapping, with the NM_001115118.1 transcript (Sepp1 transcript for which ENSEMBL annotation
doesn’t agree with our observations)
Mapping the reverse transcription priming sites
For each RNA sample we have used the magnesium fragmentation treatment to find the reverse
transcription priming sites. Using a Cutadapt utility (Martin, 2011), reads having at least 12 nucleotide
match (allowing 10% error rate) at their right end to the 5’ end of the Adapter_oligo_dT were retained,
trimmed with an awk script from any remaining “A” nucleotides at the right end, mapped to the
transcripts using Bowtie (Langmead et al., 2009) with options “‐a ‐‐norc ‐‐best ‐‐strata –S” followed by
counting the number of primary alignment ends (understood as a first non‐A nucleotide from the left
side of the read) mappings at different transcript locations and reporting it (awk script). For each gene,
one transcript isoform with the highest count of mapped priming sites has been retained in the index for
subsequent mappings.
Mapping cleavage sites
For each RNA sample, for each treatment, reads were trimmed from the RT adapter sequence using
Cutadapt with options “‐a AAAAAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCT ‐m 25”. After that, the
magnesium fragmentation sample reads were mapped with Bowtie (options “‐‐norc ‐S ‐a ‐‐best ‐‐strata ‐
‐chunkmbs 512”) and count of mapped read ends per each transcript position has been summed and
reported. If for this mapping there have been multiple transcripts per gene then only the transcript with
the highest count has been retained (in the event of equal scores the longest gene isoform was kept).
The kept transcript sequences were used to construct new bowtie index and all the treatments from a
115
given RNA sample were mapped to this bowtie index with options “‐‐norc ‐S ‐a” and the count of
mapped read ends per each transcript position has been summed and reported.
NormalizationSize‐selection correctors estimation
To start the normalization procedure, the average distribution of structure‐related read end counts from
the priming positions is defined based on the selected, “clean” set of priming sites. To define the “clean”
set, read‐in the priming positions from magnesium treated sample and then for each priming position
define the region RU spanning from 600 to 25 nucleotides upstream from the priming site and RD, from
25 to 600 nucleotides downstream. To avoid the impact of interfering priming sites keep only those
priming sites for which in the RD and RU regions the sum of priming counts is lower than 1/100 of the
sum in the discussed priming site. Next, discard the priming sites for which exists the transcript 5’end
closer than 600 nucleotides. Define clusters of remaining priming sites that were located between each
other’s RD and RU regions and calculate number of mapped magnesium fragmentation cleavage counts
in the merged RU region per priming count. Calculate median of the ratios (excluding 0’s) and discard all
the clusters that give rise to less cleavage sites than expected from the median ratio.
To create the average distribution of cleavage sites from the priming sites for each probing condition,
do: for each priming cluster (taken from the set of clean priming sites as defined above) split
proportionally (by the priming site counts) the cleavage mappings among the cluster members. At this
point one has the count of cleavage ends at a given distance for each priming site. Then, for each
priming site divide all the counts by the square root of the sum of the counts (to avoid overfeeding the
final distribution by highly expressed sites) and add the divided cleavage mapping scores to the table
with sums of counts at the given distance from the priming sites for all clean priming sites. Additionally,
the average distribution was calculated from the frequency of read lengths after trimming the poly(A)
end or reverse transcription primer adapter. The final average distribution was seamed from the
distribution estimated from the trimmed reads length distribution (up to nucleotide 65) and from the
cluster‐based distribution for positions further than 65 nt from priming site (Figure 4A).
Next, for each sample’s average distribution fit in the exponential decay (for all treatments) in the region
between 120 to 300 nt from the priming site. Then, for the magnesium fragmentation sample
extrapolate the exponential decay curve to all points in the average distribution (positions 25 to 600,
Figure 4A), divide the extrapolated values by the observed values in the average distribution and
smoothen the quotients with R loess function (Chambers and Hastie, 1992) to obtain the size selection
correctors for positions from 25 to “peak” (Figure 4B), where “peak” was manually assigned to the
samples from the first preparation (mouse and dog poly(A)) to be at the distance of 100 nt, and in the
second preparation (mouse ribozero, human poly(A)) at 80 nt from the priming site (In total, two sets of
final correctors were calculated, one for prep. 1 and one for prep. 2, the final correctors were the
average of the correctors from two magnesium fragmentation samples in each preparation).
116
Modeling number of cDNA molecules reaching given nucleotide
i. Consider only those parts of transcripts that are within the RU of any priming site present in
the magnesium fragmentation treatment for a given sample.
ii. Decompose RNA cleavage sites between priming sites
From each priming site emanate the decomposition factors into each position of RU region (to calculate
the decomposition factor, first divide the values of the exponential fit to the average distribution of a
given treatment by the size selection correctors at positions between 25 to “peak” and multiply the
quotients by the number of reads mapped to a given priming site). Then, split each mapped cleavage
site count between different priming sites by the decomposition factors weights. Finally, multiply the
values assigned to a given priming site in the region 25‐“peak” by the appropriate size selection
corrector.
iii. Estimate the number of cDNA molecules reaching each position.
At this step, for each priming site there is known number of mapped cleavage sites assigned. Here we
incorporate normalization based on the QuShape procedure (Karabiber et al., 2013). The cumulative
sum of the cleavage sites counts has been calculated starting from the most distal site from the priming
site, then the calculated values of reaching cDNA molecules in the region 25‐“peak” have been divided
by the size selection correctors. Scores for each transcript location from different priming sites have
been summed. Obtained values are modeled numbers of reaching cDNA molecules at each location.
Reporting the FragSeq 2.0 values was done in the 4 columns format, where 1st column was transcript
ENSEMBL identifier, 2nd: comma separated transcript positions, 3rd: comma separated cleavage counts at
the positions in the 2nd column, 4th: comma separated modeled number of reaching cDNA molecules at
the positions in the 2nd column. The reported positions were the positions one nucleotide before the
first sequenced nucleotide, which is consistent with both nucleases P1 and V1 cutting 3’ from single‐
stranded or stacked nucleotide, respectively (Kertesz et al., 2010; Underwood et al., 2010).
RNAsecondarystructuremodelingGeneration of random sequences has been performed with the Python script, computational prediction
of RNA secondary structure and calculation of sensitivity and positive predictive value was performed
with RNAStructure version 5.4 (Reuter and Mathews, 2010). Secondary structure visualization was
accomplished with VARNA version 3.9 (Darty et al., 2009).
Results
LibrarypreparationandsequencingLibraries in the FragSeq2 experiments were prepared in two runs, each consisted of probing two RNA
samples with 5 different conditions (Figure 1A) by following the protocol depicted on a Figure 1B. The
first sequencing round included mouse and dog liver poly(A) RNA, the second – mouse liver ribosome
depleted RNA (enzymatically polyadenylated to create priming site) and human liver poly(A) fraction,
both probed in the presence of spike‐in molecules. Our chosen probing reagents – a single strand
specific nuclease P1 (Romier et al., 1998) and stacked bases specific nuclease V1 (Ziehler and Engelke,
117
2001) leave 5’ phosphates on the cleaved RNA. The 5’ phosphates were used as handles for enrichment
for the ends created by the enzymes by ligating to them the first adapter after probing. Moreover, the
dephosphorylation was performed before probing in order to remove endogenous 5’ phosphate
residues that would otherwise have been ligated to the adapter.
Figure 1. Probed samples and experimental workflow. (A) The experiment was performed in two rounds, each time with RNA from two species and with 5 different probing conditions (P1d5 means P1 with 5x lower concentration). (B) Dephosphorylated RNA was probed with a structure specific endonuclease leaving 5’ phosphate to which the first adapter was ligated. The RNA was subsequently used as a template for reverse transcription with oligo‐dT‐adapter primer, cDNA was amplified with PCR introducing sample specific index, samples were pooled, size selected and sequenced on Illumina HiSeq2000 with 100 nt single‐read protocol.
Apart from being probed with two different concentrations of P1 nuclease and one of V1 nuclease,
samples were fragmented at elevated temperature with magnesium ions to a cleavage extent
comparable to the cleavage that occurred in the probed samples and were 5’ phosphorylated to allow
the adapter ligation (hydrolysis with metal ions leaves 5’ hydroxyl group (Forconi and Herschlag, 2009),
note that this procedure also phosphorylated the endogenous, preexisting breaks). Such produced
random fragmentation pattern was later used for data normalization and could be possibly applied for
ligation bias correction. Lastly, the untreated sample (“None”) was prepared, for which no treatment
was performed between dephosphorylation and linker ligation steps, hence obtained results should
represent only the experimental noise.
To focus our assay on 3’ regions of mRNA molecules we have designed the oligo‐dT reverse transcription
primer harboring the second necessary for sequencing adapter. This primer hybridizes at the beginning
of the poly(A) tail and primes reverse transcription that continues until reaching the 5’ end of the first
ligated Illumina adapter. The obtained cDNA was used as a template for PCR amplification with primers
recognizing both adapters hence creating setup that highly enriches for nucleic acids primed by our
introduced primer (as opposed to unspecific priming) and terminated at the adapter bound to the
nuclease cleavage site (as opposed to the background terminations). Amplified libraries (Figure 2)
1st sequencing 2nd sequencing
Mouse liver poly(A) RNA Mouse liver, ribozero frac�on (in vitro polyadenylated). Includes spike-ins
Dog liver poly(A) RNA Human liver poly(A) RNA.
Includes spike-ins
P1 V1 P1d5 Mg2+ None
AAAAAAAAAAAAAAAAAAAAAAAAA
P1
AAAAAAAAAAAAAAAAAAAAAAAAAp
AAAAAAAAAAAAAAAAAAAAAAAAA
Adapter ligation to 5’ phosphate
Reverse transcription
AAAAAAAAAAAAAAAAAAAAAAAAANVTTTTTTTTTTTTTTTTTTT
PCR
NVTTTTTTTTTTTTTTTTTTT
illumina HiSeq2000 sequencing
A B
Index forsample identification
118
contained information on (1) position of the first illumina adapter ligation site, which corresponds to the
nucleolytic cleavage of the RNA and (2) position of the priming site. We have sequenced only one end of
the construct, reading out the information (1). Sequencing statistics are presented in Table 3.
Figure 2. Agarose electrophoresis of PCR amplified sequencing libraries. (A) Samples prepared for the 1st sequencing, (B) samples prepared for the 2nd sequencing.
NormalizationThe oligo‐dT priming strategy utilized in the experiments implies that the signal density over transcript
will decay with the increasing distance from the poly(A) tail. This made it necessary to perform
normalization of the data in order to compare nucleotides located at different positions within a
transcript. To understand the meaning of the detected read end count at a given location in terms of
probing efficiency we needed to estimate the number of cDNA molecules that reached that location, in
other words, what would be the observed count given 100% probing efficiency at a given site. Our
estimation was based on the QuSHAPE procedure (Karabiber et al., 2013), where the observed signal at
a given position is divided by the sum of signal of cDNA molecules that passed this position. In our
experiment it was not straightforward to apply this method due to (1) existence of multiple priming sites
on a given transcript combined with the lack of data showing at which priming site given read has
originated (as opposed to the situation described in the Paper 2 where paired end sequencing was
used), (2) performing size selection of the libraries against short amplicons means that the maximal
possible count for the short amplicons is lower than expected based on the sum of cDNA molecules that
passed this location. When searching for the ways of normalizing the data, we found that some of the
reads bear the poly(A) stretch or part of the reverse transcription adapter sequence at their end
enabling us to define priming sites. Found priming sites are, as expected, predominantly located at the
3’UTR – poly(A) tail borders (Figure 3). Moreover, thanks to the employed random fragmentation
sample, different priming sites can be quantitatively compared (assumption) by counting number of
reads with which we have detected given site. After defining priming sites, we have estimated the
average read ends density from priming sites (Figure 4A). Those two values – strength of priming at a
given site and the value of average distribution at a given distance from priming site were used to
119
decompose the structure‐derived read ends counts between priming sites located on the same
transcript and to perform the QuSHAPE like normalization for each priming site separately, using size‐
selection correctors for sites being close to the priming site. Finally, the modeled numbers of cDNA
molecules reaching given nucleotide from different priming sites were summed and reported.
Figure 3. FragSeq2 priming occurs predominantly on polyadenylation sites. Detected number of priming events for mouse liver poly(A) sample displayed in the UCSC Genome Browser for (A) ApoB 3’ terminus (two overlapping poly(A) signals AAUAAA are underlined with red, dashed lines) and (B) serum albumin precursor (Alb) transcript.
Figure 4. Data normalization. (A) The average distribution of read ends from priming sites (black, solid line) with an exponential fit curve (red, dashed line). (B) The size selection correction values indicating a magnitude of difference between observed average distribution and extrapolated exponential fit.
Spiked‐inRNAmoleculesAfter performing the data normalization we wanted to check what the minimal coverage that ensures
the robust reproducibility is. We have taken advantage of the ERCC spike‐in molecules being present in
both mouse and human preparations (2nd sequencing) and have calculated the Pearson correlation
Scalechr12:
--->
Apob
20 bases mm98,023,580 8,023,590 8,023,600 8,023,610 8,023,620 8,023,630 8,023,640 8,023,650 8,023,660
G C T G A G T T G T T T T G T C C A A C T C A G G A T G G A G G G A G G G A G G G A A G G G G A A A T A A A T A A A T A C T T C C T T A T T G T G C A G C A T A C C T C T C A A C T T G G C T C A T T
RefSeq Genes
1901 _
1 _Map
ped
read
s en
ds
Scalechr5:
Alb
5 kb mm990,895,000 90,900,000 90,905,000
RefSeq Genes
39604 _
1 _
A
B
Map
ped
read
s en
ds
120
coefficient for P1 treatments while taking only the positions with the coverage above certain threshold
(Figure 5A). This analysis has shown that in order to observe high correlation between technical
replicates one need to apply the coverage cut‐off of at least approximately 50 reaching cDNA molecules.
High correlation between P1 and P1d5 treatments (Figure 5B) shows that used higher concentration of
enzyme didn’t lead to the major appearance of secondary cuts. Provided that P1 and V1 nucleases have
the opposite substrate specificities, we have hypothesized that the obtained signal from both should
anti‐correlate. We have again used the same test conditions as used for the Figure 5A and compared P1
and V1 treatments (Figure 5C). To our surprise, the anti‐correlation is negligible (minimum in the tested
range is approximately ‐0.04), which can stem from (1) ERCC molecules not forming stable structures
and existing as an ensemble of many different structures. This situation would not be favorable for
obtaining strong anti‐correlation. And/or (2) the general properties of the V1 nuclease which is known
not to be very helpful in finding the double‐stranded regions, sometimes cleaving close to, not within,
the double stranded region (Ziehler and Engelke, 2001). Interestingly, the magnesium fragmentation
derived end counts/coverage ratios correlate weakly with both P1 and V1 derived ratios (Figure 5D,E).
Since the fragmentation should not depend on the RNA structure (performed at high temperature in low
ionic strength environment), the observed correlation likely stems from the library‐construction biases,
such as ligation or PCR biases, which are shared between different libraries.
Figure 5. Correlation between signals from different treatments of spiked‐in ERCC libraries. Pearson correlation coefficient (black continuous line) of the end count/coverage ratio of nucleotides of ERCC RNA molecules with coverage higher than coverage cut‐off (x‐axis, log‐scale) between the compared samples (number of positions used in the calculation indicated by red, dashed line and right y‐axis). Correlation between signal from ERCC spike‐in from (A) P1 mouse and P1 human RNA, (B) P1 with P1d5, (C) P1 with V1, (D) P1 with magnesium fragmentation, (E) V1 with magnesium fragmentation. (B, C, D, E – mouse samples)
P1(H) vs P1(M)
Coverage cut−off
Pear
son
corr
elat
ion
coef
ficie
nt
2 8 32 256 4096
0.6
0.7
0.8
0.9
5000
10000
15000
20000
25000
30000
Num
ber o
f pos
ition
s
P1 vs P1d5
Coverage cut−off
Pear
son
corr
elat
ion
coef
ficie
nt
2 8 32 256 4096
0.5
0.6
0.7
0.8
0.9
5000
10000
15000
20000
25000
30000
35000
Num
ber o
f pos
ition
s
P1 vs V1
Coverage cut−off
Pear
son
corr
elat
ion
coef
ficie
nt
2 8 32 256 4096
−0.04
−0.02
0.00
0.02
5000
10000
15000
20000
25000
30000
35000
Num
ber o
f pos
ition
s
P1 vs Mg
Coverage cut−off
Pear
son
corr
elat
ion
coef
ficie
nt
2 8 32 256 4096
0.14
0.15
0.16
0.17
0.18
5000
10000
15000
20000
25000
30000
35000
Num
ber o
f pos
ition
s
V1 vs Mg
Coverage cut−off
Pear
son
corr
elat
ion
coef
ficie
nt
2 8 32 256 4096
0.08
0.10
0.12
0.14
0.16
5000
10000
15000
20000
25000
30000
35000
Num
ber o
f pos
ition
s
A B
D E
C
121
Apart from the ERCC RNA, the spiked‐in RNA contained 8 in vitro transcribed RNA molecules with
known, functional structures (Table 1). Most of the structural spike‐in molecules were too short to give a
signal of satisfactory quality (impact of size selection), but the longest of included – Escherichia coli
transfer‐messenger RNA (tmRNA) – showed a promising probing signal. An analysis of the tmRNA signal
was performed analogously to the analysis of all of the other molecules and is visualized on the Figure 6.
First, mapping of the reads bearing part of the reverse transcription adapter from magnesium
fragmentation sample informed us about utilization of the priming sites (Figure 6A) and positions of the
ligation‐proximal end sites informed us about the structure related cleavages (Figure 6B for P1 probing).
Using those two sets of mappings, the exponential fit to the average distribution of P1 probing and the
size‐selection corrector values (Figure 4) we have modeled the coverage at each position of the
molecule (Figure 6C). Finally, dividing the end counts by the coverage yielded end counts/coverage ratio
(Figure 6D). The analogous procedure to calculate the end counts/coverage ratio has been repeated for
P1d5, V1 and magnesium fragmentation treatments (Figure 6E, F, G). As expected, the P1 and the P1d5
treatments reveal distinct, high peaks in the single‐stranded regions. The V1 treatment has also
produced distinct peaks, but they are not exclusively located in the double stranded regions,
underscoring poorly characterized enzyme specificity (Ziehler and Engelke, 2001). The magnesium
fragmentation pattern produces the most even ratios over the investigated fragment, albeit the
distribution is not as flat as would be expected given no bias during fragmentation and library
preparation.
122
Figure 6. Signal distribution over tmRNA spike‐in molecule. (A) Count of read ends containing priming information mapped at a given location, (B) read ends with structure‐related information for P1 probing, (C) modeled coverage, (D) ratio of structure‐related read ends count to coverage for P1, (E) P1d5, (F) V1 and (G) magnesium fragmentation sample. Cyan background indicates unpaired nucleotides in Escherichia coli tmRNA secondary structure (Zwieb et al., 2003).
0
2000
4000
6000
8000
Prim
ing
coun
ts (M
g)
0
2000
4000
6000
8000
10000
12000
14000
End
cou
nts
(P1)
0
10000
20000
30000
40000
50000
Cov
erag
e (P
1)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Ec/
C ra
tio (P
1)
0.00
0.05
0.10
0.15
0.20
Ec/
C ra
tio (P
1d5)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Ec/
C ra
tio (V
1)
0 50 100 150 200 250 300 350
0.00
0.02
0.04
0.06
0.08
Ec/
C ra
tio (M
g)
A
B
C
D
E
F
G
123
SelenoproteinP3’UTRSelenoprotein P (Sepp1) is a liver secreted protein engaged in regulating the whole‐body selenium
homeostasis. It is unusual in carrying multiple selenocysteine residues, incorporation of which is
mediated by two conserved stem‐loop structures located in the 3’UTR called selenocysteine insertion
sequences (SECIS) (Burk and Hill, 2009). In the FragSeq2 experiment we have obtained the structural
data for both SECIS 1 (closer to 5’ end) and SECIS 2 (closer to 3’ end) from the Sepp1 3’UTR (Figure 7A,
C). Previously published analysis of SECIS elements defined apical loop, helix II, non‐Watson‐Crick base‐
paired quartet, internal loop and helix I as conserved constituents of the structure (Figure 7B, D)
(Walczak et al., 1996). The nuclease P1 consistently cleaved apical loops of both SECIS elements in each
of the analyzed species. Internal loop has been efficiently recognized by nuclease P1 in the probing of
dog SECIS 1, for which this loop is one nucleotide bigger than in the human or the mouse SECIS 1, and
showed some signal in the P1 probing of mouse SECIS 1. The V1 nuclease signal supports the formation
of a helix II in dog and mouse SECIS 1 as well as human and mouse SECIS 2. Helix I is supported by each
V1 treatment except for human SECIS 1, where the overall signal is very low. Interestingly, the large
apical loop in SECIS 1 was described to form short internal helix (Fletcher et al., 2001) (Figure 7B, dashed
lines), which is supported by cleavages of nuclease V1 (dog and mouse) and decreased P1 signal (human,
dog, mouse) around the probable interactors.
124
Figure 7. Signal distribution over Sepp1 SECIS elements backs conservation of the secondary structure. (A) and (C) An alignment of (A) SECIS 1 and (C) SECIS 2 from human (H), dog (D) and mouse (M) with highlighted in yellow predicted conserved single stranded regions (Walczak et al., 1996) and the end count/coverage ratio from FragSeq2 experiment for P1d5 and V1 treatments for 3 different species. (B) and (D) 2D models of conserved SECIS structure of (B) dog SECIS 1 and (D) human SECIS 2 with arrows indicating nuclease cleavages scaled according to the end count/coverage ratio. Dashed lines between nucleotides 34, 35, 36 and 44, 43, 42 on panel (B) indicate previously described internal helix.
H:UUCUAUUUGCUUUAAUGAGAAUAGAAACGUAAACUAUGACCUAGGGGUUUCUGUUGGAUAAUUAGCAGUUUAGAAD:UUCUACUUGCAUUAAUGAAAACAGAGACAUAAACUAUGACCUAGGGGUUUCUGUUGGAUAGUUAGCAAUUUAGAAM:UUCUAGUUACAUUAAUGAGAACAGAAACAUAAACUAUGACCUAGGGGUUUCUGUUGGAUAGCUUGUAAUUAAGAAc:***** ** * ******* ** *** ** ******************************* * * * ** ****
0.050.100.150.200.250.30
0.000.020.040.060.080.10
0.000.010.020.030.040.050.06
0.01
0.020.020.02
0.03
0.050.1
0.150.2
00.050.1
0.150.2
A 10 20 30 40 50 60 70
H
D
M
H
D
M
P1d5
V1
H:GUAUUUCCAUAGUCAAUGAUGGUU-UAAUAGGUAAACCAAACCCUAUAAACCUGACCUCCUUUAUGGUUAAUACD:GUAUUUCCAUAGUCAAUGAUGGUU-CAAUAGGUAAACUAAGUCCUAUAAACCUGAACUCCUAUAUGGUUAAUACM:GUAUUUCCAUAAUCAAUGAUGGUUUCA-UAGAGAAACUAAGUCCUAUGAACCUGACCUCUUUUAUGGCUAAUACc:*********** ************ * ** **** ** ***** ******* *** * ***** ******
0.00
0.05
0.10
0.15
0.000.020.040.060.080.10
0.000.050.100.150.20
00.020.040.06
00.050.1
0.15
00.010.020.030.040.05
C 10 20 30 40 50 60 70
H
D
M
H
D
M
P1d5
V1
B
Apical loop
Helix II
Quartet
Internalloop
Helix I
Apicalloop
Helix II
Quartet
Internalloop
Helix I
D
U
U
C
U
A
C
U
U
G
CA
U
U
A
AU
G
A
A
A
A
C
A
G
A
G
A
CA
U
A
A
A
C
UA
U GA
C
C
U
A
G
G
GG
U
U
U
C
U
G
U
U
G
G
A
UA
G
U
U
AG
C
A
A
U
U
U
A
G
A
A
1
10
20
30
40
50
60
70
75
G
U
A
U
U
U
C
C
A
U
A
G
U
C
A
A
U
G
A
U
G
G
U
U
U
A
A
U
A
G
GU
A
A
AC C
A
A
A
C
C
C
U
A
U
A
A
A
C
C
U
G
A
CC
U
C
C
UU
U
A
U
G
G
U
U
A
A
U
A
C
1
10
20
30
40
50
60
70
73
P1d5
V1
P1d5
V1
125
Non‐codingRNAmoleculesThe repertoire of RNA molecules with functional structures is arguably richer among non‐coding RNAs
than among polyadenylated mRNA molecules. In order to enrich our dataset for structured RNA
molecules, in the second round of mouse sample sequencing instead of selecting the RNA with oligo‐dT
coated beads, we have removed the ribosomal RNA (RiboZero). In this case the remaining RNA consisted
not only of mRNA but also of many other classes of RNA molecules. In order to be able to apply the
same method for signal detection as was used for probing of the 3’UTRs of mRNA molecules, we needed
all of our molecules of interest to bear a poly(A) tail, which we have added via in vitro polyadenylation.
We have hypothesized that the addition of the poly(A) tail will not influence the structure of the
molecule, especially if the formed structure is stable. To check for that assumption we have performed
computational folding of 9999 randomly generated 100 nt long sequences with and without added
100 nt long poly(A) tail and found that 75% of the predicted structures are identical after adding the
poly(A) tail and that the mean sensitivity and mean positive predictive value are 90% and 91%,
respectively. This simulation convinced us that the polyadenylation before probing can be safely applied.
One of the well characterized non‐coding molecules for which we have obtained signal of high quality is
the mouse U1 spliceosomal RNA, structure of which (Figure 8B) we have compared with our probing
data (Figure 8A). The probing signal from treatments with both P1 concentrations was highly
concentrated in the predicted loops, and signal from V1 probing was located mainly in the helical
regions, validating our method. Due to the included size selection step there is no data for
approximately last 30 nt of the molecule. It is worth noting that our normalization scheme likely
underestimates the coverage over short RNA molecules, like U1, because cDNA molecules reaching the
RNA 5’ termini are not counted (endogenously present 5’ termini are not substrates for the ligation in
our setup).
126
Figure 8 U1 spliceosomal RNA. (A) The end count/coverage ratio for nucleotides of mouse U1 spliceosomal RNA for different probing conditions. Cyan background indicates unpaired nucleotides according to the model shown on the panel (B). (B) U1 secondary structure model proposed in (Underwood et al., 2010) with the FragSeq2 nuclease cleavage data indicated by arrows.
DiscussionWe have presented the strategy of probing the complex mixtures of RNA that focuses on 3’ UTR regions
of mRNA molecules and can be expanded to probe the 3’ regions of other RNA molecules if preceded
with enzymatic polyadenylation. We have devised and implemented the normalization strategy that
decomposes the signal between observed priming sites and models the behavior of cDNA pool
extension and terminations. The probing data is affected by library preparation biases but is highly
reproducible between technical replicates.
The comparison of the obtained signal with the examples of three classes of molecules – spiked‐in
structured RNA (tmRNA), conserved 3’ UTR regulatory element (SECIS) and small nuclear RNA (U1)
reveals that the signal is of high quality, with the P1 nuclease cleavages concentrated in single‐stranded
regions and V1 in or close to the helical regions. Recently published results of an in vivo structure
probing (Rouskin et al., 2013) showed that the in vitro folding results in mRNA being more structured
A U A C U
U
A
C
C
U
G
G
C
A
G
G
GG
AG
AU
A
C
C
A
UG
A
U
CA C
G
A
A
G
G
U
G
G
UUU
UC
CC
A
G
G
G
C
G
A
G
G
C
U
U
A
U
C
C
A
U
U
GC A
C
U
C
C
G
G
A
U
G
U
G
C
U
G
A
C
C
C
C
U
GC
GAU
U
U
C
C
C
C
AA A
U
G
CG
G
G
A
A
A
CUC
G
AC
U
G
CA
UAA
U
UU
GU
G
G
U
A
G U G
G
G
G
G
A
C
U
G
C
G
U
U C
G
C
G
C
U
C
U
C
C
C
C
U G
1
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
164
A B
P1d5
V1
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Ec/
C ra
tio (P
1)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Ec/
C ra
tio (P
1d5)
0.0
0.1
0.2
0.3
0.4
Ec/
C ra
tio (V
1)
0 50 100 150
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Ec/
C ra
tio (M
g)
127
than when present in the cells. Although performing n in vivo RNA probing is very tempting it may suffer
from RNA being present in multiple conformations within cells that would blur the signal derived from
the functional structures. What’s more, it is compatible with only a few probing reagents, limiting our
probing toolset. Novelty of our strategy of finding functional RNA structures comes from combining the
enzymatic probing detected with massive parallel sequencing (modified from the predecessor of our
method (Underwood et al., 2010)) with evolutionary conservation analysis (Pedersen et al., 2006). We
investigated the structures of RNA molecules from mammals of three different orders, which radiated
within short timespan roughly 100 million years ago (Cannarozzi et al., 2007; Murphy et al., 2001).
Probed transcripts were derived from the same organ (liver), and given that the three species are
omnivores, we expect that they share some of the regulatory mechanisms exhibited on mRNA 3’UTR
structures, as shown with the Sepp1 example. Choice of the studied organisms was affirmed by both dog
(Karlsson and Lindblad‐Toh, 2008) and mouse (Anderson and Ingham, 2003) being valuable model
organisms. Liver was chosen due to its transcripts being the most promising targets of Locked Nucleic
Acid based antisense oligonucleotides (Janssen et al., 2013; Straarup et al., 2010), design of which can
be facilitated with the knowledge of RNA structure.
Apart from the RNA structural data, the sequencing data obtained with the FragSeq2 procedure carries
information defining priming sites that can be utilized to find cleavage and polyadenylation sites with a
single nucleotide resolution (Figure 3). What’s more, inclusion of magnesium fragmentation treatment
allows a gene expression measurement. In the data analysis we haven’t included the untreated sample,
but it could be possibly used for the experimental noise correction, similarly as to the use of the control
sample for ΔTCR calculation in the attached Paper 2.
Currently, the gathered data is being analyzed by our collaborators in regard of finding new structural
elements present in the 3’ UTR regions. We expect that the nuclease probing data by itself will be a
useful constraint for the RNA structure predictions for each species separately, similarly as in the
previous transcriptome‐wide structure determination projects. However, the real strength comes from
the multi‐species design. In this way we may detect both conserved and novel structural elements,
allowing for uncovering regulatory mechanisms. Another interesting way of looking at the data will be
correlating the structural signal with the microRNA efficiency (as described in (Wan et al., 2014)) or with
the occurrences of RNA modifications or editing (Dominissini et al., 2012; Peng et al., 2012).
ContributionsFragSeq2 is a collaborative project between University of Copenhagen, University of California, Santa
Cruz and Aarhus University. Line Dahl Poulsen was involved in experiment planning and initial
experiments, Andrew V. Uzilov was involved in experiment planning and data analysis, Jakob Skou
Pedersen, Sudhakar Sahoo and Zsuzsanna Sükösd Etches are involved in data analysis, Sofie Salama,
Jeppe Vinther and Jakob Skou Pedersen supervised the project.
128
ReferencesAddo‐Quaye, C., Eshoo, T.W., Bartel, D.P., and Axtell, M.J. (2008). Endogenous siRNA and miRNA targets identified by sequencing of the Arabidopsis degradome. Current biology : CB 18, 758‐762.
Anderson, K.V., and Ingham, P.W. (2003). The transformation of the model organism: a decade of developmental genetics. Nat Genet 33 Suppl, 285‐293.
Baltz, A.G., Munschauer, M., Schwanhausser, B., Vasile, A., Murakawa, Y., Schueler, M., Youngs, N., Penfold‐Brown, D., Drew, K., Milek, M., et al. (2012). The mRNA‐bound proteome and its global occupancy profile on protein‐coding transcripts. Molecular cell 46, 674‐690.
Bartel, D.P. (2009). MicroRNAs: target recognition and regulatory functions. Cell 136, 215‐233.
Burk, R.F., and Hill, K.E. (2009). Selenoprotein P‐expression, functions, and roles in mammals. Biochim Biophys Acta 1790, 1441‐1447.
Cannarozzi, G., Schneider, A., and Gonnet, G. (2007). A phylogenomic study of human, dog, and mouse. PLoS Comput Biol 3, e2.
Chambers, J.M., and Hastie, T. (1992). Statistical models in S (Pacific Grove, Calif., Wadsworth & Brooks/Cole Advanced Books & Software).
Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L., and Rice, P.M. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 1767‐1771.
Darty, K., Denise, A., and Ponty, Y. (2009). VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics 25, 1974‐1975.
Ding, Y., Tang, Y., Kwok, C.K., Zhang, Y., Bevilacqua, P.C., and Assmann, S.M. (2013). In vivo genome‐wide profiling of RNA secondary structure reveals novel regulatory features. Nature.
Dominissini, D., Moshitch‐Moshkovitz, S., Schwartz, S., Salmon‐Divon, M., Ungar, L., Osenberg, S., Cesarkas, K., Jacob‐Hirsch, J., Amariglio, N., Kupiec, M., et al. (2012). Topology of the human and mouse m6A RNA methylomes revealed by m6A‐seq. Nature 485, 201‐206.
Fletcher, J.E., Copeland, P.R., Driscoll, D.M., and Krol, A. (2001). The selenocysteine incorporation machinery: interactions between the SECIS RNA and the SECIS‐binding protein SBP2. RNA 7, 1442‐1453.
Forconi, M., and Herschlag, D. (2009). Metal ion‐based RNA cleavage as a structural probe. Methods in Enzymology 468, 91‐106.
Grimson, A., Farh, K.K., Johnston, W.K., Garrett‐Engele, P., Lim, L.P., and Bartel, D.P. (2007). MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Molecular cell 27, 91‐105.
Hofacker, I.L., Priwitzer, B., and Stadler, P.F. (2004). Prediction of locally stable RNA secondary structures for genome‐wide surveys. Bioinformatics 20, 186‐190.
Jambhekar, A., and Derisi, J.L. (2007). Cis‐acting determinants of asymmetric, cytoplasmic RNA transport. Rna 13, 625‐642.
Janssen, H.L., Reesink, H.W., Lawitz, E.J., Zeuzem, S., Rodriguez‐Torres, M., Patel, K., van der Meer, A.J., Patick, A.K., Chen, A., Zhou, Y., et al. (2013). Treatment of HCV infection by targeting microRNA. N Engl J Med 368, 1685‐1694.
129
Karabiber, F., McGinnis, J.L., Favorov, O.V., and Weeks, K.M. (2013). QuShape: rapid, accurate, and best‐practices quantification of nucleic acid probing information, resolved by capillary electrophoresis. RNA 19, 63‐73.
Karlsson, E.K., and Lindblad‐Toh, K. (2008). Leader of the pack: gene mapping in dogs and other model organisms. Nat Rev Genet 9, 713‐725.
Kedde, M., van Kouwenhove, M., Zwart, W., Oude Vrielink, J.A., Elkon, R., and Agami, R. (2010). A Pumilio‐induced RNA structure switch in p27‐3' UTR controls miR‐221 and miR‐222 accessibility. Nature cell biology 12, 1014‐1020.
Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U., and Segal, E. (2007). The role of site accessibility in microRNA target recognition. Nat Genet 39, 1278‐1284.
Kertesz, M., Wan, Y., Mazor, E., Rinn, J.L., Nutter, R.C., Chang, H.Y., and Segal, E. (2010). Genome‐wide measurement of RNA secondary structure in yeast. Nature 467, 103‐107.
Kielpinski, L.J., Boyd, M., Sandelin, A., and Vinther, J. (2013). Detection of reverse transcriptase termination sites using cDNA ligation and massive parallel sequencing. Methods Mol Biol 1038, 213‐231.
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory‐efficient alignment of short DNA sequences to the human genome. Genome biology 10, R25.
Lowman, H.B., and Draper, D.E. (1986). On the recognition of helical RNA by cobra venom V1 nuclease. The Journal of Biological Chemistry 261, 5396‐5403.
Lunde, B.M., Moore, C., and Varani, G. (2007). RNA‐binding proteins: modular design for efficient function. Nature reviews Molecular cell biology 8, 479‐490.
Martin, M. (2011). Cutadapt removes adapter sequences from high‐throughput sequencing reads, Vol 17.
Mayr, C., and Bartel, D.P. (2009). Widespread shortening of 3'UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673‐684.
Mignone, F., Gissi, C., Liuni, S., and Pesole, G. (2002). Untranslated regions of mRNAs. Genome biology 3, REVIEWS0004.
Morita, M., Ler, L.W., Fabian, M.R., Siddiqui, N., Mullin, M., Henderson, V.C., Alain, T., Fonseca, B.D., Karashchuk, G., Bennett, C.F., et al. (2012). A novel 4EHP‐GIGYF2 translational repressor complex is essential for mammalian development. Molecular and cellular biology 32, 3585‐3593.
Murphy, W.J., Eizirik, E., Johnson, W.E., Zhang, Y.P., Ryder, O.A., and O'Brien, S.J. (2001). Molecular phylogenetics and the origins of placental mammals. Nature 409, 614‐618.
Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad‐Toh, K., Lander, E.S., Kent, J., Miller, W., and Haussler, D. (2006). Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2, e33.
Peng, Z., Cheng, Y., Tan, B.C., Kang, L., Tian, Z., Zhu, Y., Zhang, W., Liang, Y., Hu, X., Tan, X., et al. (2012). Comprehensive analysis of RNA‐Seq data reveals extensive RNA editing in a human transcriptome. Nat Biotechnol 30, 253‐260.
Pruitt, K.D., Tatusova, T., Klimke, W., and Maglott, D.R. (2009). NCBI Reference Sequences: current status, policy and new initiatives. Nucleic acids research 37, D32‐36.
130
Ray, D., Kazan, H., Cook, K.B., Weirauch, M.T., Najafabadi, H.S., Li, X., Gueroussov, S., Albu, M., Zheng, H., Yang, A., et al. (2013). A compendium of RNA‐binding motifs for decoding gene regulation. Nature 499, 172‐177.
Reuter, J.S., and Mathews, D.H. (2010). RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129.
Romier, C., Dominguez, R., Lahm, A., Dahl, O., and Suck, D. (1998). Recognition of single‐stranded DNA by nuclease P1: high resolution crystal structures of complexes with substrate analogs. Proteins 32, 414‐424.
Rouskin, S., Zubradt, M., Washietl, S., Kellis, M., and Weissman, J.S. (2013). Genome‐wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature.
Seeher, S., Mahdi, Y., and Schweizer, U. (2012). Post‐transcriptional control of selenoprotein biosynthesis. Curr Protein Pept Sci 13, 337‐346.
Straarup, E.M., Fisker, N., Hedtjarn, M., Lindholm, M.W., Rosenbohm, C., Aarup, V., Hansen, H.F., Orum, H., Hansen, J.B.R., and Koch, T. (2010). Short locked nucleic acid antisense oligonucleotides potently reduce apolipoprotein B mRNA and serum cholesterol in mice and non‐human primates. Nucleic Acids Research 38, 7100‐7111.
Underwood, J.G., Uzilov, A.V., Katzman, S., Onodera, C.S., Mainzer, J.E., Mathews, D.H., Lowe, T.M., Salama, S.R., and Haussler, D. (2010). FragSeq: transcriptome‐wide RNA structure probing using high‐throughput sequencing. Nature Methods 7, 995‐1001.
Walczak, R., Westhof, E., Carbon, P., and Krol, A. (1996). A novel RNA structural motif in the selenocysteine insertion element of eukaryotic selenoprotein mRNAs. RNA 2, 367‐379.
Wan, Y., Qu, K., Zhang, Q.C., Flynn, R.A., Manor, O., Ouyang, Z., Zhang, J., Spitale, R.C., Snyder, M.P., Segal, E., et al. (2014). Landscape and variation of RNA secondary structure across the human transcriptome. Nature 505, 706‐709.
Washietl, S., Hofacker, I.L., and Stadler, P.F. (2005). Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci U S A 102, 2454‐2459.
Washietl, S., Pedersen, J.S., Korbel, J.O., Stocsits, C., Gruber, A.R., Hackermuller, J., Hertel, J., Lindemeyer, M., Reiche, K., Tanzer, A., et al. (2007). Structured RNAs in the ENCODE selected regions of the human genome. Genome research 17, 852‐864.
Zhang, L., Kasif, S., Cantor, C.R., and Broude, N.E. (2004). GC/AT‐content spikes as genomic punctuation marks. Proceedings of the National Academy of Sciences of the United States of America 101, 16855‐16860.
Ziehler, W.A., and Engelke, D.R. (2001). Probing RNA structure with chemical reagents and enzymes. Curr Protoc Nucleic Acid Chem Chapter 6, Unit 6 1.
Zwieb, C., Gorodkin, J., Knudsen, B., Burks, J., and Wower, J. (2003). tmRDB (tmRNA database). Nucleic Acids Res 31, 446‐447.
131
TablesTable 1. Structured RNA spike‐in molecules
Name Sequence
ryhB GGCGAUCAGGAAGACCCUCGCGGAGAACCUGAAAGCACGACAUUGCUCACAUUGCUUCCAGUAUUACUUAGCCAGCCGGGUGCUGGCUUUUACCUA
6S GUUUCUCUGAGAUGUUCGCAAGCGGGCCAGUCCCCUGAGCCGAUAUUUCAUACCACAAGAAUGUGGCGCUCCGCGGUUGGUGAGCAUGCUCGGUCCGUCCGAGAAGCCUUAAAACUGCGACGACACAUUCACCUUGAACCAAGGGUUCAAGGGUUACAGCCUGCGGCGGCAUCUCGGAGAUUCCACCUA
tmRNA GGGGCUGAUUCUGGAUUCGACGGGAUUUGCGAAACCCAAGGUGCAUGCCGAGGGGCGGUUGGCCUCGUAAAAAGCCGCAAAAAAUAGUCGCAAACGACGAAAACUACGCUUUAGCAGCUUAAUAACCUGCUUAGAGCCCUCUCUCCCUAGCCUCCGCUCUUAGGACGGGGAUCAAGAGAGGUCAAACCCAAAAGAGAUCGCGUGGAAGCCCUGCCUGGGGUUGAAGCGUUAAAACUUAAUCAGGCUAGUUUGUUAGUGGCGUGUCCGUCCGCAGCUGGCAAGCGAAUGUAAAGACUGACUAAGCAUGUAGUACCGAGGAUGUAGGAAUUUCGGACGCGGGUUCAACUCCCGCCAGCUCCAACCUA
DsrA GAACACAUCAGAUUUCCUGGUGUAACGAAUUUUUUAAGUGCUUCUUGCUUAAGCAAGUUUCAUCCCGACCCCCUCAGGGUCGGGAUUUACCUA
TPPapt GGACUCGGGGUGCCCUUCUGCGUGAAGGCUGAGAAAUACCCGUAUCACCUGAUCUGGAUAAUGCCAGCGUAGGGAAGUCACGGACCACCAGGUCAUUGCUUCUUCACGUUAUGGCAGGAGCAAACUAUGCAAGUCGACCUGCUGGGUUCAGCGCAAUCUGCGCACGACCUA
fhlA220 GGCAGCGUUACAUUCCCAUCCACUGGGGAAAGACGCGGCGCUGAUUGGUGAAGUGGUGGAACGUAAAGGUGUUCGUCUUGCCGGUCUGUAUGGCGUGAAACGAACCCUCGAUUUACCACACGCCGAACCGCUUCCGCGUAUAUGCUAAUAAAAUUCUAAAUCUCCUAUAGUUAGUCAAUGACCUUUUGCACCGCUUUGCGGUGCUUUCCUGGAAGAACAAAAUGUCAUAUACACCGAUGAGUGAUCUCGGACAACAAGGGUUGUUCGACAUCACUCGGACAACCUA
Spot42 GGUAGGGUACAGAGGUAAGAUGUUCUAUCUUUCAGACCUUUUACUUCACGUAAUCGGAUUUGGCUGAAUAUUUUAGCCGCCCCAGUCAGUAAUGACUGGGGCGUUUUUUAACCUA
Table 2. Oligonucleotides used in the study
Name Sequence Remarks phosphoseqADAPT ACACUCUUUCCCUACACGACGCUCUUCCGAUCUNN RNA Adapter_oligo_dT AGACGTGTGCTCTTCCGATCTTTTTTTTTTTTTTTTTTTVN DNA multi1_short AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCT DNA INDEX#_long CAAGCAGAAGACGGCATACGAGATxxxxxxGTGACTGGAGTTCAGACGT
GTGCTCTTCCGATCT DNA, xxxxxx indicates the specific index sequence. See (Kielpinski et al., 2013) for more information.
Table 3. Sequencing and mapping statistics
Sequencing name: 121019 121019 121019 121019 121019 121019 121019 121019 121019 121019
Sample: Mmus_PA Mmus_PA Mmus_PA Mmus_PA Mmus_PA Cfam_PA Cfam_PA Cfam_PA Cfam_PA Cfam_PA
Spike-ins: F F F F F F F F F F
Treatment: P1 V1 Mg NONE P1/5 P1 V1 Mg NONE P1/5
Index: 1 2 3 4 13 5 6 7 8 14
Reads: 26,184,914 31,436,703 13,620,819 1,859,822 9,418,061 24,230,982 22,656,312 22,802,173 2,508,975 18,657,694
Reads mapped to priming sites:
- - 1,951,595 - - - - 1,375,814 - -
Reads mapped as cleavage sites:
17,538,091 19,790,932 9,588,606 1,116,842 7,243,654 9,418,105 7,864,929 8,748,960 961,805 7,950,991
% 66.98% 62.95% 70.40% 60.05% 76.91% 38.87% 34.71% 38.37% 38.33% 42.62%
Sequencing name: 130220 130220 130220 130220 130220 130220 130220 130220 130220 130220
Sample: Mmus_RZ Mmus_RZ Mmus_RZ Mmus_RZ Mmus_RZ Hsap_PA Hsap_PA Hsap_PA Hsap_PA Hsap_PA
Spike-ins: T T T T T T T T T T
Treatment: P1 P1/5 V1/5 Mg NONE P1 P1/5 V1/5 Mg NONE
Index: 1 3 4 5 6 7 9 10 11 12
Reads: 18,918,149 18,129,511 21,480,088 18,212,256 3,699,234 17,466,036 12,331,841 28,211,757 18,876,283 2,575,061
Reads mapped to priming sites:
- - - 2,296,595 - - - - 3,191,888 -
Reads mapped as cleavage sites:
15,339,647 15,378,657 15,297,433 15,130,272 2,373,297 11,578,819 9,652,216 16,341,368 13,455,068 1,532,410
% 81.08% 84.83% 71.22% 83.08% 64.16% 66.29% 78.27% 57.92% 71.28% 59.51%
132