phd thesis - ku kielpinski.pdf · primære, sekundære og tertiære struktur, samt interaktioner...

F A C U L T Y O F S C I E N C E

U N I V E R S I T Y O F C O P E N H A G E N

PhD thesis

Łukasz Jan Kiełpiński

High-throughput sequencing based methods of RNA structure investigation

Academic advisors:

Associate Professor Jeppe Vinther

Associate Professor Jan Christiansen

Submitted: 14/02/2014

HIGH‐THROUGHPUTSEQUENCINGBASEDMETHODSOFRNASTRUCTUREINVESTIGATION

ŁukaszJanKiełpiński

PhDThesis

February 2014

This thesis has been submitted to

the PhD School of The Faculty of Science,

University of Copenhagen

Contents

1 Summary ............................................................................................................................................... 2

2 Dansk resumé ........................................................................................................................................ 4

3 Streszczenie po polsku .......................................................................................................................... 6

4 Acknowledgments ................................................................................................................................. 8

5 Abstract ................................................................................................................................................. 9

6 Objectives............................................................................................................................................ 10

7 Description of the research project .................................................................................................... 11

7.1 Background information ............................................................................................................. 11

7.1.1 Ribonucleic acid .................................................................................................................. 11

7.1.2 RNA structure ...................................................................................................................... 11

7.1.3 Interactions between RNA and antisense oligonucleotides ............................................... 14

7.1.4 Massive parallel sequencing ............................................................................................... 14

7.1.5 Application of the massive parallel sequencing for RNA structure determination ............ 15

7.2 Project motivations ..................................................................................................................... 15

8 Summary of the results in the papers and their relation to the international state‐of‐the‐art ......... 18

8.1 Paper 1 ........................................................................................................................................ 18

8.2 Paper 2 ........................................................................................................................................ 19

8.3 Paper 3 ........................................................................................................................................ 20

8.4 Paper 4 ........................................................................................................................................ 21

9 Conclusions and perspectives ............................................................................................................. 23

10 References ...................................................................................................................................... 25

11 Papers ............................................................................................................................................. 30

11.1 Paper 1: Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and

Massive Parallel Sequencing ................................................................................................................... 31

11.2 Paper 2: Massive parallel sequencing based hydroxyl radical probing of RNA accessibility ...... 51

11.3 Paper 3: Transcriptome‐wide detection of binding sites of Locked Nucleic Acid containing

oligonucleotides (LNA‐Stop‐Seq) ............................................................................................................ 83

11.4 Paper 4: The search for functional RNA secondary structures within 3’ untranslated regions by

enzymatic probing of liver transcripts from multiple species (FragSeq2) ............................................ 109

1

1 SummaryRNA exists in cells in the form of dynamic, three dimensional entities, but to assist its description

researchers resort to studying its primary (sequence), secondary (base pairing) and finally the tertiary

(three dimensional) structure. Traditional methods of studying the secondary and tertiary structures are

labor intensive and require analyzing every single molecule of interest separately. Since the emergence

of massive parallel sequencing the RNA structure determination field is undergoing rapid changes,

immensely increasing the throughput of experiments and proposing the new ways of data analysis. This

thesis consists of four manuscripts which describe developments within this methodological shift by

presenting and validating the novel experimental and computational approaches of harnessing the next‐

generation sequencing for RNA structural studies.

The first paper (“Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and

Massive Parallel Sequencing”) presents a flexible, easy to follow method of preparing Illumina

sequencing libraries that allows for massive identification of reverse transcription termination sites

(RTTS) – RTTS‐Seq. The detection of RTTS can be utilized for investigation of various RNA properties,

ranging from mapping 5’ ends, susceptibility towards certain treatments (e.g. structure probing),

detecting base modifications or others, which depends on the experimental design. Apart from

describing the detailed experimental protocol we provide the data analysis workflow suitable for

researchers without bioinformatics expertise. The experience from the RTTS‐Seq method has been

utilized in the second paper (“Massive parallel sequencing based hydroxyl radical probing of RNA

accessibility”) for the tertiary RNA structure probing. It has been extended with PCR bias tackling

technique and combined with normalization scheme that takes into consideration local coverage and

background reverse transcription terminations as assessed by the control reaction. The method allows

for probing multiple, long molecules simultaneously and the obtained signal correlates well with a

backbone solvent accessibility for both assayed molecules (RNase P specificity domain and the 16S

ribosomal RNA). Another included paper (“The search for functional RNA secondary structures within 3’

untranslated regions by enzymatic probing of liver transcripts from multiple species (FragSeq2)”)

presents the method of RNA secondary structure probing which is again an RTTS‐Seq modification but is

compatible with the nuclease‐based (P1 and V1) probing. In this protocol we ligated the adapter at the

RNA level as opposed to the cDNA level ligation in the RTTS‐Seq approach. Moreover, we have

performed the reverse transcription that was anchored at the poly(A) tail border, focusing the assay for

the 3’ untranslated regions. This set‐up required establishing a new data normalization workflow that

incorporates the signal decay from the 3’ ends of molecules. We have performed the experiments with

liver RNA from three species, which allows us to combine the nuclease probing data with a structure

conservation analysis creating an information rich dataset. We validate the method by comparing the

nuclease signal with the known structures for three classes of RNA molecules. The search for the novel

functional structures is ongoing.

In parallel to studying the RNA structure we have investigated the interactions between RNA and an

oligonucleotide with therapeutic potential (“Transcriptome‐wide detection of binding sites of Locked

Nucleic Acid containing oligonucleotides (LNA‐Stop‐Seq)”). We describe a development of a method

that can detect the hybridization sites on the transcriptome‐wide scale – LNA‐Stop‐Seq. We characterize

2

and optimize various steps in the procedure and propose strategies of enriching for cDNA molecules

terminated upon reaching the crosslinked oligonucleotide. Finally, the sequencing results confirm that

the enrichment works but the unexpected signal distribution requires additional data analysis efforts.

The methods presented in this thesis are capable of providing a holistic view of RNA, its primary,

secondary and tertiary structure, as well as interactions with oligonucleotides. We expect that the

advances made in the experimental and computational methods, as well as the gathered results, should

allow for better understanding of the RNA structure‐function relationship on top of the better and

simpler antisense drugs design.

3

2 DanskresuméRNA eksisterer i celler i form af dynamiske, tredimensionelle enheder, men for at lette beskrivelsen af

disse former, tyer forskere til studiet af den primære (sekvensen), den sekundære (baseparringer), og

endelig den tertiære (tredimensionelle) struktur. Traditionelle metoder hvormed man studerer den

sekundære og tertiære struktur er tidskrævende og begrænser sig til analyse af hvert molekyle

enkeltvist. Siden massiv parallel sekventering blev introduceret, har forskningsfeltet som beskæftiger sig

med bestemmelse af RNA struktur ændret sig hastigt; effektiviteten i eksperimenter er øget umådeligt

og nye måder at analysere data på er udviklet. Denne afhandling består af fire manuskripter som

beskriver forbedringer inden for dette metodiske skifte, ved at introducere og validere de nye

eksperimentelle og computationelle tilgange til at udnytte næste generation sekventering af RNA

strukturer.

Den første artikel (“Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and

Massive Parallel Sequencing”) demonstrerer en fleksibel, let forståelig metode til at opbygge Illumina

sekventerings biblioteker, som tillader massiv identifikation af revers transkription terminerings‐

positioner (RTTS) – RTTS‐Seq. Identificeringen af RTTS kan bruges til at undersøge forskellige RNA

egenskaber, fra kortlægningen af 5’‐ender og følsomhed overfor bestemte behandlinger (fx struktur

probning), til detektering af base modifikationer, alt afhængigt af det eksperimentelle design. Udover

beskrivelsen af den detaljerede eksperimentelle protokol, forklarer vi arbejdsgangen i dataanalysen,

som kan benyttes af forskere uden ekspertise i bioinformatik. Erfaringerne fra RTTS‐Seq metoden er

blevet brugt i den anden artikel (“Massive parallel sequencing based hydroxyl radical probing of RNA

accessibility”) til den tertiære struktur probning. Det er blevet udvidet med en teknik til at håndtere

systematiske fejl i PCR, og er blevet kombineret med normalisering som tager højde for lokal dækning og

for baggrund som stammer fra terminering i revers transskription, estimeret ved hjælp af kontrol

reaktionen. Metoden tillader probning af adskillige lange molekyler samtidig, og signalet korrelerer godt

med ribose‐fosfat‐kæden (backbone) solvent tilgængeligheden af begge de studerede molekyler (RNase

P specificitets domænet og 16S ribosomalt RNA). En anden artikel som er inkluderet (“The search for

functional RNA secondary structures within 3’ untranslated regions by enzymatic probing of liver

transcripts from multiple species (FragSeq2)”), præsenterer metoden til probning af sekundær RNA

struktur, som igen er en RTTS‐Seq modifikation, men er forenelig med nuklease‐baseret (P1 og V1)

probning. I denne protokol ligerede vi adapteren på RNA‐ niveau i modsætning til ligering på cDNA‐

niveau i RTTS ‐ Seq fremgangsmåden. Desuden har vi udført revers transskription, forankret i poly(A)‐

halen for at kunne fokusere analysen på 3'‐utranslaterede regioner. Dette set‐up nødvendiggjorde

udvikling af en ny data normalisering arbejdsmetode, der inkorporerer henfald i signalet fra 3'‐ender af

molekyler. Vi har udført forsøgene med lever RNA fra tre arter, som giver os mulighed for at kombinere

nuklease‐resistens fra probnings data med struktur‐konserverings analyse, og skabe et informationsrigt

datasæt. Vi validerer metoden ved at sammenligne nuklease signalet med de kendte strukturer for tre

klasser af RNA‐molekyler. Jagten på nye funktionelle strukturer er igangværende.

Parallelt til RNA struktur studierne, har vi undersøgt samspillet mellem RNA og et oligonukleotid med

terapeutisk potentiale (“Transcriptome‐wide detection of binding sites of Locked Nucleic Acid

containing oligonucleotides (LNA‐Stop‐Seq)”). Vi beskriver udviklingen af en metode, der kan detektere

4

hybridiserings‐steder på transskriptom‐skala ‐ LNA‐Stop Seq. Vi karakteriserer og optimerer forskellige

trin i proceduren og foreslår strategier til at berige for cDNA‐molekyler som er termineret efter at have

nået det krydsbundne oligonukleotid. Sekventerings resultater bekræfter at berigelsen virker, men den

uventede fordeling i signalet kræver yderligere dataanalyse.

De metoder, der præsenteres i denne afhandling kan bidrage til et holistisk syn på RNA, med dets

primære, sekundære og tertiære struktur, samt interaktioner med oligonukleotider. Vi forventer, at

forbedringerne i de eksperimentelle og computationelle metoder, samt de indsamlede resultater, bør

give mulighed for en bedre forståelse af RNA struktur‐funktion‐forholdet, udover bedre og enklere

design af antisense lægemidler.

5

3 StreszczeniepopolskuRNA występuje w komórkach w postaci dynamicznych, trójwymiarowych bytów, ale do jego opisu

naukowcy uciekają się do badania jego pierwszo‐ (sekwencja nukleotydów) , drugo‐ (parowanie zasad ) i

ostatecznie trzeciorzędowej (trójwymiarowej) struktury. Tradycyjne metody badania struktur drugo‐ i

trzeciorzędowych są pracochłonne i wymagają osobnej analizy każdej cząsteczki. Od czasu powstania

sekwencjonowania masowo równoległego, obszar określenia struktur RNA przechodzi szybkie zmiany,

znacząco zwiększając wydajność doświadczeń i proponując nowe sposoby analizy danych. Ta praca

doktorska składa się z czterech artykułów opisujących postępy w owym skoku metodologicznym,

przedstawiając i walidując nowe metody eksperymentalne i obliczeniowe wykorzystujące

sekwencjonowanie nowej generacji do badań strukturalnych RNA.

Pierwszy artykuł ("Wykrywanie miejsc terminacji odwrotnej transkryptazy przy użyciu ligacji cDNA i

masowo równoległego sekwencjonowania") zawiera elastyczny, łatwy do naśladowania sposób

przygotowania biblioteki do sekwencjonowania w technologii Illumina, który pozwala na masową

identyfikację miejsc terminacji odwrotnej transkrypcji (RTTS) – zwany dalej RTTS‐Seq . Detekcja RTTS

może być wykorzystana do badania różnych właściwości RNA takich jak mapowanie końca 5',

mapowanie wrażliwości RNA na określone zabiegi (np. sondowanie struktur), wykrywanie

zmodyfikowanych nukleotydów lub inne, zależne od projektu doświadczenia. Oprócz podania

szczegółowego protokołu doświadczalnego przedstawiamy również proces analizy danych dostosowany

dla naukowców nieposiadających umiejętności z zakresu bioinformatyki. Doświadczenia zebrane z

metody RTTS‐Seq zostały wykorzystana w drugim artykule ("Sondowanie dostępności RNA przy

wykorzystaniu wolnych rodników hydroksylowych oparte na masowo równoległym

sekwencjonowaniu”) dla sondowania trzeciorzędowej struktury RNA. Metoda ta została rozbudowana o

technikę rozwiązującą błąd wynikający z reakcji PCR oraz połączona z systemem normalizacji, który

bierze pod uwagę lokalny poziom pokrycia i tło terminacji odwrotnej transkrypcji oceniane na podstawie

reakcji kontrolnej. Metoda ta umożliwia sondowanie wielu, długich cząsteczek jednocześnie i pozwoliła

uzyskać sygnał który dobrze koreluje z dostępnością szkieletu RNA dla rozpuszczalnika dla obu

testowanych cząsteczek (domeny specyficzności RNazy P oraz rybosomalnego RNA 16S). Kolejny zawarty

artykuł ("Poszukiwanie funkcjonalnych struktur drugorzędowych RNA w 3' regionach nieulegających

translacji poprzez enzymatyczne sondowanie transkryptów z wątroby z wielu gatunków (FragSeq2)" )

przedstawia metodę sondowania drugorzędowej struktury RNA, która jest modyfikacją metody RTTS‐

Seq kompatybilną z opartym o nukleazy (P1 oraz V1) sondowaniu. W tym protokole ligacja adaptera

przeprowadzana jest na poziomie RNA, w przeciwieństwie do ligacji na poziomie cDNA w RTTS‐Seq. Co

więcej, przeprowadzona odwrotna transkrypcja była zakotwiczona na granicy ogona poli‐A skupiając

naszą analizę na 3' regionach nieulegających translacji. Taka konfiguracja wymagała opracowania nowej

metody normalizacji danych, która uwzględnia zanik sygnału od końca 3'. Przeprowadziliśmy

eksperymenty z RNA z wątroby z trzech gatunków, co pozwala nam zespolić dane sondowania

nukleazami z analizą ewolucyjnego zachowania struktur tworząc bogaty w informacje zestaw danych. W

celu walidacji przedstawionej metody, sygnał cięcia nukleazami został porównany ze znanymi

strukturami dla cząsteczek RNA z trzech różnych klas. Poszukiwanie nowych funkcjonalnych struktur jest

w toku.

6

Równolegle do badania struktur RNA badaliśmy interakcje między RNA a oligonukleotydem o

terapeutycznym potencjale ("Detekcja w transkryptomie miejsc wiążących oligonukleotydy

zawierające zablokowane kwasy nukleinowe (LNA‐Stop‐Seq)"). Opisujemy opracowanie metody, która

pozwala wykryć miejsca hybrydyzacji w skali całego transkryptomu – LNA‐Stop‐Seq. Charakteryzujemy i

optymalizujemy różne kroki w procedurze i proponujemy strategie wzbogacania cząsteczek cDNA

zatrzymanych na związanych oligonukleotydach. Ostatecznie, wyniki sekwencjonowania potwierdzają,

że metoda wzbogacania działa, ale nieoczekiwany rozkład sygnału wymaga dodatkowej analizy danych .

Metody przedstawione w tej pracy mogą zapewnić całościowe spojrzenie na RNA, jego pierwszo‐ ,

drugo‐ i trzeciorzędowej struktury, a także interakcji z oligonukleotydami. Spodziewamy się, że postępy

w metodach eksperymentalnych i obliczeniowych, a także zebrane wyniki, powinny pozwolić na lepsze

zrozumienie relacji struktury z funkcją RNA i co więcej, lepsze i prostsze projektowanie leków opartych

na antysensowej terapii.

7

4 AcknowledgmentsResults presented in this thesis were possible to obtain only thanks to a wide support that I have

received during and before my PhD studies. First of all, I would like to thank my supervisor, Prof. Jeppe

Vinther, for guiding my scientific growth over last 3.5 years, for the opportunities to openly discuss and

try new ideas, for keeping me healthily motivated to work on them and for the constructive feedback

regarding this thesis. I am also very grateful to Prof. Jan Christiansen, my co‐supervisor, who always had

time to talk about science, sports and life, and who was very helpful with coping with the administrative

processes.

I owe many thanks to our lab technicians, Amal Al‐Chaer and Lena Bjørn Johansson for ensuring

efficiently functioning laboratory with a great atmosphere, and to my coworkers Jakob Lewin Rukov,

Signe Olivarius, Line Dahl Poulsen (thanks for translating the summary!), Christel Hougård Petersen,

Yanping Feng, Sidsel Kramshøj Adolph and Heidi Theil Hansen for fruitful discussions, being helpful and

keeping the University a place that one wants to come back to. Many thanks to the section leader –

Prof. Anders Krogh for scientific and social engagement and to Henriette Husum Bak‐Jensen for a

passionate organizational support.

I would especially like to thank Sofie Salama and the whole Haussler Lab for the great and productive

time during my academic stay in Santa Cruz, as well as Jakob Skou Pedersen and his lab for the valuable

collaboration. Special thanks go to the representatives of Santaris Pharma – Morten Lindow and Peter

Hagedorn, whose enthusiasm and expert insight gave the momentum to our joint projects.

I am greatly indebted for the high quality education I have received prior to my doctoral studies at the

Poznań University of Life Sciences and at the Saint Mary Magdalene High School in Poznań. I owe

particular thanks to dr Tomasz Pniewski for guiding me through my first research venture and to Prof.

Włodzimierz Krzyżosiak and his lab for the very important, scientifically forming experience during work

for my master project.

I would like to thank the Department of Biology for funding my scholarship and The Danish Council for

Strategic Research for funding most of the remaining expenses and the stay abroad.

Finally, I would especially like to thank all my friends living here in Denmark and my friends in Poland,

my girlfriend Gillian and to my whole family.

Dziękuję Wam moi Rodzice za miłość, wsparcie, oraz godny naśladowania wzór życia.

8

5 AbstractIn this thesis we describe the development of four related methods for RNA structure probing that

utilize massive parallel sequencing. Using them, we were able to gather structural data for multiple, long

molecules simultaneously. First, we have established an easy to follow experimental and computational

protocol for detecting the reverse transcription termination sites (RTTS‐Seq). This protocol was

subsequently applied to hydroxyl radical footprinting of three dimensional RNA structures to give a

probing signal that correlates well with the RNA backbone solvent accessibility. Moreover, we applied

RTTS‐Seq to detect antisense oligonucleotide binding sites within a transcriptome. In this case, we

applied an enrichment strategy to greatly reduce the background. Finally, we have modified the RTTS‐

Seq to study the secondary structure of 3’ untranslated regions with nuclease probing in combination

with the structure evolutionary conservation study. In the course of this thesis we describe several

computational methods. One that alleviates PCR bias by estimating number of unique molecules existing

before the amplification, and two methods for data normalization: one applicable when the paired end

sequencing is performed, and the other that works with the single read sequencing with known priming

sites.

9

6 ObjectivesThe overall objective of my thesis is to further RNA biology understanding and facilitate antisense

oligonucleotide drugs design by the development of methods for studying RNA properties in the

transcriptome‐wide manner with the use of massive parallel sequencing. Those overarching objectives

were split into working goals:

establishing a generic method of detecting the reverse transcription termination sites (which

can originate from the RNA structure probing or other RNA properties signal) with the Illumina

sequencing technology (Paper 1),

devising an experimental and computational workflow for studying the RNA tertiary structure

(Paper 2),

characterizing interactions between RNA and specific oligonucleotide with therapeutic potential

(Paper 3),

describing the secondary structure of 3’ untranslated regions of mRNA molecules in the

evolutionary context (Paper 4).

10

7 Descriptionoftheresearchproject

7.1 Backgroundinformation

7.1.1 RibonucleicacidRibonucleic acid (RNA) carry multiple functions and contribute to almost 4% of dry weight (DW) of

Escherichia coli and 20% DW of a typical mammalian cell (Alberts, 2002). The RNA molecules are

polymers composed of ordered adenosine (A), cytosine (C), guanosine (G) and uridine (U)

monophosphates with chemical repertoire being extended by nucleotide modifications (Cantara et al.,

2011; Limbach et al., 1994). Traditionally, RNAs have been categorized as coding and non‐coding, with

the main function of the coding molecules being an intermediate in the flow of genetic information from

DNA to proteins (Crick, 1970). Among non‐coding RNA molecules (ncRNA) we observe the astounding

variety of functions ranging from catalysis, delivering amino acids, detection of small molecules

(Serganov and Nudler, 2013) or temperature (Kortmann and Narberhaus, 2012), involvement in

reactions acting on other RNA molecules (Matera et al., 2007), genome management (Froberg et al.,

2013), telomere synthesis (Gesteland et al., 2006) and increasingly appreciated involvement in the post‐

transcriptional gene expression regulation (Carthew and Sontheimer, 2009; Ulitsky and Bartel, 2013)

among others.

7.1.2 RNAstructureAlthough RNA molecules are linear polymers they exist as three‐dimensional entities, whose structure is

dictated by their sequence, history of the molecule, solvent properties and molecular interactors. Under

physiological conditions the main interactions dictating the structure are base stacking and base pairing,

with many other forces shaping the final molecular organization. It is easiest to appreciate the

importance of RNA folding into its specific three dimensional structure when considering catalytic RNA

molecules – ribozymes – with known involvement in RNA processing and in the protein synthesis

performed by especially interesting catalytic RNA ‐ ribosomal RNA (Doudna and Cech, 2002). Ribosomal

RNA is the most abundant class of RNA present in living cells accounting for approximately 80% of RNA

mass, and which structure has been heavily studied since mid‐XX century (Bakowska‐Zywicka and

Tyczewska, 2009). Solving its three dimensional structure at the beginning of the XXI century with the

help of the X‐ray crystallography allowed the full appreciation of the importance of folded RNA in its

functioning (Steitz, 2008). We have used a small subunit of this complex as a benchmark for the method

of probing three dimensional RNA structures described in this thesis (Paper 2). Representatives of the

second most abundant class of RNA molecules, tRNAs, also require folding into specific three

dimensional structures to be charged with an amino acid by their particular aminoacyl synthetases and

deliver them to the ribosomes (Perona and Hadd, 2012). Apart from those well studied models there are

numerous known examples of RNA fold being important for the function. For instance, RNA folded into

a hairpin is a substrate in microRNA biogenesis pathway (Kim, 2005), the structure (secondary and

tertiary) of microRNA target sites can modulate silencing efficiancy (Gan and Gunsalus, 2013; Kertesz et

al., 2007; Wan et al., 2014), structures within pre‐mRNA are involved in alternative splicing regulation

(McManus and Graveley, 2011) and some RNA‐protein interactions require specific RNA fold (Lunde et

al., 2007). More examples of RNA structure roles have been summarized in (Wan et al., 2011).

11

7.1.2.1 HierarchyofRNAstructureTo better understand the RNA structure researchers describe it in the terms of secondary and tertiary

structure models. Secondary structure describes the pattern of base‐pairing forming structural features

such as stems, internal loops, hairpin loops, multi‐loops, bulges and pseudoknots (see (Andronescu et

al., 2008) for explanation). Thanks to the base stacking and hydrogen bonding, secondary structure

contributes to the most of the negative free energy of structure formation and assuming hierarchical

folding model it forms the basis for tertiary organization of the RNA molecules (Tinoco and Bustamante,

1999). Tertiary structure of RNA describes the three dimensional coordinates of its constituting atoms.

Observed patterns include very rich repertoire of forms including A‐form helix, coaxial stacking, helix

junctions, interactions between nucleotide and helix minor groove (A‐minor), kink‐turns, hook turns, S‐

turns, tetraloops and tetraloops receptors, intercalations, triple‐stranded RNA, G‐quadruplexes, ribose

zippers and interactions involving base pairing (hence sometimes considered to be secondary structure

features) such as kissing loops and pseudoknots, see (Butcher and Pyle, 2011) for more detailed

description. Moreover, apart from Watson‐Crick base pairs the spectrum of possible hydrogen bonding

between bases is enriched by non‐canonical pairs (Leontis and Westhof, 2001). Overall, relative

simplicity of determination and importance of RNA secondary structure directed more efforts towards

its solving as compared with three dimensional models building.

7.1.2.2 SecondarystructuredeterminationThere are various approaches towards investigating secondary structure of a given RNA molecule. One is

to use the energy minimization programs such as Mfold (Zuker, 2003), RNAStructure (Reuter and

Mathews, 2010) or many others, which use the primary RNA sequence as input and output the folding

patterns with calculated energies. Their predictions depend on the thermodynamic parameters to find

the secondary structure with the lowest free energy. Obtained structures are not guaranteed to be

actually present in the solution nor inform us that the structure is biologically relevant. On average their

accuracy is 73% and their high probability predictions are generally correct (Mathews, 2004).

Alternatives to free energy minimization include statistical learning algorithms (Do et al., 2006) or

statistical sampling from ensemble (Ding et al., 2004) among others.

Accuracy can be further increased by constraining the predictions with the results of structure probing

experiment. Its outline is to 1) fold the RNA molecule in the appropriate folding buffer and thermal

conditions (preferably establishing if the molecule is functional), 2) treat with the probing reagent, 3)

detect reactive sites by either direct electrophoresis of the beforehand labeled RNA molecule or

performing reverse transcription with labeled primer and cDNA electrophoresis (slab‐gel or capillary).

Commonly used probing reagents include structure‐sensitive endonucleases such as single‐strand

specific nucleases A, I, P1, S1, T2 or mung bean among others (Gite and Shankar, 1995; Ziehler and

Engelke, 2001), double‐strand‐specific nuclease V1 (Ziehler and Engelke, 2001), metal ions ‐ especially

Pb2+ (Kirsebom and Ciesiolka, 2008) or other chemical reagents such as DMS, SHAPE reagents, kethoxal,

CMCT or hydroxyl radicals (Weeks, 2010). Chemical probing is often preferred over enzymatic cleavage

due to better defined behavior and avoiding steric clashes between RNA and the small probing reagent.

Moreover, the lead(II) ions, some SHAPE reagents, X‐ray generated hydroxyl radicals and DMS are

applicable to in vivo probing (Adilakshmi, 2006; Ding et al., 2013; Lindell et al., 2002; Rouskin et al.,

12

2013; Spitale et al., 2013; Wells et al., 2000) which is considered superior over in vitro probing as it

provides information about RNA molecules in their natural setting.

Researchers studying RNA molecules with conserved structures are in the privileged position since they

can apply the gold‐standard secondary structure prediction method – building comparative structure

models. It relies on supporting the structure hypothesis by observation of compensatory mutations

which change the primary sequence but preserve the secondary structure. It gives results of very high

quality even for large molecules (Gutell et al., 2002) and since it implies that the structure has been

preserved in evolution it strongly suggests that it is functional. When given only a few aligned sequences

it is often beneficial to use the combination of thermodynamic optimization and comparative models, as

described in (Seetin and Mathews, 2012). As more and more genomes are being sequenced, the

comparative methods bring a possibility of genome‐wide searches of conserved structures as applied in

EvoFold (Pedersen et al., 2006). In a Paper 4 we describe the development of the method aiming at

combining the massive parallel sequencing based nucelase structure probing with the evolutionary

approch.

7.1.2.3 TertiarystructuredeterminationAs the comparative structural model of the RNA secondary structure was built first for tRNA molecules

(Madison et al., 1966), also the tertiary RNA structure determination was pioneered using the structure‐

conservation approach (Levitt, 1969) and was soon after mastered with the X‐ray crystallography (Kim

et al., 1974). X‐ray crystallography is now a method of choice for studying the tertiary structure of

biological molecules, including complex RNA (Ban et al., 2000; Wimberly et al., 2000). Although capable

of producing data of very high resolution, there are many drawbacks of applying this method. Producing

RNA crystals is time consuming, requires specialized equipment and skills, success is not guaranteed and

molecules are observed under artificial conditions. Moreover, producing suitable crystals often requires

molecular engineering to stabilize the structures (Ke and Doudna, 2004).

Another experimental method borrowed from studying the three dimensional structures of proteins is

NMR spectroscopy. It’s advantage is that it provides information about the molecules in solution, but

similarly to the X‐ray crystallography it has also high equipment and skills requirements and additionally

has a limitation for size of the molecule to up to 100 nt (Furtig et al., 2003).

Automated prediction of RNA tertiary structure has been approached with different methods. Since

molecular dynamics simulations are prohibitively computationally demanding, alternative methods have

been developed. Structures have been build using phylogenetic information (Michel and Westhof,

1990), simplified energy function (Das and Baker, 2007), assembled using nucleotide cyclic motifs

(Parisien and Major, 2008) or probabilistic modeling (Frellsen et al., 2009). Despite ongoing

improvements, automatically generated predictions are mostly largely deviated from the experimentally

solved structures (Laing and Schlick, 2010).

Considering the difficulties associated with experimental obtaining of the high resolution structural data

and confines of computational predictions it is advantageous to use easier to obtain low resolution

experimental data to guide molecular modeling. One of the approaches relies on using a small‐angle X‐

13

ray scattering (SAXS) which can be applied in the native conditions of RNA molecules to obtain low‐

resolution electron density map, which are especially useful to study the conformational changes

(Lipfert and Doniach, 2007). Other approach, requiring only standard molecular biology laboratory

equipment, is the measurement of hydroxyl radical reactivity of different nucleotides and using them as

guides for 3D modeling refinement (Ding et al., 2012) . Hydroxyl radical probing coupled with the next

generation sequencing is a method described in the Paper 2.

7.1.3 InteractionsbetweenRNAandantisenseoligonucleotidesOne of the reasons for studying RNA structure is its influence on antisense drugs efficiency. Interactions

between antisense oligonucleotides (ASOs) and RNA depend on multiple parameters such as sequence,

solvent parameters (usually physiological), RNA structure and bound proteins (Kedde et al., 2007). The

term RNA accessibility (not to be confused with the backbone solvent accessibility measured with the

hydroxyl radical probing) is used, which can be broadly defined as ability of RNA “to form stable

complexes with complementary oligonucleotides” (Allawi et al., 2001). Various experiments were

proposed for assessing RNA accessibility such as measuring the oligonucleotide‐RNA association with

dialysis, arrays of oligonucleotides or detection by enzymatic reaction (RNAse H or reverse

transcriptase), as summarized in (Allawi et al., 2001). Importantly, ASOs targeted towards accessible

regions are downregulating gene expression more efficiently (Allawi et al., 2001).

Apart from experimental methods, several computational approaches for assessing RNA accessibility

have been described. Some of them calculated accessibility as a difference between energy of ASO‐RNA

hybridization and probe intramolecular folding energy (Luebke et al., 2003) or RNA intramolecular

folding (cost of removing pairs in a given region) (Lu and Mathews, 2008). Others predict RNA structure

locally in the sliding window and assess the probability that a given region is base paired (Tafer et al.,

2008). Interestingly, the local structure prediction approach has been shown to be superior over global

(Lange et al., 2012). Paper 3 concerns studying The RNA‐ASO interactions.

7.1.4 MassiveparallelsequencingRecent years brought a revolution in DNA sequencing with so called High‐Throughput or Next‐

Generation Sequencing (NGS) technologies. Various NGS systems compete currently on the market, but

all of them are based on sequencing of the short stretches of the multiple DNA molecules

simultaneously (hence called massive parallel sequencing), yielding up to 4G reads per instrument per

run (Illumina HiSeq 2500). This unprecedented technological advance facilitated emergence of whole‐

new methods, such as genome sequencing, exome sequencing, RNA sequencing (Ozsolak and Milos,

2011), microRNA sequencing, crosslinking and immunoprecipitation sequencing (Hafner et al., 2010;

Konig et al., 2010; Licatalosi et al., 2008), chromatin immunoprecipitation and sequencing (Furey, 2012),

ribosome profiling (Ingolia et al., 2009), sequencing based RNA structure probing (Kertesz et al., 2010;

Underwood et al., 2010) and many other methods.

Samples, bgefore sequencing with an Illumina sequencing technology (which was utilized throughout

the thesis), must be transformed into suitable sequencing libraries that can bind a flow cell, generate

clusters in a bridge PCR amplification (with primers covalently attached to the flow cell) and hybridize

with the sequencing primers. The sequencing can be performed sequentially with three different

14

primers, first for the first sequencing read, optional second for the second sequencing read if paired‐end

sequencing is performed and the third primer, which reads out the sample specific index and allows for

distinguishing different samples in multiplexed sequencing. The sequencing reaction is based on a

sequencing‐by‐synthesis approach. In each cycle primers hybridized to the clustered amplicons, which

are derived from a single molecule in the library, are extended by one nucleotide bearing fluorescently

labeled extension terminator, with the fluorescent group being nucleotide‐specific. Next, the flow cell is

scanned for the colors of clusters, and the identity of the nucleotide attached to each cluster is saved

together with quality score. Following scanning, the terminators are removed and the cycle is repeated.

Final result of the sequencing is a FASTQ file that for each cluster contains the information about its

position within flow cell, sequence and quality at each nucleotide.

7.1.5 ApplicationofthemassiveparallelsequencingforRNAstructuredetermination

Several protocols have been established aiming at harnessing massive parallel sequencing for RNA

secondary structure probing detection. All of them utilized traditional probing reagents (structure

sensitive nucleases or chemicals leading to the RNA strand cleavage or modification) but alleviated the

need of electrophoretic separation of nucleic acids by detecting the sites of modifications with

sequencing. In the year 2010 three competing approaches were published. Two of them, parallel

analysis of RNA structure (PARS) (Kertesz et al., 2010) and FragSeq (Underwood et al., 2010) are based

on limited (as in traditional RNA structure probing) nuclease digestion and detection of cleavage sites as

sites to which the sequencing adapter has been ligated. They used different enzymes and data analysis

schemes. In the PARS method, the ratio between cleavage extent of double‐strand specific nuclease V1

and single‐strand specific nuclease S1 has been used to determine the state of a given nucleotide. On

the other hand the FragSeq method used only one enzyme, a single‐strand specific nuclease P1, to

determine which bases are single stranded and compared the cleavage extent to the cleavages observed

in the untreated control. The third method, dsRNA‐seq (Zheng et al., 2010), focused on finding double

stranded RNA regions by extensive degradation of single stranded RNA with RNase I and sequencing the

remaining RNA. The three described methods were based on ligating the sequencing adapters to the

RNA at the site of probing. This approach is not possible to apply if use of non‐cleaving probe is desired,

as in the SHAPE probing. Resolving that issue was an objective for development of the next NGS based

method of RNA structure investigation – SHAPE‐Seq (Lucks et al., 2011). In the SHAPE‐Seq the reverse

transcription terminates upon reaching the modification and the adapter is ligated to the cDNA

terminus. This method cannot be used for transcriptome‐wide studies, because it requires a specific

reverse transcription primer, which can anneal only to artificially introduced 3’ end cassettes. This

limitation has been resolved in two recently published papers which used DMS for in vivo RNA

secondary structure probing (Ding et al., 2013; Rouskin et al., 2013), with one comparing the extent of

terminations between treated and control sample (Ding et al., 2013) and the other taking advantage of

the novel selection protocol (Rouskin et al., 2013).

7.2 ProjectmotivationsAs exemplified above, knowledge of RNA structure is a key to understanding many biological

phenomena as well as constitutes an important parameter in the rational ASOs design (Vickers et al.,

15

2000). Computational methods for RNA secondary structure prediction are often useful for superficial

assessments of hypotheses but they suffer from many limitations. What’s more, methods for a tertiary

structure prediction from the sequence only are even less reliable. The accuracy of predictions can be

increased when providing the structure building algorithms with experimentally obtained constraints for

both secondary and tertiary structure predictions (Ding et al., 2012; Reuter and Mathews, 2010).

Unfortunately, performing the traditional structure probing experiments is a time consuming task,

requiring at least a standard molecular biology laboratory equipment and a separate analysis of each

molecule (in the case of long molecules the analysis must be split into smaller parts). Development of

massive parallel sequencing allowed simultaneous structural probing of complex mixtures of RNA

molecules remarkably increasing the throughput, covering in one experiment millions of bases “which is

approximately 100‐fold more than all published RNA footprints to date” (Kertesz et al., 2010).

Inspired by the early applications of NGS for RNA structure probing (Kertesz et al., 2010; Lucks et al.,

2011; Underwood et al., 2010), we aimed at strengthening the field with the development of both

experimental and data analysis methods. First, we needed a system for sequencing library preparation

that is flexible, easy to adapt for other applications and compatible with the standard, multiplexed

Illumina sequencing. We describe its design in the Paper 1.

Establishment of this method has opened multiple research opportunities for us. Recently published

development of the computational methods guiding the tertiary RNA structure predictions (Ding et al.,

2012) suggested that investigating the RNA three dimensional structures by detecting hydroxyl radical

footprinting (HRF) signal with NGS will open the venue for structure predictions of multiple long

molecules simultaneously. We show the method of coupling the HRF with the massive sequencing in the

Paper 2.

Our collaboration with the pharmaceutical company Santaris Pharma A/S which specializes in the

development of Locked Nucleic Acid (LNA) based ASOs led us to investigate how the oligonucleotides

interact with transcripts. For that purpose we have again used our established sequencing protocol. The

realization that the signal would contain a very high level of noise led us to develop methods of

enriching for the desired signal, which we describe in the Paper 3.

RNA structures are especially prominent within 3’ untranslated regions (3’ UTRs). 3’ UTRs are mRNA

segments known to be involved in a gene expression regulation and their functioning partly depends on

their specific fold (Bartel, 2009; Kuersten and Goodwin, 2003; Szostak and Gebauer, 2013; Wan et al.,

2014). Together with our collaborators from University of California, Santa Cruz (established the FragSeq

method) and Aarhus University (experienced with comparative analysis of RNA structure) we aimed at

developing a method for profiling the structures of 3’ UTRs. To utilize our combined experience we have

planned an experiment that uses library generation method similar to the one described in the Paper 1

to map the nuclease cleavage sites (as in FragSeq) and to perform this experiment with RNA samples

from different species allowing the use of the structure conservation information (Paper 4).

Studying RNA structure with the NGS raises many issues on how to properly interpret the data, including

need of resolving method specific biases, such as PCR bias (Weeks, 2011). What’s more the custom

16

methods of library preparation (as applied throughout this thesis) require custom data analysis since the

questions and assumptions of the available programs do not fit the experimental design. To address

those issues we aimed at developing computational methods of correcting the PCR bias and of the signal

normalization. The novel PCR bias correction method is described in Paper 2 and is also applied in the

Paper 3. Regarding the data normalization, we found ourselves in two different situations – obtaining

paired‐end (Paper 2) or single‐end (Paper 4) reads, for which we have proposed two different, albeit

related, normalization workflows.

17

8 Summaryoftheresultsinthepapersandtheirrelationtotheinternationalstate‐of‐the‐art

8.1 Paper1In the paper Detection of Reverse Transcriptase Termination Sites Using cDNA Ligation and Massive

Parallel Sequencing we give a detailed protocol of sequencing library preparation intended for detecting

reverse transcription termination sites (RTTS), called here RTTS‐Seq. Traditionally, RTTS were detected

with the slab‐gel or capillary electrophoresis in a wide range of applications such as finding transcripts 5’

termini (Simpson and Brown, 1995), RNA secondary structure probing with the SHAPE or other reagents

(Weeks, 2010), tertiary RNA structure probing or RNA‐protein interactions footprinting with hydroxyl

radicals (Adilakshmi, 2006) or identification of nucleotide modifications (Motorin et al., 2007). Our NGS

based protocol can in principle be used with all of the abovementioned procedures, but allows for much

higher throughput and easier data analysis.

To perform the experiment, we carried out the reverse transcription for which we used a primer with

the Illumina adapter overhang. Having synthesized cDNA that terminated upon reaching the feature of

interest we ligate the adapter to its 3’end (RTTS) with single‐strand DNA ligase. After the ligation, we

finish the library construction with a PCR that adds the sample specific index, allowing for mixing

multiple samples together and perform multiplexed sequencing. Our sequencing libraries structure is

shown on the Figure 1.

Apart from the experimental workflow, we describe the data analysis procedure and publish necessary

scripts focusing on users without extensive bioinformatics experience. We guide how to perform the

initial processing, mapping, trimming and how to visualize the data in the popular UCSC Genome

Browser. We introduce the concept of trimming the reads from the nucleotides added by the reverse

transcriptase via its terminal transferase activity, which would otherwise shift the mapped signal

upstream in the RNA molecule.

The RTTS‐Seq is similar to the procedure described in the SHAPE‐Seq paper (Lucks et al., 2011), but is

compatible with the Illumina multiplexed paired‐end DNA sequencing and with the random priming,

alleviating the need for introducing the structure cassette in the probed RNA. Recently published

method aiming at finding RTTS with massive parallel sequencing – MAP‐Seq (Seetin et al., 2014) has very

similar design to the proposed in RTTS‐Seq but is designed to work with the fixed primer that doesn’t

allow for transcriptome‐wide searches. On the other hand, MAP‐Seq protocol allows for skipping the

PCR step, avoiding some of the biases.

18

Figure 1. Schematic view of the Illumina sequencing library.

8.2 Paper2The paper Massive parallel sequencing based hydroxyl radical probing of RNA accessibility concerns

applying the method described in the Paper 1 for the tertiary RNA structure probing with hydroxyl

radical footprinting (HRF). The HRF is a well established method for measuring the nucleic acid backbone

solvent accessibility (Tullius and Greenbaum, 2005). Traditionally, the signal has been detected with the

electrophoresis of either end‐labeled, cleaved RNA molecule or of the primer extension product. Here,

by applying the modified RTTS‐Seq, we substitute the electrophoretic separation with the sequencing

allowing HRF‐Seq to probe multiple, long RNA molecules simultaneously.

The paper describes the analysis of two RNA molecules with the crystallographically solved three

dimensional structures – Bacillus subtilis RNase P specificity domain and Escherichia coli 16S ribosomal

RNA. The RNA molecules were probed with hydroxyl radicals and were used as templates for the

sequencing libraries preparation. Reverse transcription was performed with either single primer

(RNase P) or with the random primers (16S ribosomal RNA). As in the RTTS‐Seq, we have ligated the

adapter to the cDNA 3’ end, amplified the libraries with PCR and sequenced with the paired‐end

protocol.

The major novelty of the paper comes from the proposed data analysis workflow. First, we have

mapped the pairs of reads to the analyzed molecules, defining the start and the end of the insert,

corresponding to the RTTS and the priming site, respectively. At this step, many inserts had the same

start and end positions, raising the question which copies are derived from true biological replicates and

which are simply PCR duplicates. To resolve that issue we have used a 7 nt random bracode introduced

during adapter ligation and developed a framework for calculating estimated unique counts (EUC) of

each repeated insert based on the random sampling of unequally probable barcodes. Working with EUC

instead of raw counts gave us the advantage of alleviating PCR bias and allowing for proper use of count

statistics. This is a similar outcome as offered by the use of the amplification free MAP‐Seq, but in our

case we are avoiding working with very little amount of material which can be troublesome in certain

applications.

We have defined the coverage at a given location as the sum of EUC of inserts spanning it, and

calculated the termination coverage ratio (TCR) by dividing the EUC of inserts terminating at a given

location by the coverage. To estimate the extent of cleavages induced by HRF at a given location, we

needed to consider spontaneous reverse transcription terminations. We have calculated the ΔTCR,

which is a difference between TCRs of a hydroxyl radical treated and control samples. As expected, ΔTCR

correlates with the ribose solvent accessibility as measured from the crystal structures. The concept of

Flow cell binding

First and second read sequencing primer binding

DNA insert

Sample specific index

19

ΔTCR is analogous to the concept of signal intensity presented in the QuSHAPE method (Karabiber et al.,

2013), but brings it to a realm of information‐rich NGS data.

8.3 Paper3Transcriptome‐wide detection of binding sites of Locked Nucleic Acid containing oligonucleotides

(LNA‐Stop‐Seq) describes development of a method for mapping hybridization sites of an

oligonucleotide with a complex mix of transcripts. The antisense oligonucleotides (ASOs) form a new

class of pharmaceuticals, with two drugs being approved for the medical use by U.S. Food and Drug

Administration – fomivirsen intended to treat cytomegalovirus retinitis and mipomersen targeting ApoB

transcript in patients with familial hypercholesterolaemia (Jones, 2011) and many more in clinical trials

(Rayburn and Zhang, 2008). Action of ASOs starts with the hybridization to their intended target RNA

and several mechanisms of action have been utilized, including target degradation and splicing or

function alteration. Efficient ASOs need to be chemically modified to prevent their degradation and to

increase potency. One of the proposed modification is the use of LNA nucleotide analogues, which

protect from nucleases and increase the affinity (Koch et al., 2008). High affinity of the molecules leads

to the risk of causing hybridization‐dependent toxicity if the non‐targeted sequences are similar enough

to interact with the drug, creating off‐target effects (Lindow et al., 2012). Here we describe the process

of finding the off‐target binding sites of the potential LNA‐containing therapeutic molecule (Straarup et

al., 2010) in the mouse transcriptome. Proposed detection of the ASO‐RNA interaction sites is based on

the crosslinking of the hybridized oligonucleotides via 4‐thiothymidine (4‐thio‐T), biotin‐based selection

and detecting the locations with sequencing.

First, we describe various optimization steps, such as choice of the reverse transcriptase, way of

separating the non‐crosslinked oligonucleotides from the target (LNA modified oligonucleotides bind the

RNA with the affinity high enough to stop the reverse transcription even without the crosslinking),

deciding where in the oligonucleotide the 4‐thio‐T modification should be incorporated and for how

long the crosslinking should be performed.

Expecting the number of hybridization sites to be very limited as compared to the number of RNA 5’

ends, we needed to develop a method to enrich RTTS pool for the molecules that actually are derived

from the termination at the oligonucleotide rather than being derived from the mRNA 5’ ends or from

the spontaneous cDNA synthesis termination. We present two strategies of enrichment, both supported

by experimental evidence. One of the approaches is based on modifying RNA to bear 5’ phosphates and

degrading it with 5’ phosphate dependant exonuclease which terminates upon reaching crosslinked

oligonucleotide. This method is related to the one proposed in the RNase R exonuclease based SHAPE

modification detection procedure (Steen et al., 2010), where the covalent adduct terminates the

exonucleolytic degradation. In our setting we observe the remaining RNA after that treatment to be

composed of the RNA part downstream from the crosslinked ASO.

Another enrichment approach is based on utilizing the CAGE selection system (Takahashi et al., 2012),

but instead of selecting for the biotinylated 5’ cap structures we select for the biotin‐modified RNA‐

crosslinked oligonucleotides. One of the crucial steps of the CAGE selection is the RNase I degradation of

RNA that is not protected by the cDNA. In our scenario, it was vital that the RNA fragment between the

20

cDNA 3’ end and the crosslinked, biotinylated ASO is protected from the cleavage, which was shown to

be the case. After the RNase I cleavage, the RNA‐cDNA hybrids are bound to the streptavidin beads via

the biotinylated oligonucleotide and only the cDNA molecules that extended up to the ASO are kept and

their RTTS are sequenced.

Finally, we have prepared the CAGE‐like selected sequencing library and the non‐selected control. The

non‐selected sample is comparable to the HRF‐Seq dataset, but instead of probing with the hydroxyl

radicals, probing with the oligonucleotide was performed, and the expected target site gives a clear

signal of reverse transcription terminations. Comparison of the non‐selected with the selected samples

shows that the selection removes a big portion of the background signal (as expected) but also

introduces difficult to interpret peaks along the transcript. The sequence of the used oligonucleotide can

be recapitulated from the enrichment profile, indicating that the selection enriches for hybridization

partners.

Interestingly, we were able to find certain clear spots of interaction that would have been difficult to

define using traditionally performed in silico screening (Lindow et al., 2012), which raises hopes that the

further analysis of the dataset would allow defining new rules of hybridization. It is worth noting that

the off‐targets as defined by the LNA‐Stop‐Seq would not necessarily affect the transcript level, as we

don’t check for the ability of the duplex to trigger the action. What’s more, we have only tested the

transcripts present in liver, possibly missing physiologically relevant interactions with transcripts from

other tissues, which was an issue raised when discussing the use of microarrays for finding off‐targets

(Lindow et al., 2012).

8.4 Paper4The last included manuscript, The search for functional RNA secondary structures within 3’

untranslated regions by enzymatic probing of liver transcripts from multiple species (FragSeq2), is

focused on parallel probing of the secondary structure of the mRNA 3’ UTRs. 3’ UTRs are platforms for

translational regulation of gene expression, with their structure playing an important role via e.g.

microRNA or protein binding modulation. This work borrows on one side from the established

experimental protocols of FragSeq (Underwood et al., 2010) and PARS (Kertesz et al., 2010) which

combined the enzymatic probing with the high‐throughput sequencing, and on the other side from the

EvoFold (Pedersen et al., 2006), the method that uses the evolutionary information for the functional

structures determination.

The presented method relies on an in vitro RNA folding and probing with a single‐strand specific

nuclease P1 in two different concentrations, with a double‐strand specific RNase V1 and performing

random shattering with the magnesium ions at elevated temperature. To the cleavage‐generated

5’ phosphates (magnesium shattering required the phosphorylation reaction) an adaptor is ligated and

the RNA is reverse transcribed using the oligo‐dT primer bearing the 5’ adaptor, which focuses our assay

on the 3’ regions of the mRNA molecules. Synthesized cDNA is used for a PCR and sequenced with the

Illumina single‐read protocol, reading out the nuclease cleavage positions. In total, we have probed four

liver RNA samples: human, dog and mouse poly(A), and the ribosome depleted mouse sample. The

21

multiple species experimental design allows harnessing not only nuclease probing information, but also

the evolutionary conservation.

After mapping, we observed that some of the reads contained the information about the priming site

location, and we used that for the data normalization. Upon initial data analysis we have assumed that

the signal distribution from the priming sites is a function of (1) exponential decay expected from the

fact that if a reverse transcription stops at a cleavage site it will not be able to detect the cleavage sites

upstream in the RNA molecule, and (2) the size selection, that lowered the chance of observing the

short products. This required applying a novel normalization scheme that would be able to translate the

observed read count to the cleavage efficiency. Inspired by the QuSHAPE method (Karabiber et al.,

2013) we have modeled the extension of cDNA molecules from the priming sites and for each position

have estimated the count of cDNA molecules reaching that position, which can be compared with the

observed number of reads ending at a given site.

We show the structural signal from three classes of RNA molecules, structured spiked‐in RNA (E. coli

transfer‐messenger RNA), known 3’UTR structure (selenocysteine insertion element SECIS) and known,

structured non‐coding RNA (U1 spliceosomal RNA). All three RNA molecules show clear correlation

between the known structure and the the P1 and V1 signal. Interestingly, the signal for two SECIS

elements in selenoprotein P mRNA is consistent over all three species tested underscoring the

evolutionary perspective of the method, with very clear, high P1 cleavage rate for the apical loops. The

signal for U1 spliceosomal RNA, available in the enzymatically polyadenylated ribosome depleted mouse

RNA, has been compared with the previously compiled structure (Underwood et al., 2010) and shows

almost perfect agreement.

In the recent years we have witnessed the development of multiple methods of RNA structure probing

detected with massive parallel sequencing (Kertesz et al., 2010; Lucks et al., 2011; Underwood et al.,

2010; Wan et al., 2014). Propositions differed from each other with the used probing reagents, library

preparation protocols and the data analysis workflows. Latest published methods describe the in vivo

probing approach, which is especially relevant, as the in vitro folded structures may not necessarily be

the biologically relevant (Ding et al., 2013; Rouskin et al., 2013). Our way of improving the detection of

the biologically relevant structures is to combine the in vitro probing signal with the conservation signal.

This, in certain situations, may be superior to the in vivo probing approach, as the RNA in vivo may be

present in the functional state for only limited fraction of time making it difficult to detect.

22

9 ConclusionsandperspectivesWe have presented four intertwined projects broadly related to investigating RNA structural properties

on the massive scale with the next generation sequencing. We have started with the presentation of the

easy to follow, generic method for sequencing libraries generation that was later applied towards

obtaining a global perspective of the RNA structure: its secondary and tertiary organization, as well as

intermolecular interactions between RNA and antisense oligonucleotides.

We provide insights into RNA structure probing with the NGS, describing biases, ways of tackling them

and the data normalization schemes. We have confirmed that the NGS approach is suitable for the RNA

structure determination, and given the proper data analysis it performs comparably well to the low

throughput, traditional counterparts. The vast amount of gathered data should make it possible to

refine the folding parameters used in the computational prediction programs as well as lead to the

better understanding of used reagents.

Given the rising popularity of using the NGS methods we expect the HRF‐Seq to find an immediate

application with the combination of the HRF‐driven tertiary structure prediction algorithms for the

large‐scale 3D modeling projects (Ding et al., 2012). Such a marriage would make the data analysis much

easier and more reliable by feeding the structural algorithm with the digital data of known uncertainty

(count statistics). The analysis of many, long molecules simultaneously would possibly allow a discovery

of new folding rules. Another, not yet explored venue for the HRF‐Seq could be performing an

experiment that compares the same set of RNA molecules between different conditions, in which case

we expect the data to be of even higher quality, since sequence‐dependent biases should cancel out.

Moreover, the use of an X‐ray radiation would allow us to apply the method for in vivo studies,

answering how well the in vitro probing experiments recapitulate the physiological state.

In the FragSeq2 paper we describe the probing of the RNA secondary organization in vitro, creating a

dataset comprising the wealth of information of probing with different nucleases combined with the

conservation signal thanks to probing of three different species. As for now we have performed the

experiments and designed the data normalization procedure. The results are in agreement with the

known structures, hinting that the dataset possibly contains the information on the novel structural

elements. Next goal of the project is to perform the holistic data mining with the use of the nuclease

cleavage data and the evolutionary information. Insights gathered during this analysis can lead us to

develop the subsequent version of the combined nuclease‐conservation structure determination

approach where we would extend the probed sequence space to cover whole transcripts.

We have shown that the LNA‐Stop‐Seq can be successfully applied for finding the sites of interactions

between an ASO and RNA in vitro. We have performed only initial data analysis, and it suggests that we

may be able to improve our understanding of this kind of interactions. On the other hand, we didn’t

characterize if the biotin and 4‐thio‐T influence the hybridization. The procedure can be very easily

performed with oligonucleotides of different sequences or chemistries. It was develop with the future in

vivo application in mind, where the oligonucleotides would be delivered to the cultured cells via

transfection. The LNA‐Stop‐Seq describes the first application of the very specific CAGE‐like selection

23

outside of the conventional cap‐trapping, suggesting that this protocol can be adapted to enrich for

other interesting RNA properties.

It is worth noting that the FragSeq2 and the LNA‐Stop‐Seq methods are parts of bigger collaborations.

We have established experimental protocols and the initial data processing schemes and we expect

from the future analysis to define new 3’ UTR structures and correlate them with the cellular regulation

mechanism (e.g. microRNAs) as well as defining the rules governing LNA containing oligonucleotides

hybridization.

24

10 ReferencesAdilakshmi, T. (2006). Hydroxyl radical footprinting in vivo: mapping macromolecular structures with synchrotron radiation. Nucleic Acids Research 34, e64‐e64.

Alberts, B. (2002). Molecular biology of the cell, 4th edn (New York, Garland Science).

Allawi, H.T., Dong, F., Ip, H.S., Neri, B.P., and Lyamichev, V.I. (2001). Mapping of RNA accessible sites by extension of random oligonucleotide libraries with reverse transcriptase. RNA (New York, NY) 7, 314‐327.

Andronescu, M., Bereg, V., Hoos, H.H., and Condon, A. (2008). RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics 9, 340.

Bakowska‐Zywicka, K., and Tyczewska, A. (2009). The structure of the ribosome – short history. Biotechnologia 1, 14‐23.

Ban, N., Nissen, P., Hansen, J., Moore, P.B., and Steitz, T.A. (2000). The complete atomic structure of the large ribosomal subunit at 2.4 A resolution. Science 289, 905‐920.

Bartel, D.P. (2009). MicroRNAs: Target Recognition and Regulatory Functions. Cell 136, 215‐233.

Butcher, S.E., and Pyle, A.M. (2011). The molecular interactions that stabilize RNA tertiary structure: RNA motifs, patterns, and networks. Acc Chem Res 44, 1302‐1311.

Cantara, W.A., Crain, P.F., Rozenski, J., McCloskey, J.A., Harris, K.A., Zhang, X., Vendeix, F.A., Fabris, D., and Agris, P.F. (2011). The RNA Modification Database, RNAMDB: 2011 update. Nucleic Acids Res 39, D195‐201.

Carthew, R.W., and Sontheimer, E.J. (2009). Origins and Mechanisms of miRNAs and siRNAs. Cell 136, 642‐655.

Crick, F. (1970). Central dogma of molecular biology. Nature 227, 561‐563.

Das, R., and Baker, D. (2007). Automated de novo prediction of native‐like RNA tertiary structures. Proc Natl Acad Sci U S A 104, 14664‐14669.

Ding, F., Lavender, C.A., Weeks, K.M., and Dokholyan, N.V. (2012). Three‐dimensional RNA structure refinement by hydroxyl radical probing. Nat Methods.

Ding, Y., Chan, C.Y., and Lawrence, C.E. (2004). Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res 32, W135‐141.

Ding, Y., Tang, Y., Kwok, C.K., Zhang, Y., Bevilacqua, P.C., and Assmann, S.M. (2013). In vivo genome‐wide profiling of RNA secondary structure reveals novel regulatory features. Nature.

Do, C.B., Woods, D.A., and Batzoglou, S. (2006). CONTRAfold: RNA secondary structure prediction without physics‐based models. Bioinformatics 22, e90‐98.

Doudna, J.A., and Cech, T.R. (2002). The chemical repertoire of natural ribozymes. Nature 418, 222‐228.

Frellsen, J., Moltke, I., Thiim, M., Mardia, K.V., Ferkinghoff‐Borg, J., and Hamelryck, T. (2009). A probabilistic model of RNA conformational space. PLoS Comput Biol 5, e1000406.

Froberg, J.E., Yang, L., and Lee, J.T. (2013). Guided by RNAs: X‐inactivation as a model for lncRNA function. J Mol Biol 425, 3698‐3706.

Furey, T.S. (2012). ChIP‐seq and beyond: new and improved methodologies to detect and characterize protein‐DNA interactions. Nat Rev Genet 13, 840‐852.

25

Furtig, B., Richter, C., Wohnert, J., and Schwalbe, H. (2003). NMR spectroscopy of RNA. Chembiochem 4, 936‐962.

Gan, H.H., and Gunsalus, K.C. (2013). Tertiary structure‐based analysis of microRNA‐target interactions. RNA 19, 539‐551.

Gesteland, R.F., Cech, T., and Atkins, J.F. (2006). The RNA world : the nature of modern RNA suggests a prebiotic RNA world, 3rd edn (Cold Spring Harbor, N.Y., Cold Spring Harbor Laboratory Press).

Gite, S.U., and Shankar, V. (1995). Single‐strand‐specific nucleases. Crit Rev Microbiol 21, 101‐122.

Gutell, R.R., Lee, J.C., and Cannone, J.J. (2002). The accuracy of ribosomal RNA comparative structure models. Curr Opin Struct Biol 12, 301‐310.

Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., Rothballer, A., Ascano, M., Jungkamp, A.‐C., Munschauer, M., et al. (2010). Transcriptome‐wide Identification of RNA‐Binding Protein and MicroRNA Target Sites by PAR‐CLIP. Cell 141, 129‐141.

Ingolia, N.T., Ghaemmaghami, S., Newman, J.R.S., and Weissman, J.S. (2009). Genome‐Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling. Science 324, 218‐223.

Jones, D. (2011). The long march of antisense. Nature reviews Drug discovery 10, 401‐402.

Karabiber, F., McGinnis, J.L., Favorov, O.V., and Weeks, K.M. (2013). QuShape: rapid, accurate, and best‐practices quantification of nucleic acid probing information, resolved by capillary electrophoresis. RNA 19, 63‐73.

Ke, A., and Doudna, J.A. (2004). Crystallization of RNA and RNA‐protein complexes. Methods 34, 408‐414.

Kedde, M., Strasser, M.J., Boldajipour, B., Vrielink, J.A.F.O., Slanchev, K., le Sage, C., Nagel, R., Voorhoeve, P.M., van Duijse, J., Ørom, U.A., et al. (2007). RNA‐Binding Protein Dnd1 Inhibits MicroRNA Access to Target mRNA. Cell 131, 1273‐1286.

Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U., and Segal, E. (2007). The role of site accessibility in microRNA target recognition. Nat Genet 39, 1278‐1284.

Kertesz, M., Wan, Y., Mazor, E., Rinn, J.L., Nutter, R.C., Chang, H.Y., and Segal, E. (2010). Genome‐wide measurement of RNA secondary structure in yeast. Nature 467, 103‐107.

Kim, S.H., Sussman, J.L., Suddath, F.L., Quigley, G.J., McPherson, A., Wang, A.H., Seeman, N.C., and Rich, A. (1974). The general structure of transfer RNA molecules. Proc Natl Acad Sci U S A 71, 4970‐4974.

Kim, V.N. (2005). MicroRNA biogenesis: coordinated cropping and dicing. Nat Rev Mol Cell Biol 6, 376‐385.

Kirsebom, L.A., and Ciesiolka, J. (2008). Pb2+‐induced Cleavage of RNA. In Handbook of RNA Biochemistry (Wiley‐VCH Verlag GmbH), pp. 214‐228.

Koch, T., Rosenbohm, C., Hansen, H.F., Hansen, B., Marie Straarup, E., and Kauppinen, S. (2008). Chapter 5 Locked Nucleic Acid: Properties and Therapeutic Aspects. In Therapeutic Oligonucleotides (The Royal Society of Chemistry), pp. 103‐141.

Konig, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner, D.J., Luscombe, N.M., and Ule, J. (2010). iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat Struct Mol Biol 17, 909‐915.

26

Kortmann, J., and Narberhaus, F. (2012). Bacterial RNA thermometers: molecular zippers and switches. Nat Rev Microbiol 10, 255‐265.

Kuersten, S., and Goodwin, E.B. (2003). The power of the 3' UTR: translational control and development. Nat Rev Genet 4, 626‐637.

Laing, C., and Schlick, T. (2010). Computational approaches to 3D modeling of RNA. J Phys Condens Matter 22, 283101.

Lange, S.J., Maticzka, D., Mohl, M., Gagnon, J.N., Brown, C.M., and Backofen, R. (2012). Global or local? Predicting secondary structure and accessibility in mRNAs. Nucleic Acids Research.

Leontis, N.B., and Westhof, E. (2001). Geometric nomenclature and classification of RNA base pairs. RNA 7, 499‐512.

Levitt, M. (1969). Detailed molecular model for transfer ribonucleic acid. Nature 224, 759‐763.

Licatalosi, D.D., Mele, A., Fak, J.J., Ule, J., Kayikci, M., Chi, S.W., Clark, T.A., Schweitzer, A.C., Blume, J.E., Wang, X., et al. (2008). HITS‐CLIP yields genome‐wide insights into brain alternative RNA processing. Nature 456, 464‐469.

Limbach, P.A., Crain, P.F., and McCloskey, J.A. (1994). Summary: the modified nucleosides of RNA. Nucleic Acids Res 22, 2183‐2196.

Lindell, M., Romby, P., and Wagner, E.G. (2002). Lead(II) as a probe for investigating RNA structure in vivo. RNA 8, 534‐541.

Lindow, M., Vornlocher, H.‐P., Riley, D., Kornbrust, D.J., Burchard, J., Whiteley, L.O., Kamens, J., Thompson, J.D., Nochur, S., Younis, H., et al. (2012). Assessing unintended hybridization‐induced biological effects of oligonucleotides. Nature Biotechnology 30, 920‐923.

Lipfert, J., and Doniach, S. (2007). Small‐angle X‐ray scattering from RNA, proteins, and protein complexes. Annu Rev Biophys Biomol Struct 36, 307‐327.

Lu, Z.J., and Mathews, D.H. (2008). OligoWalk: an online siRNA design tool utilizing hybridization thermodynamics. Nucleic Acids Res 36, W104‐108.

Lucks, J.B., Mortimer, S.A., Trapnell, C., Luo, S., Aviran, S., Schroth, G.P., Pachter, L., Doudna, J.A., and Arkin, A.P. (2011). Multiplexed RNA structure characterization with selective 2'‐hydroxyl acylation analyzed by primer extension sequencing (SHAPE‐Seq). Proceedings of the National Academy of Sciences of the United States of America 108, 11063‐11068.

Luebke, K.J., Balog, R.P., and Garner, H.R. (2003). Prioritized selection of oligodeoxyribonucleotide probes for efficient hybridization to RNA transcripts. Nucleic Acids Res 31, 750‐758.

Lunde, B.M., Moore, C., and Varani, G. (2007). RNA‐binding proteins: modular design for efficient function. Nature Reviews Molecular Cell Biology 8, 479‐490.

Madison, J.T., Everett, G.A., and Kung, H. (1966). Nucleotide sequence of a yeast tyrosine transfer RNA. Science 153, 531‐534.

Matera, A.G., Terns, R.M., and Terns, M.P. (2007). Non‐coding RNAs: lessons from the small nuclear and small nucleolar RNAs. Nat Rev Mol Cell Biol 8, 209‐220.

Mathews, D.H. (2004). Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA 10, 1178‐1190.

27

McManus, C.J., and Graveley, B.R. (2011). RNA structure and the mechanisms of alternative splicing. Curr Opin Genet Dev 21, 373‐379.

Michel, F., and Westhof, E. (1990). Modelling of the three‐dimensional architecture of group I catalytic introns based on comparative sequence analysis. J Mol Biol 216, 585‐610.

Motorin, Y., Muller, S., Behm‐Ansmant, I., and Branlant, C. (2007). Identification of Modified Residues in RNAs by Reverse Transcription‐Based Methods. 425, 21‐53.

Ozsolak, F., and Milos, P.M. (2011). RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12, 87‐98.

Parisien, M., and Major, F. (2008). The MC‐Fold and MC‐Sym pipeline infers RNA structure from sequence data. Nature 452, 51‐55.

Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad‐Toh, K., Lander, E.S., Kent, J., Miller, W., and Haussler, D. (2006). Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2, e33.

Perona, J.J., and Hadd, A. (2012). Structural diversity and protein engineering of the aminoacyl‐tRNA synthetases. Biochemistry 51, 8705‐8729.

Rayburn, E.R., and Zhang, R. (2008). Antisense, RNAi, and gene silencing strategies for therapy: mission possible or impossible? Drug Discov Today 13, 513‐521.

Reuter, J.S., and Mathews, D.H. (2010). RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11, 129.

Rouskin, S., Zubradt, M., Washietl, S., Kellis, M., and Weissman, J.S. (2013). Genome‐wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature.

Seetin, M.G., Kladwang, W., Bida, J.P., and Das, R. (2014). Massively parallel RNA chemical mapping with a reduced bias MAP‐seq protocol. Methods Mol Biol 1086, 95‐117.

Seetin, M.G., and Mathews, D.H. (2012). RNA structure prediction: an overview of methods. Methods Mol Biol 905, 99‐122.

Serganov, A., and Nudler, E. (2013). A decade of riboswitches. Cell 152, 17‐24.

Simpson, C.G., and Brown, J.W. (1995). Primer extension assay. Methods in molecular biology (Clifton, N J ) 49, 249‐256.

Spitale, R.C., Crisalli, P., Flynn, R.A., Torre, E.A., Kool, E.T., and Chang, H.Y. (2013). RNA SHAPE analysis in living cells. Nat Chem Biol 9, 18‐20.

Steen, K.A., Malhotra, A., and Weeks, K.M. (2010). Selective 2'‐hydroxyl acylation analyzed by protection from exoribonuclease. J Am Chem Soc 132, 9940‐9943.

Steitz, T.A. (2008). A structural understanding of the dynamic ribosome machine. Nat Rev Mol Cell Biol 9, 242‐253.

Straarup, E.M., Fisker, N., Hedtjarn, M., Lindholm, M.W., Rosenbohm, C., Aarup, V., Hansen, H.F., Orum, H., Hansen, J.B.R., and Koch, T. (2010). Short locked nucleic acid antisense oligonucleotides potently reduce apolipoprotein B mRNA and serum cholesterol in mice and non‐human primates. Nucleic Acids Research 38, 7100‐7111.

Szostak, E., and Gebauer, F. (2013). Translational control by 3'‐UTR‐binding proteins. Brief Funct Genomics 12, 58‐65.

28

Tafer, H., Ameres, S.L., Obernosterer, G., Gebeshuber, C.A., Schroeder, R., Martinez, J., and Hofacker, I.L. (2008). The impact of target site accessibility on the design of effective siRNAs. Nature Biotechnology 26, 578‐583.

Takahashi, H., Kato, S., Murata, M., and Carninci, P. (2012). CAGE (cap analysis of gene expression): a protocol for the detection of promoter and transcriptional networks. Methods in Molecular Biology (Clifton, NJ) 786, 181‐200.

Tinoco, I., Jr., and Bustamante, C. (1999). How RNA folds. J Mol Biol 293, 271‐281.

Tullius, T.D., and Greenbaum, J.A. (2005). Mapping nucleic acid structure by hydroxyl radical cleavage. Curr Opin Chem Biol 9, 127‐134.

Ulitsky, I., and Bartel, D.P. (2013). lincRNAs: genomics, evolution, and mechanisms. Cell 154, 26‐46.

Underwood, J.G., Uzilov, A.V., Katzman, S., Onodera, C.S., Mainzer, J.E., Mathews, D.H., Lowe, T.M., Salama, S.R., and Haussler, D. (2010). FragSeq: transcriptome‐wide RNA structure probing using high‐throughput sequencing. Nature Methods 7, 995‐1001.

Vickers, T.A., Wyatt, J.R., and Freier, S.M. (2000). Effects of RNA secondary structure on cellular antisense activity. Nucleic Acids Res 28, 1340‐1347.

Wan, Y., Kertesz, M., Spitale, R.C., Segal, E., and Chang, H.Y. (2011). Understanding the transcriptome through RNA structure. Nat Rev Genet 12, 641‐655.

Wan, Y., Qu, K., Zhang, Q.C., Flynn, R.A., Manor, O., Ouyang, Z., Zhang, J., Spitale, R.C., Snyder, M.P., Segal, E., et al. (2014). Landscape and variation of RNA secondary structure across the human transcriptome. Nature 505, 706‐709.

Weeks, K.M. (2010). Advances in RNA structure analysis by chemical probing. Current Opinion in Structural Biology 20, 295‐304.

Weeks, K.M. (2011). RNA structure probing dash seq. Proc Natl Acad Sci U S A 108, 10933‐10934.

Wells, S.E., Hughes, J.M., Igel, A.H., and Ares, M., Jr. (2000). Use of dimethyl sulfate to probe RNA structure in vivo. Methods Enzymol 318, 479‐493.

Wimberly, B.T., Brodersen, D.E., Clemons, W.M., Jr., Morgan‐Warren, R.J., Carter, A.P., Vonrhein, C., Hartsch, T., and Ramakrishnan, V. (2000). Structure of the 30S ribosomal subunit. Nature 407, 327‐339.

Zheng, Q., Ryvkin, P., Li, F., Dragomir, I., Valladares, O., Yang, J., Cao, K., Wang, L.S., and Gregory, B.D. (2010). Genome‐wide double‐stranded RNA sequencing reveals the functional significance of base‐paired RNAs in Arabidopsis. PLoS Genet 6, e1001141.

Ziehler, W.A., and Engelke, D.R. (2001). Probing RNA structure with chemical reagents and enzymes. Curr Protoc Nucleic Acid Chem Chapter 6, Unit 6 1.

Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31, 3406‐3415.

29

11 Papers

30

11.1 Paper1:DetectionofReverseTranscriptaseTerminationSitesUsingcDNALigationandMassiveParallelSequencing

The book chapter reprinted with kind permission from Springer Science and Business Media.

Kielpinski, L.J., Boyd, M., Sandelin, A., and Vinther, J. (2013). Detection of reverse transcriptase

termination sites using cDNA ligation and massive parallel sequencing. Methods Mol Biol (Springer and

Humana Press) vol. 1038, pp 213‐231. Edited by Noam Shomron

© Springer Science+Business Media New York 2013

31

Chapter 13

Detection of Reverse Transcriptase Termination Sites UsingcDNA Ligation and Massive Parallel Sequencing

Lukasz J. Kielpinski, Mette Boyd, Albin Sandelin, and Jeppe Vinther

Abstract

Detection of reverse transcriptase termination sites is important in many different applications, such asstructural probing of RNAs, rapid amplification of cDNA 50 ends (50 RACE), cap analysis of geneexpression, and detection of RNA modifications and protein–RNA cross-links. The throughput of thesemethods can be increased by applying massive parallel sequencing technologies.Here, we describe a versatile method for detection of reverse transcriptase termination sites based on

ligation of an adapter to the 30 end of cDNA with bacteriophage TS2126 RNA ligase (CircLigase™). In thefollowing PCR amplification, Illumina adapters and index sequences are introduced, thereby allowingamplicons to be pooled and sequenced on the standard Illumina platform for genomic DNA sequencing.Moreover, we demonstrate how to map sequencing reads and perform analysis of the sequencing data withfreely available tools that do not require formal bioinformatics training. As an example, we apply themethod to detection of transcription start sites in mouse liver cells.

Key words Reverse transcription, Termination, Sequencing, TS2l26 RNA ligase, CAGE, Galaxy

1 Introduction

Detection of reverse transcriptase termination sites (RTTS) is ageneral strategy that can be used to detect different features ofRNA, such as their ends [1], modifications [2], structure [3], andbinding of proteins [4]. Historically, RTTS have been monitoredby fragment analysis using radioactive or fluorescent labelling of theprimer used for the reverse transcription and detection with dena-turing gel or capillary electrophoresis, respectively. Alternatively,RTTS can be detected by ligating an adapter to the 30 end of theterminated cDNA, cloning, and sequencing. While fragment anal-ysis has been very successfully used to investigate many differentRNA features, the decreasing cost of sequencing makes it increas-ingly more advantageous to use sequencing for detection of RTTS.It is therefore likely that existing RTTS-based methods will beadapted for sequencing and that new methods will be developed.

Noam Shomron (ed.), Deep Sequencing Data Analysis, Methods in Molecular Biology, vol. 1038,DOI 10.1007/978-1-62703-514-9_13, # Springer Science+Business Media New York 2013

213

32

The key step in the detection of RTTS by sequencing is toattach sequencing adapter sequences to the ends of the cDNA.Typically the 50 adapter sequence is included as overhang in gene-specific or random primer used for the first-strand reaction. Thenext step is the ligation of an adapter to the 30 end of the terminatedcDNA and several methods for doing this have been developed. Insingle-strand linker ligation a double-stranded adapter with a 30

overhang is ligated to the free 30 end of the RTTS cDNA using T4DNA ligase [5]. Alternatively, a single-stranded adapter can be usedfor ligation with the thermostable TS2126 RNA ligase (CircLigase)[6]. The efficiency of both of these enzymes are somewhat biasedby the sequence in the very 30 end of the cDNA that have to beligated (results not shown), but these biases are reproducible andare therefore not an issue if an appropriate control is used fornormalization. Another issue is the ability of reverse transcriptaseto add 1–3 untemplated nucleotides to the 30 end of cDNAs. Thisoccurs more efficiently at capped 50 ends compared to 50 endsending in OH (typical for degraded RNA) [7] and has to betaken into account when sequences are mapped to the RNA beinginvestigated. The added nucleotides allow the reverse transcriptaseto perform template switching, which can be exploited to add anadaptor sequence to the 30 of cDNAs [8].

Some RTTSmethods have successfully been adapted to massiveparallel sequencing. Cap analysis of gene expression (CAGE) hasbeen successfully used to identify transcription start sites (TSS) [9].Originally the CAGEmethodwas based on concatenation of CAGEtags and Sanger sequencing [10], but it has recently been adapted tomassive parallel sequencing [1]. Another example is SHAPE-basedprobing of RNA structure, which has been widely and successfullyused for investigating the structure of single RNAs using capillaryelectrophoresis [11]. Nevertheless, recent result demonstrating thatpopulations of RNA molecules can be SHAPE probed in parallelusing sequencing fuels hope that the throughput of structure prob-ing can be increased [12]. These successful implementations ofsequencing for RTTS detection suggest that RTTS methods gener-ally can be adapted to the new sequencing technologies.

Here, we describe a general method for detecting RTTS basedon the Illumina paired-end genomic DNA adapters, sequencingprimer, and indexing reads. Samples can therefore be multiplexedwith other samples containing the standard Illumina adaptors andused for both single- and paired-end sequencing. The method caneasily be adapted to detect RTTS produced by any experimentalprotocol. In addition, we demonstrate in detail how to go from theraw sequencing reads to counts of RTTSmapped to the RNA beinginvestigated and how to compare with the existing annotation andvisualize the results in the UCSC genome browser. An overview ofthe entire protocol is shown in Fig. 1.

214 Lukasz J. Kielpinski et al.

33

Fig. 1 Schematic outline of the analysis. The starting material are RNA molecules containing a feature ofinterest, which can cause reverse transcriptase termination. The RNA is reverse transcribed with a primercontaining a 50 adapter overhang. After cDNA purification, a second adapter is ligated to the 30 ends of theobtained cDNA. Molecules containing both adapters serve as templates for a PCR, which adds all necessaryelements for Illumina sequencing. After library sequencing, the resulting sequencing reads are mapped tosequences of interest (this could be the full genome or selected RNA sequences) and the location of the reads’50ends (corresponding to the feature of interest) counted. The resulting RTTS count file can be used for furtheranalysis, such as visualization in the UCSC genome browser, producing RTTS plots for specific RNA molecules,and comparing with the existing annotation

Reverse Transcriptase Termination Site (RTTS) Mapping 215

34

2 Materials

2.1 RNA Sample 1. Material to be analyzed: The RNA should be treated in a waythat reverse transcription will terminate on sites of interest. Thiscould be RNA strand breaks, RNA modifications, RNA 50 ends,protein–RNA cross-links among others.

2.2 Oligonucleotides 1. Oligonucleotide sequences are listed in Table 1. RT_random_-primer and LIGATION_ADAPTER were HPLC purified, andthe remaining oligonucleotides were PAGE purified.

2.3 Reverse

Transcription and

Purifications

1. PrimeScript™ ReverseTranscriptase including PrimeScript™5� buffer (Takara).

2. 10 mM dNTPs.

3. Sorbitol–trehalose mix (1.67 M sorbitol, 0.33 M trehalose).

4. Agencourt® AMPure® XP–PCR Purification (Beckman Coul-ter).

5. Agencourt® RNAClean® XP (Beckman Coulter).

6. 70 % EtOH.

7. 5 mM Na-citrate pH 6.

8. 10 mM Tris–HCI pH 8.3.

9. RNAseH (New England Biolabs).

2.4 Linker Ligation 1. CircLigase (Epicentre).

2. 1 mM ATP (Epicentre).

3. CircLigase buffer (Epicentre).

4. 50 mM MnCl2 (Epicentre).

5. 50 % PEG 6000 (filter sterilized).

6. 5 M glycine betaine (filter sterilized).

2.5 PCR 1. Phusion® High-Fidelity DNA Polymerase (NEB).

2. 5� HF Phusion buffer (NEB).

3. 10 mM dNTPs.

4. H2O (PCR grade).

2.6 Quality Control 1. Agarose electrophoresis.

2. 1� TBE buffer.

3. Agarose.

4. 6� DNA loading buffer (Fermentas).

5. DNA Size standard with 150 bp band (e.g., Ultra Low RangeDNA ladder—Fermentas).


35

Table1

Oligonucleotides

used

inthis

study

Nam

ePrimer

sequence

RT_random_p

rimer

AGACGTGTGCTCTTCCGATCTNNNNNNNNS

LIG

ATIO

N_A

DAPTER

50 phosphate-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3

0 3NHC3

PCR_forw

ard

AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCT

PCR_R

EVERSE_INDEX.1_A

TCACG

CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.2_C

GATGT

CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.3_T

TAGGC

CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.4_T

GACCA

CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.5_A

CAGTG

CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.6_G

CCAAT

CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.7_C

AGATC

CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.8_A

CTTGA

CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.9_G

ATCAG

CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.10_T

AGCTT

CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_REVERSE_IN

DEX.11_GGCTAC

CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_REVERSE_IN

DEX.12_CTTGTA

CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.13_A

GTCAA

CAAGCAGAAGACGGCATACGAGATTTGACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.14_A

GTTCC

CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.15_A

TGTCA

CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_R

EVERSE_INDEX.16_C

CGTCC

CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

Allsequen

cesare50 –30

Theoligonucleo

tidesequen

cesoftheIlluminagen

omicDNAadapters

arecopyrightedbyIllumina,Inc.2006.Allrightsreserved

LIG

ATIO

N_A

DAPTERisalinkerwithclonable50 endandwithan

amino-blocked

30 end

Index

sequen

cesareshownin

bold


36

6. Stain G (Serva).

7. Agilent DNA 1000 Kit (Agilent Technologies).

2.7 Equipment 1. Tubes.

(a) RT, purifications, ligation—0.5 ml PCR tube (BRAND,781310).

(b) PCR—0.2 ml 8-Strip tubes (alpha laboratories, LW2500).

2. Thermocyclers.

(a) RT, ligation: MJ Research, PTC-200 for 0.5 ml tubes.

(b) PCR: BIORAD S1000.

3. Magnetic stand.

4. NanoDrop 1000.

5. Agilent 2100 Bioanalyzer.

3 Methods

3.1 Reverse

Transcription

(Modified from ref. 1)

1. Mix 1 μg of RNA startingmaterial (could be in vitro-transcribedRNA or purified RNA) with 100 pmol of RT_random_primerin a 7.5 μl volume (optimal amount of primer can vary with thespecific application). Heat denature for 5 min at 65 �C, and puton ice (seeNote 1).

2. Preparemaster mix. For one reaction take 7.5 μl 5�PrimeScriptBuffer, 1.87 μl 10 mM dNTP, 7.5 μl sorbitol–trehalose mix(which is half of the concentration used in [1]), 9.38 μl H2O,and 3.75 μl PrimeScript enzyme. Add 30 μl master mix toRNA–primer, and mix by pipetting (seeNote 2).

3. Incubate as follows: 25 �C, 10 min (skip this incubation if agene-specific primer is used); 42 �C, 30 min; 50 �C, 10 min;56 �C, 10 min; 60 �C, 10 min; and hold on 4 �C. The result ofthe reverse transcription is a cDNA carrying a 50 adapter andterminating at the feature of interest (Fig. 2a).

4. Inactivate reverse transcriptase enzyme by incubating samplefor 15 min at 70 �C, then place on ice, and add 1 μl RNase Henzyme (New England Biolabs, 5,000 U/ml). Incubate for20 min at 37 �C to degrade the RNA (see Note 3).

3.2 cDNA

Purification (Modified

from ref. 1)

1. Add 67.5 μl RNAClean XP beads (room temperature, wellmixed) to reactions, and pipette mix. Incubate at room tem-perature for 30 min vortexing every 10 min.

2. Put on magnetic stand for 5 min, and aspirate cleared solution.

3. 2� wash with 70 % ethanol (used volume depends on the tubesused; for 500 μl tubes use 400 μl ethanol).

4. Add 40 μl 5 mMNa-citrate (pH 6) preheated to 37 �C, andmixextensively by pipetting. Incubate for 10 min at 37 �C.


37

5. Place on magnetic stand, and transfer eluant to the new tube(see Note 4).

3.3 cDNA Ligation 1. Prepare master mix. For one reaction take 1 μl of CircLigasebuffer (Epicentre), 0.5 μl of 1 mM ATP, 0.5 μl 50 mMMnCl2, 2 μl of 50 % PEG 6000, 2 μl of 5 M betaine, 0.5 μl

Fig. 2 Outline of library generation. (a) The first steps in library generation are reverse transcription andligation of an adapter to the 30 end of the cDNA, which correspond to the location of the feature of interest. (b)In the subsequent PCR Illumina adapter sequences are added to produce a double-stranded DNA library that isready for sequencing on the Illumina genomic DNA platform


38

100 μM LIGATION_ ADAPTER, and 0.5 μl CircLigaseenzyme. Mix well.

2. Split master mix into 7 μl aliquots, and add 3 μl of cDNA.

3. Incubate as follows: 60 �C, 2 h; 68 �C, 1 h; 80 �C, 10 min; andhold on 4 �C.

4. Add 10 μl H2O to increase volume.

5. Purify as point 2, but using Ampure XP bead (20 μl ligationreaction + 36 μl Ampure beads). Elute in 16 μl H2O. Theresult of the cDNA ligation step is a single-stranded cDNAcontaining adapters both at the 50 and 30 end, which can beused for the subsequent PCR reaction (Fig. 2b).

3.4 PCR

Amplification

of Library

1. Prepare master mix. For one reaction take 3 μl PCR_forward10 μM primer, 10 μl Phusion 5� HF buffer, 1 μl 10 mMdNTPs, 27.5 μl H2O, and 1 μl Phusion DNA polymerase.Mix well.

2. Split master mix into 42.5 μl aliquots, and add 2.5 μl ofindexing primer (PCR_REVERSE_INDEX.##_NNNNNN)(see Note 5) and 5 μl purified linker-ligated cDNA. Start thePCR reaction program as follows: 98 �C, 3 min; (98 �C, 80 s;64 �C, 15 s; 72 �C, 30 s) � 4; (98 �C, 80 s; 72 �C, 45 s) � 15;72 �C, 5 min; and hold on 4 �C (see Note 6).

3. Agarose electrophoresis (see Note 7). Prepare 2 % agarose gelwith a DNA stain (e.g., Stain G). Apply 5 μl of samples (addloading dye) and size standard, and run at 4 V/cm untilbromophenol blue from loading dye has travelled approxi-mately 2.5 cm. Visualize under UV light. You should seesmears of products longer than 200 bp. Presence of amplifiedPCR product shorter than 150 bp is typically caused by lowamounts of starting material, combined with small amounts ofleftover reverse transcription primer in the ligation reaction,and is the result of amplification of directly ligated RT primer—LIGATION_ADAPTER molecules, which can be Illuminasequenced, but is uninformative (see Fig. 3a). To get rid ofthe short PCR product, try to redo the library with morestarting material or alternatively perform agarose gel purifica-tion to remove the short PCR product. In case that no ampli-fied library (smear) is detected at this step, perform small-scalePCR with different number of cycles and analyze the PCRreaction by agarose electrophoresis. Then repeat the PCR reac-tion with the lowest number of PCR cycles that allows fordetection of the library on the gel. Optimal number of cyclesdepends on the amount of starting material.


39

3.5 Purification and

Quantification of

Library (See Note 7)

1. Ampure XP purification—as Subheading 3.2 but use AmpureXP beads and add 72 μl beads to 40 μl PCR reaction. Elute in20 μl preheated 10 mM Tris–HCl pH 8.3.

2. Measure concentration on NanoDrop (as dsDNA) and runBioanalyzer DNA 1000. Perform smear analysis (side panel ->Global ->Advanced ->Smear analysis ->regions) with range140–600 bp and usemolarity as a guideline for your sequencingorder. The library should contain dsDNA molecules of variedlength with a considerable fraction being above 200 bp andbelow 600 bp (Fig. 3b) (seeNote 8).

3. Samples can now be sequenced using standard Illumina DNAgenomic sequencing and can be multiplexed with other sam-ples made with the same adapters (genomic DNA) as long asthey utilize different indexes (see Note 5).

3.6 Data Analysis

Using Linux Command

Line

1. Data analysis of massive parallel sequencing experiments can bea challenge for scientists without formal training in bioinfor-matics. Below we demonstrate in detail how to go from the

Fig. 3 Expected result from PCR amplification. (a) PCR products are first checked by agarose electrophoresis.A successfully prepared library should form a smear of molecules longer than 150 bp (lane 2). Presence ofband shorter than 150 bp (lane 1) indicates problems with library preparation (see step 3 of Subheading 3.4).(b) Library is purified and checked for size distribution on Agilent Bioanalyzer DNA 1000 chip. A successfullyprepared library should have dsDNA molecules of varied length with a considerable fraction being above200 bp and below 600 bp


40

sequencing output (FASTQ file) to an RTTS count file withoutassuming prior knowledge of bioinformatics using tools avail-able in GALAXY [13]–16], including the Bowtie mapper forsequencing reads [17] and the FASTX toolkit [18]. However,using a Unix or an OSXmachine with a command line interfaceis recommended for large projects. For those users, the analysisimplemented in Subheadings 3.7–3.9 can be carried out usingBowtie and an awk script available at this URL http://people.binf.ku.dk/~lukasz/SAM2counts.awk.

3.7 Quality Check

of Sequencing Reads

1. Log in to Galaxy (http://usegalaxy.org/) and create a newGalaxy history. Upload the relevant FASTQ files to Galaxywith the “Upload File from your computer” tool found inthe “Get Data” tool category. Point to the location of therelevant FASTQ file on your computer and click execute (seeNote 9).

2. Check the integrity of the FASTQ files with the “FASTQGroomer” tool found in the “NGS: QC and manipulation”tool category. For newer FASTQ files (Illumina 1.8 and later)the quality is encoded in Sanger format. Choose the Galaxyhistory item containing the FASTQ file, set “Input FASTQquality scores type:” to Sanger, and click execute (seeNote 10).

3. Compute FASTQ quality statistics with the “Compute qualitystatistics” tool found in the “NGS: QC and manipulation” toolcategory. Choose the groomed FASTQ file and click execute.

4. Plot the distributions of quality scores for the differentsequencing cycles using the “Draw quality score boxplot”tool found in the “NGS: QC and manipulation” tool category.Choose the Galaxy history item containing the output of the“Compute quality statistics” tool and click execute. Look at theresulting boxplot by clicking on the eye icon next to the “Drawquality score boxplot” history item (see Fig. 4a). For mostexperiments, where the median quality is not very low (fallingbelow 25), it is unnecessary to filter the reads on quality.If quality is very low it may be an advantage to filter the readsfor low quality using the “Filter by quality” tool found in the“NGS: QC and manipulation” tool category. Set the “Qualitycut-off value” option to 20 and the “Percent of bases insequence that must have quality equal to/higher than cut-offvalue” option to 90 and click execute.

5. Plot nucleotide distributions of the different sequencing cyclesusing the “Draw nucleotides distribution chart” tool found inthe “NGS: QC and manipulation” tool category. Look at theresulting plot by clicking on the eye icon next to the “Drawnucleotides distribution chart” history item (see Fig. 4b).


41

http://people.binf.ku.dk/~lukasz/SAM2counts.awk

http://people.binf.ku.dk/~lukasz/SAM2counts.awk

http://usegalaxy.org/

The nucleotide distributions are typically similar across thesequencing cycles, but if this is not the case, the librarymay not have sufficient complexity or be contaminated withadapter–adapter ligation products.

Fig. 4 Expected quality plots of sequencing reads. (a) Example of quality boxplot produced by Galaxy. The plotshows the median read quality in the different sequencing cycles. (b) Example of nucleotide distribution plotproduced by Galaxy. The plot shows the percentage of the nucleotides in the different sequencing cycles.Deviation from uniform distribution in the first cycle reflects a combination of bias for specific nucleotides interminal transferase activity of Reverse Transcriptase, bias in the TS2126 RNA ligase reaction and in somecases biased seqences of the genomic locations being mapped by the RTTS


42

3.8 Mapping Reads

with Bowtie

1. Depending on the nature of your experiment you can map yourreads either to the entire genome relevant for the experimentor to one or more RNA sequences. The genomes of the mostcommonly investigated species are pre-installed in Galaxy,whereas mapping to one or more specific RNAs requires thatthe sequence(s) is uploaded to Galaxy as a FASTA file.If necessary upload a FASTA file with the “Upload File fromyour computer” tool found in the “Get Data” tool category.Point to the location of the relevant FASTA file on your com-puter and click execute.

2. To map the reads, use the “Map with Bowtie for Illumina” toolfound in the “NGS: Mapping” tool category. If mapping to agenome that is pre-indexed in Galaxy, choose “Use a built-inindex” and the relevant genome. Otherwise choose “Use onefrom history” and select history item containing the uploadedFASTA file. Next, select the history item containing thegroomed (and filtered) FASTQ file under the “FASTQ file”option and choose “Full parameter list” in the “Bowtie settingsto use” drop-down menu. Then change “Maximum number ofmismatches permitted in the seed (-n)” to 3 and “Maximumpermitted total of quality values at mismatched read positions(-e)” to 300 and choose “Use best” in the “Whether or not tomake Bowtie guarantee that reported singleton alignments are‘best’ in terms of stratum and in terms of the quality values atthe mismatched positions (–best)” drop-down menu. Finallymap the reads by clicking Execute. The mapping may take awhile depending on the size of the FASTQ file and thesequence to be mapped against (see Note 11).

3.9 Preparing an

RTTS Count File from

SAM File

1. It is necessary to trim mapped reads that contain untemplatednucleotides added by reverse transctiptase (see Note 12). Thistrimming requires many Galaxy operations and we have there-fore created a Galaxy workflow to perform this operation andsubsequently count and sum RTTS. In this procedure reads aretrimmed, if they contain mismatches in the first three positions(Fig. 5). To download the workflow go to https://main.g2.bx.psu.edu/workflow/list_published and search for RTTS Map-per. Click on the workflow and import it into your own Galaxyaccount by clicking on “Import workflow” in the upper rightcorner. Alternatively, if a local instance of galaxy is used, theRTTS Mapper workflow can be imported into Galaxy by click-ing on “Workflow” on the top Galaxy bar and then on the“Upload and import workflow” button in the upper rightcorner. At this URL https://main.g2.bx.psu.edu/workflow/import_workflow, the workflow can be imported by providingthe URL http://people.binf.ku.dk/~lukasz/Galaxy_RTTS_mapper.ga as the “Galaxy workflow URL” and clicking import.


43

https://main.g2.bx.psu.edu/workflow/list_published

https://main.g2.bx.psu.edu/workflow/list_published

https://main.g2.bx.psu.edu/workflow/import_workflow

https://main.g2.bx.psu.edu/workflow/import_workflow

http://people.binf.ku.dk/~lukasz/Galaxy_RTTS_mapper.ga

http://people.binf.ku.dk/~lukasz/Galaxy_RTTS_mapper.ga

Also import a control file to your history by pasting in this URLhttp://people.binf.ku.dk/~lukasz/RTTS_control.interval inthe URL/Text window in the “Upload File from your com-puter” tool found in the “Get Data” tool category.

2. To prepare RTTS count files, use the RTTS Mapper workflowimported above. Click on “Workflow” on the top Galaxy barand then on the “RTTS mapper” workflow and choose “Run.”Select the history item containing the SAMfile from the Bowtiemapping for the “Select dataset to convert” option and theRTTS_control.interval file for the “Select control file” optionand click “Run workflow” at the bottom of the page.

3. The resulting RTTS count files (for counts on the plus andminus strand, respectively) can be used for further analysis in R,Excel, or other data analysis program. The exact analysis willdepend on the nature of the experiment performed. Below weprovide tools for some common types of analysis using thefreely available tool R, which can easily be installed on anycomputer platform [19] (see Note 13).

3.10 Preparing Wig

File and Visualizing

in the UCSC Genome

Browser

If the RTTS experiments have been mapped to a genome assembly,it will often be advantageous to visualize the results on the UCSCGenome Browser and compare with the many kinds of data avail-able as tracks. To do this it is necessary to convert the RTTS file tothe UCSC wig format.

Fig. 5 Schematic representation of the trimming performed by the RTTS mapper. After mapping the reads togenome the three 5’ terminal nucleotides of mapped reads (corresponding to 3’ ends of cDNA molecules) areevaluated for mismatches to the reference sequence and trimmed if necessary. The four possible scenariosare the following: full match (a), mismatch at the terminal position (b), position one (c), or position two (d)before the terminal position, in which cases we trim 0, 1, 2, or 3 positions, respectively (returned positions areindicated by the triangles). Red boxes: Mismatched positions; white boxes: matched positions


44

http://people.binf.ku.dk/~lukasz/RTTS_control.interval

1. Download RTTS count files to your local computer from theGalaxy server by clicking on the floppy disc icon for the relevanthistory items.

2. The RTTS count file can be converted to wig format by copy/pasting a small program (script) into R. Download theprovided script from http://people.binf.ku.dk/~lukasz/wig_generator.R. Open the file in a text editor and modify it bychanging the assignment of variables “input_filename_plus”and “input_filename_min” to names of files produced by galaxyworkflow.

3. Start-up R, and change working directory to the one contain-ing RTTS count files by writing “setwd (‘path of file direc-tory’)” in the console window and pressing enter or using the“Change dir” command found in the File menu. Then pastethe edited script into the R console and hit enter. This willproduce two new files named OUTPUTp.wig and OUT-PUTm.wig in the same folder.

4. Go to the UCSC genome browser (http://www.genome.ucsc.edu/cgi-bin/hgGateway) and choose the species and assemblythat were used for the mapping of the RTTS experiment in thedrop-down menu. Then click “manage custom tracks,” browsethe local drive for the wig files, and submit them one by one(after adding first one press “add custom tracks”). Finally press“go to the genome browser” with RTTS counts visualized ashistogram at each genomic position (Fig. 6a).

3.11 Making Plots

for Single RNAs

If the RTTS data was mapped to single RNAs (using a providedFASTA file) rather than the full genome, it will often be relevant tovisualize the RTTS counts across each of the different RNAs.

1. Create a new folder and download the FASTA file that wereused for mapping and the twoRTTS count files from theGalaxyhistory by clicking on the floppy disc icon for the relevanthistory items to the folder. Change file names of the RTTScount files to counts_plus.txt and counts_minus.txt.

2. Then download this R script http://people.binf.ku.dk/~lukasz/few_genes_histogram.r and open it in a text editor.Execute R and set the working directory (as described in Sub-heading 3.10.3) to the folder containing the FASTA file andthe RTTS count files. To generate RNA-specific RTTS plots foreach RNA present in FASTA file that have at least one readmapped, copy/paste the script to the R console window and hitenter (Fig. 6b).

3.12 Comparing

to Annotation Data

In some cases, it will be relevant to compare RTTS data to somekind of annotation to identify global trends. This can be done bysummarizing the read counts around a set of locations. We have


45

http://people.binf.ku.dk/~lukasz/wig_generator.R

http://people.binf.ku.dk/~lukasz/wig_generator.R

http://www.genome.ucsc.edu/cgi-bin/hgGateway

http://www.genome.ucsc.edu/cgi-bin/hgGateway

http://people.binf.ku.dk/~lukasz/few_genes_histogram.r

http://people.binf.ku.dk/~lukasz/few_genes_histogram.r

prepared R script utilizing the bioconductor [20] for generatingsuch a plot from the RTTS count wigs and an additional file con-taining either user-supplied genomic locations or refseq TSS.

Fig. 6 Example of output produced with the described protocol. Mouse liver RNA was analyzed with thedescribed protocol, including an optional CAGE selection to enrich for RTTS corresponding to transcriptionstart sites. (a) Output of Subheading 3.10. The sequencing data was mapped to genome, RTTS counted,converted to wig file, and uploaded to UCSC genome browser. Height of the bar at each genomic locationcorresponds to the number of read 50 ends mapping to this location. Minus strand is shown with negativevalues using different scale. (b) Output of Subheading 3.11. Reads were mapped to a single sequence(Hmgcs2 mRNA) and count of 5’ ends at each location was plotted. Reads mapping to positive strand areshown as above 0, while those mapping to negative strand as below zero. (c) Output of Subheading 3.12.Upper plot shows sum of reads at each distance from annotated TSS. High peak at position 1 results frommany reads mapped to known TSS, while high peak at position 13 results from an alternative TSS for thehighly expressed albumin transcript


46

1. Prepare file with a set of locations that will serve as referencepoint for counting read locations. The format of the file is threetab-delimited columns. Columns must have headers (named“seqnames”—name of the chromosome, “position,”“strand”). Example given in the script. Positions must be 1-based (see Note 14).

2. Download the script from this URL http://people.binf.ku.dk/~lukasz/plot_around_locations_from_wig.r and open it in atext editor. In the text editor, edit the input file names tomatch two wig files prepared as described in Subheading 3.9and the position file prepared above. Also edit genome assem-bly name and the size of the window surrounding the givenpositions and used for summarizing read counts.

3. Start R, set proper working directory, and copy/paste the scriptto R console. This will produce a barplot of the RTTS countsrelative to the positions given as reference (Fig. 6c).

4 Notes

1. The amount of starting material can be reduced if necessary. Onthe other hand for samples that are to be used for CAGEselection a minimum of 5 μg of RNA is needed. The amountof reverse transcription primer should be scaled with theamount of RNA. The quality of the RNA starting material isvery important as degraded RNA will produce background inany type of experiment based on detection of RTTS. Moreover,random priming typically produces more background thangene-specific priming. In CAGE experiments the non-full-length cDNAs are removed in a selection step, thereby effec-tively reducing the background, but in other applications anegative control sample is required and can be used to normal-ize for reverse transcriptase pretermination.

2. The priming sequence used in RT_random_primer(..NNNNNNNNS-30) can be modified according to specificneeds. In many cases, such as RNA structure probing, a gene-specific primer with the 50 overhang sequence can be used(50-AGACGTGTGCTCTTCCGATCT-“gene specific sequence”).

3. If CAGE selection is to be performed this step should beskipped.

4. Optional selection of full-length cDNA for CAGE analysis ofTSS can be performed according to Subheadings 3.3–3.7[without concentration] as described in [1] and results in atotal volume 34 μl cap-selected RNA.

5. Be careful with low-level pooling of indexes since propersequencing requires that at each cycle there is at least one


47

http://people.binf.ku.dk/~lukasz/plot_around_locations_from_wig.r

http://people.binf.ku.dk/~lukasz/plot_around_locations_from_wig.r

green laser read nucleotide (G or T) and one red laser red (A orC). See more at http://www.epibio.com/pdftechlit/312pl1211.pdf.

6. Using long denaturation time in PCR reaction helps alleviateGC bias and fosters reproducibility between different thermalcyclers [21].

7. To simplify the procedure and reduce the risk of contaminatinglaboratory space with generated libraries one can instead ofrunning agarose gel analyze and quantify the PCR productson Bioanalyzer DNA 1000 chip without prior purification. Thisallows pooling the crude reactions in right proportions (it isadvisable to add EDTA to the reactions before pooling to avoidindex switching) and performing only single Ampure XP puri-fication.

8. In case when prepared libraries have the same size distributionit is possible to pool them based on NanoDrop measuredconcentration.

9. The output from sequencing is one or more FASTQ filecontaining the sequence reads and the corresponding qualityscores. If several indexes were used for different experimentalconditions, FASTQ files from each index should be analyzedindividually. If using the main Galaxy server and dealing withlarge datasets (>2 GB), it is an advantage to use the Galaxyftp upload. A tutorial can be found here: http://screencast.g2.bx.psu.edu/quickie_17_ftp_upload/flow.html. The analysisdescribed below can be carried out on a local instance ofGalaxy or on the main Galaxy server (http://usegalaxy.org/).When using the main server be sure to make a login so thatyour analysis is saved. Alternatively the analysis can be per-formed on a Unix/OSX machine in-house (see Subhead-ing 3.6).

10. If the dataset consists of several FASTQ files they can bemerged into one file with the “Concatenate datasets” toolsfound in the “Text Manipulation” tool category at this pointto facilitate the further analysis of the full dataset.

11. Other sequencing read mappers can be used instead of theBowtie mapper. However, it is important not to use too strin-gent cutoff for mapping, because a considerable fraction ofreads contain untemplated sequence added by reverse tran-scriptase at the 50 end. The stringency of mapping conditionsshould be considered individually for each experiment whiletaking the quality of the sequencing reads and the complexityof the sequences that are being mapped against into account.When mapping against short sequences the coverage towardsthe 30 end can be improved by trimming sequencing reads fromthe 30 end.


48

http://www.epibio.com/pdftechlit/312pl1211.pdf

http://www.epibio.com/pdftechlit/312pl1211.pdf

http://screencast.g2.bx.psu.edu/quickie_17_ftp_upload/flow.html

http://screencast.g2.bx.psu.edu/quickie_17_ftp_upload/flow.html

http://usegalaxy.org/

12. Reverse transcriptase will in some cases add extra untemplatednucleotides after terminating at the 50 end of the RNA. This isespecially pronounced when the 50 end of the RNA is capped,which is the case for mRNAs. For the conditions describedhere, we find that the Primescript RT enzyme will add untem-plated nucleotides in 81 % of cases for RTTS located closer than50 nts to an annotated TSS (most of these presumably beingcapped), while the same is the case for 12 % of the RTTSlocated elsewhere. It is therefore necessary to trim the readsthat have one or more mismatches in the first three mappedpositions, which is implemented in the published workflow. Inthe cases where untemplated nucleotide matches the genomicsequence, it is not possible to do trimming.

13. R can be freely downloaded for any platform at http://cran.r-project.org/. Scripts are written for version 2.15.

14. At this step user must ensure that numbers provided as locationsof interest are in 1-based coordinate system. This system is used,e.g., in UCSC genome browser display window. Be aware thattables downloaded fromUCSC table browser are provided in 0-based system. To use TSS information from the table inprovided script one must add 1 to the starting positions. Readmore on coordinate systems on http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms.

Acknowledgments

The research was funded by the Danish Council for StrategicResearch, the Lundbeck Foundation and the Novo Nordisk Foun-dation. Morten Lindow and Susanna Obad, Santaris Pharma,provided mouse liver samples and RIKEN/Piero Carninci providedthe updated CAGE protocol as well as advice ahead of publication.

References

1. Takahashi H, Kato S, Murata M et al (2012)CAGE (cap analysis of gene expression): a pro-tocol for the detection of promoter and tran-scriptional networks. In: Deplancke B, GheldofN (eds) Gene regulatory networks, vol 786.Humana, Totowa, NJ, pp 181–200

2. Motorin Y, Muller S, Behm‐Ansmant I et al(2007) Identification of modified residues inRNAs by reverse transcription‐based methods.Methods Enzymol 425:21–53. doi:10.1016/s0076-6879(07)25002-5

3. Mortimer SA,WeeksKM(2009)Time-resolvedRNA SHAPE chemistry: quantitative RNAstructure analysis in one-second snapshots and

at single-nucleotide resolution. Nat Protoc4(10):1413–1421. doi:nprot.2009.126 [pii]10.1038/nprot.2009.126

4. Konig J, Zarnack K, Rot G et al (2010) iCLIPreveals the function of hnRNP particles insplicing at individual nucleotide resolution.Nat Struct Mol Biol 17(7):909–915.doi:10.1038/nsmb.1838

5. Shibata Y, Carninci P, Watahiki A et al (2001)Cloning full-length, cap-trapper-selectedcDNAs by using the single-strand linker liga-tion method. Biotechniques 30(6):1250–1254

6. Li TW, Weeks KM (2006) Structure-independent and quantitative ligation of


49

http://cran.r-project.org/

http://cran.r-project.org/

http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms

http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms

single-stranded DNA. Anal Biochem 349(2):242–246. doi:10.1016/j.ab.2005.11.002

7. Hirzmann J, Luo D, Hahnen J et al(1993) Determination of messenger RNA5’-ends by reverse transcription of the capstructure. Nucleic Acids Res 21(15):3597–3598

8. Zhu YY, Machleder EM, Chenchik A et al(2001) Reverse transcriptase template switch-ing: a SMART approach for full-length cDNAlibrary construction. Biotechniques 30(4):892–897

9. Carninci P, Kasukawa T, Katayama S et al(2005) The transcriptional landscape of themammalian genome. Science 309(5740):1559–1563. doi:10.1126/science.1112014

10. Shiraki T, Kondo S, Katayama S et al (2003) Capanalysis gene expression for high-throughputanalysis of transcriptional starting point and iden-tification of promoter usage. Proc Natl Acad SciU S A 100(26):15776–15781. doi:10.1073/pnas.2136655100

11. Weeks KM,MaugerDM (2011) ExploringRNAstructural codes with SHAPE chemistry. AccChem Res 44(12):1280–1291. doi:10.1021/ar200051h

12. Lucks JB, Mortimer SA, Trapnell C et al (2011)Multiplexed RNA structure characterization withselective 2’-hydroxyl acylation analyzed by primerextension sequencing (SHAPE-Seq). Proc NatlAcad Sci U S A 108(27):11063–11068. doi:10.1073/pnas.1106501108

13. Giardine B, Riemer C, Hardison RC et al(2005) Galaxy: a platform for interactivelarge-scale genome analysis. Genome Res 15(10):1451–1455. doi:10.1101/Gr.4086505

14. Goecks J, Nekrutenko A, Taylor J et al (2010)Galaxy: a comprehensive approach for support-ing accessible, reproducible, and transparentcomputational research in the life sciences.Genome Biol 11(8):R86. doi:10.1186/Gb-2010-11-8-R86

15. Blankenberg D, Gordon A, Von Kuster G et al(2010) Manipulation of FASTQ data with Gal-axy. Bioinformatics 26(14):1783–1785.doi:10.1093/bioinformatics/btq281

16. Blankenberg D, Von Kuster G, Coraor N et al.(2010) Galaxy: a web-based genome analysistool for experimentalists. Curr Protoc MolBiol Chapter 19:Unit 19.10.11–21

17. Langmead B, Trapnell C, Pop M et al (2009)Ultrafast and memory-efficient alignment ofshort DNA sequences to the human genome.Genome Biol 10(3):R25. doi:10.1186/Gb-2009-10-3-R25

18. Hannon-Lab, Gordon A (2010) FASTX-toolkit:FASTQ/A short-reads pre-processing tools.http://hannonlab.cshl.edu/fastx_toolkit/

19. R Foundation for Statistical Computing(2012) R: A language and environment forstatistical computing, 2151st edn. R Founda-tion for Statistical Computing, Vienna, Austria

20. Gentleman RC, Carey VJ, Bates DM et al(2004) Bioconductor: open software develop-ment for computational biology and bioinfor-matics. Genome Biol 5(10):R80

21. Aird D, Ross MG, Chen WS et al (2011) Ana-lyzing and minimizing PCR amplification biasin Illumina sequencing libraries. Genome Biol12(2):R18. doi:10.1186/gb-2011-12-2-r18


50

http://hannonlab.cshl.edu/fastx_toolkit/

11.2 Paper2:MassiveparallelsequencingbasedhydroxylradicalprobingofRNAaccessibility

This is a pre‐copy‐editing, author‐produced print of an article accepted for publication in Nucleic Acids

Research following peer review. The definitive publisher‐authenticated version will be available online

51

Massive parallel sequencing based hydroxyl radical probing of RNA accessibility

Lukasz Jan Kielpinski1, Jeppe Vinther1,*,

1Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 Copenhagen N, Denmark * To whom correspondence should be addressed. Tel: +4535321264; Fax: +4535322128; Email: [email protected]

ABSTRACT

Hydroxyl Radical Footprinting (HRF) is a tried-and-tested method for analysis of the tertiary structure

of RNA and for identification of protein footprints on RNA. The hydroxyl radical reaction breaks

accessible parts of the RNA backbone, thereby allowing ribose accessibility to be determined by

detection of reverse transcriptase termination sites. Current methods for HRF rely on reverse

transcription of a single primer and detection by fluorescent fragments by capillary electrophoresis.

Here, we describe an accurate and efficient massive parallel sequencing based method for probing

RNA accessibility with hydroxyl radicals, called HRF-Seq. Using random priming and a novel

barcoding scheme, we show that HRF-Seq dramatically increases the throughput of HRF experiments

and facilitates the parallel analysis of multiple RNAs or experimental conditions. Moreover, we

demonstrate that HRF-Seq data for the Escherichia coli 16S rRNA correlates well with the ribose

accessible surface area as determined by X-ray crystallography and have a resolution that readily

allows the difference in accessibility caused by exposure of one side of RNA helices to be observed.

INTRODUCTION

It is becoming clear that many RNA molecules from living cells and viruses have functions that do not

depend on being translated, but rather on adopting intricate structures and binding to proteins (1,2).

This is true for well characterized non-coding RNAs such as ribosomal, transfer, small nucleolar RNAs

and viral RNA genomes, but also for more recently discovered non-coding RNA families, such as long

non-coding RNAs and microRNAs. For many of the novel non-coding RNAs that have been

discovered during the past decade, the function remains unknown and even for some of those that

have been functionally characterized, details of the mechanism of action are lacking. In many cases,

knowledge of the tertiary structure of these RNA molecules will be necessary to identify and

understand their functions. Thus, there is a clear need for structure-probing methods that can deal

with the increasing number of known RNA molecules in cells. Computational methods for prediction of

tertiary RNA structure are improving (3), but they still demand large computational resources, cannot

be used with long RNAs and have large root mean square deviations from the experimental structures

52

(4). Moreover, experimental methods, such as X-ray crystallography and NMR, are especially

challenging for long or flexible RNA molecules (4).

As an attractive alternative, the RNA backbone solvent accessibility can be mapped by hydroxyl

radical footprinting (HRF) (5-7). The hydroxyl radical reacts with hydrogen atoms on the ribose C4’

and C5’ positions in parts of an RNA molecule exposed to the solvent, leading to RNA cleavage (8).

The cleavage pattern can be visualized by electrophoresis of cDNA fragments produced by reverse

transcription (6). Hydroxyl radicals can be conveniently produced in solution through the Fenton

reaction between Fe(II)–EDTA and hydrogen peroxide (5) or inside cells using a synchrotron X-ray

beam (9). HRF can therefore be applied to many different experimental conditions and allows

changes in the tertiary structure or accessibility of the RNA to be determined by comparison of the

abundance of fragments produced during reverse transcription. This type of comparison is relatively

insensitive to the background produced by non-specific termination of reverse transcriptase and has

successfully been used to identify the changes occurring during the folding of the RNA (10) and the

binding of ligands to riboswitches (11) or to map protein binding sites on RNA (also called footprinting)

(9,12). Alternatively, HRF data for RNA molecules can be compared to a non-hydroxyl radical treated

control to normalize for background termination of reverse transcription and in this way produce a

direct measure of the accessibility of the analyzed RNA molecule (6). Recently, it was demonstrated

that such normalized HRF data anti-correlates with the number of through-space ribose neighbors,

which is a measure that can be used to bias discrete molecular dynamics simulations of RNA tertiary

structure prediction. Importantly, addition of the experimental data led to significant improvements in

the accuracy of the predicted structures (13).

Historically, HRF data have been obtained with radioactive labelling of the reverse transcription primer,

gel electrophoresis and phosphor imaging, but the current use of fluorescently labelled primers,

capillary electrophoresis and automated data analysis have significantly improved the throughput of

HRF experiments (14,15). Nevertheless, the capillary methods still deal with a single RNA at a time

and typically provide data for only 3-400 nucleotides in a single experiment. Thus, the throughput of

HRF could be dramatically improved if its readout could be adapted to using modern massive parallel

sequencing technology. This has recently been shown to be possible for SHAPE probing of RNA

secondary structure allowing hundreds of in vitro transcribed RNA molecules to be analyzed in

parallel using a single primer (16). Here, we use massive parallel sequencing together with random

priming of reverse transcription and a novel barcoding and normalization scheme to dramatically

improve the throughput of HRF experiments. The method allows the probing of purified RNAs and

facilitates the parallel analysis of multiple RNAs or experimental conditions. Importantly, we

demonstrate that HRF-Seq data correlates well with the ribose accessible surface area as determined

by X-ray crystallography. The data have a resolution that readily allows the difference in accessibility

53

caused by exposure of one side of RNA helices to be observed, suggesting that HRF-Seq can be

applied in many different settings to gain insight into the functional relevance of tertiary RNA

structures.

MATERIAL AND METHODS

Ribosome preparation

Ribosomes were purified from the E. coli MRE600 strain (gift of Birte Vester, University of Southern

Denmark) as previously described (17). Briefly, bacteria were grown in LB medium until OD600 was

approximately 0.7, transferred to +4°C for 15 min to slowly cool down, pelleted and stored frozen.

1.25 g of the pellet was resuspended in 3.125 ml buffer A (20 mM Tris-HCl pH 7 at 22°C, 10.5 mM

MgOAc, 100 mM NH4Cl, 0.5 mM EDTA and 3 mM 2-mercaptoethanol) and lyzed twice with a French

press at 1000 psi. 125 µl DNase I (Fermentas) was added to 2.5 ml of lysate followed by 20 min

incubation on ice. The DNase treated lysate was centrifuged at 30000 g for 45 min and 1 ml of

supernatant was transferred onto 1 ml of 1.1 M sucrose made in buffer B (as buffer A, but with 0.5 M

NH4Cl) and centrifuged for 15 hours at 100000 g at 4°C. The pellet was washed with buffer A and

resuspended in 5 ml of buffer C (10 mM Tris-HCl pH 7, 10.5 mM MgOAc, 500 mM NH4Cl, 0.5 mM

EDTA and 7 mM 2-mercaptoethanol) followed by 16 hours centrifugation at 100000 g at 4°C. The

pellet was washed and dissolved in buffer EH (10 mM HEPES-Na pH 7.2, 10 mM MgOAc, 60 mM

NH4Cl, 3 mM 2-mercaptoethanol). Ribosomes were precipitated by addition of 81.25 µl ethanol to 125

µl ribosomes followed by incubation 30 minutes at -80°C and centrifugation at 16000 g for 15 min.

The supernatant was removed and the pellet was dissolved in buffer EH lacking 2-mercaptoethanol.

Just before probing, ribosomes were diluted to 10 ng/µl (NanoDrop) and incubated 5 minutes at 37°C.

RNase P specificity domain preparation

A plasmid containing the sequence of the RNase P specificity domain with a structure cassette as

previously described (16) was ordered as a gene synthesis from Eurofins MWG Operon. The plasmid

was linearized with BsaI-HF™ restriction enzyme (New England Biolabs) and used as a template for

an in vitro transcription reaction with T7 RNA polymerase, 0.7 mM rNTP, 6 mM MgCl2, 1 mM

spermidine, 5 mM DTT and 40 mM Tris-HCl pH 8. The reaction was incubated for 90 minutes at 37°C,

ethanol precipitated, centrifuged and resolved on a 5% polyacrylamide, 7M Urea, 1x TBE gel. The

RNA product was located with UV shadowing and the band was cut out and eluted from the gel

overnight in a buffer containing 250 mM NaAc and 1 mM EDTA in the presence of half of the volume

of phenol. The water phase was chloroform extracted and ethanol precipitated, followed by

centrifugation and resuspension in water. RNA was folded before probing as previously described (18)

with modifications. Briefly, 5.5 ng/ul RNA in 140 mM KCl and 20 mM Tris-HCl was incubated for 1

minute at 90°C and transferred to 37°C. After 15 minutes MgCl2 was added to the final concentration

54

of 2.5 mM (KCl and Tris-HCl concentrations kept constant) and the mixture was incubated for 5

minutes at 37°C.

Hydroxyl radical probing

Probing was performed according to the peroxidative Fenton chemistry protocol as previously

described (19). Briefly, three droplets, 2 µl each, with 5 mM ferrous ammonium sulfate-EDTA, 50 mM

sodium ascorbate and 1.5 % H2O2 were placed on the inside walls of a tube containing 100 µl of

prepared substrates (ribosomes or RNase P). The tubes were vigorously vortexed to mix the reagents

and after 60 seconds reactions were stopped by adding 318 µl ice-cold ethanol and 10 ug of glycogen.

The samples were incubated -80°C for 30 min, centrifuged and resuspended in 12.5 µl H2O. Control

reactions were performed in parallel, but with addition of 6 µl H2O instead of the three aforementioned

droplets.

Sequencing library preparation

Sequencing libraries were prepared as previously described (20) with modifications. The sequences

of the primers used in this study are listed in Supplementary Table 1. Briefly, 1 µl of primer (10 µM of

RT_random_primer for ribosomes, 1.7 µM RT_structure_cassette for RNase P probing) was added to

5 µl of probed RNA, followed by incubation 5 minutes at 65°C and transfer to ice. 14 µl of a master

mix was added to each reaction to obtain final concentrations of 50 mM HEPES pH 8.3, 75 mM KCl, 3

mM MgCl2, 0.5 mM dNTP, 0.67 M sorbitol, 0.13 M trehalose and 10 U/µl of PrimeScript Reverse

Transcriptase. The ribosome probing reactions were incubated for 30 sec. at 25°C, 30 min at 42°C,

10 min at 50°C, 10 min at 56°C, 10 min at 60°C and placed on ice. The RNase P probing reactions

were reverse transcribed using the same thermal conditions as used for the ribosome reaction, but

without the incubation at 25°C. The cDNA was recovered with RNAClean XP as described (20)

(ribosomes) or ethanol precipitation (RNase P) and resuspended in 25 µl 5 mM Na-citrate pH 6. The

cDNAs were diluted 200 times in H2O and 3 µl were mixed with 7 µl of a ligation master mix (prepared

by mixing 1 volume of CircLigaseTM 10x buffer, 0.5 volume of 1 mM ATP, 50 mM MnCl2, CircLigaseTM

enzyme, 100 µM LIGATION_ADAPTER_RB oligonucleotide and 2 volumes of 50% PEG 6000 and 5

M betaine). The ligation reaction was incubated for 2 hours at 60°C, 1 hour at 68°C and 10 minutes at

80°C and purified with Ampure XP beads as described (20) and eluted in 16 µl H2O. 1 µl of 10 µM

PCR_REVERSE_INDEX primer and 14 µl of PCR master mix (1.2 volume of 10 µM PCR_forward

primer, 4 volumes of Phusion 5x HF buffer, 1.6 volume of 2.5 mM dNTPs, 6.8 volume of H2O, 0.4

volume of Phusion polymerase) were added to 5 µl of the ligated cDNA. The reactions were incubated

using the following temperature profile: (3 min, 98°C)x1, (80 sec, 98°C; 15 sec, 64°C; 30 sec, 72°C)x4,

(80 sec, 98°C; 45 sec, 72°C)x20, (5 min, 72°C)x1, purified with Ampure XP beads as described (20).

The PCR reactions were pooled and size selected on an E-gel 2% SizeSelect gel to retain the

55

products in the size range 200-600 bp, which were further concentrated on a PCR purification column

(Qiagen) and finally purified on Ampure XP beads before being sequenced on an Illumina HiSeq

system with the 2X100 paired-end protocol. The raw sequencing data is available at

http://people.binf.ku.dk/jvinther/data/HRF-Seq/

Gel electrophoresis detection of RNase P hydroxyl radical probing

The RNase P RNA was prepared and probed as described above for the sequencing-based detection.

After probing, the RNA was mixed with radioactively labelled (T4 polynucleotide kinase and ATP γ-32P)

RT_structure_cassette oligonucleotide, incubated at 65°C for 5 minutes and placed on ice. 4.5 µl of

the reverse transcription master mix (2 volumes of PrimeScript 5x buffer and of H2O, 0.5 volumes of

10 mM dNTP) was added to 5 µl of the RNA-primer mix. The sample was transferred to 42°C and

after 5 minutes of incubation, 0.5 µl PrimeScript enzyme was added and incubation was continued for

30 minutes, followed by ethanol precipitation with glycogen as carrier. A sequencing ladder sample

was prepared in parallel with untreated RNase P by adding 1 µl 5 mM ddATP to the reaction. The

samples were dissolved in formamide loading dye (92.5% formamide, 5 mM EDTA, 0.025%

bromophenol blue, 0.025% xylene cyanol), denatured (2 min, 90°C) and resolved on 40 cm long, 8%

polyacrylamide, 7M Urea, 1x TBE gel at 45 W. After electrophoresis the gel was transferred onto

Whatman paper, dried, exposed to image plate and scanned (Cyclone Storage Phosphor, Packard).

Pre-processing of sequencing reads

The Cutadapt utility (21) was used to remove contaminating adapter sequences (“-a

AGATCGGAAGAGCACACGTCT” for the first and “-a

AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT” for the second read in pair) and to filter out low

quality ends (“-q 17”). Using an awk script, the 7 nucleotide barcode was removed from the beginning

of the first read and saved in separate file and the last 7 nucleotides from the end of the second read

were removed. Finally, pairs containing a read shorter than 15 nucleotides after trimming were filtered

out.

Assembly of E. coli MRE600 16S rRNA sequence

The pre-processed sequence pairs were used as input for Trinity (22) to assemble the strain specific

16S rRNA sequence. Comparison of the Assembly to the sequence of chain A in 3OFA pdb structure

identified 5 mutations (r.80a>c, r.89u>g, r.93u>c, r.183c>u and r.1498u>g).

Mapping reads pairs to strain specific 16S rRNA sequence of RNase P specificity domain

sequence

56

The sequence pairs were mapped to the assembly-corrected 16S rRNA sequence or to the RNase P

specificity domain sequence using Bowtie 2 program (23) with options “-N 1 -L 15 --norc -X 700”.

Untemplated nucleotides, putatively added via terminal transferase activity of reverse transcriptase,

were trimmed as described previously (20). For the analysis of 16S rRNA, pairs that spanned less

than 100 nt were discarded to reduce effects of size selection.

Estimated Unique Counts (EUC)

We defined a fragment as a pair of sites, 1) the termination site, which is the last reverse transcribed

RNA nucleotide and 2) the priming site, which is the first sequenced nucleotide of the second read.

Relationship between the EUC (‘n’) and the number of observed unique barcodes (‘k’) was calculated

using formula 1, which is an extension of a previously used method (24), but allowing different

barcodes to be ligated with different probabilities (‘Pi’). We calculated the frequency of the different

nucleotides at each position of the barcode using the observed set of barcodes from mapped

fragments having a read count within three lowest quartiles of all fragments in the given dataset

(Supplementary Table 2). To estimate the Pi for each barcode in each performed ligation reaction, we

assumed that positions in the barcode are independent and multiplied the probabilities for all possible

sequence combinations. Finally, for each experiment we sum over all possible barcodes (‘m’) and

calculate the table of k(n) relationships, which was reversed to a n(k) table, rounded to nearest integer,

and used to read out the EUC (‘n’) for the observed (‘k’) for each fragment.

Formula 1:

k 1 1

RNase P hydroxyl radical probing gel quantification and correlation with sequencing.

The scanned gel image was quantified with ImageJ (25). The signals corresponding to nucleotides

117-221 in the RNase P RNA were manually assigned to the sequence by comparison to a ddATP

sequencing reaction run in parallel. For each band the maximal value was extracted, followed by

subtraction of the average signal intensity in the whole +/- 6 nt region to correct for unequal

background intensity over the gel length. To allow optimal comparison between sequencing EUC and

gel intensities, the sequencing data was not trimmed for untemplated additions to the 3’ end of the

cDNA by reverse transcriptase, because we expect these shifts in signal to be present in the gel

resolved fragments. For the plot in supplementary figure 2, we have used positions 117 to 186, which

were chosen due to bands compression in the region before and the effect of size selection of the

sequencing library in the region after.

57

Number of through-space contacts in RNase P specificity domain calculation

To calculate the number of through-space ribose contacts, we have used chain B of the 1NBS pdb

structure (26) with the positions 121-124 structurally aligned from chain A of the same structure. Atom

locations were obtained from the PDB file and used to calculate ribose positions, defined as the mean

of the C1’, C2’, C3’, C4’ and O4’ positions. Next, we used the ribose bead locations to calculate the

number of ribose positions (excluding the neighbouring riboses) within distance of 14 Å from a given

ribose position.

Solvent accessible surface area calculation

Solvent accessible surface area was calculated using the PyMOL get_area function with settings

dot_solvent=1, dot_density=3. For the RNase P specificity domain, chain B of 1NBS structure (chain

A for positions 120-125) (26) and a solvent radius of 1.4 Å was used, whereas chain A of 3OFA

structure in complex with 3OFC (27) and a solvent radius of 3 Å was used for 16S rRNA

(supplementary figure 1).

Running average of ∆TCR calculation

Termination count at a given position was calculated as the sum of the EUCs of fragments terminating

at the position. Effective coverage at a given position was calculated as the sum of the EUCs of the

fragments terminating at or spanning the position. In addition for the ribosome analysis, fragment

were only used for calculation of effective coverage for a given position, if distance between the

position and the priming position was at least 100 nt. For RNase P the coverage was calculated using

all fragments, but only positions 87-186 were used for the subsequent analysis. A coverage cut off

was set to coverage that would provide a 90 % probability that a termination count was observed

given the average cleavage probability (median ∆TCR). The Termination-Coverage ratio (TCR) of a

given position was calculated by dividing termination EUC by the effective coverage EUC. ∆TCR was

calculated according to formula 2. As a last step ∆TCR was smoothed with a moving average over a

window of 3 nucleotides and offset by 1 position upstream to reflect the fact that reverse transcription

terminates before cleaved position.

Formula 2:

∆ max1

, 0

58

RESULTS

Reducing the biases in massive parallel sequencing based readout of HRF

As in classic HRF, our massive parallel sequencing strategy (HRF-Seq) is based on the detection of

reverse transcription termination sites, but instead of analyzing the sample on a gel or a capillary, we

ligate an adaptor to the 3’ end of the cDNA and PCR amplify using primers containing adaptor and

index sequences allowing massive parallel sequencing of many different conditions in a single lane on

the Illumina platform (16,20) (Figure 1). After paired end sequencing, the resulting reads can be

mapped to the investigated RNA to give the precise coordinates of the priming and probing event.

Compared with capillary analysis, the great advantage of using sequencing is increased throughput,

but sequencing methods also introduce additional experimental biases during ligation, PCR

amplification and sequencing steps (28). To reduce these biases, we introduced a 7 nucleotide

random barcode sequence in the 5’ end of the adaptor used for ligation. The barcode serves two

purposes. First, it has been shown that using an adaptor pool significantly reduces ligation bias in

small RNA cloning experiments using T4 RNA ligases (29) and we expect that the same is true for the

TS2126 RNA ligase (CircLigaseTM) used in this study. Second, the barcode serves as a label that is

added to each fragment before introduction of PCR and sequencing biases. At low coverage the

number of unique barcodes can be used directly to give the count for the specific fragment before the

PCR. At high coverage, it becomes more likely that the barcodes of the same sequence are ligated to

the same fragment multiple times (become saturated). Saturation occurs when the fragment count

exceeds the square root of the number of barcodes and will affect the accuracy of quantification (30).

By assuming that all the barcodes have equal probability of being attached to a given fragment, it is

possible to correct for saturation and calculate an Estimated Unique Count (EUC) (24). In our

experiments, the ligation adaptor is prepared by standard oligonucleotide synthesis as a pool of

oligonucleotides having 7 degenerate positions at the 5’ end. During our analysis, we realized that the

individual barcodes are present at very different frequencies in the barcode pool (Figure 2A), meaning

that the observed distribution of barcodes is modelled very poorly when equal barcode frequencies in

the barcode pool is assumed (Figure 2B). We therefore devised a novel strategy for estimating

individual fragment counts based on the method previously implemented by Fu et al. (24), but taking

into account that barcodes are present at different frequencies in the adaptor pool. In our strategy, the

underlying barcode frequencies in the adaptor pool are estimated by determining the nucleotide

frequencies observed at the seven different positions in the barcode after excluding fragments with

counts in the top quartile to avoid bias from clonal amplification of specific fragments. These

nucleotide frequencies are stable across our different experiments (Supplementary Table 2),

suggesting that they are accurate. Assuming independence among the positions in the barcode, we

then estimate the barcode frequencies by multiplication of the nucleotide frequencies. In simulation,

59

the estimated underlying barcode frequencies produce an observed distribution of barcodes that are

similar to the actual observed distribution, although the observed data still have a more extreme

distribution, probably because of the presence of PCR duplicates (Figure 2B). We applied this

normalization strategy to calculate EUC for HRF of a short in vitro transcribed RNA (specificity domain

from the Bacillus subtilis RNase P RNA) and for HRF of a long RNA purified from cells (Escherichia

coli 16S ribosomal RNA), both probed with hydroxyl radicals. For the RNase P specificity domain RNA,

we obtained high coverage resulting in saturation of barcodes. This is corrected using our strategy,

but not using simple barcode counting or by assuming equal barcode frequencies (Figure 2C). The

saturation of barcodes was not observed with the 16S rRNA, because of much lower coverage

(Figure 2D). By comparing the observed fragment counts with the EUC and stratifying by fragment

length, it is clear that for the RNAse P RNA, most positions have no length dependent bias (counts

equals EUC) (Figure 2E). This is most likely because there is relatively little length difference between

the different fragments in the PCR. For some of the RNase P positions (the longest fragments), we

observe a bias, which is related to some of the barcodes containing deletions, leading to assignment

of RNase P sequence as part of the barcode and subsequent reduction in the barcode complexity

and underestimation of the EUC. This phenomenon will have a small, but significant effect on the

quality of our data and can be avoided in the future by extending the barcode and giving it a specific

signature that will allow true barcodes to be distinguished (30). For the 16S rRNA dataset, we

observe a striking overrepresentation of short fragments, which is most likely caused by PCR

amplification and sequencing biases (Figure 2F) and our barcode normalization strategy efficiently

corrects for this bias. For both the 16S rRNA and the RNase P RNA, the EUC calculated using

unequal barcode frequencies performs at least as well as the other normalization strategies when

comparing with accessibility data obtained from the crystal structures (Supplementary Table 3). The

superior performance of our method in determining the RNase P accessibility stems mainly from

saturation of barcodes for the fragments that reach the RNA fragment terminus, leading to

underestimation of signal in the other type of barcode normalization. In contrast, the 16S rRNA

coverage is lower, so that a simple count of unique barcodes allows the data to be normalized for

fragment length bias of PCR. Thus, our barcoding strategy corrects for fragment length bias and for

the barcode saturation that can occur at high coverage, allowing the strategy to be used regardless of

the level of coverage

60

Figure 1. Major experimental steps of the HRF‐Seq method. Following hydroxyl radical probing, primers containing a 5’ illumina adaptor overhang are extended by reverse transcriptase to positions of radical induced breaks. Adapters containing a 7 nt barcode are ligated to the 3’ ends of cDNAs, followed by PCR amplification with primers containing Illumina compatible adaptor and index sequences. After size selection, the library is sequenced with the Illumina paired‐end protocol to provide information of the positions of probing and priming.

61

Figure 2. Using barcodes to estimate unique counts. A) Observed barcode frequencies. Histogram showing the distribution of observed barcode frequencies in the hydroxyl radical treated RNase P experiment. The broken vertical line indicates the barcode frequency if all barcodes were present at equal frequencies. B) Estimation of barcode counts. The plot compares the observed barcode counts with simulated barcode counts as estimated by assuming equal barcode frequencies or the unequal barcode frequencies as estimated by our strategy. Data is from the hydroxyl radical treated RNase P experiment. C) Relationship between the number of observed unique barcodes and EUC for different types of barcode normalization strategies for the hydroxyl radical treated RNAse P experiment. The vertical line shows the highest count observed in the experiment. D) Relationship between the number of observed unique barcodes and EUC for different types of barcode normalization strategies for the hydroxyl radical treated 16S rRNA. The vertical line shows the highest count observed in the experiment. E) Length dependent bias of fragments in the probing of the RNAse P specificity domain RNA. F) Length dependent bias of fragments in the probing of the 16S rRNA.

HRF-Seq analysis of in vitro transcribed RNAse P RNA

To validate our sequencing based output of HRF, we first compared the EUCs obtained for the

specificity domain of B. subtilis RNase P RNA with the output obtained with classical gel based HRF

using identical conditions and the same primer for reverse transcription. The footprinting signals from

the two methods are strongly correlated (R = 0.80), showing that the HRF-Seq EUC captures the

same signal as classical hydroxyl radical footprinting (Supplementary Figure 2). The HRF signal

(Figure 3A) contains both background signal caused by spontaneous termination of the reverse

62

transcriptase and a signal decay resulting from termination of reverse transcriptase before the probed

position. To normalize for the background, we implemented a slightly modified version of the

QuShape normalization method recently described by Weeks and colleagues for analysis of SHAPE

data (15). In line with the QuShape method, we estimate the coverage across the RNA by summing

the EUC for the fragments that reach or pass a given position (Figure 3B). The observed coverage is

a measure of number of reverse transcriptases reaching a given position. This can be used to

normalize the termination EUC to give a Termination-Coverage ratio (TCR), which is the fraction of

reverse transcriptases that will terminate at a given position. The TCR of the treated sample is

composed of probing signal and background signal, whereas the control samples’ TCR is composed

of background signal only. Comparing the sum of TCR for the control and treated experiments after

excluding the 5’ run off indicates that the treated RNaseP sample contains 47 % background signal.

Assuming that background causes the same fraction of reverse transcriptases to terminate at a given

position in the control and treated sample, the probing signal can be normalized for spontaneous

termination of the reverse transcriptase by subtraction of the control sample TCR from the treated

sample TCR to give a normalized accessibility measure ∆TCR (see methods section for full

description). This is slightly different from the QuShape procedure, which assumes that the

background signal in the probed sample is a scaling of the signal observed in the control sample. The

median ∆TCR is a measure of the average hydroxyl radical induced cleavage probability and for

RNAse P probing it is 0.0033 (Supplementary Figure 3A and 3B) corresponding to 1 hydroxyl radical

induced cleavage per 300 nt and approximately 34 % probability of observing a single hit on the RNA.

HRF data is known to have high background signal and in some cases, barcode assignment and

terminal transferase activity of reverse transcriptase can cause the signal to shift by one or two

nucleotides. In order to reduce the overall experimental noise, we therefore take advantage of the

accessibility of neighboring positions being highly correlated and calculate the moving average of

∆TCR in a 3 nucleotide window (Fig. 3C). Comparing the moving average of ∆TCR with the moving

average of ribose accessibility calculated from the solved crystal structure for the RNAseP specificity

domain RNA, we find a significant correlation (R = 0.55) (Figure 3D). This correlation is slightly higher

than previously observed for this RNA using traditional HRF based on capillary analysis (13).

Moreover, we also find that the moving average of ∆TCR anti-correlates with through-space ribose

neighbors (R = -0.57) as calculated from the RNAse P crystal structure (Figure 2E), suggesting that

HRF-Seq data can be used to inform discrete molecular dynamics simulations of RNA tertiary

structure prediction (13). In the comparison with the crystal structure accessibility, we observe 4

positions (positions 99-102) that are clear outliers in our probing data, giving too high ∆TCR signal.

This region is a loop (Figure 3F) and the discrepancy between our data and the data from the crystal

structure probably reflects that this loop is more flexible and has a higher accessibility in solution.

63

Figure 3. HRF‐Seq analysis of RNase P RNA specificity domain. A) Termination signal for HRF treated sample calculated as the sum of EUC for fragments terminating at a given position. B) Coverage for HRF treated sample. C) Normalized HRF‐Seq signal calculated as the 3 nucleotides moving average of the termination coverage ratio for the HRF treated sample with the termination coverage ratio for the control sample subtracted. D) Correlation between the normalized HRF‐Seq signal and a three nucleotide moving average of ribose accessibility from the published crystal structure (26) using a 1.4 Å probe. E) Correlation between the normalized HRF‐Seq signal the number of ribose through‐space contacts from the published crystal structure (26). R values are calculated using the Pearson correlation. F) Normalized HRF‐Seq signal displayed on the crystal structure of the RNase P RNA specificity domain (26), gray indicates no data.

Random primed HRF-Seq analysis of purified 16S rRNA

Next, we wanted to extend HRF-Seq to the analysis of long RNA molecules isolated from the cellular

environment. To make our strategy general and applicable to the entire transcriptome, we used

random primers for reverse transcription, rather than the single primer strategy that we used for the

RNase P experiments and that were previously used for SHAPE-Seq (16). We chose the E. coli 16S

ribosomal RNA for validation of our strategy, because of the high abundance of the ribosome and the

solved crystal structure (27). Native ribosomes including ribosomal proteins were purified and used for

HRF-Seq using random priming during reverse transcription to obtain signals for the entire 16S RNA

molecule in a single experiment. We also obtained data for the 23S rRNA, but because of low stability

during purification and high prevalence of posttranscriptional modifications that terminate reverse

transcription, only parts of the 23S rRNA were covered. After mapping the reads to the 16S rRNA, we

again used the barcodes present in the ligation adaptors to calculate the EUC for each observed

64

fragment (Figure 4A). The fragments can be collapsed to give EUC for each termination position

(Figure 4B). Knowing the EUC and the exact probing and priming position for each fragment, we can

calculate the effective coverage at each position by taking the size selection that occurs during

preparation of the sequencing library into account. In our set-up a fragment size cut-off of 100

nucleotides ensures that the effective coverage of a position is affected only by the molecules that

potentially could have been observed at the specific position given their priming site. The data for the

hydroxyl radical treated sample and the control were obtained using 5.7 % of an Illumina HiSeq lane.

For the treated sample, 12% of 5.2 million reads mapped to 16S and provided good coverage across

the large majority of the 16S rRNA (Figure 4C). Using the termination EUC and the effective coverage,

we then calculated TCR for the hydroxyl treated sample and the control experiment (Figure 4D).

Comparing the sum of TCR for the control and treated experiments after excluding the 5’ run off

indicates that the treated sample in this case contain 86 % background signal. Surprisingly, we

observe a couple of positions that have very high signal in the control compared to the treated sample

(most notably position 330, 551, 552 and 1378). As the only difference between the treated and

control sample is the radical treatment, we speculate that these signals are the result of a nuclease

activity that co-purifies with the ribosome and becomes inactivated by the radical treatment. We

subtracted the control TCR from the treated TCR to give a ∆TCR value for each position. The median

∆TCR is 0.0018, which corresponds to 1 hydroxyl radical induced cleavage per 560 nt on average

(Supplementary Figure 3C and 3D). Finally, we applied the 3 nucleotides window moving average to

∆TCR to give accessibility values for the 16S E. coli rRNA. We find that the RNA accessibility

calculated from the ribosomal crystal structure (27) as a 3 nucleotides moving average of ribose

solvent accessibility using a solvent radius of 3 Å correlates with the HRF-Seq determined ∆TCR (R =

0.56) (Figure 4E). While the agreement between the crystal structure accessibility and the HRF-Seq

data in general is quite striking, 16S rRNA positions 723 and 729 shows high signal in the HRF-Seq

data, but are inaccessible in the crystal structure. In the ribosome crystal, position 723 of the 16S

rRNA is bound and hidden from solvent by ribosomal protein S21 (RPS21) and RPS21 has previously

been shown to crosslink to position 723 (31). Interestingly, RPS21 is known to have a fast off rate and

exchange rapidly in reconstitution experiments (32) and is therefore likely to have been lost during

purification, which would explain the discrepancy between our data and the crystal structure at this

position. Positions 723 and 729 are located in a loop and the high HRF-Seq signal at position 729

compared to the crystal accessibility indicates that the loop changes its conformation when RPS21 is

absent, thereby exposing position 729 to the solvent. In general, however, the footprints of ribosomal

proteins and the large ribosomal subunit on the 16S surface are readily observed in HRF-Seq data

(Figure 5). As exemplified by position 723, the resolution of the HRF-Seq accessibility signal is high.

Zooming in on H16/H17, which run parallel to the long axis of the subunit and are located on a rather

flat surface, it is clear that HRF-Seq allows the difference in accessibility caused by exposure of one

side of RNA helices to be attained (Figure 6A). In fact, even for the entire 16S molecule, we observe a

65

strong correlation in accessibility signal for positions separated by one or two helical turns (Figure 6B),

probably because a significant fraction of 16S rRNA is helical and exposed on the surface. As

expected for accessibility footprinting there is no significant difference in the HRF-Seq signal for base-

paired positions compared to non-base-paired positions, but interestingly the probing signal of

positions that are Watson-Crick base-paired correlates with the probing signal of positions on the

opposite strand located downstream (offset by 2 and 3 bases) from the paired position (R=0.41 and

0.43, respectively). This is in perfect agreement with what one would expect from the accessible

surface area of riboses in helical structure with one side facing the solvent.

66

Figure 4. HRF‐Seq analysis of E. coli 16S rRNA. A) Sequenced fragments (EUC) from the treated (left) and control (right) sample mapped to 16S rRNA sequence. Left terminus of each fragment corresponds to the reverse transcription termination site and the right terminus to the priming site. B) Sum of EUC termination signal at each position for HRF treated and control sample. C) EUC based coverage for HRF treated and control sample. D) Termination‐Coverage ratio (TCR) calculated by dividing the termination signal with the coverage for the treated and control samples. E) Top graph (red) shows normalized HRF‐Seq signal calculated by subtracting TCR for the control sample from the TCR obtained from the treated sample and taking the 3 nucleotide moving average. Bottom graph (blue) shows the area of ribose accessibility calculated from the crystal structure (27) as the 3 nucleotide moving average of the accessibility to a probe with 3 Å radius . R calculated using the Pearson correlation

67

Figure 5. 16S rRNA accessibility surface representation HRF‐Seq data. A) Three views of the crystal structure of the RNA part of the 16S small ribosomal subunit colored with moving average of ribose accessibility as measured from the crystal structure (27) using a 3 Å probe. P, H and S indicates the platform, head and shoulder of the ribosomal subunit as named in (34). B) Crystal structure of 16S small ribosomal subunit colored with the normalized HRF‐Seq signal, gray indicates no data.

68

Figure 6. Periodicity of RNA accessibility. A) Close‐up of the positions 400‐500 of the 16S rRNA colored with the normalized HRF‐Seq signal. B) Pearson correlation between HRF‐Seq signal and ribose accessibility from the crystal structure for nucleotides separated by the indicated offset.

DISCUSSION

We present a new method for HRF of RNA backbone accessibility using massive parallel sequencing

as the readout. Our study demonstrates that this method has dramatically improved throughput

compared to classical capillary based methods and produces data that agree well with RNA ribose

accessible surface areas and through-space contacts determined by the X-ray crystallography.

Importantly, we show that HRF-Seq makes it possible to analyze long RNA molecules and mixtures of

RNA molecules in parallel in a single tube by using random primers. To this end, we devised new

strategies for reducing PCR and sequencing biases based on barcodes in the ligation adaptor and on

data normalization using the probing and priming position information obtained during sequencing.

Both of these strategies could be implemented for other types of sequencing based probing methods.

During the final preparation of this manuscript, Das and colleagues published a method to reduce the

bias in probing experiments based on the detection of termination of reverse transcription also by

introducing barcodes, but only for in vitro transcribed RNAs with a single primer (33). An important

advantage of using massive parallel sequencing as readout for HRF experiments is the digital nature

69

of the data, which makes data processing relatively easy compared to the analysis of data obtained

by gel or capillary electrophoresis. Moreover, after mapping we find that a substantial fraction of the

reads (~20 % on average) have mismatches in the 3 positions corresponding to the very 3’ end of the

cDNA produced, which is indicative of untemplated nucleotides being added to the cDNA by the

terminal transferase activity of the reverse transcriptase. This causes a shift of signal in the 5’

direction of the RNA, which cannot be corrected when using gel and capillary based methods for data

readout. In contrast, using massive parallel sequencing readout, we can perform a simple trimming of

reads with terminal mismatches to correct the probing position for approximately 75 % of cases with

untemplated nucleotides added (20).

Hydroxyl radical footprinting is a versatile method that can be used to investigate changes in tertiary

RNA structure, identify protein footprints on RNA and guide the computational prediction of tertiary

RNA structure. Here, we compare a radical treated sample with a control sample to obtain an

accessibility signal that could be used for computational prediction of tertiary RNA structure by

calculating ∆TCR and averaging it over 3 positions. The averaging improves overall correlation

because of the high accessibility correlation with neighboring position observed in the dataset (Figure

6B), but also blurs the fine details. In other types of experiments, such as typical footprinting

experiments, where two probed conditions are compared, the objective will be to determine specific

position that have differential accessibility in the two conditions. In such cases, it would make sense to

analyze the data by comparing the coverage and termination EUCs of the two samples with the

Fisher exact test or a test based on the negative binominal distribution. In this way the coverage and

termination count will be taken into account in the calculation of the significant differences between

the two samples. Importantly, the use of X-rays allows hydroxyl radical footprinting to be performed

inside intact cells (9) and kinetic studies of RNA folding (10) to be performed. HRF-Seq should be

readily applicable to such types of analysis and we therefore expect that the throughput provided by

HRF-Seq will help pave the way for an increased understanding of the functional consequences of

RNA tertiary structure inside cells and the dynamics of RNA folding. In particular, HRF-Seq should

facilitate the probing of long RNA molecules, such as mRNAs, long ncRNAs and viral RNAs, for which

tertiary structure information currently is very limited.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online.

70

ACKNOWLEDGEMENT

We are grateful to Jan Christiansen, who helped purify E. coli ribosomes and to Anders Krogh for

advice on the calculation of estimated unique counts. We thank the Danish National DNA Sequencing

Center for performing sequencing and the system administration at Section for Computational and

RNA Biology for providing computational infrastructure.

FUNDING

This work was supported by the Danish Council for Strategic Research [Center for Computational and

Applied Transcriptomics, DSF-10-092320]. LJK is funded by a PhD stipend from the Department of

Biology, University of Copenhagen. Funding for open access charge: the Danish Council for Strategic

Research.

REFERENCES

1. Wan, Y., Kertesz, M., Spitale, R.C., Segal, E. and Chang, H.Y. (2011) Understanding the

transcriptome through RNA structure. Nat Rev Genet, 12, 641-655.

2. Sharp, P.A. (2009) The centrality of RNA. Cell, 136, 577-580.

3. Cruz, J.A., Blanchet, M.F., Boniecki, M., Bujnicki, J.M., Chen, S.J., Cao, S., Das, R., Ding, F.,

Dokholyan, N.V., Flores, S.C. et al. (2012) RNA-Puzzles: a CASP-like evaluation of RNA

three-dimensional structure prediction. RNA, 18, 610-625.

4. Laing, C. and Schlick, T. (2010) Computational approaches to 3D modeling of RNA. J Phys

Condens Matter, 22, 283101.

5. Latham, J.A. and Cech, T.R. (1989) Defining the inside and outside of a catalytic RNA

molecule. Science, 245, 276-282.

6. Tullius, T.D. and Greenbaum, J.A. (2005) Mapping nucleic acid structure by hydroxyl radical

cleavage. Current opinion in chemical biology, 9, 127-134.

7. Brenowitz, M., R. Chance, M., Dhavan, G. and Takamoto, K. (2002) Probing the structural

dynamics of nucleic acids by quantitative time-resolved and equilibrium hydroxyl radical

‘footprinting’. Current Opinion in Structural Biology, 12, 648-653.

71

8. Balasubramanian, B., Pogozelski, W.K. and Tullius, T.D. (1998) DNA strand breaking by the

hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the

DNA backbone. Proceedings of the National Academy of Sciences, 95, 9738-9743.

9. Adilakshmi, T., Lease, R.A. and Woodson, S.A. (2006) Hydroxyl radical footprinting in vivo:

mapping macromolecular structures with synchrotron radiation. Nucleic Acids Res, 34, e64.

10. Sclavi, B., Sullivan, M., Chance, M.R., Brenowitz, M. and Woodson, S.A. (1998) RNA Folding

at Millisecond Intervals by Synchrotron Hydroxyl Radical Footprinting. Science, 279, 1940-

1943.

11. Lipfert, J., Das, R., Chu, V.B., Kudaravalli, M., Boyd, N., Herschlag, D. and Doniach, S. (2007)

Structural Transitions and Thermodynamics of a Glycine-Dependent Riboswitch from Vibrio

cholerae. Journal of Molecular Biology, 365, 1393-1406.

12. Powers, T. and Noller, H.F. (1995) HYDROXYL RADICAL FOOTPRINTING OF

RIBOSOMAL-PROTEINS ON 16S RIBOSOMAL-RNA. Rna-a Publication of the Rna Society,

1, 194-209.

13. Ding, F., Lavender, C.A., Weeks, K.M. and Dokholyan, N.V. (2012) Three-dimensional RNA

structure refinement by hydroxyl radical probing. Nature methods, 9, 603-608.

14. Yoon, S., Kim, J., Hum, J., Kim, H., Park, S., Kladwang, W. and Das, R. (2011) HiTRACE:

high-throughput robust analysis for capillary electrophoresis. Bioinformatics, 27, 1798-1805.

15. Karabiber, F., McGinnis, J.L., Favorov, O.V. and Weeks, K.M. (2013) QuShape: rapid,

accurate, and best-practices quantification of nucleic acid probing information, resolved by

capillary electrophoresis. RNA (New York, N Y ), 19, 63-73.

16. Lucks, J.B., Mortimer, S.A., Trapnell, C., Luo, S.J., Aviran, S., Schroth, G.P., Pachter, L.,

Doudna, J.A. and Arkin, A.P. (2011) Multiplexed RNA structure characterization with selective

2 '-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq). Proceedings

of the National Academy of Sciences of the United States of America, 108, 11063-11068.

17. Spedding, G. (1990) Ribosomes and protein synthesis : a practical approach. IRL Press at

Oxford University Press, Oxford England ; New York.

72

18. Kjems, J., Egebjerg, J. and Christiansen, J. (1998) Analysis of RNA-protein complexes in vitro.

Elsevier, Amsterdam ; New York.

19. Shcherbakova, I. and Mitra, S. (2009) Hydroxyl-radical footprinting to probe equilibrium

changes in RNA tertiary structure. Methods in Enzymology, 468, 31-46.

20. Kielpinski, L.J., Boyd, M., Sandelin, A. and Vinther, J. (2013) Detection of reverse

transcriptase termination sites using cDNA ligation and massive parallel sequencing. Methods

Mol Biol, 1038, 213-231.

21. Martin, M. (2011) Cutadapt removes adapter sequences from high-throughput sequencing

reads. . EMBnet J 17, 10-12.

22. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X.,

Fan, L., Raychowdhury, R., Zeng, Q. et al. (2011) Full-length transcriptome assembly from

RNA-Seq data without a reference genome. Nat Biotechnol, 29, 644-652.

23. Langmead, B. and Salzberg, S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat

Methods, 9, 357-359.

24. Fu, G.K., Hu, J., Wang, P.-H. and Fodor, S.P.A. (2011) Counting individual DNA molecules by

the stochastic attachment of diverse labels. Proceedings of the National Academy of

Sciences, 108, 9026-9031.

25. Schneider, C.A., Rasband, W.S. and Eliceiri, K.W. (2012) NIH Image to ImageJ: 25 years of

image analysis. Nat Methods, 9, 671-675.

26. Krasilnikov, A.S., Yang, X., Pan, T. and Mondragon, A. (2003) Crystal structure of the

specificity domain of ribonuclease P. Nature, 421, 760-764.

27. Dunkle, J.A., Xiong, L., Mankin, A.S. and Cate, J.H. (2010) Structures of the Escherichia coli

ribosome with antibiotics bound near the peptidyl transferase center explain spectra of drug

action. Proc Natl Acad Sci U S A, 107, 17152-17157.

28. Weeks, K.M. (2011) RNA structure probing dash seq. Proceedings of the National Academy

of Sciences of the United States of America, 108, 10933-10934.

73

29. Jayaprakash, A.D., Jabado, O., Brown, B.D. and Sachidanandam, R. (2011) Identification and

remediation of biases in the activity of RNA ligases in small-RNA deep sequencing. Nucleic

Acids Res, 39, e141.

30. Casbon, J.A., Osborne, R.J., Brenner, S. and Lichtenstein, C.P. (2011) A method for counting

PCR template molecules with application to next-generation sequencing. Nucleic Acids Res,

39, e81.

31. Brimacombe, R., Atmadja, J., Stiege, W. and Schüler, D. (1988) A detailed model of the

three-dimensional structure of Escherichia coli 16 S ribosomal RNA in situ in the 30 S subunit.

Journal of Molecular Biology, 199, 115-136.

32. Bunner, A.E., Trauger, S.A., Siuzdak, G. and Williamson, J.R. (2008) Quantitative ESI-TOF

analysis of macromolecular assembly kinetics. Anal Chem, 80, 9379-9386.

33. Seetin, M.G., Kladwang, W., Bida, J.P. and Das, R. (2014) Massively Parallel RNA Chemical

Mapping with a Reduced Bias MAP-Seq Protocol. Methods Mol Biol, 1086, 95-117.

34. Schluenzen, F., Tocilj, A., Zarivach, R., Harms, J., Gluehmann, M., Janell, D., Bashan, A.,

Bartels, H., Agmon, I., Franceschi, F. et al. (2000) Structure of functionally activated small

ribosomal subunit at 3.3 angstroms resolution. Cell, 102, 615-623.

74

1

Supplementary information accompanying the paper:

Massive parallel sequencing based hydroxyl radical footprinting of RNA accessibility

Lukasz Jan Kielpinski1, Jeppe Vinther1,*,

1Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 Copenhagen N, Denmark

CONTENT Supplementary figures 1-3

Supplementary tables 1-3

75

2

Supplementary Figure 1. Impact of solvent radii used for calculation of ribose accessible surface area on the correlation with HRF-Seq data. A) Correlation between moving average of ΔTCR and moving average of ribose accessible surface area calculated with different solvent radii for RNase P. Highest correlation was observed for probe radii 1.4 Å. B) Correlation between moving average of ΔTCR and moving average of ribose accessible surface area calculated with different solvent radii for 16S rRNA. Highest correlation was observed for probe radii 3 Å.

76

3

Supplementary Figure 2. Comparison of classical hydroxyl radical probing with HRF-Seq.

A) Autoradiogram of gel electrophoresis of RNase P hydroxyl radical probing (lower lane) with the

ddATP sequencing as size marker (upper lane). B) Quantification of gel shown on plot A. C)

Termination EUC as obtained in RNase P-treated sequencing experiment. D) Correlation plot

between signal intensity and termination EUC. The shown R value is the Pearson correlation.

77

4

Supplementary Figure 3. ΔTCR values before averaging and zeroing.

A) Barplot of non-zeroed ΔTCR for the footprinting of the RNAseP RNA. B) Distribution (excluding 5%

top and 5% bottom values) of non-zeroed ΔTCR for the footprinting of the RNAseP. Dashed, vertical

red lines represent the median ΔTCR. C) Barplot of non-zeroed ΔTCR for the footprinting of the 16S

rRNA. D) Distribution (excluding 5% top and 5% bottom values) of non-zeroed ΔTCR for the

footprinting of the 16S rRNA. Dashed, vertical red lines represent the median ΔTCR.

78

5

Oligonucleotide name Oligonucleotide sequence (5’ to 3’)

RT_random_primer AGACGTGTGCTCTTCCGATCTNNNNNNNNS

RT_structure_cassette AGACGTGTGCTCTTCCGATCTGAACCGGACCGAAGCCCG

LIGATION_ADAPTER_RB PHO-NNNNNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3NHC3

PCR_forward AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCT

PCR_REVERSE_INDEX.14_AGTTCC

CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_REVERSE_INDEX.16_CCGTCC

CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_REVERSE_INDEX.22_CGTACG

CAAGCAGAAGACGGCATACGAGATCGTACGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

PCR_REVERSE_INDEX.24_GGTAGC

CAAGCAGAAGACGGCATACGAGATGCTACCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

Oligonucleotide sequences © 2007-2009 Illumina, Inc. All rights reserved.

Supplementary Table 1. Oligonucleotides used in the study

79

6

Sample Sequenced

nucleotide

Sequenced position

1 2 3 4 5 6 7

16S rRNA, Treated

A 0.26 0.26 0.24 0.27 0.3 0.42 0.08

C 0.3 0.33 0.33 0.31 0.28 0.28 0.55

G 0.19 0.17 0.19 0.19 0.17 0.1 0.09

T 0.25 0.24 0.24 0.23 0.24 0.19 0.28

16S rRNA, Control

A 0.25 0.25 0.24 0.28 0.31 0.43 0.08

C 0.3 0.33 0.32 0.31 0.28 0.27 0.58

G 0.2 0.18 0.19 0.19 0.17 0.1 0.09

T 0.25 0.24 0.25 0.22 0.24 0.2 0.26

RNase P, Treated

A 0.27 0.26 0.25 0.27 0.3 0.41 0.09

C 0.29 0.33 0.33 0.31 0.29 0.28 0.52

G 0.2 0.17 0.18 0.19 0.17 0.1 0.09

T 0.24 0.24 0.24 0.23 0.24 0.2 0.3

RNase P, Control

A 0.26 0.26 0.25 0.27 0.3 0.42 0.09

C 0.3 0.33 0.32 0.3 0.29 0.28 0.54

G 0.2 0.17 0.19 0.2 0.17 0.1 0.09

T 0.24 0.24 0.24 0.23 0.25 0.2 0.28

Supplementary Table 2. Nucleotide frequencies at each barcode position for each sample used to

calculate the barcode ligation probabilities.

80

7

Sample Counting reads

Counting unique barcodes

EUC equal barcode frequencies

EUC estimated barcode frequencies

RNase P 0.45 0.50 0.53 0.55

16S rRNA 0.49 0.56 0.56 0.56

Supplementary Table 3. Pearson correlation between HRF-Seq signal and ribose accessibility for

different methods of processing the sequencing data

81

11.3 Paper3:Transcriptome‐widedetectionofbindingsitesofLockedNucleicAcidcontainingoligonucleotides(LNA‐Stop‐Seq)

83

Transcriptome‐widedetectionofbindingsitesofLockedNucleicAcidcontainingoligonucleotides(LNA‐Stop‐Seq)

AbstractAntisense oligonucleotides (ASOs) form a new class of promising drug candidates that act by hybridizing to

RNA molecules and exploit various cellular mechanisms for their function. Here, we describe the

development of a method for transcriptome‐wide characterization of ASO binding sites by finding the ASO

induced reverse transcription termination sites. First, we have characterized several reverse transcriptase

enzymes and have chosen the PrimeScript for the remaining experiments. Next, we have optimized the

separation of hybridized oligonucleotides from RNA with a gel filtration in formamide. Then, we show the

characterization of the crosslinking of 4‐thiothymidine (4‐thio‐T) modified oligonucleotide to the RNA. We

have researched two possibilities of enriching for the ASO‐terminated cDNA molecules. First was based on

degradation of RNA molecules (or their parts) not protected by the crosslinked oligonucleotide. Second is

based on the CAGE‐like selection of cDNA molecules terminated upon reaching crosslinked, biotinylated

ASO. The second strategy was used to build the libraries for massive parallel sequencing. Motif generated

based on the sequencing results recapitulates the sequence of the used ASO and the overall signal shows

enrichment in the vicinity of the possible binding sites. On the other hand, portion of the signal is of no

obvious origin and the analysis is ongoing.

IntroductionAntisense oligonucleotides have been long imagined to have therapeutic potential and lured researchers by

the promised ease of designing drugs by simply synthesizing the molecule with sequence matching to the

troublesome gene. Many strategies of action were proposed, including hybridization with microRNAs to

inhibit their function, blocking splicing machinery to modulate mRNA maturation or most commonly

degrading disease‐causing transcripts with siRNAs or gapmers (Kole et al., 2012; Stenvang et al., 2012). It

was recognized that to improve the drug properties such as delivery to the tissue of interest, hybridization

and stability, various modifications are required. One of the promising modifications is a substitution of

some or all of the nucleotides with the nucleotide analog – locked nucleic acid (LNA) (Koch et al., 2008)

which protects the oligonucleotide (ASO) from degradation by nucleases and significantly increases affinity

for the target. The LNA is incorporated in many drug candidates, deploying strategies such as microRNA

inhibition by antisense hybridization (Lanford et al., 2009; Obad et al., 2011) or mRNA degradation with

gapmer (Straarup et al., 2010), that is the molecule with a DNA core (that recruits RNase H) and flanks

composed of LNA. In the case of siRNAs it was shown that they act not only on the intended targets but

also exhibit sequence‐ (Jackson et al., 2003; Lindow et al., 2012) or sequence‐non‐ (Olejniczak et al., 2010)

specific effects. In contrast, little is known about off‐target effects of LNA containing oligonucleotides,

which is of significant interest considering the current therapeutic developments of drugs based on the LNA

chemistry.

84

There were several published approaches towards profiling the RNA accessibility for interactions with

oligonucleotides. Those methods were based on hybridizing a target RNA with random

oligo(deoxy)nucleotides and detecting sites of efficient binding by dialysis, RNase H treatment or reverse

transcription priming. Alternatively, the RNA was hybridized to oligonucleotides coated arrays and detected

to which oligonucleotides it can stably bind (summarized in (Allawi et al., 2001)).

Here we describe the development of the method, named LNA‐Stop‐Seq, which allows for the identification

of binding sites of an oligonucleotide (here we use LNA modified) across the entire transcriptome. As a

proof of concept, we apply the LNA‐Stop‐Seq to find hybridization sites of a previously described gapmer,

which targets apolipoprotein B and reduces plasma level of non‐high‐density lipoprotein cholesterol

(Straarup et al., 2010). The method relies on crosslinking of the hybridized ASO bearing 4‐thiothymidine (4‐

thio‐T, Figure 1) to the transcripts and finding the specific sites of interactions with massive parallel

sequencing of reverse transcription terminations. The 4‐thio‐T is an analog of well characterized

crosslinking group – 4‐thiouridine (4‐thio‐U), which is a naturally occurring nucleotide that crosslinks at

close range with both amino acids and nucleotides upon long‐range UV (>320 nm) excitation. Crosslinking

sites to RNA can be detected by finding terminations of reverse transcription (Sontheimer, 1994). Among

advantages of using the photocrosslinkable nucleotide for covalent binding of the ASO to its hybridization

sites are preserving ASO structure, stability (until irradiation) and thanks to the used long UV wavelength

minimizing crosslinking between other groups present in the probed mixture (Meisenheimer and Koch,

1997).

Figure 1. 4‐thiothymidine structure

Materialsandmethods

Buffersused2x RNA folding buffer (40 mM Tris‐HCl pH 7.8, 280 mM KCl) (Kjems et al., 1998)

2x RNA folding buffer – EDTA (40 mM Tris‐HCl pH 7.8, 280 mM KCl, 0.01 mM EDTA)

10x Mg for RNA folding (20 mM Tris‐HCl pH 7.8, 140 mM KCl, 25 mM MgCl2)

10x Mg for RNA folding – EDTA (20 mM Tris‐HCl pH 7.8, 140 mM KCl, 25 mM MgCl2, 0.005 mM EDTA)

PreparationofinvitrotranscribedRNA1. ApoB RNA fragment

85

The PCR product derived from the human genomic DNA with primers ApoBrev and ApoBfor+T7 using Pfu

DNA polymerase has been used as a template for transcription with T7 RNA polymerase followed by

polyacrylamide gel purification with UV shadowing for product visualization. Expected RNA sequence is

GGGAGAUUCUCCUUUAAAUCAAGUGUCAUCACACUGAAUACCAAUGCUGAACUUUUUAACCAGUCAGAUAUUG

UUGCUCAUCUCCUUUCUUCAUCUUCAUCUGUCAUUGAUGCACUGCAGUACAAAUUAGAGGGCACCACAAGAUU

GACAAGAAAAAGGGGAUUGAAGUUAGCCACAGCUCUGUCUCUGAGCA.

2. ApoB mutated fragments

Mutated fragments of ApoB were obtained in the same way as ApoB fragment, but PCR products were

synthesized with either ApoB‐rev‐A, ApoB‐rev‐C, ApoB‐rev‐G or ApoB‐rev‐T primer in pair with ApoBfor+T7.

Expected RNA sequence is

GGGAGAUUCUCCUUUAAAUCAAGUGUCAUCACACUGAAUACCAAUGCXGAACUUUUUAACCAGUCAGAUAUUG

UUGCUCAUC, where X indicates the mutated base and can be either A, C, G or U.

3. IGF‐II RNA fragment

Human IGF‐II fragment RNA is a gift from Jan Christiansen (its predicted structure is shown on the figure 8

in (Christiansen et al., 1994) but the 3’ end of the used RNA molecule is located downstream from 3’ end

shown on the figure).

HighresolutionpolyacrylamideelectrophoresisThermally denatured samples were resolved on preheated 1xTBE, 7M Urea polyacrylamide (concentration

given in the method of specific experiments) gel at 45‐50 W. Gels were transferred onto Whatman paper,

dried, and exposed to phosphoimaging screen which was subsequently scanned with the Cyclone Storage

Phosphor System (Packard) usually after 16 hours of exposition.

Choiceofreversetranscriptase(Figure2)Reverse transcription reactions were performed according to manufactures recommendations with 5’ end

labeled (T4 PNK with ATP γ‐32P) ApoB_PE for ApoB fragment or IGF2_PE.h‐p for IGF‐II fragment primers

except: (1) final volume of reactions was 18.75 µl and contained 1 µl of respective enzyme, (2) all reactions

were supplemented with 667 mM sorbitol and 133 mM trehalose, (3) initial mixture was prepared by

mixing 100 fmol labeled primer with 100 fmol RNA and 100 ng tRNA and, for the reactions marked “H”,

1 pmol ApoB‐str.dis5' (which is complementary to 5’ part of RNA) ASO, (4) Thermal conditions: initial

mixture was heated to 65°C (70°C for IGF‐II) for 5 min and transferred on ice supplemented with master

mix and incubated as follows: 42°C for 10 min, ramp 0.1°/sec until 50°C and kept for 30 min, then 10 min at

56°C, 10 min at 60°C and cooled to 4°C. ThermoScript enzyme reactions were incubated 50°C for 20 min,

65°C for 40 min, 85°C for 5 min and cooled to 4°C. Volume of reactions with IGF‐II fragment was scaled

down by a factor of 2. Samples were mixed with equal volume of either formamide (ApoB fragment) or

urea gel loading solution and were resolved with high resolution polyacrylamide electrophoresis.

LNA‐RNAhybridsseparationassay(Figure3A)5 pmol of ApoB RNA fragment has been mixed with 5 µg of yeast tRNA and 10 pmol of either 6434 or 6435

ASOs, incubated 2 min at 65°C and kept at room temperature for 5 min followed by ethanol precipitation

and pellets resuspension in deionized formamide (84 µl). Columns (NucAway – Ambion, AutoSeq G‐50 – GE

86

Healthcare, illustra MicroSpin S‐300 – GE Healthcare) were either prepared with formamide (NucAway) or

buffer exchanged with formamide by 4x repeated application of 500 µl (S‐300) or 350 µl (AutoSeq) and

centrifugation at 730 g for 1 min. The RNA‐oligo mix in formamide (20 µl) was heat denatured (95°C for

2 min), applied on the columns and spun for 1 min at 730 g. The flow‐through was ethanol precipitated

(sample M1 and A with 4 volumes of ethanol, M2 and S with ethanol and sodium acetate – each sample

was precipitated with both protocols and chosen by the highest yield), dissolved in H2O, quantified with

NanoDrop, thermally denatured together with primer (100 fmol 5’ end labeled ApoB_PE; 70°C, 5 min) and

used for primer extension reaction (8.88 µl reactions with 1x PrimeScript buffer, 0.2 µl PrimeScript,

667 mM sorbitol and 133 mM trehalose, 0.5 mM dNTP, thermal conditions 42°C – 10 min , 50°C – 30 min,

56°C – 10 min, 60°C – 10 min and cooled to 4°C). After primer extension, the samples were ethanol

precipitated, pellets dissolved in 5 µl formamide loading dye out of which 2 µl were run on 10% high

resolution polyacrylamide gel. Signal quantification was performed with the ImageJ program (Schneider et

al., 2012) for each lane for the regions marked with the yellow bar by multiplying the mean signal from the

given region by its area and subtracting mean background signal (as measured on the relatively big area

outside the samples region) multiplied by the measured region area.

ASO‐RNAcrosslinking(Figure3B) 2.3 pmol of ApoB RNA fragment was mixed with 4 pmol respective ASO and 1056 ng yeast tRNA in 1.03x

PrimeScript buffer in 21 µl, heated to 65°C for 1 min and transferred to room temperature. 20 µl droplets

were spotted on Parafilm in Stratalinker 1800 equipped with EIKO F8T5 (Blacklight) lamps and irradiated for

approximately 23 minutes. Remaining liquid (due to evaporation only approximately 11 µl left) was ethanol

precipitated, resuspended in 22 µl deionized formamide and incubated 2 min at 95°C and 20 µl was applied

on formamide‐washed MicroSpin S‐300 columns as described in the description for Figure 3A. Flow‐

through was ethanol precipitated (with sodium acetate), pellet washed and dissolved in 10 µl H2O. 2.58 µl

was used for primer extension reaction and gel electrophoresis as described in the method for Figure 3A.

Basepairingwithdifferentnucleotide(Figure3C)1 pmol of an ApoB RNA fragment or of a mutated ApoB RNA fragment has been mixed with 1 µg of yeast

tRNA and 2 pmol of respective ASO in 1xRNA folding buffer in the volume of 18 µl. Samples were denatured

(1 min, 90°C) and transferred to room temperature. After cooling down, the magnesium concentration was

adjusted to 2.5 mM MgCl2 with 10x Mg for RNA folding buffer and the samples were crosslinked for 10 min

with NIS F8T5 Black Light bulbs in Stratalinker 1800, ethanol precipitated, resuspended in 20 µl deionized

formamide, purified with S‐300 columns as described in method for Figure 3B and analyzed by primer

extension and electrophoresis as described in method for Figure 3A.

Time‐gradientofcrosslinking(Figure4)Experiment performed as the experiment for Figure 3B with modifications: 20 µl mix that contained 1 pmol

of ApoB RNA fragment, 5 pmol of ASO (6434 or 4‐thio‐1) and 1 µg of yeast tRNA in 1x PrimeScript buffer

was irradiated with varying time.

EnrichmentwithTerminatorenzyme(Figure5)155 pmol cytidine‐3'‐phosphate was 5’ end labeled with 3.3 pmol ATP γ‐32P using T4 PNK enzyme in 1x T4

PNK buffer (Fermentas) by incubating in the volume of 5 µl for 40 min at 37°C followed by enzyme

inactivation at 70°C for 5 min. Obtained [5'‐32P]pCp (not purified) was ligated to the 3’ end of ApoB

87

fragment (used ~6 pmol) with T4 RNA ligase in 1x T4 RNA ligase buffer (Fermentas) supplemented with ATP

at 4°C over night to yield 3’ end labeled RNA molecule. Labeling was followed by NucAway purification

(Ambion) and 21.5 out of 38 µl of eluant was subject to RNA 5’polyphosphatase (Epicentre) treatment to

convert 5’ triphosphate to 5’ monophosphate (1x polyphosphatase buffer, 1 µl enzyme per 25 µl reaction,

37°C for 30 min) followed by one more NucAway purification. The RNA was split into 5 equal parts and

mixed with a respective ASO in 1x Terminator buffer A (Epicentre), heated to 65°C for 1 min and brought to

a room temperature for 5 min, crosslinked in open tubes with black light bulbs for 15 min. Volume was

adjusted with H2O to 10 µl and 5 µl of the reaction was transferred to the new tube for Terminator (5’

phosphate dependent exonuclease) digestion with 0.22 µl of the enzyme in 7.2 µl reaction (in 1x terminator

buffer A) for 30 min at 30°C followed by addition of 1 µg tRNA, ethanol precipitation and high resolution

10% polyacrylamide denaturing electrophoresis.

RNaseIprotectionassay(Figure6)A body‐labeled ApoB RNA fragment was synthesized with T7 RNA polymerase from the same PCR template

as a non‐labeled ApoB RNA fragment with addition of UTP α‐32P. The RNA was DNase I treated (Ambion)

and purified on a NucAway column. The labeled RNA was mixed with 4‐thio‐7 ASO in 1x RNA folding buffer,

heated to 90°C for 1 min, incubated at 37°C for 15 min, supplemented with 10x Mg for RNA folding to

obtain 1x concentration, incubated for 5 more minutes at 37°C and crosslinked for 20 min with black light

bulbs followed by adjusting the volume with H2O. Irradiated RNA duplexed with ASO underwent digestion

with RNase H (NEB) with 10 mM DTT for 1 hour at 37°C followed by the ethanol precipitation enhanced by

the addition of tRNA. Products were resolved on 6% polyacrylamide denaturing gel, visualized by

autoradiography and band expected to be derived from the RNase H cleavage of RNA with crosslinked ASO

(comparison with clearly visible band on the cold gel) was cut out, the RNA eluted and precipitated. Such

prepared sample of RNA‐ASO crosslinked complex was split into two parts and used as a template for

primer extension reaction with cold ApoBrev primer, in which samples of RNA were mixed with 10 pmol

primer in the volume of 22.5 µl, incubated 5 min at 65°C and placed on ice. Reverse transcription master

mix was prepared by mixing 22.5 µl 5x PrimeScript buffer, 5.63 µl 10 mM dNTP, 45 µl sorbitol‐trehalose mix

(1.67 M and 0.33 M) and 15.38 µl H2O. The master mix was split into 2 times 88.5 µl, one supplemented

with 1.5 µl PrimeScript enzyme, one with 1.5 µl H2O and added to the RNA‐primer mix and incubated in the

same thermal conditions as reactions for Figure 3A. Reverse transcription was transferred on ice, stopped

by addition of 15 µl 50 mM EDTA and 3 µg tRNA. Each reaction was split into six times 20 µl and

supplemented with different amount (3, 1.5, 0.75 or 0.375 µl) of RNase I (Fermentas), phenol‐chloroform

extracted, ethanol precipitated and resolved on 10% polyacrylamide denaturing gel.

Demonstrationofselection(Figure7)100 fmol of ApoB RNA fragment was mixed with 0, 100 or 1000 fmol of respective ASO in the volume of

36 µl in 1x RNA folding buffer – EDTA, folded in thermocycler following the program: 90°C for 1 min, ramp

0.1°C/s until 79°C, 79°C for 5 min, ramp 0.1°C/s until 74°C, 74°C for 10 min, ramp 0.1°C/s until 69°C, 69°C

for 5 min, ramp 0.1°C/s until 37°C, 37°C for 10 min and supplemented with 4 µl 10xMg for RNA folding –

EDTA. After folding, the samples were placed on Parafilm in Stratalinker and irradiated for 10 min with EIKO

black light bulbs, collected, ethanol precipitated, dissolved in formamide, spun through S‐300 columns (as

in the description for Figure 3A), reverse transcribed (as in the description for Figure 6, but scaled down to

25 µl per reaction and with using only 3/10 of such calculated amount of enzyme). After the reverse

transcription to each of the reactions 4 µl of 0.17 µg/µl tRNA and 41.7 µl EDTA was added, followed by

88

addition of 0.54 µl RNase I and incubation at 37°C for 30 min and purification with RNAClean XP beads (as

in (Kielpinski et al., 2013) but with elution in 25 µl 10 mM Tris‐HCl pH 8.3). Fraction of the sample (5 µl)

underwent selection as described in the CAGE protocol (Takahashi et al., 2012)[sections 3.6 and 3.7] but

scaled down by the factor of 7 (volume). An adapter (LIG_DNA) was ligated to the 3’ end of cDNA from

selected (1 µl out of 16.25 µl eluted from the beads) and non‐selected (3 µl taken from purification after

RNase I treatment) samples using CircLigase as described in (Kielpinski et al., 2013) but using 10 times less

enzyme. Samples were purified (Ampure XP) and PCR amplified with LIG_PCR and ApoB_qPCR_R primers

for 35 cycles, resolved on 3% agarose, 1x TBE gel, stained with ethidium bromide and UV visualized.

PreparationofsequencinglibraryMouse liver total RNA (Zyagen MR‐314) was poly(A) enriched using Ambion Poly(A) purist MAG kit with

1.6% yield. 200 ng of poly(A) RNA was folded with 0, 0.02, 0.2 or 2 pmol of ASO (4‐thio‐1 batch 2 or 4‐thio‐

1‐biotin – synthesized in parallel) in 1x RNA folding buffer – EDTA in 36 µl in a thermocycler following the

program: 90°C for 1 min, ramp 0.1°C/s until 79°C, 79°C for 5 min, ramp 0.1°C/s until 74°C, 74°C for 10 min,

ramp 0.1°C/s until 69°C, 69°C for 5 min, ramp 0.1°C/s until 37°C, 37°C for 10 min, add 4 µl 10x Mg for RNA

folding – EDTA, 37°C for 5 min, add oligo (in the volume of 2 µl, note that for this sample folding occurred in

34 µl in 1.06x RNA folding buffer – EDTA) to the “cofolded” sample, 37°C for 5 min, crosslink all samples for

10 min (drops on parafilm; no‐UV sample removed before irradiation) in Stratalinker 1800 with EIKO bulbs

and ethanol precipitate. The pellets were dissolved in 30 µl formamide (samples 11 and 12 in 8 µl H2O – no

column elution), RNA purified on S‐300 HR spin column as described in method for Figure 3A (the columns

were prepared by 2x spinning with 700 µl and 1x with 350 µl formamide) and subsequently ethanol

precipitated with glycogen as carrier and resuspended in 8 µl H2O. For the reverse transcription, 4 µl of RNA

was mixed with 1 µl 10 µM RT_15xN primer, incubated at 65°C for 5 min and cooled on ice, followed by

addition of 15 µl master mix prepared by mixing 4 volumes of 5x PrimeScript buffer, 1 volume of 10 mM

dNTP, 8 volumes of sorbitol‐trehalose mix (1.67 M and 0.33 M) and 2 volumes of PrimeScript enzyme. The

reactions were incubated in thermocycler following the program: 25°C for 30 sec, 42°C for 30 min, 50°C for

10 min, 56°C for 10 min, 60°C for 10 min and placed on ice. To each reaction 4 µl with 666 ng of tRNA and

167 nmol of EDTA and 0.5 µl RNase I was added followed by 30 min incubation at 37°C and RNAClean XP

purification as in RTTS‐Seq (Kielpinski et al., 2013) with elution in 25 µl of 10 mM Tris‐HCl pH 8.3. 20 µl from

each reaction was used for selection which was scaled down version of the reaction described in the CAGE

protocol (Takahashi et al., 2012). Briefly: 440 µg MPG streptavidin mix was incubated for 30 min with

132 µg tRNA, washed twice with wash buffer 1, resuspended in 352 µl wash buffer 1 and split into 40 µl

batches in low‐binding tubes to which 20 µl purified cDNA was added and allowed to bind for 30 min at

room temperature, followed by washings with buffers 1 (1x) , 2 (1x), 3 (2x), 4 (2x) and released with 30 µl

50 mM NaOH by incubating for 10 min. The supernatant was neutralized with 6 µl 1 M Tris‐HCl (pH 7) and

purified with 65 µl Ampure XP beads (mod. prot.) with elution in 8 µl H2O. Selected and non‐selected cDNA

was ligated with LIG_DNArandBARC oligonucleotide as described in (Kielpinski et al., 2013), purified with

Ampure XP beads (mod. prot.), eluted in 16 µl H2O. The ligated samples (5 µl) were used for PCR

amplification as described in (Kielpinski et al., 2013) in the total reaction volume of 20 µl using 22 for

selected and 12 for non‐selected cycles (plus 4 initial three‐step cycles). After PCR, the 10x diluted amplified

libraries were quantified and quality checked on Bioanalyzer High Sensitivity chips and mixed by adding

5.38 µl of the selected samples, 10 µl samples 9 and 10 and 4.27 µl samples 11 and 12 onto 250 nmol of

EDTA followed by purification with 137 µl Ampure XP according to the manufacturers protocol with elution

89

in 10 µl 10 mM Tris‐HCl, 2% E‐Gel SizeSelect (invitrogen) size selection keeping fragments between 200‐600

bp (buffer was collected from a lower chamber every 20 seconds of the electrophoresis run), volume

reduction with Qiagen PCR purification kit with elution in 30 µl Tris‐HCl obtaining concentration 5.5 ng/µl

(NanoDrop) and subsequent 94 nt long single‐read illumina HiSeq sequencing multiplexed with another

sample from the laboratory (Jakob Rukov).

MassiveparallelsequencingdataanalysisReads were pre‐processed with a Cutadapt utility (Martin, 2011) with options “‐m 27 ‐a

AGATCGGAAGAGCACACGTCT ‐q 17”, followed by trimming and keeping the barcode (first 7 nt), TopHat2

(Kim et al., 2013) mapping to a mouse mm9 genome assembly and trimming untemplated nucleotides from

the beginning of the remaining read (Kielpinski et al., 2013). Estimated unique counts (EUC) of reads

sharing reverse transcription termination site (RTTS) were calculated based on the number of unique

barcodes of all the reads mapping to a given location as described in the Paper 2. The EUC per position was

displayed using BedGraph track in a UCSC Genome Browser (Kent et al., 2002) as shown on a Figure 9. Input

for the MEME motif discovery (Bailey and Elkan, 1994) was generated with the custom script that uses the

Bioconductor package (Gentleman et al., 2004) which extracted the sequence of 20 nt located upstream

from the position with the highest EUC in each RefSeq representation (in the case of positions with equal

counts at the same transcript one was chosen randomly). cWords analysis (Rasmussen et al., 2013) was

performed on the web server (http://servers.binf.ku.dk/cwords/) on December 20th 2012 with the options

“Species:Mouse” and “Sequences:mRNA”. The input consisted of Ensembl gene IDs sorted (decreasing) by

the ratio of number of reads mapping to a given transcript (longest isoform of each gene) in the tested

sample (indices 3,5,7) to the sum of reads mapping to this transcript in the control samples (indices 11 and

12). Motifs reported were the top motifs enriched in the up‐regulated genes. Plots for the Figure 10B were

prepared according to the section 3.12 in (Kielpinski et al., 2013) using starting positions of Bowtie

(Langmead et al., 2009) mapped sequence of the used oligonucleotide as the annotation (options: “‐y ‐S ‐a

–n2 mm9 ‐c GCATTGGTATTCA”).

RNA‐RNAinteractionsprediction(Figure11)RNA‐RNA interactions were predicted using RNAStructure v5.3 (Reuter and Mathews, 2010) and figures

were generated using VARNA v3.9 (Darty et al., 2009).

Results

ChoiceofareversetranscriptaseIn this study we aimed at finding the RNA‐ASO hybridization sites on the transcriptome‐wide scale by

finding the ASO‐induced reverse transcription termination sites (RTTS). To reduce the biases and obtain a

signal of the highest possible quality, we first set out to choose which reverse transcriptase to use. First of

all, we found it important to select an enzyme that can efficiently pass thorough stable RNA structures

(Harrison et al., 1998). We have performed the primer extension reactions on the stable hairpins from

human IGF‐II mRNA (Christiansen et al., 1994) (Figure 2B). Comparison of seven commercially available

enzymes left us with SuperScript II, SuperScript III, PrimeScript and ThermoScript as being able to efficiently

pass through the structured RNA (Figure 2A, B). Another important consideration regarding choice of the

enzyme is its terminal transferase activity (Kulpa et al., 1997) that should be minimized to improve mapping

90

efficiency and precision. To test for that property, we ran a high resolution electrophoresis of primer

extension reactions performed with an in vitro transcribed RNA fragment as template. Use of AccuScript

and AffinityScript enzymes led us to a very well defined full‐length product (Figure 2C), use of a PrimeScript

enzyme resulted in two main bands, use of SuperScript enzymes gave rise to one main band and several

weaker, while a ThermoScript‐derived cDNA molecules had the widest length distribution.

Moreover, an additional concern with the strategy was, that performing randomly primed reverse

transcription reaction in the complex mixture may cause interference between synthesized cDNA

molecules. This phenomenon can happen if newly synthesized strand would terminate on the cDNA strand

synthesized upstream on the RNA molecule. That interference would be minimized if the reverse

transcriptase would have efficient strand displacement activity. To check for that property we have carried

out the primer extension reaction in the presence of DNA ASO hybridized to the RNA (upstream from the

site complementary to the labeled primer) which resulted in the detection of undesired shorter cDNA

molecules for the AccuScript, AffinityScript and ThermoScript. Low resolution of the gel doesn’t allow

distinguishing if the early termination is related to the presence of RNase H activity or inefficient strand

displacement (Figure 2D) but both would be undesired. Combination of the tests led us to choose the

PrimeScript as an optimal enzyme for our assay.

91

Figure 2. Characterization of reverse transcriptases. (A) Gel electrophoresis of primer extension reaction on highly structured IGF‐II RNA fragment (shown on panel B) with different reverse transcriptases. (B) Structure of the IGF‐II fragment after (Christiansen et al., 1994). The red arrow indicates the primer binding site (only reverse transcribed part of RNA shown). (C) Heterogeneity of full length products of primer extension on ApoB fragment. (D) Primer extension of ApoB fragment with (H) or without (C) complementary DNA oligonucleotide. Enzyme names abbreviations: SII – SuperScript II (Invitrogen), SIII – SuperScript III (Invitrogen), P – PrimeScript (TaKaRa), Ac – AccuScript (Agilent), Af – AffinityScript (Agilent), G – GoScript (Promega), T – ThermoScript (Invitrogen), N – no enzyme. Red rectangle indicates image resizing – given original proportions it is a square (which also applies to Figure 3 and Figure 4).

SII SIII P Ac Af T

C H C HC H C H C H C H

SII SIII Af GP Ac T N

CA

SII SIII Ac Af P T

C H C HC H C H C H C H

D

C

C

UG

ACU

C

C

C

U

GGUGUGCUCCU

GG

AA

GGAAGAU

CU

UGGGGA

C

CC C C C

C

A C

C

G G A G C A C A C CUA

G

G

G

A

U

CAUCUU

UGCC

CGU

CUCCUGGGGACC

CCC

CAA

G

A

AA U

GU

G GA

G U C C U C G G G G GC

C GU

G C AC U

G A U G

C

GG

G G AG

U

1

10

20

30

40

50

60

708090100

110 120 130

140

141

B

92

CrosslinkingcharacterizationAlthough our assay for finding the hybridization sites depends on reverse transcription termination upon

reaching the crosslinked ASO we hypothesized that the termination can be also induced by hybridized but

not crosslinked ASO, especially highly affine ASOs containing LNA, leading to the risk of observing reverse

transcription terminations on the hybridization sites that were not occupied during crosslinking but were

taken by the ASO in the later steps of the protocol. Therefore, it was crucial to develop a method of

removing non‐covalently bound LNA ASOs before reverse transcription reaction. Such a separation of ASO

from the bound RNA requires (1) dissociation and (2) subsequent physical separation preferably based on

the different molecular sizes. Based on the previous report (Pinder et al., 1974) formamide aids RNA duplex

melting and inhibits its reassociation upon cooling, properties that made it a perfect solvent for the

discussed process. We have compared several commercially available gel exclusion columns for size

separation of formamide dissolved, heat‐denatured RNA‐ASO mix and compared their performance with

primer extension assay by measuring the ratio of the signal in the region surrounding ASO binding site to

the signal of the full length product (Figure 3A). For the 8‐mer nucleotide (6435), all of the tested columns

gave similar results, diminishing the signal observed in the non‐purified sample to the approximately

background level indicating efficient separation. On the other hand, only the formamide soaked illustra

MicroSpin S‐300 HR column gave comparably good results when separating 13‐mer nucleotide from the

RNA, and we have decided to use it for subsequent experiments.

Next, we wanted to answer how the crosslinking to RNA depends on the position of 4‐thio‐T incorporation

within the ASO. We have performed crosslinking of 8 different ASOs – based on 13‐mer and 8‐mer, fully are

fully complementary to the in vitro transcribed RNA fragment. They were synthesized with 4‐thio‐T located

either internally, at their 5’ or 3’ end or without any crosslinkable group. We then annealed ASOs to the

RNA, crosslinked with long‐range UV and looked for primer extension terminations (Figure 3B). As a

confirmation of previous findings (Dubreuil et al., 1991), the internally incorporated 4‐thio‐T didn’t lead to

efficient crosslinking. Reactive groups incorporated on one of the termini crosslinked well to the RNA,

underscoring the requirement of structural flexibility. Based on the experiment we have chosen 5’

incorporation site of 4‐thio‐T for our future experiments due to better defined reverse transcription

termination sites upstream of the ASO binding site, although we cannot exclude possibility of distal

crosslinks outside of the surveyed region.

Since flexibility is required for the crosslinking we decided to test if the type of a nucleotide that could

potentially base‐pair with 4‐thio‐T impacts the crosslinking efficiency. We have performed experiment

analogous to the one described above, but checking (1) crosslinking of the same ASO to four RNA targets

that differ by the single nucleotide that can possibly base‐pair with 4‐thio‐T and (2) panel of different ASOs

positioned on the same target in a way that the 4‐thio‐T can base‐pair with different nucleotides (Figure

3C). Gel autoradiography revealed that base pairing of the terminal nucleotide does not exclude the

possibility of crosslinking but that the slight changes in positioning of the ASO on the target can have a big

impact on the crosslinking induced RTTS pattern.

We have also checked the dependence of crosslinking efficiency on irradiation time (Figure 4). Comparison

of signal strength upstream of the ASO binding to the full length product shows dose‐response (with

strongest response at the beginning of irradiation) for the samples with 4‐thio‐T and no effect in the

samples without 4‐thio‐T.

93

Figure 3. Characterization of crosslinking of 4‐thio‐T containing ASO to RNA by primer extension reaction on ApoB fragment. (A) A comparison of different ways of washing out non‐crosslinked hybridized oligonucleotides (N – no treatment, M1 and M2 – NucAway column (Ambion), S – illustra MicroSpin S‐300 (GE Healthcare), A – AutoSeq G‐50 (GE Healthcare)). Numbers above the lanes indicate the ratio between the signal in the oligonucleotide‐related termination (lower yellow bar) to the signal of the full length product (upper yellow bar). (B) Assessing the impact of the position of 4‐thio‐T incorporation in the oligonucleotide, N‐ no 4‐thio‐T, 5’‐ 4‐thio‐T at the 5’ end, 3’ – at the 3’ end, Int – 4‐thio‐T incorporated internally, 13‐mer – ASO design based on 6434, 8‐mer ‐ ASO design based on 6435 (see Table 1). (C) Impact of the identity of the nucleotide to which 4‐thio‐T can potentially basepair (indicated below the lanes) on crosslinking. In the figures (B) and (C), right panel is the copy of the left panel with the overlaid predicted region of hybridization (blue bar) and 4‐thio‐T location (red circle).

94

Figure 4. Impact of irradiation time on the crosslinking efficiency. Gel electrophoresis of a primer extension reaction on ApoB RNA fragment irradiated with UV for different amount of time with oligonucleotide with (4‐thio‐1) or without (6434) incorporated 4‐thiothymidine.

EnrichmentstrategiesPlanning to apply our method for transcriptome‐wide study, we have realized that it will be highly

advantageous and economical to enrich for cDNA molecules that terminated at the ASO crosslinking site,

while removing the background signal, including cDNA terminated at the RNA 5’ ends. We have explored

two enrichment strategies – (1) Terminator exonuclease digestion of RNA and (2) use of biotin labeled ASO

with streptavidin binding of full length RNA‐cDNA complexes .

5’phosphatedependentexonucleaseenrichmentBased on the previous report indicating successful use of an exonuclease to detect RNA modifications

(Steen et al., 2010) we hypothesized that the use of 5’ phosphate dependant Terminator exonuclease will

enable us to remove parts of RNA upstream from the crosslinked ASOs (Figure 5A). First, we needed to

show that the exonuclease would actually terminate upon reaching the crosslinked ASO. To test for that,

we have 3’ labeled an in vitro transcribed RNA molecule, modified its 5’ triphosphate to monophosphate to

make it a suitable substrate for the exonuclease, crosslinked with different ASOs, Terminator treated and

electrophoretically resolved on a high resolution polyacrylamide gel (Figure 5B). Results indicate that (1)

not crosslinked ASO doesn’t protect the RNA (lanes labeled “6434”), (2) the ASO crosslinking via group

attached to either 5’ or 3’ end of ASO can terminate the exonucleolytic degradation of RNA (lanes “4‐thio‐

1”, “4‐thio‐2” and “4‐thio‐5”), largely increasing the ratio between crosslinked and the remaining full length

RNA (α/β and δ/γ) while preserving large fraction of the intended target (δ/α) (Figure 5C).

95

Figure 5. Enrichment of RNA with crosslinked ASOs using 5’ phosphate dependent exonuclease (Terminator). (A) Strategy of enrichment. Mixture of RNA molecules with (1) or without (2) crosslinked ASO is treated with the Terminator exonuclease that fully digests species (2) but terminates on the crosslinked ASO from the species (1) yielding 3’ RNA fragments (3). (B) 3’ end labeled ApoB RNA fragment was crosslinked with various oligonucleotides and digested with the Terminator exonuclease. 5’ppp – sample not treated with 5’polyphosphatase hence bearing 5’ triphosphate which is not a substrate for used exonuclease, 5’p – not crosslinked RNA, 6434, 4‐thio‐1, 4‐thio‐2, 4‐thio‐5 – RNA crosslinked with one of the ASOs (see Table 1 for the sequences), C – no‐Terminator control, T – samples treated with Terminator. (C) Zoom‐in into 4‐thio‐1 crosslinked sample electrophoresis. Markings (1),(2) and (3) indicate suspected molecular species represented by a given band according to the model shown on the panel (A). (D) Ratios between quantified signals from boxes marked by Greek letters on the panel (C).

96

CAGE‐likeselectionWe have shown that the use of Terminator digestion strategy can degrade parts of RNA upstream from the

crosslinked ASO, but it doesn’t solve the problem of possible background arising from the spontaneous

cDNA synthesis terminations downstream from the crosslinked ASO. To resolve that issue we have devised

another strategy for enrichment, based on the idea for enrichment of capped molecules as described in the

CAGE protocol (Kodzius et al., 2006), but instead of biotinylating the cap we have planned to use a 3’

biotinylated ASO (Figure 6A). Relying on the very high specificity of CAGE method (Takahashi et al., 2012),

we expected that the large majority of cDNA molecules after the stringent washing will be derived from the

cDNA molecules whose synthesis terminated on the crosslinked, biotinylated ASO. First, we needed to

check if RNase I used in the CAGE study would or wouldn’t cleave RNA between cDNA 3’ terminus and the

crosslinked ASO. In order to check that, we have prepared a body‐labeled RNA molecule with the

crosslinked ASO at its 5’ terminus, which was used as a template for a primer extension reaction with the

primer complementary to its 3’ terminus. We digested such prepared cDNA‐RNA hybrid with different

concentrations of RNase I and resolved the products on the denaturing high‐resolution polyacrylamide gel

(Figure 6B). As expected, RNase I degraded the RNA in the samples without the protective cDNA (“RT ‐“

lanes). Moreover, the bands in the digested samples with the protective cDNA clustered in four groups.

Analysis of the gel lead us to conclude that (numbers in the brackets relate to the structures drawn on the

right side of a Figure 6B) (1) the longest species is the full length RNA with the crosslinked ASO – which

suggests that the RNA on the cDNA‐ASO border was protected from RNase I cleavage, (2) the second

longest is the species for which the RNase I cleaved between crosslinked ASO and cDNA, suggesting that

the protection is not fully efficient. Sequence analysis of the RNA molecule revealed that 30 nt downstream

from the fully matched complementary site lies partially complementary site (Figure 6C), which apparently

bound the ASO in our assay giving rise to the bands (3) and (4) as being analogous to bands (1) and (2). This

suspicion is strengthened by counting nucleotide bands in the degradation ladder in the [“RT ‐“, “RNase I ‐“]

lane which shows that clusters (1) and (3) are separated by 30 nt. We have shown that the RNA on which

cDNA and crosslinked ASO are hybridized is partially protected from RNase I cleavage at the border of cDNA

and ASO and that we can use this property for the CAGE‐like enrichment strategy.

97

Figure 6. CAGE‐like enrichment with biotinylated oligonucleotide. (A) Strategy of enrichment. The RNA with crosslinked ASO bearing biotin (blue circle) on the 3’ end is used in reverse transcription reaction yielding different products in the mix (red line – cDNA, black line – RNA, purple ‐ ASO) that are used as substrates for RNase I that cleaves RNA not protected by cDNA and finally for the capture of biotin‐conjugated oligonucleotides crosslinked to the RNA hybridized to the full length cDNA (not‐full‐length cDNA is washed away because of the RNase I cleavage between ASO and cDNA). (B) Gel electrophoresis of body‐labeled RNA with the oligonucleotide attached to its 5’ end that was (RT +) or was not (RT ‐) covered with cDNA and was digested with different concentrations of RNase I. Drawings on the right indicate suspected structure of the bands located at the same height. Note that only species (1) and (3) would be selected according to the model shown in (A). (C) Intended (left) and possible secondary (right) binding site of the oligonucleotide used in this experiment. Secondary binding site is responsible for emergence of species (3) and (4) shown on the panel (B).

To show that the biotin‐enrichment strategy can be used to enrich for cDNA molecules terminated before

crosslinked ASO over other species of cDNA molecules, we have performed an experiment with crosslinking

of a biotinylated ASO to an in vitro transcribed RNA molecule. In this experiment we reverse transcribed the

RNA crosslinked with different concentrations of ASOs with or without biotin on their 3’ end, followed by

RNase I treatment and selection on streptavidin beads. To the 3’ end of such selected cDNA an adapter was

ligated, the construct was PCR amplified with one primer matching the cDNA and the other complementary

to the ligated adapter (Figure 7). We have expected to observe two length species – (1) longer products

(expected length 183 bp) derived from cDNA molecules that reached the RNA terminus and (2) shorter

98

(expected length ~136 bp) that terminated before the crosslinked ASO. We have observed the biotin‐

dependent enrichment of the products of the second species, confirming that the CAGE‐like selection of

cDNA molecules terminated on the crosslinked ASO is feasible.

Figure 7. Demonstration of selection. ApoB RNA fragment was crosslinked with different concentrations of ASO (Oligo/RNA ratio) with (4‐thio‐1‐biotin) or without (4‐thio‐1) biotin, cDNA was synthesized with specific primer and underwent (S) or not (N) the selection on streptavidin beads, followed by linker ligation and PCR. The figure shows the agarose electrophoresis of PCR products. Arrows indicate products derived from the full‐length (1) or stopped at the ASO (2) cDNA.

Massiveparallelsequencingbasedtranscriptome‐widesearchforASObindingsitesEncouraged by the demonstration of working selection we decided to construct the transcriptome‐wide

map of ASO binding with the high‐throughput sequencing of selected cDNA molecules. We started by

crosslinking the poly(A) fraction of mouse liver RNA with different concentrations of biotinylated or non‐

biotinylated ApoB‐targeting ASO. Afterwards, we have performed the randomly primed reverse

transcription and the streptavidin based CAGE‐like selection (plus non‐selected controls). We have

expected to keep only cDNA molecules that reached the biotinylated ASO. Those cDNA molecules were

transformed into sequencing libraries and sequenced on the HiSeq 2000 sequencer with the protocol

described in the Paper 1. The obtained reads were mapped and the estimated unique counts (EUC) were

calculated as described in the Paper 2 (Figure 8, Table 2).

99

Figure 8. Workflow of sequencing library preparation. RNA (purple) is hybridized with the biotin (green) containing ASO (red), crosslinked and used for cDNA synthesis (blue). Following RNase I treatment, the mixture undergoes selection on the streptavidin beads, cDNA is released and adapter bearing 7 nt random barcode is ligated to the cDNA 3’ end. Subsequent PCR introduces sample‐specific index which allows multiplexed Illumina sequencing.

Sequencing and mapping statistics (Table 2) indicates that the number of sequencing reads in the selected

samples increases with more biotinylated ASO used (compare indices 3,5,7), as well as when allowing for

more favorable hybridization conditions (compare indices 5 and 8), phenomenon that was also observed

with Bioanalyzer quantification of prepared libraries (not shown). Moreover, we have observed that the

selection reduced the number of contaminating reads derived from the adapter ligation to the non‐

extended reverse transcription primer (see “Cutadapt trimmed” column), but significantly increased

number of observed PCR duplicates (Barcode collapsed/Mappings column) which most likely stems from

the limiting amount of material left after the selection reaction. Since reverse transcription terminates

upon reaching 5’ end of RNA templates one would expect to see reads ends enrichment on the 5’ side of

the mRNA molecules, which is indeed observed for non‐selected samples (compare columns ‘5’ UTR’ and

‘3’ UTR’ in Table 2). Selection of the samples with biotinylated ASO equalized the coverage over transcripts,

but for unknown reason the streptavidin selection of non‐biotinylated molecules (representing selection

Primer withsample-specific barcode

RT-stop sitesequencing

5’

3’ adapter withrandom barcode

5’

Pooling samplesSize selection

5’

5’

5’ 5’

5’

5’5’ adapteron RT primer

5’

5’Prematuretermination

5’

Input RNA with hybridized oligo

UV crosslinking

Reversetranscription

RNase I treatment

Selection

Adapter ligation

PCR

Illumina sequencing

5’

100

noise) seems to bias end mappings towards 3’ UTRs. As a first confirmation that the mapped reads are

indeed associated with the studied ASO we have looked at their distribution around the intended target

site located on ApoB transcript (Figure 9A). Visual inspection revealed ASO‐dependant signal just

downstream from the binding site in both the selected and non‐selected samples, but not in the selected

samples crosslinked to the non‐biotinylated ASO. Observed signal confirmed that the sequencing strategy

worked as expected. Inspection of regions located on ApoB transcript further away from the intended

target site revealed many peaks of the height comparable to the height around intended target site (Figure

9B), suggesting that the dataset contains some difficult to interpret additional information. To find if we can

observe the ASO binding signal on the transcriptome‐wide scale, we have extracted sequences (20 nt)

upstream of the position with the highest EUC in each transcript and used them as an input for a motif

discovery software MEME (Bailey and Elkan, 1994) yielding motifs highly similar to the ASO used in the

study (Figure 10A). Additionally, the analysis of changes in the total number of reads mapped to a given

transcript between treated (indices 3, 5, 7) and the control samples (indices 11, 12) with cWords

(Rasmussen et al., 2013) revealed that for the two higher concentrations of ASO used, the most enriched

motif in the upregulated by the selection genes is the motif derived from the used ASO (Figure 9B).

Interestingly, both motif discovery methods – MEME and cWords identified the 5’ part (proximal to the

crosslinking group) of the ASO as the most significant. In the reversed approach, we have used our

knowledge of the ASO sequence to find all matching sites in the genome (allowing up to 2 mismatches) and

calculated the sum of EUC as a function of distance from the ASO matched locations (Figure 10B). This

analysis confirmed that the signal density increases in the close proximity to the ASO 5’ end both in

selected and non‐selected samples (with reduced background in selected samples) but not in the selected

sample with non‐biotinylated ASO.

101

Figure 9. Mapped signal around the intended target site. (A) Genome browser view of the EUC per nucleotide in the vicinity of the ASO intended target site in the mouse ApoB transcript. Region complementary to the used ASO is highlighted with orange bar and the 4‐thio‐T position marked with a green circle. (B) Zoomed‐out view centered on the region shown on the panel (A) (highlighted in green). Samples description: S/NS – selected, non‐selected; T/TB – oligonucleotide with 4‐thio‐T, with 4‐thio‐T and biotin, number of “+” indicates amount of oligonucleotide used. Full samples description in Table 2.

chr12:--->

8,018,350 8,018,360 8,018,370 8,018,380 8,018,390A A T C A A G T G T C A T C A C A C T G A A T A C C A A T G C T G G A C T T T A T A A C C A A T C A G A T A

16 _

0 _13 _

0 _10 _

0 _8 _

0 _17 _

0 _26 _

0 _6 _

0 _9 _

0 _1 _

0 _1 _

0 _1 _

0 _

3 (S, TB, +)

5 (S, TB, ++)

7 (S, TB, +++)

8 (S, TB, ++)

9 (NS, T, ++)

10 (NS, TB, ++)

11 (NS, -)

12 (NS, -)

2 (S, T, +)

4 (S, T, ++)

6 (S, T, +++)

A

16 _

0 15 _

0 18 _

0 9 _

0 45 _

0 59 _

0 19 _

0 25 _

0 1 _

0 1 _

0 1 _

0

8,017,950 8,018,400 8,018,800

B

102

Figure 10. Analysis of sequencing data. (A) Motif recognized by MEME (logo) based on the location of the nucleotide with highest EUC in each transcript and by cWords (red bar under the logo) based on the set of transcripts up‐regulated after selection. Below logos is the sequence of used oligonucleotide (upper case – LNA, lower case – DNA, b – biotin, S – 4‐thio‐T). (B) Genome‐wide sum of EUC for nucleotides separated by +/‐ 50 nt from the 5’ end of ASO matched site (2 mismatches allowed). Samples description in the caption of the Figure 9.

General trends of the data, such as the ability to recover the ASO sequence by motif discovery and to

observe the genome wide signal enrichment around matching sites are encouraging but the aim of this

study is to find the precise locations of alternative binding site of the ASO. First clue of such a site came

from the observation that the fraction of EUC mapping to a mitochondrial genome was much higher in

selected biotin containing samples than in the remaining samples (Table 2). Distribution of EUCs over the

103

mitochondrial chromosome showed high peak at the position 5607 in the cytochrome c oxidase I (mt‐Co1)

transcript. In silico folding of the region preceding the high peak showed possible ASO binding site located

approximately 65 nucleotides upstream of it and separated by the hairpin structure (Figure 11A). This

observation hints that the location of detected signal comes from the nucleotide that was close to the

crosslinking group in the three dimensional space but not necessarily in the linear sequence. Another

interesting observation regarding finding novel binding sites is the site in highly expressed Fabp1 transcript

(Figure 11B). In this case our assay recovered binding that apart from mismatches contained bulged mRNA

nucleotide, which makes computational finding of such sites much more challenging than of sites differing

from the perfect match just by simple mismatches.

Figure 11. Examples of detected ASO binding sites. (A) Predicted ASO binding to mt‐Co1 transcript (nucleotide count in chromosome M coordinates) and (B) to the mouse Fabp1 gene (nucleotide counting in RefSeq transcript coordinates). Red‐filled circles – ASO, yellow circles – 4‐thio‐T. Arrows indicate sites with very high EUC in the selected samples.

DiscussionCurrent state‐of‐the‐art design of specific antisense sequences for drug discovery relies on computational

search of genome or transcript database for similarities, which to be accomplished in the reasonable time

needs to use simplified folding rules (Tafer and Hofacker, 2008). This approach is practical in the initial

screen of candidate molecules but it leaves the risk that the true off‐target binding sites do not fulfill the

algorithm criteria as exemplified by the site found on Fabp1 (Figure 11B). The most popular strategy of

determining the ASO off‐target sites is the transcriptome profiling with identification of down‐regulated

genes that contain the sequence match to the used ASO (Jackson et al., 2003). Such an approach is helpful

in finding safe drug candidates, but it is impossible to discern true off‐target interactions from the

secondary effects. On the other hand, our strategy is focused solely on the existence of direct interactions

between transcript and the ASO. It is worth noting that pinpointing the interaction doesn’t imply the

function, which is possible in the case when hybridization occurs but doesn’t lead to the gene regulation.

This suggests that the method is a supplement, but not necessarily a substitute of the transcriptome

profiling. The analysis of the obtained transcriptome‐wide dataset is currently ongoing. We hope that

inclusion of samples folded with different oligonucleotide concentrations will allow for better

understanding of binding thermodynamics to different locations, with strong binders being highly occupied

even when the low concentration was used, while weak targets being activated only after certain

concentration was reached (as exemplified in Figure 10A). Moreover, comparison of samples differing only

by the used folding protocol should let us better understand the impact of preexisting RNA structures on

C

U

U

C

AU A G

U

A A U A C C A A U

A

A U AA

U U G G A G G C U U U G

G

AA

AC U G

AC

U

U

G

U

C

C

CA

CUAA

UA

A

U

CGGAGCCCCAGA

UAUA

GCAU

SG

CATTGGTATT

CA

5600

A

5530

5’

B

GG C A

A G U A C C A A U U G C AG A

G

SGCATTGGTATTCA

60 70

5’

104

interactions with introduced ASOs. Furthermore, the results will show the possible RNA‐ASO hybridization

modes which can be used for improving the existing off‐target finding algorithms. We believe that the

presented method is suitable for wide adoption in the antisense drug discovery community and will further

our understanding of interactions between RNA molecules and oligonucleotides.

AcknowledgmentsWe thank our collaboration partners Morten Lindow and Peter Hagedorn from Santaris Pharma A/S for

insightful discussions and for supplying modified oligonucleotides used in this study.

ReferencesAllawi, H.T., Dong, F., Ip, H.S., Neri, B.P., and Lyamichev, V.I. (2001). Mapping of RNA accessible sites by extension of random oligonucleotide libraries with reverse transcriptase. RNA (New York, NY) 7, 314‐327.

Bailey, T.L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28‐36.

Christiansen, J., Kofod, M., and Nielsen, F.C. (1994). A guanosine quadruplex and two stable hairpins flank a major cleavage site in insulin‐like growth factor II mRNA. Nucleic Acids Res 22, 5709‐5716.

Darty, K., Denise, A., and Ponty, Y. (2009). VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics 25, 1974‐1975.

Dubreuil, Y.L., Expert‐Bezancon, A., and Favre, A. (1991). Conformation and structural fluctuations of a 218 nucleotides long rRNA fragment: 4‐thiouridine as an intrinsic photolabelling probe. Nucleic Acids Res 19, 3653‐3660.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80.

Harrison, G.P., Mayo, M.S., Hunter, E., and Lever, A.M. (1998). Pausing of reverse transcriptase on retroviral RNA templates is influenced by secondary structures both 5' and 3' of the catalytic site. Nucleic Acids Res 26, 3433‐3442.

Jackson, A.L., Bartz, S.R., Schelter, J., Kobayashi, S.V., Burchard, J., Mao, M., Li, B., Cavet, G., and Linsley, P.S. (2003). Expression profiling reveals off‐target gene regulation by RNAi. Nat Biotechnol 21, 635‐637.

Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. (2002). The human genome browser at UCSC. Genome Res 12, 996‐1006.

Kielpinski, L.J., Boyd, M., Sandelin, A., and Vinther, J. (2013). Detection of reverse transcriptase termination sites using cDNA ligation and massive parallel sequencing. Methods Mol Biol 1038, 213‐231.

Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S.L. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14, R36.

Kjems, J., Egebjerg, J., and Christiansen, J. (1998). Analysis of RNA‐protein complexes in vitro (Amsterdam ; New York, Elsevier).

Koch, T., Rosenbohm, C., Hansen, H.F., Hansen, B., Marie Straarup, E., and Kauppinen, S. (2008). Chapter 5 Locked Nucleic Acid: Properties and Therapeutic Aspects. In Therapeutic Oligonucleotides (The Royal Society of Chemistry), pp. 103‐141.

105

Kodzius, R., Kojima, M., Nishiyori, H., Nakamura, M., Fukuda, S., Tagami, M., Sasaki, D., Imamura, K., Kai, C., Harbers, M., et al. (2006). CAGE: cap analysis of gene expression. Nat Methods 3, 211‐222.

Kole, R., Krainer, A.R., and Altman, S. (2012). RNA therapeutics: beyond RNA interference and antisense oligonucleotides. Nat Rev Drug Discov 11, 125‐140.

Kulpa, D., Topping, R., and Telesnitsky, A. (1997). Determination of the site of first strand transfer during Moloney murine leukemia virus reverse transcription and identification of strand transfer‐associated reverse transcriptase errors. EMBO J 16, 856‐865.

Lanford, R.E., Hildebrandt‐Eriksen, E.S., Petri, A., Persson, R., Lindow, M., Munk, M.E., Kauppinen, S., and Orum, H. (2009). Therapeutic Silencing of MicroRNA‐122 in Primates with Chronic Hepatitis C Virus Infection. Science 327, 198‐201.

Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory‐efficient alignment of short DNA sequences to the human genome. Genome Biology 10, R25.

Lindow, M., Vornlocher, H.‐P., Riley, D., Kornbrust, D.J., Burchard, J., Whiteley, L.O., Kamens, J., Thompson, J.D., Nochur, S., Younis, H., et al. (2012). Assessing unintended hybridization‐induced biological effects of oligonucleotides. Nature Biotechnology 30, 920‐923.

Martin, M. (2011). Cutadapt removes adapter sequences from high‐throughput sequencing reads. . EMBnet J 17, 10‐12.

Meisenheimer, K.M., and Koch, T.H. (1997). Photocross‐linking of nucleic acids to associated proteins. Crit Rev Biochem Mol Biol 32, 101‐140.

Obad, S., dos Santos, C.O., Petri, A., Heidenblad, M., Broom, O., Ruse, C., Fu, C., Lindow, M., Stenvang, J., Straarup, E.M., et al. (2011). Silencing of microRNA families by seed‐targeting tiny LNAs. Nature Genetics 43, 371‐378.

Olejniczak, M., Galka, P., and Krzyzosiak, W.J. (2010). Sequence‐non‐specific effects of RNA interference triggers and microRNA regulators. Nucleic Acids Res 38, 1‐16.

Pinder, J.C., Staynov, D.Z., and Gratzer, W.B. (1974). Properties of RNA in formamide. Biochemistry 13, 5367‐5373.

Rasmussen, S.H., Jacobsen, A., and Krogh, A. (2013). cWords ‐ systematic microRNA regulatory motif discovery from mRNA expression data. Silence 4, 2.


Schneider, C.A., Rasband, W.S., and Eliceiri, K.W. (2012). NIH Image to ImageJ: 25 years of image analysis. Nat Methods 9, 671‐675.

Sontheimer, E.J. (1994). Site‐specific RNA crosslinking with 4‐thiouridine. Mol Biol Rep 20, 35‐44.

Steen, K.A., Malhotra, A., and Weeks, K.M. (2010). Selective 2'‐hydroxyl acylation analyzed by protection from exoribonuclease. J Am Chem Soc 132, 9940‐9943.

Stenvang, J., Petri, A., Lindow, M., Obad, S., and Kauppinen, S. (2012). Inhibition of microRNA function by antimiR oligonucleotides. Silence 3, 1.


106

Tafer, H., and Hofacker, I.L. (2008). RNAplex: a fast tool for RNA‐RNA interaction search. Bioinformatics 24, 2657‐2663.

Takahashi, H., Kato, S., Murata, M., and Carninci, P. (2012). CAGE (Cap Analysis of Gene Expression): A Protocol for the Detection of Promoter and Transcriptional Networks. In Gene Regulatory Networks, B. Deplancke, and N. Gheldof, eds. (Totowa, NJ, Humana Press), pp. 181‐200.

107

Tables

Table 1. Oligonucleotides used in the study

Name Sequence Remarks ApoBrev TGCTCAGAGACAGAGCTGTG DNA ApoBfor+T7 CAGAGATGCATAATACGACTCACTATAGGGAGATTCTCCTTTAAATCAAGTGTCATCA DNA ApoB-PE GATGAGCAACAATATCTGACTGG DNA IGF2_PE.h-p TCCAACCGCCAGACTTCCCAC DNA ApoB-str.dis5' GTGTGATGACACTTGATTTAAAGGAGAATCTCCC DNA 6434 GCatTgGtatTCA Upper case – LNA, lower case –DNA;

“S” indicates 4-thio-T. Oligonucleotides synthesized by Santaris Pharma A/S. ASOs 4-thio-1 to 4-thio-3 are analogs of the gapmer described in (Straarup et al., 2010) lacking the long stretch of consecutive DNA nucleotides in the middle. This will enable us to apply our assay in the in vivo setting, since otherwise the RNase H directed degradation of the targets would have occurred.

6435 ATTGGTAT 4-thio-1 SGCatTgGtatTCA 4-thio-1-biotin SGCatTgGtatTCA-biotin 4-thio-2 GCatTgGtatTCAS 4-thio-3 GCatTgGSatTCA 4-thio-4 SATTGGTAT 4-thio-5 ATTGGTATS 4-thio-6 ATTGGSAT 4-thio-7 SGCattggtatTCA ApoB-rev-A GATGAGCAACAATATCTGACTGGTTAAAAAGTTCTGCATTGG DNA ApoB-rev-C GATGAGCAACAATATCTGACTGGTTAAAAAGTTCGGCATTGG DNA ApoB-rev-G GATGAGCAACAATATCTGACTGGTTAAAAAGTTCCGCATTGG DNA ApoB-rev-T GATGAGCAACAATATCTGACTGGTTAAAAAGTTCAGCATTGG DNA LIG_DNA PHO-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3NHC3 DNA, modifications LIG_DNArandBARC PHO-NNNNNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3NHC3 DNA, modifications LIG_PCR ACACTCTTTCCCTACACGACGCT DNA RT_15xN AGACGTGTGCTCTTCCGATCTNNNNNNNNNNNNNNN DNA

Table 2. Sequenced samples and mapping statistics

Index Selection Oligo Amount of oligo

No of reads

Cutadapt discarded (too short)

Mappings/Reads (Barcode

collapsed)/Mappings chrM

5' UTR

3' UTR

1 T - 0 7,558,169 1% 41% 2% 8% 1.05 2.96 2 T 4-thio-1 0.02 7,371,391 2% 38% 1% 8% 0.61 3.59 3 T 4-thio-1-biotin 0.02 3,823,000 5% 43% 2% 35% 1.06 1.41 4 T 4-thio-1 0.2 5,704,255 2% 41% 2% 8% 1.32 2.99 5 T 4-thio-1-biotin 0.2 20,405,141 1% 51% 2% 41% 0.97 1.58 6 T 4-thio-1 2 8,347,640 2% 40% 2% 8% 0.71 3.52 7 T 4-thio-1-biotin 2 35,482,366 1% 52% 2% 49% 0.46 0.79

8 T 4-thio-1-biotin (added after

folding) 0.2 12,938,259 1% 49% 1% 39% 0.69 1.83

9 F 4-thio-1 0.2 13,126,453 31% 40% 58% 10% 3.71 0.38 10 F 4-thio-1-biotin 0.2 15,299,916 26% 43% 57% 11% 3.86 0.39 11 F - (+UV) 0 9,596,743 18% 46% 73% 11% 4.39 0.41 12 F - (-UV) 0 10,898,260 13% 51% 70% 10% 4.61 0.39

Table 3. Abbreviations used in this chapter

4‐thio‐T 4‐thiothymidine 4‐thio‐U 4‐thiouridine LNA Locked Nucleic Acid ASO Antisense Oligonucleotide RTTS Reverse Transcription Termination Site

108

11.4 Paper4:ThesearchforfunctionalRNAsecondarystructureswithin3’untranslatedregionsbyenzymaticprobingoflivertranscriptsfrommultiplespecies(FragSeq2)

109

ThesearchforfunctionalRNAsecondarystructureswithin3’untranslatedregionsbyenzymaticprobingoflivertranscriptsfrommultiplespecies(FragSeq2)

Abstract3’ untranslated regions of mRNA molecules are multifunctional platforms for posttranscriptional gene

expression regulation. Some of the regulatory mechanisms depend on the RNA folding into specific

secondary structure. Moreover, functional structural elements are often evolutionary conserved. In

order to investigate those structures we have set up a method of massive parallel sequencing based on

a detection of cleavage sites of structure specific nucleases (P1 and V1) combined with a novel

normalization scheme which was applied to human, dog and mouse liver transcripts. Our results are

highly reproducible and largely agree with known, conserved structures of selenocysteine insertion

sequences. Additionally, applying the extra step of enzymatic polyadenylation before probing allows for

obtaining the data for 3’ regions of other RNA classes, exemplified by the clear structural signal seen for

a U1 spliceosomal RNA. The presented results validate the applicability of the method for

transcriptome‐wide 3’ regions structure probing and are starting point for ongoing search for functional

structures.

IntroductionMessenger RNA (mRNA) molecules can be functionally divided into 5’ cap, 5’ untranslated region (5’

UTR), coding region, 3’ untranslated region (3’ UTR) and poly(A) tail. 3’ UTR is bordered on its 5’ end by

the stop codon and on its 3’ end by the first adenosine of the poly(A) tail. Analysis of RefSeq annotation

(Pruitt et al., 2009) shows that the median length of 3’UTRs for mouse and human is 787 and 866 nt,

respectively. Interestingly, 3’UTRs have lower GC content than 5’UTRs or coding regions in vertebrae

(Zhang et al., 2004) but are nevertheless highly structured (Wan et al., 2014) and have the strongest

enrichment for the putative functional structures (Washietl et al., 2007). 3’ UTRs are rarely spliced,

which is an adaptation to a nonsense mediated decay regulation (Mignone et al., 2002) but their length

may vary due to utilization of alternative polyadenylation signals (Mayr and Bartel, 2009).

Since 3’ UTRs are not required to code for the functional proteins (except selenocystein (Seeher et al.,

2012)) they form a flexible platform for the emergence of regulatory features in the course of evolution.

The encoded regulatory elements can affect the protein expression via changes in mRNA stability or

translatability or can affect RNA cellular localization.

One of the modes of posttranscriptional gene expression regulation is microRNA (miRNA) mediated

gene silencing (Bartel, 2009), which is especially efficient if the target site is localized within 3’ UTR, thus

avoiding interactions with ribosomes (Grimson et al., 2007). To understand the mRNA‐miRNA

110

interactions it is crucial to investigate the secondary structure of 3’UTRs, as the target accessibility for

base‐pairing is an important determinant of silencing efficiency (Kertesz et al., 2007; Wan et al., 2014).

In a study analyzing the proteome occupancy on mRNA molecules it was shown that vast portions of

3’UTRs are protein interactors (Baltz et al., 2012). The binding proteins modulate mRNA stability (Ray et

al., 2013), translational efficiency (Morita et al., 2012) or mRNA localization (Jambhekar and Derisi,

2007). Although some of the protein‐RNA interactions are based on the primary sequence, for others it

is the RNA structure that is a determinant (Lunde et al., 2007). Especially interesting regulatory switch

has been observed in the 3’UTR of p27 gene, where binding of an RNA‐binding protein PUM1 modulates

the RNA structure allowing specific miRNAs to access the target site and downregulate the p27

expression (Kedde et al., 2010) effectively creating the AND logic gate.

In recent years several attempts of an RNA secondary structure prediction on the transcriptome‐wide

scale employing different approaches were published. Computational prediction strategies included

finding minimum free energy structures with local folding (Hofacker et al., 2004), utilizing evolutionary

conservation of the structure (Pedersen et al., 2006) or conservation coupled with thermodynamic

stability (Washietl et al., 2005). Advent of high‐throughput sequencing allowed probing complex mixture

of RNA molecules first in vitro (Kertesz et al., 2010; Underwood et al., 2010) and recently in vivo (Ding et

al., 2013; Rouskin et al., 2013).

Here we show an approach of probing the secondary structure of 3’ regions of in vitro folded liver

specific mRNA molecules in three species: mouse, dog and human. We believe that the combination of

enzymatic probing of RNA coupled with an analysis of conservation will allow finding functional

structures located in the 3’ UTRs.

Materials

InputRNA mouse liver total RNA (Zyagen),

dog liver total RNA (Zyagen),

human liver total RNA (Ambion, AM7960),

ERCC RNA Spike‐In Mix (Ambion, 4456740),

in vitro transcribed spike‐in structured RNA mix (using equal weights of different RNA molecules

as determined by UV spectrophotometry; molecules synthesized by Line Dahl Poulsen;

sequences in Table 1)

Kitsandreagents Ribo‐Zero™ Magnetic Kit (Human/Mouse/Rat) (Epicentre)

Poly(A)Purist™ MAG Kit (Ambion)

Agencourt RNAClean XP (Beckman Coulter)

Ampure XP (Beckman Coulter)

Poly(A) Polymerase, Yeast 600 U/µl and 5x buffer(USB)

Calf Intestinal Alkaline Phosphatase (CIAP) 20 U/µl (USB)

111

NEBuffer 3 (New England Biolabs)

Nuclease Stop Buffer (NSB) (380 mM NaOAc pH 5.2, 10 mM EDTA)

5x fragmentation buffer (250 mM Tris‐HCl pH 8, 25 mM MgCl2)

T4 Polynucleotide Kinase (T4 PNK) and buffer (New England Biolabs)

P1 nuclease dilution buffer (50% glycerol, 50 mM Tris‐HCl pH 7.5, 100 µM Zn(OAc)2)

V1 nuclease dilution buffer (50% glycerol, 10 mM Tris‐HCl pH 7.5, 200 mM KCl)

5x P1 buffer (250 mM Tris‐HCl pH 7.5, 750 mM NaCl, 25 mM MgCl2, 50 µM Zn(OAc)2)

5x V1 buffer (250 mM Tris‐HCl pH 7.5, 750 mM NaCl, 25 mM MgCl2)

T4 RNA Ligase 10 U/µl (Fermentas)

5x T4 RNA Ligase Buffer (250 mM Tris‐HCl pH 7.6, 50 mM MgCl2, 50 mM DTT, 5 mM ATP) – as

Fermentas T4 RNA Ligase buffer.

BSA 10 mg/ml (New England Biolabs)

PrimeScript enzyme and 5x buffer (Takara)

Phusion® High‐Fidelity DNA Polymerase and 5x HF Buffer (New England Biolabs)

100 bp DNA Ladder (New England Biolabs)

E‐Gel® SizeSelect™ 2% Gel (Invitrogen)

DNA purification spin columns (Zymo research)

QIAquick PCR Purification Kit (Qiagen)

High Sensitivity DNA Kit (Agilent)

DNA 1000 Kit (Agilent)

Oligonucleotides listed in Table 2

Methods

Probingreagentscalibration50 µl of solutions containing 0.5 ng/µl of fhlA220 and 0.5 ng/µl of Spot42 RNA fragments in 1x P1 or 1x

V1 buffers were allowed to fold (55°C, 5 min; 37°C, 10 min) and were supplemented with 0.5 µl of the

appropriate enzyme dilution, incubated at 37°C for 30 min followed by addition of 150 µl of the

nuclease stop buffer, phenol‐chloroform extraction (with double volume of phenol) and ethanol

precipitation. For comparison, the RNA was fragmented with 1x fragmentation buffer at 95°C for 90 sec

or 10 min or incubated in 10 mM Tris‐HCl on ice, followed by addition of 150 µl of the nuclease stop

buffer and ethanol precipitation. The pellets were dissolved in 7 µl 50 mM Tris‐HCl pH 7 and analyzed on

a Bioanalyzer RNA Pico chip.

SequencinglibrarypreparationSequencing libraries were prepared in two rounds with slight differences. In the first round (prep. 1)

mouse liver poly(A) and dog liver poly(A) fractions were probed, in the second round (prep. 2) mouse

liver ribosome depleted and human liver poly(A) fractions were probed.

Poly(A) enrichment of mouse, dog and human liver total RNA was performed with Poly(A) Purist MAG

kit according to the manufacturer’s recommendations

112

Depletion of ribosomal RNA from mouse liver total RNA was performed using RiboZero MAG kit

according to the manufacturer’s recommendations

RNAClean XP and Ampure XP purifications were performed as described in (Kielpinski et al., 2013)

unless stated otherwise.

Phenol‐chloroform extractions were performed by addition of 1 volume of phenol pH 8 to 1 volume of

extracted liquid, vigorous shaking, transfer of aqueous phase to a new tube, addition of 1 volume of

chloroform, shaking and transfer of aqueous phase to a new tube.

Ethanol precipitations were performed by addition of 2.5 volume of absolute ethanol (ice cold) to

1 volume of salt‐containing nucleic acid solution, incubation at ‐20°C overnight or at ‐80°C for 30 min,

centrifugation at 14000g for 30 min, removing the supernatant, washing the pellet in 1 ml of 70%

ethanol, short centrifugation at 14000g, removing the supernatant, air drying until no visible liquid is left

and dissolving in H2O.

Polyadenylation of 600 ng of structured spike‐in RNA molecules and of two batches of 432 ng of

ribosome depleted mouse liver RNA was performed in the presence of 0.5 mM ATP, 1x polyadenylation

buffer and 24 U/µl enzyme in the volume of 23.24 µl for 20 min at 37°C followed by RNA Clean XP

purification with 15 minutes incubation and elution in 15 µl H2O which resulted in concentrations

35.1 ng/µl and 36.2 ng/µl (mouse liver ribozero fraction) and 41 ng/µl (structured spike‐ins)

RNA dephosphorylation was performed with 1 U/µl of CIAP enzyme in 1x NEBuffer 3. Reactions

contained 660 ng RNA in total volume of 20 µl (prep. 1) or 770 ng RNA supplemented with 7.7 ng

polyadenylated structured spike‐ins and 7.7 µl of 10x diluted ERCC spike‐ins in total volume of 44 µl

(prep. 2). The reactions were incubated 50 min at 37°C followed by 10 min at 50°C followed by addition

of the NSB buffer (265 µl for 20 µl reactions, 283 µl for 44 µl reactions) and 5 mg/ml glycogen (12 µl and

14 µl), phenol‐chloroform extraction, collection of supernatant (250 µl and 292 µl), ethanol precipitation

and dissolving pellet in H2O (55 µl and 65.9 µl).

Structure probing was performed in PCR tubes in 6 different conditions using 10 µl of dephosphorylated

RNA for each. Different conditions are named throughout the chapter as: no treatment (NONE),

magnesium fragmentation, P1, P1/5, V1 and V1/5 (the V1/5 was employed only in prep. 2) and were

applied according to:

NONE –addition of 90 µl H2O and incubation on ice

Magnesium fragmentation – addition of 2 µl H2O and 3 µl 5x fragmentation buffer, 90 sec incubation

at 95°C, transfer on ice, addition of 2 µl 10x T4 PNK buffer, 2 µl of 10 mM ATP, 1 µl T4 PNK enzyme

and incubation 30 min at 37°C followed by adding 80 µl NSB. (Degradation of total RNA with the

given above ion concentrations in 50% formamide at 95°C resulted in cleavage probability

0.001/bond/minute)

P1, V1, P1/5 and V1/5 probings: addition of 70 µl H2O and 20 µl of 5x P1 or 5x V1 buffer followed by

5 min incubation at 55°C and 10 min at 37°C. While holding the tubes in the thermocycler, 1 µl of a

respective enzyme diluted with its respective dilution buffer was added (5 ng/µl of P1 for P1

113

probing, 1 ng/µl of P1 for P1/5 probing, 0.01 U/µl of V1 for V1 probing and 0.002 U/µl for V1/5

probing) followed by 30 min incubation at 37°C, transfer of the reaction to 300 µl of ice cold NSB

and immediate phenol‐chloroform extraction with the extra volume of organic solvent (800 µl),

ethanol precipitation and solubilizing pellet in 4 µl H2O.

Linker ligation was performed by adding 1 µl of 100 µM phosphoseqADAPT oligonucleotide to 4 µl of

probed RNA, heat denaturation (65°C for 5 min followed by transfer on ice) (Addo‐Quaye et al., 2008)

and adding master mix to final volume of 20 µl and concentrations of 1x T4 RNA Ligase buffer, 0.1 mg/ml

BSA, 10% DMSO, 0.5 U/µl T4 RNA Ligase, incubation at 37°C for 2 hours, purification with RNAClean XP

with a final elution in 20 µl H2O

Reverse transcription was performed by mixing 9.5 µl of linker‐ligated RNA with 0.5 µl 10 µM

Adapter_oligo_dT oligonucleotide, incubation at 65°C for 5 minutes followed by transfer on ice, adding

9 µl of master mix composed of 4 µl 5x PrimeScript buffer, 4 µl of 2.5 mM dNTP and 1 µl H2O, incubation

at 42°C for 5 minutes, addition of 1 µl of PrimeScript enzyme and incubation at 42°C for 60 min followed

by 15 min at 72°C and purification with RNAClean XP beads with final elution in 20 µl H2O.

PCR amplification of cDNA was performed on 9.5 µl of purified cDNA in the total volume of 20 µl with

the concentrations of 0.6 µM multi1_short oligonucleotide, 0.5 µM respective INDEX#_long

oligonucleotide (see Table 3 for sample‐primer pairs), 1x Phusion HF buffer, 0.2 mM dNTP and 0.04 U/µl

Phusion polymerase. Thermocycling conditions were as follows: (3 min, 98°C)x1, (80 sec, 98°C; 15 sec,

64°C; 30 sec, 72°C)x4, (80 sec, 98°C; 45 sec, 72°C)x[15 for prep 1.; 18 for prep. 2], (5 min, 72°C)x1. PCR

reactions were assessed with 2% agarose electrophoresis.

Post‐PCR treatment of the prep. 1 started with adding 10 µl from each PCR reaction to one tube

containing 20 µl 50 mM EDTA followed by Ampure XP purification with 5 minutes incubation with beads

and elution in 50 µl 10 mM Tris‐HCl pH 8 which contained 53.8 ng/µl DNA (Nanodrop). Post‐PCR

treatment of the prep. 2 started with quantification of the amount of DNA in the 200‐600 bp range

(Bioanalyzer DNA 1000 Kit) and mixing the equimolar amount from each reaction (1/5 for NONE

treatment; total volume 65.7 ul) to one tube containing 13.14 µl 50 mM EDTA and subsequent

purification with Ampure XP beads.

Size selection of the combined sequencing libraries was performed on a 2% SizeSelect gel using 100 bp

DNA Ladder and collecting the molecules in the size range of 200‐500 bp (prep. 1) or 200‐600 bp (prep.

2) with collection and buffer replacement in the lower well every 20 sec. Collected DNA in the running

buffer was bound to DNA purification columns (Zymo Research for prep. 1, QIAGEN for prep. 2) and

eluted in 30 µl of 10 mM Tris‐HCl pH 8 (prep. 1) or pH 8.5 (prep. 2).

The size distribution of libraries was checked with a Bioanalyzer High Sensitivity (prep. 1) or DNA 1000

(prep. 2) kits and the libraries were sent for single read sequencing on Illumina HiSeq platform

multiplexed with another sample from the laboratory (Jakob Rukov’s sample containing a low

complexity library).

114

The library size distribution for V1 treatment in two rounds (prep. 1 vs. prep 2.) varied very much, in the

second round being much more digested. Probable cause is the storage of the diluted enzyme before

probing in the first round, which is known to decrease the enzyme activity (Lowman and Draper, 1986)

but not before the second probing (speculation – time of dilution preparation was not noted). For the

analysis, only treatment V1 from the first round and the V1/5 from the second round were taken and

both were called V1 treatments.

Dataanalysis

ReadsprocessingInitial processing

Sequencing results were obtained as demultiplexed FASTQ files (Cock et al., 2010). Sequences and

quality scores were trimmed from the first two nucleotides using an awk script in order to remove the

random nucleotides introduced during ligation. Reads matching the multiplexed low complexity library

were filtered out.

Reference sequences for mapping

Reads were mapped to the ENSEMBL transcripts sequences associated with genome assemblies

canFam3 (dog), hg19 (human) and mm9 (mouse) combined with the spike‐in sequences and, only for

dog mapping, with the NM_001115118.1 transcript (Sepp1 transcript for which ENSEMBL annotation

doesn’t agree with our observations)

Mapping the reverse transcription priming sites

For each RNA sample we have used the magnesium fragmentation treatment to find the reverse

transcription priming sites. Using a Cutadapt utility (Martin, 2011), reads having at least 12 nucleotide

match (allowing 10% error rate) at their right end to the 5’ end of the Adapter_oligo_dT were retained,

trimmed with an awk script from any remaining “A” nucleotides at the right end, mapped to the

transcripts using Bowtie (Langmead et al., 2009) with options “‐a ‐‐norc ‐‐best ‐‐strata –S” followed by

counting the number of primary alignment ends (understood as a first non‐A nucleotide from the left

side of the read) mappings at different transcript locations and reporting it (awk script). For each gene,

one transcript isoform with the highest count of mapped priming sites has been retained in the index for

subsequent mappings.

Mapping cleavage sites

For each RNA sample, for each treatment, reads were trimmed from the RT adapter sequence using

Cutadapt with options “‐a AAAAAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCT ‐m 25”. After that, the

magnesium fragmentation sample reads were mapped with Bowtie (options “‐‐norc ‐S ‐a ‐‐best ‐‐strata ‐

‐chunkmbs 512”) and count of mapped read ends per each transcript position has been summed and

reported. If for this mapping there have been multiple transcripts per gene then only the transcript with

the highest count has been retained (in the event of equal scores the longest gene isoform was kept).

The kept transcript sequences were used to construct new bowtie index and all the treatments from a

115

given RNA sample were mapped to this bowtie index with options “‐‐norc ‐S ‐a” and the count of

mapped read ends per each transcript position has been summed and reported.

NormalizationSize‐selection correctors estimation

To start the normalization procedure, the average distribution of structure‐related read end counts from

the priming positions is defined based on the selected, “clean” set of priming sites. To define the “clean”

set, read‐in the priming positions from magnesium treated sample and then for each priming position

define the region RU spanning from 600 to 25 nucleotides upstream from the priming site and RD, from

25 to 600 nucleotides downstream. To avoid the impact of interfering priming sites keep only those

priming sites for which in the RD and RU regions the sum of priming counts is lower than 1/100 of the

sum in the discussed priming site. Next, discard the priming sites for which exists the transcript 5’end

closer than 600 nucleotides. Define clusters of remaining priming sites that were located between each

other’s RD and RU regions and calculate number of mapped magnesium fragmentation cleavage counts

in the merged RU region per priming count. Calculate median of the ratios (excluding 0’s) and discard all

the clusters that give rise to less cleavage sites than expected from the median ratio.

To create the average distribution of cleavage sites from the priming sites for each probing condition,

do: for each priming cluster (taken from the set of clean priming sites as defined above) split

proportionally (by the priming site counts) the cleavage mappings among the cluster members. At this

point one has the count of cleavage ends at a given distance for each priming site. Then, for each

priming site divide all the counts by the square root of the sum of the counts (to avoid overfeeding the

final distribution by highly expressed sites) and add the divided cleavage mapping scores to the table

with sums of counts at the given distance from the priming sites for all clean priming sites. Additionally,

the average distribution was calculated from the frequency of read lengths after trimming the poly(A)

end or reverse transcription primer adapter. The final average distribution was seamed from the

distribution estimated from the trimmed reads length distribution (up to nucleotide 65) and from the

cluster‐based distribution for positions further than 65 nt from priming site (Figure 4A).

Next, for each sample’s average distribution fit in the exponential decay (for all treatments) in the region

between 120 to 300 nt from the priming site. Then, for the magnesium fragmentation sample

extrapolate the exponential decay curve to all points in the average distribution (positions 25 to 600,

Figure 4A), divide the extrapolated values by the observed values in the average distribution and

smoothen the quotients with R loess function (Chambers and Hastie, 1992) to obtain the size selection

correctors for positions from 25 to “peak” (Figure 4B), where “peak” was manually assigned to the

samples from the first preparation (mouse and dog poly(A)) to be at the distance of 100 nt, and in the

second preparation (mouse ribozero, human poly(A)) at 80 nt from the priming site (In total, two sets of

final correctors were calculated, one for prep. 1 and one for prep. 2, the final correctors were the

average of the correctors from two magnesium fragmentation samples in each preparation).

116

Modeling number of cDNA molecules reaching given nucleotide

i. Consider only those parts of transcripts that are within the RU of any priming site present in

the magnesium fragmentation treatment for a given sample.

ii. Decompose RNA cleavage sites between priming sites

From each priming site emanate the decomposition factors into each position of RU region (to calculate

the decomposition factor, first divide the values of the exponential fit to the average distribution of a

given treatment by the size selection correctors at positions between 25 to “peak” and multiply the

quotients by the number of reads mapped to a given priming site). Then, split each mapped cleavage

site count between different priming sites by the decomposition factors weights. Finally, multiply the

values assigned to a given priming site in the region 25‐“peak” by the appropriate size selection

corrector.

iii. Estimate the number of cDNA molecules reaching each position.

At this step, for each priming site there is known number of mapped cleavage sites assigned. Here we

incorporate normalization based on the QuShape procedure (Karabiber et al., 2013). The cumulative

sum of the cleavage sites counts has been calculated starting from the most distal site from the priming

site, then the calculated values of reaching cDNA molecules in the region 25‐“peak” have been divided

by the size selection correctors. Scores for each transcript location from different priming sites have

been summed. Obtained values are modeled numbers of reaching cDNA molecules at each location.

Reporting the FragSeq 2.0 values was done in the 4 columns format, where 1st column was transcript

ENSEMBL identifier, 2nd: comma separated transcript positions, 3rd: comma separated cleavage counts at

the positions in the 2nd column, 4th: comma separated modeled number of reaching cDNA molecules at

the positions in the 2nd column. The reported positions were the positions one nucleotide before the

first sequenced nucleotide, which is consistent with both nucleases P1 and V1 cutting 3’ from single‐

stranded or stacked nucleotide, respectively (Kertesz et al., 2010; Underwood et al., 2010).

RNAsecondarystructuremodelingGeneration of random sequences has been performed with the Python script, computational prediction

of RNA secondary structure and calculation of sensitivity and positive predictive value was performed

with RNAStructure version 5.4 (Reuter and Mathews, 2010). Secondary structure visualization was

accomplished with VARNA version 3.9 (Darty et al., 2009).

Results

LibrarypreparationandsequencingLibraries in the FragSeq2 experiments were prepared in two runs, each consisted of probing two RNA

samples with 5 different conditions (Figure 1A) by following the protocol depicted on a Figure 1B. The

first sequencing round included mouse and dog liver poly(A) RNA, the second – mouse liver ribosome

depleted RNA (enzymatically polyadenylated to create priming site) and human liver poly(A) fraction,

both probed in the presence of spike‐in molecules. Our chosen probing reagents – a single strand

specific nuclease P1 (Romier et al., 1998) and stacked bases specific nuclease V1 (Ziehler and Engelke,

117

2001) leave 5’ phosphates on the cleaved RNA. The 5’ phosphates were used as handles for enrichment

for the ends created by the enzymes by ligating to them the first adapter after probing. Moreover, the

dephosphorylation was performed before probing in order to remove endogenous 5’ phosphate

residues that would otherwise have been ligated to the adapter.

Figure 1. Probed samples and experimental workflow. (A) The experiment was performed in two rounds, each time with RNA from two species and with 5 different probing conditions (P1d5 means P1 with 5x lower concentration). (B) Dephosphorylated RNA was probed with a structure specific endonuclease leaving 5’ phosphate to which the first adapter was ligated. The RNA was subsequently used as a template for reverse transcription with oligo‐dT‐adapter primer, cDNA was amplified with PCR introducing sample specific index, samples were pooled, size selected and sequenced on Illumina HiSeq2000 with 100 nt single‐read protocol.

Apart from being probed with two different concentrations of P1 nuclease and one of V1 nuclease,

samples were fragmented at elevated temperature with magnesium ions to a cleavage extent

comparable to the cleavage that occurred in the probed samples and were 5’ phosphorylated to allow

the adapter ligation (hydrolysis with metal ions leaves 5’ hydroxyl group (Forconi and Herschlag, 2009),

note that this procedure also phosphorylated the endogenous, preexisting breaks). Such produced

random fragmentation pattern was later used for data normalization and could be possibly applied for

ligation bias correction. Lastly, the untreated sample (“None”) was prepared, for which no treatment

was performed between dephosphorylation and linker ligation steps, hence obtained results should

represent only the experimental noise.

To focus our assay on 3’ regions of mRNA molecules we have designed the oligo‐dT reverse transcription

primer harboring the second necessary for sequencing adapter. This primer hybridizes at the beginning

of the poly(A) tail and primes reverse transcription that continues until reaching the 5’ end of the first

ligated Illumina adapter. The obtained cDNA was used as a template for PCR amplification with primers

recognizing both adapters hence creating setup that highly enriches for nucleic acids primed by our

introduced primer (as opposed to unspecific priming) and terminated at the adapter bound to the

nuclease cleavage site (as opposed to the background terminations). Amplified libraries (Figure 2)

1st sequencing 2nd sequencing

Mouse liver poly(A) RNA Mouse liver, ribozero frac�on (in vitro polyadenylated). Includes spike-ins

Dog liver poly(A) RNA Human liver poly(A) RNA.

Includes spike-ins

P1 V1 P1d5 Mg2+ None

AAAAAAAAAAAAAAAAAAAAAAAAA

P1

AAAAAAAAAAAAAAAAAAAAAAAAAp

AAAAAAAAAAAAAAAAAAAAAAAAA

Adapter ligation to 5’ phosphate

Reverse transcription

AAAAAAAAAAAAAAAAAAAAAAAAANVTTTTTTTTTTTTTTTTTTT

PCR

NVTTTTTTTTTTTTTTTTTTT

illumina HiSeq2000 sequencing

A B

Index forsample identification

118

contained information on (1) position of the first illumina adapter ligation site, which corresponds to the

nucleolytic cleavage of the RNA and (2) position of the priming site. We have sequenced only one end of

the construct, reading out the information (1). Sequencing statistics are presented in Table 3.

Figure 2. Agarose electrophoresis of PCR amplified sequencing libraries. (A) Samples prepared for the 1st sequencing, (B) samples prepared for the 2nd sequencing.

NormalizationThe oligo‐dT priming strategy utilized in the experiments implies that the signal density over transcript

will decay with the increasing distance from the poly(A) tail. This made it necessary to perform

normalization of the data in order to compare nucleotides located at different positions within a

transcript. To understand the meaning of the detected read end count at a given location in terms of

probing efficiency we needed to estimate the number of cDNA molecules that reached that location, in

other words, what would be the observed count given 100% probing efficiency at a given site. Our

estimation was based on the QuSHAPE procedure (Karabiber et al., 2013), where the observed signal at

a given position is divided by the sum of signal of cDNA molecules that passed this position. In our

experiment it was not straightforward to apply this method due to (1) existence of multiple priming sites

on a given transcript combined with the lack of data showing at which priming site given read has

originated (as opposed to the situation described in the Paper 2 where paired end sequencing was

used), (2) performing size selection of the libraries against short amplicons means that the maximal

possible count for the short amplicons is lower than expected based on the sum of cDNA molecules that

passed this location. When searching for the ways of normalizing the data, we found that some of the

reads bear the poly(A) stretch or part of the reverse transcription adapter sequence at their end

enabling us to define priming sites. Found priming sites are, as expected, predominantly located at the

3’UTR – poly(A) tail borders (Figure 3). Moreover, thanks to the employed random fragmentation

sample, different priming sites can be quantitatively compared (assumption) by counting number of

reads with which we have detected given site. After defining priming sites, we have estimated the

average read ends density from priming sites (Figure 4A). Those two values – strength of priming at a

given site and the value of average distribution at a given distance from priming site were used to

119

decompose the structure‐derived read ends counts between priming sites located on the same

transcript and to perform the QuSHAPE like normalization for each priming site separately, using size‐

selection correctors for sites being close to the priming site. Finally, the modeled numbers of cDNA

molecules reaching given nucleotide from different priming sites were summed and reported.

Figure 3. FragSeq2 priming occurs predominantly on polyadenylation sites. Detected number of priming events for mouse liver poly(A) sample displayed in the UCSC Genome Browser for (A) ApoB 3’ terminus (two overlapping poly(A) signals AAUAAA are underlined with red, dashed lines) and (B) serum albumin precursor (Alb) transcript.

Figure 4. Data normalization. (A) The average distribution of read ends from priming sites (black, solid line) with an exponential fit curve (red, dashed line). (B) The size selection correction values indicating a magnitude of difference between observed average distribution and extrapolated exponential fit.

Spiked‐inRNAmoleculesAfter performing the data normalization we wanted to check what the minimal coverage that ensures

the robust reproducibility is. We have taken advantage of the ERCC spike‐in molecules being present in

both mouse and human preparations (2nd sequencing) and have calculated the Pearson correlation

Scalechr12:

--->

Apob

20 bases mm98,023,580 8,023,590 8,023,600 8,023,610 8,023,620 8,023,630 8,023,640 8,023,650 8,023,660

G C T G A G T T G T T T T G T C C A A C T C A G G A T G G A G G G A G G G A G G G A A G G G G A A A T A A A T A A A T A C T T C C T T A T T G T G C A G C A T A C C T C T C A A C T T G G C T C A T T

RefSeq Genes

1901 _

1 _Map

ped

read

s en

ds

Scalechr5:

Alb

5 kb mm990,895,000 90,900,000 90,905,000

RefSeq Genes

39604 _

1 _

A

B

Map

ped

read

s en

ds

120

coefficient for P1 treatments while taking only the positions with the coverage above certain threshold

(Figure 5A). This analysis has shown that in order to observe high correlation between technical

replicates one need to apply the coverage cut‐off of at least approximately 50 reaching cDNA molecules.

High correlation between P1 and P1d5 treatments (Figure 5B) shows that used higher concentration of

enzyme didn’t lead to the major appearance of secondary cuts. Provided that P1 and V1 nucleases have

the opposite substrate specificities, we have hypothesized that the obtained signal from both should

anti‐correlate. We have again used the same test conditions as used for the Figure 5A and compared P1

and V1 treatments (Figure 5C). To our surprise, the anti‐correlation is negligible (minimum in the tested

range is approximately ‐0.04), which can stem from (1) ERCC molecules not forming stable structures

and existing as an ensemble of many different structures. This situation would not be favorable for

obtaining strong anti‐correlation. And/or (2) the general properties of the V1 nuclease which is known

not to be very helpful in finding the double‐stranded regions, sometimes cleaving close to, not within,

the double stranded region (Ziehler and Engelke, 2001). Interestingly, the magnesium fragmentation

derived end counts/coverage ratios correlate weakly with both P1 and V1 derived ratios (Figure 5D,E).

Since the fragmentation should not depend on the RNA structure (performed at high temperature in low

ionic strength environment), the observed correlation likely stems from the library‐construction biases,

such as ligation or PCR biases, which are shared between different libraries.

Figure 5. Correlation between signals from different treatments of spiked‐in ERCC libraries. Pearson correlation coefficient (black continuous line) of the end count/coverage ratio of nucleotides of ERCC RNA molecules with coverage higher than coverage cut‐off (x‐axis, log‐scale) between the compared samples (number of positions used in the calculation indicated by red, dashed line and right y‐axis). Correlation between signal from ERCC spike‐in from (A) P1 mouse and P1 human RNA, (B) P1 with P1d5, (C) P1 with V1, (D) P1 with magnesium fragmentation, (E) V1 with magnesium fragmentation. (B, C, D, E – mouse samples)

P1(H) vs P1(M)

Coverage cut−off

Pear

son

corr

elat

ion

coef

ficie

nt

2 8 32 256 4096

0.6

0.7

0.8

0.9

5000

10000

15000

20000

25000

30000

Num

ber o

f pos

ition

s

P1 vs P1d5

Coverage cut−off

Pear

son

corr

elat

ion

coef

ficie

nt

2 8 32 256 4096

0.5

0.6

0.7

0.8

0.9

5000

10000

15000

20000

25000

30000

35000

Num

ber o

f pos

ition

s

P1 vs V1

Coverage cut−off

Pear

son

corr

elat

ion

coef

ficie

nt

2 8 32 256 4096

−0.04

−0.02

0.00

0.02

5000

10000

15000

20000

25000

30000

35000

Num

ber o

f pos

ition

s

P1 vs Mg

Coverage cut−off

Pear

son

corr

elat

ion

coef

ficie

nt

2 8 32 256 4096

0.14

0.15

0.16

0.17

0.18

5000

10000

15000

20000

25000

30000

35000

Num

ber o

f pos

ition

s

V1 vs Mg

Coverage cut−off

Pear

son

corr

elat

ion

coef

ficie

nt

2 8 32 256 4096

0.08

0.10

0.12

0.14

0.16

5000

10000

15000

20000

25000

30000

35000

Num

ber o

f pos

ition

s

A B

D E

C

121

Apart from the ERCC RNA, the spiked‐in RNA contained 8 in vitro transcribed RNA molecules with

known, functional structures (Table 1). Most of the structural spike‐in molecules were too short to give a

signal of satisfactory quality (impact of size selection), but the longest of included – Escherichia coli

transfer‐messenger RNA (tmRNA) – showed a promising probing signal. An analysis of the tmRNA signal

was performed analogously to the analysis of all of the other molecules and is visualized on the Figure 6.

First, mapping of the reads bearing part of the reverse transcription adapter from magnesium

fragmentation sample informed us about utilization of the priming sites (Figure 6A) and positions of the

ligation‐proximal end sites informed us about the structure related cleavages (Figure 6B for P1 probing).

Using those two sets of mappings, the exponential fit to the average distribution of P1 probing and the

size‐selection corrector values (Figure 4) we have modeled the coverage at each position of the

molecule (Figure 6C). Finally, dividing the end counts by the coverage yielded end counts/coverage ratio

(Figure 6D). The analogous procedure to calculate the end counts/coverage ratio has been repeated for

P1d5, V1 and magnesium fragmentation treatments (Figure 6E, F, G). As expected, the P1 and the P1d5

treatments reveal distinct, high peaks in the single‐stranded regions. The V1 treatment has also

produced distinct peaks, but they are not exclusively located in the double stranded regions,

underscoring poorly characterized enzyme specificity (Ziehler and Engelke, 2001). The magnesium

fragmentation pattern produces the most even ratios over the investigated fragment, albeit the

distribution is not as flat as would be expected given no bias during fragmentation and library

preparation.

122

Figure 6. Signal distribution over tmRNA spike‐in molecule. (A) Count of read ends containing priming information mapped at a given location, (B) read ends with structure‐related information for P1 probing, (C) modeled coverage, (D) ratio of structure‐related read ends count to coverage for P1, (E) P1d5, (F) V1 and (G) magnesium fragmentation sample. Cyan background indicates unpaired nucleotides in Escherichia coli tmRNA secondary structure (Zwieb et al., 2003).

0

2000

4000

6000

8000

Prim

ing

coun

ts (M

g)

0

2000

4000

6000

8000

10000

12000

14000

End

cou

nts

(P1)

0

10000

20000

30000

40000

50000

Cov

erag

e (P

1)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Ec/

C ra

tio (P

1)

0.00

0.05

0.10

0.15

0.20

Ec/

C ra

tio (P

1d5)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Ec/

C ra

tio (V

1)

0 50 100 150 200 250 300 350

0.00

0.02

0.04

0.06

0.08

Ec/

C ra

tio (M

g)

A

B

C

D

E

F

G

123

SelenoproteinP3’UTRSelenoprotein P (Sepp1) is a liver secreted protein engaged in regulating the whole‐body selenium

homeostasis. It is unusual in carrying multiple selenocysteine residues, incorporation of which is

mediated by two conserved stem‐loop structures located in the 3’UTR called selenocysteine insertion

sequences (SECIS) (Burk and Hill, 2009). In the FragSeq2 experiment we have obtained the structural

data for both SECIS 1 (closer to 5’ end) and SECIS 2 (closer to 3’ end) from the Sepp1 3’UTR (Figure 7A,

C). Previously published analysis of SECIS elements defined apical loop, helix II, non‐Watson‐Crick base‐

paired quartet, internal loop and helix I as conserved constituents of the structure (Figure 7B, D)

(Walczak et al., 1996). The nuclease P1 consistently cleaved apical loops of both SECIS elements in each

of the analyzed species. Internal loop has been efficiently recognized by nuclease P1 in the probing of

dog SECIS 1, for which this loop is one nucleotide bigger than in the human or the mouse SECIS 1, and

showed some signal in the P1 probing of mouse SECIS 1. The V1 nuclease signal supports the formation

of a helix II in dog and mouse SECIS 1 as well as human and mouse SECIS 2. Helix I is supported by each

V1 treatment except for human SECIS 1, where the overall signal is very low. Interestingly, the large

apical loop in SECIS 1 was described to form short internal helix (Fletcher et al., 2001) (Figure 7B, dashed

lines), which is supported by cleavages of nuclease V1 (dog and mouse) and decreased P1 signal (human,

dog, mouse) around the probable interactors.

124

Figure 7. Signal distribution over Sepp1 SECIS elements backs conservation of the secondary structure. (A) and (C) An alignment of (A) SECIS 1 and (C) SECIS 2 from human (H), dog (D) and mouse (M) with highlighted in yellow predicted conserved single stranded regions (Walczak et al., 1996) and the end count/coverage ratio from FragSeq2 experiment for P1d5 and V1 treatments for 3 different species. (B) and (D) 2D models of conserved SECIS structure of (B) dog SECIS 1 and (D) human SECIS 2 with arrows indicating nuclease cleavages scaled according to the end count/coverage ratio. Dashed lines between nucleotides 34, 35, 36 and 44, 43, 42 on panel (B) indicate previously described internal helix.

H:UUCUAUUUGCUUUAAUGAGAAUAGAAACGUAAACUAUGACCUAGGGGUUUCUGUUGGAUAAUUAGCAGUUUAGAAD:UUCUACUUGCAUUAAUGAAAACAGAGACAUAAACUAUGACCUAGGGGUUUCUGUUGGAUAGUUAGCAAUUUAGAAM:UUCUAGUUACAUUAAUGAGAACAGAAACAUAAACUAUGACCUAGGGGUUUCUGUUGGAUAGCUUGUAAUUAAGAAc:***** ** * ******* ** *** ** ******************************* * * * ** ****

0.050.100.150.200.250.30

0.000.020.040.060.080.10

0.000.010.020.030.040.050.06

0.01

0.020.020.02

0.03

0.050.1

0.150.2

00.050.1

0.150.2

A 10 20 30 40 50 60 70

H

D

M

H

D

M

P1d5

V1

H:GUAUUUCCAUAGUCAAUGAUGGUU-UAAUAGGUAAACCAAACCCUAUAAACCUGACCUCCUUUAUGGUUAAUACD:GUAUUUCCAUAGUCAAUGAUGGUU-CAAUAGGUAAACUAAGUCCUAUAAACCUGAACUCCUAUAUGGUUAAUACM:GUAUUUCCAUAAUCAAUGAUGGUUUCA-UAGAGAAACUAAGUCCUAUGAACCUGACCUCUUUUAUGGCUAAUACc:*********** ************ * ** **** ** ***** ******* *** * ***** ******

0.00

0.05

0.10

0.15

0.000.020.040.060.080.10

0.000.050.100.150.20

00.020.040.06

00.050.1

0.15

00.010.020.030.040.05

C 10 20 30 40 50 60 70

H

D

M

H

D

M

P1d5

V1

B

Apical loop

Helix II

Quartet

Internalloop

Helix I

Apicalloop

Helix II

Quartet

Internalloop

Helix I

D

U

U

C

U

A

C

U

U

G

CA

U

U

A

AU

G

A

A

A

A

C

A

G

A

G

A

CA

U

A

A

A

C

UA

U GA

C

C

U

A

G

G

GG

U

U

U

C

U

G

U

U

G

G

A

UA

G

U

U

AG

C

A

A

U

U

U

A

G

A

A

1

10

20

30

40

50

60

70

75

G

U

A

U

U

U

C

C

A

U

A

G

U

C

A

A

U

G

A

U

G

G

U

U

U

A

A

U

A

G

GU

A

A

AC C

A

A

A

C

C

C

U

A

U

A

A

A

C

C

U

G

A

CC

U

C

C

UU

U

A

U

G

G

U

U

A

A

U

A

C

1

10

20

30

40

50

60

70

73

P1d5

V1

P1d5

V1

125

Non‐codingRNAmoleculesThe repertoire of RNA molecules with functional structures is arguably richer among non‐coding RNAs

than among polyadenylated mRNA molecules. In order to enrich our dataset for structured RNA

molecules, in the second round of mouse sample sequencing instead of selecting the RNA with oligo‐dT

coated beads, we have removed the ribosomal RNA (RiboZero). In this case the remaining RNA consisted

not only of mRNA but also of many other classes of RNA molecules. In order to be able to apply the

same method for signal detection as was used for probing of the 3’UTRs of mRNA molecules, we needed

all of our molecules of interest to bear a poly(A) tail, which we have added via in vitro polyadenylation.

We have hypothesized that the addition of the poly(A) tail will not influence the structure of the

molecule, especially if the formed structure is stable. To check for that assumption we have performed

computational folding of 9999 randomly generated 100 nt long sequences with and without added

100 nt long poly(A) tail and found that 75% of the predicted structures are identical after adding the

poly(A) tail and that the mean sensitivity and mean positive predictive value are 90% and 91%,

respectively. This simulation convinced us that the polyadenylation before probing can be safely applied.

One of the well characterized non‐coding molecules for which we have obtained signal of high quality is

the mouse U1 spliceosomal RNA, structure of which (Figure 8B) we have compared with our probing

data (Figure 8A). The probing signal from treatments with both P1 concentrations was highly

concentrated in the predicted loops, and signal from V1 probing was located mainly in the helical

regions, validating our method. Due to the included size selection step there is no data for

approximately last 30 nt of the molecule. It is worth noting that our normalization scheme likely

underestimates the coverage over short RNA molecules, like U1, because cDNA molecules reaching the

RNA 5’ termini are not counted (endogenously present 5’ termini are not substrates for the ligation in

our setup).

126

Figure 8 U1 spliceosomal RNA. (A) The end count/coverage ratio for nucleotides of mouse U1 spliceosomal RNA for different probing conditions. Cyan background indicates unpaired nucleotides according to the model shown on the panel (B). (B) U1 secondary structure model proposed in (Underwood et al., 2010) with the FragSeq2 nuclease cleavage data indicated by arrows.

DiscussionWe have presented the strategy of probing the complex mixtures of RNA that focuses on 3’ UTR regions

of mRNA molecules and can be expanded to probe the 3’ regions of other RNA molecules if preceded

with enzymatic polyadenylation. We have devised and implemented the normalization strategy that

decomposes the signal between observed priming sites and models the behavior of cDNA pool

extension and terminations. The probing data is affected by library preparation biases but is highly

reproducible between technical replicates.

The comparison of the obtained signal with the examples of three classes of molecules – spiked‐in

structured RNA (tmRNA), conserved 3’ UTR regulatory element (SECIS) and small nuclear RNA (U1)

reveals that the signal is of high quality, with the P1 nuclease cleavages concentrated in single‐stranded

regions and V1 in or close to the helical regions. Recently published results of an in vivo structure

probing (Rouskin et al., 2013) showed that the in vitro folding results in mRNA being more structured

A U A C U

U

A

C

C

U

G

G

C

A

G

G

GG

AG

AU

A

C

C

A

UG

A

U

CA C

G

A

A

G

G

U

G

G

UUU

UC

CC

A

G

G

G

C

G

A

G

G

C

U

U

A

U

C

C

A

U

U

GC A

C

U

C

C

G

G

A

U

G

U

G

C

U

G

A

C

C

C

C

U

GC

GAU

U

U

C

C

C

C

AA A

U

G

CG

G

G

A

A

A

CUC

G

AC

U

G

CA

UAA

U

UU

GU

G

G

U

A

G U G

G

G

G

G

A

C

U

G

C

G

U

U C

G

C

G

C

U

C

U

C

C

C

C

U G

1

10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

160

164

A B

P1d5

V1

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Ec/

C ra

tio (P

1)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Ec/

C ra

tio (P

1d5)

0.0

0.1

0.2

0.3

0.4

Ec/

C ra

tio (V

1)

0 50 100 150

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Ec/

C ra

tio (M

g)

127

than when present in the cells. Although performing n in vivo RNA probing is very tempting it may suffer

from RNA being present in multiple conformations within cells that would blur the signal derived from

the functional structures. What’s more, it is compatible with only a few probing reagents, limiting our

probing toolset. Novelty of our strategy of finding functional RNA structures comes from combining the

enzymatic probing detected with massive parallel sequencing (modified from the predecessor of our

method (Underwood et al., 2010)) with evolutionary conservation analysis (Pedersen et al., 2006). We

investigated the structures of RNA molecules from mammals of three different orders, which radiated

within short timespan roughly 100 million years ago (Cannarozzi et al., 2007; Murphy et al., 2001).

Probed transcripts were derived from the same organ (liver), and given that the three species are

omnivores, we expect that they share some of the regulatory mechanisms exhibited on mRNA 3’UTR

structures, as shown with the Sepp1 example. Choice of the studied organisms was affirmed by both dog

(Karlsson and Lindblad‐Toh, 2008) and mouse (Anderson and Ingham, 2003) being valuable model

organisms. Liver was chosen due to its transcripts being the most promising targets of Locked Nucleic

Acid based antisense oligonucleotides (Janssen et al., 2013; Straarup et al., 2010), design of which can

be facilitated with the knowledge of RNA structure.

Apart from the RNA structural data, the sequencing data obtained with the FragSeq2 procedure carries

information defining priming sites that can be utilized to find cleavage and polyadenylation sites with a

single nucleotide resolution (Figure 3). What’s more, inclusion of magnesium fragmentation treatment

allows a gene expression measurement. In the data analysis we haven’t included the untreated sample,

but it could be possibly used for the experimental noise correction, similarly as to the use of the control

sample for ΔTCR calculation in the attached Paper 2.

Currently, the gathered data is being analyzed by our collaborators in regard of finding new structural

elements present in the 3’ UTR regions. We expect that the nuclease probing data by itself will be a

useful constraint for the RNA structure predictions for each species separately, similarly as in the

previous transcriptome‐wide structure determination projects. However, the real strength comes from

the multi‐species design. In this way we may detect both conserved and novel structural elements,

allowing for uncovering regulatory mechanisms. Another interesting way of looking at the data will be

correlating the structural signal with the microRNA efficiency (as described in (Wan et al., 2014)) or with

the occurrences of RNA modifications or editing (Dominissini et al., 2012; Peng et al., 2012).

ContributionsFragSeq2 is a collaborative project between University of Copenhagen, University of California, Santa

Cruz and Aarhus University. Line Dahl Poulsen was involved in experiment planning and initial

experiments, Andrew V. Uzilov was involved in experiment planning and data analysis, Jakob Skou

Pedersen, Sudhakar Sahoo and Zsuzsanna Sükösd Etches are involved in data analysis, Sofie Salama,

Jeppe Vinther and Jakob Skou Pedersen supervised the project.

128

ReferencesAddo‐Quaye, C., Eshoo, T.W., Bartel, D.P., and Axtell, M.J. (2008). Endogenous siRNA and miRNA targets identified by sequencing of the Arabidopsis degradome. Current biology : CB 18, 758‐762.

Anderson, K.V., and Ingham, P.W. (2003). The transformation of the model organism: a decade of developmental genetics. Nat Genet 33 Suppl, 285‐293.

Baltz, A.G., Munschauer, M., Schwanhausser, B., Vasile, A., Murakawa, Y., Schueler, M., Youngs, N., Penfold‐Brown, D., Drew, K., Milek, M., et al. (2012). The mRNA‐bound proteome and its global occupancy profile on protein‐coding transcripts. Molecular cell 46, 674‐690.

Bartel, D.P. (2009). MicroRNAs: target recognition and regulatory functions. Cell 136, 215‐233.

Burk, R.F., and Hill, K.E. (2009). Selenoprotein P‐expression, functions, and roles in mammals. Biochim Biophys Acta 1790, 1441‐1447.

Cannarozzi, G., Schneider, A., and Gonnet, G. (2007). A phylogenomic study of human, dog, and mouse. PLoS Comput Biol 3, e2.

Chambers, J.M., and Hastie, T. (1992). Statistical models in S (Pacific Grove, Calif., Wadsworth & Brooks/Cole Advanced Books & Software).

Cock, P.J., Fields, C.J., Goto, N., Heuer, M.L., and Rice, P.M. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research 38, 1767‐1771.

Darty, K., Denise, A., and Ponty, Y. (2009). VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics 25, 1974‐1975.

Ding, Y., Tang, Y., Kwok, C.K., Zhang, Y., Bevilacqua, P.C., and Assmann, S.M. (2013). In vivo genome‐wide profiling of RNA secondary structure reveals novel regulatory features. Nature.

Dominissini, D., Moshitch‐Moshkovitz, S., Schwartz, S., Salmon‐Divon, M., Ungar, L., Osenberg, S., Cesarkas, K., Jacob‐Hirsch, J., Amariglio, N., Kupiec, M., et al. (2012). Topology of the human and mouse m6A RNA methylomes revealed by m6A‐seq. Nature 485, 201‐206.

Fletcher, J.E., Copeland, P.R., Driscoll, D.M., and Krol, A. (2001). The selenocysteine incorporation machinery: interactions between the SECIS RNA and the SECIS‐binding protein SBP2. RNA 7, 1442‐1453.

Forconi, M., and Herschlag, D. (2009). Metal ion‐based RNA cleavage as a structural probe. Methods in Enzymology 468, 91‐106.

Grimson, A., Farh, K.K., Johnston, W.K., Garrett‐Engele, P., Lim, L.P., and Bartel, D.P. (2007). MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Molecular cell 27, 91‐105.

Hofacker, I.L., Priwitzer, B., and Stadler, P.F. (2004). Prediction of locally stable RNA secondary structures for genome‐wide surveys. Bioinformatics 20, 186‐190.

Jambhekar, A., and Derisi, J.L. (2007). Cis‐acting determinants of asymmetric, cytoplasmic RNA transport. Rna 13, 625‐642.

Janssen, H.L., Reesink, H.W., Lawitz, E.J., Zeuzem, S., Rodriguez‐Torres, M., Patel, K., van der Meer, A.J., Patick, A.K., Chen, A., Zhou, Y., et al. (2013). Treatment of HCV infection by targeting microRNA. N Engl J Med 368, 1685‐1694.

129

Karabiber, F., McGinnis, J.L., Favorov, O.V., and Weeks, K.M. (2013). QuShape: rapid, accurate, and best‐practices quantification of nucleic acid probing information, resolved by capillary electrophoresis. RNA 19, 63‐73.

Karlsson, E.K., and Lindblad‐Toh, K. (2008). Leader of the pack: gene mapping in dogs and other model organisms. Nat Rev Genet 9, 713‐725.

Kedde, M., van Kouwenhove, M., Zwart, W., Oude Vrielink, J.A., Elkon, R., and Agami, R. (2010). A Pumilio‐induced RNA structure switch in p27‐3' UTR controls miR‐221 and miR‐222 accessibility. Nature cell biology 12, 1014‐1020.

Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U., and Segal, E. (2007). The role of site accessibility in microRNA target recognition. Nat Genet 39, 1278‐1284.

Kertesz, M., Wan, Y., Mazor, E., Rinn, J.L., Nutter, R.C., Chang, H.Y., and Segal, E. (2010). Genome‐wide measurement of RNA secondary structure in yeast. Nature 467, 103‐107.

Kielpinski, L.J., Boyd, M., Sandelin, A., and Vinther, J. (2013). Detection of reverse transcriptase termination sites using cDNA ligation and massive parallel sequencing. Methods Mol Biol 1038, 213‐231.

Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009). Ultrafast and memory‐efficient alignment of short DNA sequences to the human genome. Genome biology 10, R25.

Lowman, H.B., and Draper, D.E. (1986). On the recognition of helical RNA by cobra venom V1 nuclease. The Journal of Biological Chemistry 261, 5396‐5403.

Lunde, B.M., Moore, C., and Varani, G. (2007). RNA‐binding proteins: modular design for efficient function. Nature reviews Molecular cell biology 8, 479‐490.

Martin, M. (2011). Cutadapt removes adapter sequences from high‐throughput sequencing reads, Vol 17.

Mayr, C., and Bartel, D.P. (2009). Widespread shortening of 3'UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673‐684.

Mignone, F., Gissi, C., Liuni, S., and Pesole, G. (2002). Untranslated regions of mRNAs. Genome biology 3, REVIEWS0004.

Morita, M., Ler, L.W., Fabian, M.R., Siddiqui, N., Mullin, M., Henderson, V.C., Alain, T., Fonseca, B.D., Karashchuk, G., Bennett, C.F., et al. (2012). A novel 4EHP‐GIGYF2 translational repressor complex is essential for mammalian development. Molecular and cellular biology 32, 3585‐3593.

Murphy, W.J., Eizirik, E., Johnson, W.E., Zhang, Y.P., Ryder, O.A., and O'Brien, S.J. (2001). Molecular phylogenetics and the origins of placental mammals. Nature 409, 614‐618.

Pedersen, J.S., Bejerano, G., Siepel, A., Rosenbloom, K., Lindblad‐Toh, K., Lander, E.S., Kent, J., Miller, W., and Haussler, D. (2006). Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2, e33.

Peng, Z., Cheng, Y., Tan, B.C., Kang, L., Tian, Z., Zhu, Y., Zhang, W., Liang, Y., Hu, X., Tan, X., et al. (2012). Comprehensive analysis of RNA‐Seq data reveals extensive RNA editing in a human transcriptome. Nat Biotechnol 30, 253‐260.

Pruitt, K.D., Tatusova, T., Klimke, W., and Maglott, D.R. (2009). NCBI Reference Sequences: current status, policy and new initiatives. Nucleic acids research 37, D32‐36.

130

Ray, D., Kazan, H., Cook, K.B., Weirauch, M.T., Najafabadi, H.S., Li, X., Gueroussov, S., Albu, M., Zheng, H., Yang, A., et al. (2013). A compendium of RNA‐binding motifs for decoding gene regulation. Nature 499, 172‐177.


Romier, C., Dominguez, R., Lahm, A., Dahl, O., and Suck, D. (1998). Recognition of single‐stranded DNA by nuclease P1: high resolution crystal structures of complexes with substrate analogs. Proteins 32, 414‐424.

Rouskin, S., Zubradt, M., Washietl, S., Kellis, M., and Weissman, J.S. (2013). Genome‐wide probing of RNA structure reveals active unfolding of mRNA structures in vivo. Nature.

Seeher, S., Mahdi, Y., and Schweizer, U. (2012). Post‐transcriptional control of selenoprotein biosynthesis. Curr Protein Pept Sci 13, 337‐346.


Underwood, J.G., Uzilov, A.V., Katzman, S., Onodera, C.S., Mainzer, J.E., Mathews, D.H., Lowe, T.M., Salama, S.R., and Haussler, D. (2010). FragSeq: transcriptome‐wide RNA structure probing using high‐throughput sequencing. Nature Methods 7, 995‐1001.

Walczak, R., Westhof, E., Carbon, P., and Krol, A. (1996). A novel RNA structural motif in the selenocysteine insertion element of eukaryotic selenoprotein mRNAs. RNA 2, 367‐379.

Wan, Y., Qu, K., Zhang, Q.C., Flynn, R.A., Manor, O., Ouyang, Z., Zhang, J., Spitale, R.C., Snyder, M.P., Segal, E., et al. (2014). Landscape and variation of RNA secondary structure across the human transcriptome. Nature 505, 706‐709.

Washietl, S., Hofacker, I.L., and Stadler, P.F. (2005). Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci U S A 102, 2454‐2459.

Washietl, S., Pedersen, J.S., Korbel, J.O., Stocsits, C., Gruber, A.R., Hackermuller, J., Hertel, J., Lindemeyer, M., Reiche, K., Tanzer, A., et al. (2007). Structured RNAs in the ENCODE selected regions of the human genome. Genome research 17, 852‐864.

Zhang, L., Kasif, S., Cantor, C.R., and Broude, N.E. (2004). GC/AT‐content spikes as genomic punctuation marks. Proceedings of the National Academy of Sciences of the United States of America 101, 16855‐16860.

Ziehler, W.A., and Engelke, D.R. (2001). Probing RNA structure with chemical reagents and enzymes. Curr Protoc Nucleic Acid Chem Chapter 6, Unit 6 1.

Zwieb, C., Gorodkin, J., Knudsen, B., Burks, J., and Wower, J. (2003). tmRDB (tmRNA database). Nucleic Acids Res 31, 446‐447.

131

TablesTable 1. Structured RNA spike‐in molecules

Name Sequence

ryhB GGCGAUCAGGAAGACCCUCGCGGAGAACCUGAAAGCACGACAUUGCUCACAUUGCUUCCAGUAUUACUUAGCCAGCCGGGUGCUGGCUUUUACCUA

6S GUUUCUCUGAGAUGUUCGCAAGCGGGCCAGUCCCCUGAGCCGAUAUUUCAUACCACAAGAAUGUGGCGCUCCGCGGUUGGUGAGCAUGCUCGGUCCGUCCGAGAAGCCUUAAAACUGCGACGACACAUUCACCUUGAACCAAGGGUUCAAGGGUUACAGCCUGCGGCGGCAUCUCGGAGAUUCCACCUA

tmRNA GGGGCUGAUUCUGGAUUCGACGGGAUUUGCGAAACCCAAGGUGCAUGCCGAGGGGCGGUUGGCCUCGUAAAAAGCCGCAAAAAAUAGUCGCAAACGACGAAAACUACGCUUUAGCAGCUUAAUAACCUGCUUAGAGCCCUCUCUCCCUAGCCUCCGCUCUUAGGACGGGGAUCAAGAGAGGUCAAACCCAAAAGAGAUCGCGUGGAAGCCCUGCCUGGGGUUGAAGCGUUAAAACUUAAUCAGGCUAGUUUGUUAGUGGCGUGUCCGUCCGCAGCUGGCAAGCGAAUGUAAAGACUGACUAAGCAUGUAGUACCGAGGAUGUAGGAAUUUCGGACGCGGGUUCAACUCCCGCCAGCUCCAACCUA

DsrA GAACACAUCAGAUUUCCUGGUGUAACGAAUUUUUUAAGUGCUUCUUGCUUAAGCAAGUUUCAUCCCGACCCCCUCAGGGUCGGGAUUUACCUA

TPPapt GGACUCGGGGUGCCCUUCUGCGUGAAGGCUGAGAAAUACCCGUAUCACCUGAUCUGGAUAAUGCCAGCGUAGGGAAGUCACGGACCACCAGGUCAUUGCUUCUUCACGUUAUGGCAGGAGCAAACUAUGCAAGUCGACCUGCUGGGUUCAGCGCAAUCUGCGCACGACCUA

fhlA220 GGCAGCGUUACAUUCCCAUCCACUGGGGAAAGACGCGGCGCUGAUUGGUGAAGUGGUGGAACGUAAAGGUGUUCGUCUUGCCGGUCUGUAUGGCGUGAAACGAACCCUCGAUUUACCACACGCCGAACCGCUUCCGCGUAUAUGCUAAUAAAAUUCUAAAUCUCCUAUAGUUAGUCAAUGACCUUUUGCACCGCUUUGCGGUGCUUUCCUGGAAGAACAAAAUGUCAUAUACACCGAUGAGUGAUCUCGGACAACAAGGGUUGUUCGACAUCACUCGGACAACCUA

Spot42 GGUAGGGUACAGAGGUAAGAUGUUCUAUCUUUCAGACCUUUUACUUCACGUAAUCGGAUUUGGCUGAAUAUUUUAGCCGCCCCAGUCAGUAAUGACUGGGGCGUUUUUUAACCUA

Table 2. Oligonucleotides used in the study

Name Sequence Remarks phosphoseqADAPT ACACUCUUUCCCUACACGACGCUCUUCCGAUCUNN RNA Adapter_oligo_dT AGACGTGTGCTCTTCCGATCTTTTTTTTTTTTTTTTTTTVN DNA multi1_short AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCT DNA INDEX#_long CAAGCAGAAGACGGCATACGAGATxxxxxxGTGACTGGAGTTCAGACGT

GTGCTCTTCCGATCT DNA, xxxxxx indicates the specific index sequence. See (Kielpinski et al., 2013) for more information.

Table 3. Sequencing and mapping statistics

Sequencing name: 121019 121019 121019 121019 121019 121019 121019 121019 121019 121019

Sample: Mmus_PA Mmus_PA Mmus_PA Mmus_PA Mmus_PA Cfam_PA Cfam_PA Cfam_PA Cfam_PA Cfam_PA

Spike-ins: F F F F F F F F F F

Treatment: P1 V1 Mg NONE P1/5 P1 V1 Mg NONE P1/5

Index: 1 2 3 4 13 5 6 7 8 14

Reads: 26,184,914 31,436,703 13,620,819 1,859,822 9,418,061 24,230,982 22,656,312 22,802,173 2,508,975 18,657,694

Reads mapped to priming sites:

- - 1,951,595 - - - - 1,375,814 - -

Reads mapped as cleavage sites:

17,538,091 19,790,932 9,588,606 1,116,842 7,243,654 9,418,105 7,864,929 8,748,960 961,805 7,950,991

% 66.98% 62.95% 70.40% 60.05% 76.91% 38.87% 34.71% 38.37% 38.33% 42.62%

Sequencing name: 130220 130220 130220 130220 130220 130220 130220 130220 130220 130220

Sample: Mmus_RZ Mmus_RZ Mmus_RZ Mmus_RZ Mmus_RZ Hsap_PA Hsap_PA Hsap_PA Hsap_PA Hsap_PA

Spike-ins: T T T T T T T T T T

Treatment: P1 P1/5 V1/5 Mg NONE P1 P1/5 V1/5 Mg NONE

Index: 1 3 4 5 6 7 9 10 11 12

Reads: 18,918,149 18,129,511 21,480,088 18,212,256 3,699,234 17,466,036 12,331,841 28,211,757 18,876,283 2,575,061

Reads mapped to priming sites:

- - - 2,296,595 - - - - 3,191,888 -

Reads mapped as cleavage sites:

15,339,647 15,378,657 15,297,433 15,130,272 2,373,297 11,578,819 9,652,216 16,341,368 13,455,068 1,532,410

% 81.08% 84.83% 71.22% 83.08% 64.16% 66.29% 78.27% 57.92% 71.28% 59.51%

132

phd thesis - ku kielpinski.pdf · primære, sekundære og tertiære struktur, samt interaktioner...

Documents