cs5263 bioinformatics

CS5263 Bioinformatics

Lecture 21

RNA Secondary Structure Prediction

Road map

• Biological roles for RNA

• What’s “secondary structure”?

• How is it represented?

• Why is it important?

• How to predict?

Central dogma

The flow of genetic information

DNA RNA Protein

transcription translation

Replication

Classical Roles for RNA

• mRNA - Message RNA• tRNA - Transfer RNA (~61 kinds, ~ 75nt)• rRNA - Ribosomal RNA (~4 kinds, 120-5k nt)

Ribosome

Protein

Classical Roles for RNA

• mRNA

• tRNA

• rRNA

Ribosome

“Semi-classical” RNA

• snRNA - small nuclear RNA (splicing: U1, etc, 60-300nt)

• RNaseP - tRNA processing (~300 nt)• SRP - signal recognition particle; membrane

targeting (~100-300 nt)• tmRNA - resetting stalled ribosomes, destroy

aberrant mRNA• Telomerase - (200-400nt)• snoRNA - small nucleolar RNA (many varieties;

80-200nt)

New Roles for RNA

• Riboswitch: an mRNA regulates its own activity• siRNA (Nobel prize 2006, Fire & Mello)• microRNAs• saRNA: small activating RNA

• Hundreds of families– Rfam release 1, 1/2003: 25 families, 55k instances– Rfam release 7, 3/2005: 503 families, 300k instances

Example: Riboswitch

Non-coding RNAs

Dramatic discoveries in last 5 years•100s of new families•Many roles: regulation, transport, stability, catalysis, …

•1% of DNA codes forprotein, but 30% of it is copied into RNA, i.e.ncRNA >> mRNA

Take-home message

• RNAs play many important roles in the cell beyond the classical roles– Many of which yet to be discovered

• RNA functions are determined by structures

RNA structure

• Primary: sequence

• Secondary: base-pairing

• Tertiary: 3D shape

RNA base-pairing

• Watson-Crick Pairing– C-G ~3kcal/mole– A-U ~2kcal/mole

• “Wobble Pair” G – U ~1kcal/mole

• Non-canonical Pairs

tRNA structure

Secondary structure prediction

• Given: CAUUUGUGUACCU…. • Goal:

• How can we compute that?

Hairpin Loops

Bulge loop

Interior loops

Multi-branched loop

Terminology

Pseudoknot

• Makes structure prediction hard. Not considered in most algorithms.

15202530

40 45 3’

ucgacuguaaaaaagcgggcgacuuucagucgcucuuuuugucgcgcgc5’- -3’10 20 30 40

The Nussinov algorithm

• Goal: maximizing the number of base-pairs

• Idea: Dynamic programming– Loop matching– Nussinov, Pieczenik, Griggs, Kleitman ’78

• Too simple for accurate prediction, but stepping-stone for later algorithms

Problem:

Find the RNA structure with the maximum (weighted) number of nested pairings

Nested: no pseudoknotAGACC

ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACCGCGAGAGGGAAGACUCGUAUAAGCG

• Given sequence X = x1…xN,

• Define DP matrix: F(i, j) = maximum number of base-pairs if xi…xj folds optimally– Matrix is symmetric, so let i < j

• Can be summarized into two cases:– (i, j) paired: optimal score is 1 + F(i+1, j-1)– (i, j) unpaired: optimal score is

maxk F(i, k) + F(k+1, j)

• a number of other ways to summarize, all equivalent

• F(i, i) = 0

F(i+1, j-1) + S(xi, xj)• F(i, j) = max

maxk F(i, k) + F(k+1, j)• S(xi, xj) = 1 if xi, xj can form a base-pair,

and 0 otherwise– Generalize: S(A, U) = 2, S(C, G) = 3, S(G, U) = 1– Or other types of scores (later)

• F(1, N) gives the optimal score for the whole seq

How to fill in the DP matrix?

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

maxk F(i, k) + F(k+1, j)0

0 (i, j)

j–1 j

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

j – i = 1

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

j – i = 2

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

j – i = 3

F(i+1, j-1) + S(xi, xj)

• F(i, j) = max

j – i = N - 1

Minimum Loop length

• Sharp turns unlikely• Let minimum length

of hairpin loop be 1• F(i, j) = 0 for j – i < 2

U AG CC GG

AlgorithmInitialization:

F(i, i) = 0; for i = 1 to NF(i, i+1) = 0; for i = 1 to N-1

Iteration:For L = 1 to N-1

For i = 1 to N – lj = min(i + L, N)

F(i+1, j -1) + s(xi, xj)F(i, j) = max

max{ i k < j } F(i, k) + F(k+1, j)

Termination: Best score is given by F(1, N)(Need to trace back; refer to the Durbin book)

Complexity

For L = 1 to N-1

For i = 1 to N – l

j = min(i + L, N)

F(i+1, j -1) + s(xi, xj)

F(i, j) = max

max{ i k < j } F(i, k) + F(k+1, j)

• Time complexity: O(N3)

• Memory: O(N2)

Example

• RNA sequence: GGGAAAUCC

• Only count # of base-pairs– A-U = 1– G-C = 1– G-U = 1

• Minimum hairpin loop length = 1

G G G A A A U C C

0 0 0 0

0 0 0 1

0 0 1 1

0 0 0 0

G G G A A A U C C

0 0 0 0 0

0 0 0 0 1

0 0 0 1 1

0 0 1 1 1

0 0 0 0

G G G A A A U C C

0 0 0 0 0 0 1 2 3

0 0 0 0 0 1 2 3

0 0 0 0 1 2 2

0 0 0 1 1 1

0 0 1 1 1

0 0 0 0

G G G A A A U C C

A UG CG CG

G UG CG C

A UGG CG C

0 0 0 0 0 0 1 2 3

0 0 0 0 0 1 2 3

0 0 0 0 1 2 2

0 0 0 1 1 1

0 0 1 1 1

0 0 0 0

G G G A A A U C C

A UG CG CG

G UG CG C

A UGG CG C

0 0 0 0 0 0 1 2 3

0 0 0 0 0 1 2 3

0 0 0 0 1 2 2

0 0 0 1 1 1

0 0 1 1 1

0 0 0 0

G G G A A A U C C

A UG CG CG

G UG CG C

A UGG CG C

0 0 0 0 0 0 1 2 3

0 0 0 0 0 1 2 3

0 0 0 0 1 2 2

0 0 0 1 1 1

0 0 1 1 1

0 0 0 0

G G G A A A U C C

A UG CG CG

G UG CG C

A UGG CG C

Energy minimization

For L = 1 to N-1For i = 1 to N – l

j = min(i + L, N);

E(i+1, j -1) + e(xi, xj)E(i, j) = min

min{ i k < j } E(i, k) + E(k+1, j)

e(xi, xj) represents the energy for xi base pair with xj

• Energy are negative values. Therefore minimization rather than maximize.

• More complex energy rules: energy depends on neighboring bases

Hairpin Loops

Bulge loop

Interior loops

Multi-branched loop

Terminology

The Zuker algorithm – main ideas

1. Instead of base pairs, pairs of base pairs (more accurate)

2. Separate score for bulges

3. Separate score for different-size & composition of loops

4. Separate score for interactions between stem & beginning of loop

5. Use additional matrix to remember current state. similar to affine-gap alignment.

Two popular implementation

• mFold by Zuker

• RNAfold in the Vienna package (Hofacker)– Includes several useful utilities, such as

structure comparison, searching, base-paring probability from partition functions, etc.

Accuracy

• 50-70% for sequences up to 300 nt• Not perfect, but useful• Possible reasons:

– Energy rule not perfect: 5-10% error– Many alternative structures within this error

range– Alternative structure do exist– Structure may change in presence of other

molecules

Comparative structure prediction

Given K homologous aligned RNA sequences:

Human aagacuucggaucuggcgacaccc

Mouse uacacuucggaugacaccaaagug

Worm aggucuucggcacgggcaccauuc

Fly ccaacuucggauuuugcuaccaua

Orc aagccuucggagcgggcguaacuc

If ith and jth positions are always base paired and covary, then they are likely to be paired

Mutual information

fab(i,j): # of times the pair a, b are in positions i, j

fa (i): # of times the base a is in positions i

),(log),(),( 2

),,,(, jfif

jifjifjiM

TGCAbaab

aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc

fgc(3,13) = 3/5fcg(3,13) = 1/5fau(3,13) = 1/5

fg(3) = 3/5fc(3) = 1/5fa(3) = 1/5

fc(13) = 3/5fg(13) = 1/5fu(13) = 1/5

)2.02.0

2.0(log2.0)

2.02.0

2.0(log2.0)

6.06.0

6.0(log6.0)13,3( 222

Mutual information

• Also called covariance score• M is high if base a in position i always follow by base b in position j

– Does not require a to base-pair with b– Advantage: can detect non-canonical base-pairs

• However, M = 0 if no mutation at all, even if perfect base-pairs

),(log),(),( 2

),,,(, jfif

jifjifjiM

TGCAbaab

aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc

One way to get around is to combine covariance and energy scores

• Given a multiple alignment, can infer structure that maximizes the sum of mutual information, by DP

• However, alignment is hard, since structure often more important than sequence

In practice:1. Get multiple alignment2. Find covarying bases – deduce structure3. Improve multiple alignment (by hand)4. Go to 2

A manual EM process!!

• Align then fold

• Align and fold

• Fold then align

Context-free Grammar for RNA Secondary Structure

• S = SS | aSu | cSg | uSa | gSc | L

• L = aL | cL | gL | uL |

aaacgg ugcc

ag ucg

a c g g a g u g c c c g u

Stochastic Context-free Grammar (SCFG)

• Probabilistic context-free grammar• Probabilities can be converted into weights• CFG vs SCFG is similar to RG vs HMM

• S = SS • S = aSu | uSa | L• S = cSg | gSc | L• S = uSg | gSu | L• L = aL | cL | gL | uL |

e(xi, xj) + F(i+1, j-1)

F(i, j) = max L(i, j)

maxk (F(i, k) + F(k+1, j))

L(i, j) = 0

SCFG Decoding

• Decoding: given a grammar (SCFG/HMM) and a sequence, find the best parse (highest probability or score)– CYK algorithm (Viterbi)– The Nussinov and Zuker algorithms are

essentially special cases of CYK– CYK and SCFG are also used in other

domains (NLP, Compiler, etc).

SCFG Evaluation

• Given a sequence and a SCFG model– Estimate P(seq is generated by model), summing

over all possible paths

• Inside-outside algorithm– Analogous to forward-background– Inside: bottom-up parsing (P(xi..xj))– Outside: top-down parsing (P(x1..xi-1 xj+1..xN))

• Can calculate base-paring probability – Analogous to posterior decoding– Essentially the same idea implemented in the Vienna

RNAfold package

SCFG Learning

• Covariance model: similar to profile HMMs– Given a set of sequences with common structures,

simultaneously learn SCFG parameters and optimally parse sequences into states

– EM on SCFG – Inside-outside algorithm– Efficiency is a bottleneck

• Have been successfully applied to predict tRNA genes and structures– tRNAScan

Future directions

• Structure prediction– Secondary– Tertiary

• Structural comparison tools– Structural alignment

• Structure search tools– “RNA-BLAST”

• Structural motif finding– “RNA-MEME”

cs5263 bioinformatics

max maxk fi

j unpaired

important roles

dp matrix

maximum number of base

ntnew roles

nussinov algorithmproblem

ntsnorna small nucleolar

Documents

bioinformatics at iu - ketan mane. bioinformatics at iu what...

cs5263 bioinformatics lecture 12: hidden markov models and...

introduction to bioinformatics introduction to...

immunological bioinformatics. the immunological...

| bioinformatics usc libraries bioinformatics service ·...

meta data and bioinformatics bioinformatics is ebi-centred,...

the cmbi: bioinformatics content bioinformatics ...

cs5263 bioinformatics rna secondary structure prediction

bioinformatics - stellenbosch universitypevsner j....

about bioinformatics courses under...

cs5263 bioinformatics lecture 1: introduction outline...

1 introduction to bioinformatics. 2 what is bioinformatics?

bioinformatics ii theoretical bioinformatics and machine...

introduction to bioinformatics, 2010 - göteborgs...

bioinformatics iii: structural bioinformatics and genome...

bioinformatics, translational bioinformatics, personalized...

bioinformatics for molecular biology - wiki.uio.no ·...

cs5263 bioinformatics

cs5263 bioinformatics lecture 17 exact string matching...

introduction to bioinformatics · biopotato bioinformatics...