Download - CS5263 Bioinformatics
![Page 1: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/1.jpg)
CS5263 Bioinformatics
Lecture 21
RNA Secondary Structure Prediction
![Page 2: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/2.jpg)
Road map
• Biological roles for RNA
• What’s “secondary structure”?
• How is it represented?
• Why is it important?
• How to predict?
![Page 3: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/3.jpg)
Central dogma
The flow of genetic information
DNA RNA Protein
transcription translation
Replication
![Page 4: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/4.jpg)
Classical Roles for RNA
• mRNA - Message RNA• tRNA - Transfer RNA (~61 kinds, ~ 75nt)• rRNA - Ribosomal RNA (~4 kinds, 120-5k nt)
Ribosome
Protein
RNA
![Page 5: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/5.jpg)
Classical Roles for RNA
• mRNA
• tRNA
• rRNA
Ribosome
![Page 6: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/6.jpg)
“Semi-classical” RNA
• snRNA - small nuclear RNA (splicing: U1, etc, 60-300nt)
• RNaseP - tRNA processing (~300 nt)• SRP - signal recognition particle; membrane
targeting (~100-300 nt)• tmRNA - resetting stalled ribosomes, destroy
aberrant mRNA• Telomerase - (200-400nt)• snoRNA - small nucleolar RNA (many varieties;
80-200nt)
![Page 7: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/7.jpg)
New Roles for RNA
• Riboswitch: an mRNA regulates its own activity• siRNA (Nobel prize 2006, Fire & Mello)• microRNAs• saRNA: small activating RNA
• Hundreds of families– Rfam release 1, 1/2003: 25 families, 55k instances– Rfam release 7, 3/2005: 503 families, 300k instances
![Page 8: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/8.jpg)
Example: Riboswitch
![Page 9: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/9.jpg)
Non-coding RNAs
Dramatic discoveries in last 5 years•100s of new families•Many roles: regulation, transport, stability, catalysis, …
•1% of DNA codes forprotein, but 30% of it is copied into RNA, i.e.ncRNA >> mRNA
![Page 10: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/10.jpg)
Take-home message
• RNAs play many important roles in the cell beyond the classical roles– Many of which yet to be discovered
• RNA functions are determined by structures
![Page 11: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/11.jpg)
RNA structure
• Primary: sequence
• Secondary: base-pairing
• Tertiary: 3D shape
![Page 12: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/12.jpg)
RNA base-pairing
• Watson-Crick Pairing– C-G ~3kcal/mole– A-U ~2kcal/mole
• “Wobble Pair” G – U ~1kcal/mole
• Non-canonical Pairs
![Page 13: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/13.jpg)
tRNA structure
![Page 14: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/14.jpg)
Secondary structure prediction
• Given: CAUUUGUGUACCU…. • Goal:
• How can we compute that?
![Page 15: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/15.jpg)
Hairpin Loops
Stems
Bulge loop
Interior loops
Multi-branched loop
Terminology
![Page 16: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/16.jpg)
Pseudoknot
• Makes structure prediction hard. Not considered in most algorithms.
5’5
10
15202530
35
40 45 3’
ucgacuguaaaaaagcgggcgacuuucagucgcucuuuuugucgcgcgc5’- -3’10 20 30 40
![Page 17: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/17.jpg)
The Nussinov algorithm
• Goal: maximizing the number of base-pairs
• Idea: Dynamic programming– Loop matching– Nussinov, Pieczenik, Griggs, Kleitman ’78
• Too simple for accurate prediction, but stepping-stone for later algorithms
![Page 18: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/18.jpg)
The Nussinov algorithm
Problem:
Find the RNA structure with the maximum (weighted) number of nested pairings
Nested: no pseudoknotAGACC
UCUGG
GCGGC
AGUC
UAU
GCG
AA
CGC
GUCA
UCAG
C UG
GA
AGAAG
GG A
GA
UC
U U C
ACCA
AU
ACU
G
AA
UU
GC
A
ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACCGCGAGAGGGAAGACUCGUAUAAGCG
![Page 19: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/19.jpg)
The Nussinov algorithm
• Given sequence X = x1…xN,
• Define DP matrix: F(i, j) = maximum number of base-pairs if xi…xj folds optimally– Matrix is symmetric, so let i < j
![Page 20: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/20.jpg)
The Nussinov algorithm
• Can be summarized into two cases:– (i, j) paired: optimal score is 1 + F(i+1, j-1)– (i, j) unpaired: optimal score is
maxk F(i, k) + F(k+1, j)
• a number of other ways to summarize, all equivalent
![Page 21: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/21.jpg)
The Nussinov algorithm
• F(i, i) = 0
F(i+1, j-1) + S(xi, xj)• F(i, j) = max
maxk F(i, k) + F(k+1, j)• S(xi, xj) = 1 if xi, xj can form a base-pair,
and 0 otherwise– Generalize: S(A, U) = 2, S(C, G) = 3, S(G, U) = 1– Or other types of scores (later)
• F(1, N) gives the optimal score for the whole seq
![Page 22: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/22.jpg)
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk F(i, k) + F(k+1, j)0
0
0 (i, j)
0
0
0
0
0
0
0
i
i+1
j–1 j
![Page 23: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/23.jpg)
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk F(i, k) + F(k+1, j)0
0
0
0
0
0
0
0
0
0
j – i = 1
![Page 24: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/24.jpg)
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk F(i, k) + F(k+1, j)0
0
0
0
0
0
0
0
0
0
j – i = 2
![Page 25: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/25.jpg)
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk F(i, k) + F(k+1, j)0
0
0
0
0
0
0
0
0
0
j – i = 3
![Page 26: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/26.jpg)
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk F(i, k) + F(k+1, j)0
0
0
0
0
0
0
0
0
0
j – i = N - 1
![Page 27: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/27.jpg)
Minimum Loop length
• Sharp turns unlikely• Let minimum length
of hairpin loop be 1• F(i, j) = 0 for j – i < 2
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
U AG CC GG
C
![Page 28: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/28.jpg)
AlgorithmInitialization:
F(i, i) = 0; for i = 1 to NF(i, i+1) = 0; for i = 1 to N-1
Iteration:For L = 1 to N-1
For i = 1 to N – lj = min(i + L, N)
F(i+1, j -1) + s(xi, xj)F(i, j) = max
max{ i k < j } F(i, k) + F(k+1, j)
Termination: Best score is given by F(1, N)(Need to trace back; refer to the Durbin book)
![Page 29: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/29.jpg)
Complexity
For L = 1 to N-1
For i = 1 to N – l
j = min(i + L, N)
F(i+1, j -1) + s(xi, xj)
F(i, j) = max
max{ i k < j } F(i, k) + F(k+1, j)
• Time complexity: O(N3)
• Memory: O(N2)
![Page 30: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/30.jpg)
Example
• RNA sequence: GGGAAAUCC
• Only count # of base-pairs– A-U = 1– G-C = 1– G-U = 1
• Minimum hairpin loop length = 1
![Page 31: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/31.jpg)
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
![Page 32: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/32.jpg)
0 0 0
0 0 0
0 0 0
0 0 0
0 0 1
0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
![Page 33: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/33.jpg)
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
![Page 34: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/34.jpg)
0 0 0 0 0
0 0 0 0 0
0 0 0 0 1
0 0 0 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
![Page 35: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/35.jpg)
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
A UG CG CG
AA
G UG CG C
AAA
A UGG CG C
AA
![Page 36: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/36.jpg)
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
A UG CG CG
AA
G UG CG C
AAA
A UGG CG C
AA
![Page 37: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/37.jpg)
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
A UG CG CG
AA
G UG CG C
AAA
A UGG CG C
AA
![Page 38: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/38.jpg)
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
A UG CG CG
AA
G UG CG C
AAA
A UGG CG C
AA
![Page 39: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/39.jpg)
Energy minimization
For L = 1 to N-1For i = 1 to N – l
j = min(i + L, N);
E(i+1, j -1) + e(xi, xj)E(i, j) = min
min{ i k < j } E(i, k) + E(k+1, j)
e(xi, xj) represents the energy for xi base pair with xj
• Energy are negative values. Therefore minimization rather than maximize.
• More complex energy rules: energy depends on neighboring bases
![Page 40: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/40.jpg)
Hairpin Loops
Stems
Bulge loop
Interior loops
Multi-branched loop
Terminology
![Page 41: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/41.jpg)
The Zuker algorithm – main ideas
1. Instead of base pairs, pairs of base pairs (more accurate)
2. Separate score for bulges
3. Separate score for different-size & composition of loops
4. Separate score for interactions between stem & beginning of loop
5. Use additional matrix to remember current state. similar to affine-gap alignment.
![Page 42: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/42.jpg)
Two popular implementation
• mFold by Zuker
• RNAfold in the Vienna package (Hofacker)– Includes several useful utilities, such as
structure comparison, searching, base-paring probability from partition functions, etc.
![Page 43: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/43.jpg)
Accuracy
• 50-70% for sequences up to 300 nt• Not perfect, but useful• Possible reasons:
– Energy rule not perfect: 5-10% error– Many alternative structures within this error
range– Alternative structure do exist– Structure may change in presence of other
molecules
![Page 44: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/44.jpg)
Comparative structure prediction
Given K homologous aligned RNA sequences:
Human aagacuucggaucuggcgacaccc
Mouse uacacuucggaugacaccaaagug
Worm aggucuucggcacgggcaccauuc
Fly ccaacuucggauuuugcuaccaua
Orc aagccuucggagcgggcguaacuc
If ith and jth positions are always base paired and covary, then they are likely to be paired
![Page 45: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/45.jpg)
Mutual information
fab(i,j): # of times the pair a, b are in positions i, j
fa (i): # of times the base a is in positions i
)()(
),(log),(),( 2
),,,(, jfif
jifjifjiM
ba
ab
TGCAbaab
aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc
fgc(3,13) = 3/5fcg(3,13) = 1/5fau(3,13) = 1/5
fg(3) = 3/5fc(3) = 1/5fa(3) = 1/5
fc(13) = 3/5fg(13) = 1/5fu(13) = 1/5
37.1
)2.02.0
2.0(log2.0)
2.02.0
2.0(log2.0)
6.06.0
6.0(log6.0)13,3( 222
M
![Page 46: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/46.jpg)
Mutual information
• Also called covariance score• M is high if base a in position i always follow by base b in position j
– Does not require a to base-pair with b– Advantage: can detect non-canonical base-pairs
• However, M = 0 if no mutation at all, even if perfect base-pairs
)()(
),(log),(),( 2
),,,(, jfif
jifjifjiM
ba
ab
TGCAbaab
aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc
One way to get around is to combine covariance and energy scores
![Page 47: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/47.jpg)
Comparative structure prediction
• Given a multiple alignment, can infer structure that maximizes the sum of mutual information, by DP
• However, alignment is hard, since structure often more important than sequence
![Page 48: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/48.jpg)
Comparative structure prediction
In practice:1. Get multiple alignment2. Find covarying bases – deduce structure3. Improve multiple alignment (by hand)4. Go to 2
A manual EM process!!
![Page 49: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/49.jpg)
Comparative structure prediction
• Align then fold
• Align and fold
• Fold then align
![Page 50: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/50.jpg)
Context-free Grammar for RNA Secondary Structure
• S = SS | aSu | cSg | uSa | gSc | L
• L = aL | cL | gL | uL |
aaacgg ugcc
ag ucg
a c g g a g u g c c c g u
S
S
S
S
L
S
L
a L
S
La
![Page 51: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/51.jpg)
Stochastic Context-free Grammar (SCFG)
• Probabilistic context-free grammar• Probabilities can be converted into weights• CFG vs SCFG is similar to RG vs HMM
• S = SS • S = aSu | uSa | L• S = cSg | gSc | L• S = uSg | gSu | L• L = aL | cL | gL | uL |
0
2
3
0
1
e(xi, xj) + F(i+1, j-1)
F(i, j) = max L(i, j)
maxk (F(i, k) + F(k+1, j))
L(i, j) = 0
![Page 52: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/52.jpg)
SCFG Decoding
• Decoding: given a grammar (SCFG/HMM) and a sequence, find the best parse (highest probability or score)– CYK algorithm (Viterbi)– The Nussinov and Zuker algorithms are
essentially special cases of CYK– CYK and SCFG are also used in other
domains (NLP, Compiler, etc).
![Page 53: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/53.jpg)
SCFG Evaluation
• Given a sequence and a SCFG model– Estimate P(seq is generated by model), summing
over all possible paths
• Inside-outside algorithm– Analogous to forward-background– Inside: bottom-up parsing (P(xi..xj))– Outside: top-down parsing (P(x1..xi-1 xj+1..xN))
• Can calculate base-paring probability – Analogous to posterior decoding– Essentially the same idea implemented in the Vienna
RNAfold package
![Page 54: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/54.jpg)
SCFG Learning
• Covariance model: similar to profile HMMs– Given a set of sequences with common structures,
simultaneously learn SCFG parameters and optimally parse sequences into states
– EM on SCFG – Inside-outside algorithm– Efficiency is a bottleneck
• Have been successfully applied to predict tRNA genes and structures– tRNAScan
![Page 55: CS5263 Bioinformatics](https://reader035.vdocuments.mx/reader035/viewer/2022062408/5681336a550346895d9a80b2/html5/thumbnails/55.jpg)
Future directions
• Structure prediction– Secondary– Tertiary
• Structural comparison tools– Structural alignment
• Structure search tools– “RNA-BLAST”
• Structural motif finding– “RNA-MEME”