outline rna search and whirlwind tour of ncrna...

18
1 RNA Search and Motif Discovery Lecture 9 CSEP 590A Summer 2006 Outline Whirlwind tour of ncRNA search & discovery Covariance Model Review Algorithms for Training “Mutual Information” Algorithms for searching Rigorous & heuristic filtering Motif discovery Wrap up Course Evals 1 gagcccggcc cgggggacgg gcggcgggat agcgggaccc cggcgcggcg gtgcgcttca 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca agaggcggcg ggagccggtg 121 gcggctcggc atcatgcgtc gagggcgtct gctggagatc gccctgggat ttaccgtgct 181 tttagcgtcc tacacgagcc atggggcgga cgccaatttg gaggctggga acgtgaagga 241 aaccagagcc agtcgggcca agagaagagg cggtggagga cacgacgcgc ttaaaggacc 301 caatgtctgt ggatcacgtt ataatgctta ctgttgccct ggatggaaaa ccttacctgg 361 cggaaatcag tgtattgtcc ccatttgccg gcattcctgt ggggatggat tttgttcgag 421 gccaaatatg tgcacttgcc catctggtca gatagctcct tcctgtggct ccagatccat 481 acaacactgc aatattcgct gtatgaatgg aggtagctgc agtgacgatc actgtctatg 541 ccagaaagga tacataggga ctcactgtgg acaacctgtt tgtgaaagtg gctgtctcaa 601 tggaggaagg tgtgtggccc caaatcgatg tgcatgcact tacggattta ctggacccca 661 gtgtgaaaga gattacagga caggcccatg ttttactgtg atcagcaacc agatgtgcca 721 gggacaactc agcgggattg tctgcacaaa acagctctgc tgtgccacag tcggccgagc 781 ctggggccac ccctgtgaga tgtgtcctgc ccagcctcac ccctgccgcc gtggcttcat 841 tccaaatatc cgcacgggag cttgtcaaga tgtggatgaa tgccaggcca tccccgggct 901 ctgtcaggga ggaaattgca ttaatactgt tgggtctttt gagtgcaaat gccctgctgg 961 acacaaactt aatgaagtgt cacaaaaatg tgaagatatt gatgaatgca gcaccattcc 1021 ... The Human Parts List, circa 2001 3 billion nucleotides, containing: •25,000 protein-coding genes (only ~1% of the DNA) •Messenger RNAs made from each •Plus a double-handful of other RNA genes Breakthrough of the Year Noncoding RNAs Dramatic discoveries in last 5 years 100s of new families Many roles: Regulation, transport, stability, catalysis, … 1% of DNA codes for protein, but 30% of it is copied into RNA, i.e. ncRNA >> mRNA

Upload: others

Post on 29-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

1

RNA Search andMotif Discovery

Lecture 9CSEP 590A

Summer 2006

Outline

Whirlwind tour of ncRNA search & discoveryCovariance Model ReviewAlgorithms for Training

“Mutual Information”

Algorithms for searchingRigorous & heuristic filtering

Motif discovery

Wrap upCourse Evals

1 gagcccggcc cgggggacgg gcggcgggat agcgggaccc cggcgcggcg gtgcgcttca 61 gggcgcagcg gcggccgcag accgagcccc gggcgcggca agaggcggcg ggagccggtg121 gcggctcggc atcatgcgtc gagggcgtct gctggagatc gccctgggat ttaccgtgct181 tttagcgtcc tacacgagcc atggggcgga cgccaatttg gaggctggga acgtgaagga241 aaccagagcc agtcgggcca agagaagagg cggtggagga cacgacgcgc ttaaaggacc301 caatgtctgt ggatcacgtt ataatgctta ctgttgccct ggatggaaaa ccttacctgg361 cggaaatcag tgtattgtcc ccatttgccg gcattcctgt ggggatggat tttgttcgag421 gccaaatatg tgcacttgcc catctggtca gatagctcct tcctgtggct ccagatccat481 acaacactgc aatattcgct gtatgaatgg aggtagctgc agtgacgatc actgtctatg541 ccagaaagga tacataggga ctcactgtgg acaacctgtt tgtgaaagtg gctgtctcaa601 tggaggaagg tgtgtggccc caaatcgatg tgcatgcact tacggattta ctggacccca661 gtgtgaaaga gattacagga caggcccatg ttttactgtg atcagcaacc agatgtgcca721 gggacaactc agcgggattg tctgcacaaa acagctctgc tgtgccacag tcggccgagc781 ctggggccac ccctgtgaga tgtgtcctgc ccagcctcac ccctgccgcc gtggcttcat841 tccaaatatc cgcacgggag cttgtcaaga tgtggatgaa tgccaggcca tccccgggct901 ctgtcaggga ggaaattgca ttaatactgt tgggtctttt gagtgcaaat gccctgctgg961 acacaaactt aatgaagtgt cacaaaaatg tgaagatatt gatgaatgca gcaccattcc1021 ...

The Human Parts List, circa 2001

3 billion nucleotides, containing:•25,000 protein-coding genes(only ~1% of the DNA)•Messenger RNAs made from each•Plus a double-handful of other RNA genes

Breakthroughof the Year

NoncodingRNAsDramatic discoveries inlast 5 years

100s of new familiesMany roles: Regulation,transport, stability, catalysis, …

1% of DNA codes forprotein, but 30% of it iscopied into RNA, i.e.ncRNA >> mRNA

Page 2: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

2

“RNA sequence analysis usingcovariance models”

Eddy & DurbinNucleic Acids Research, 1994

vol 22 #11, 2079-2088(see also, Ch 10 of Durbin et al.)

What

A probabilistic model for RNA familiesThe “Covariance Model”≈ A Stochastic Context-Free GrammarA generalization of a profile HMM

Algorithms for TrainingFrom aligned or unaligned sequencesAutomates “comparative analysis”Complements Nusinov/Zucker RNA folding

Algorithms for searching

Main Results

Very accurate search for tRNA(Precursor to tRNAscanSE - current favorite)

Given sufficient data, model constructioncomparable to, but not quite as good as,human expertsSome quantitative info on importance ofpseudoknots and other tertiary features

Probabilistic Model Search

As with HMMs, given a sequence, you calculatelikelihood ratio that the model could generate thesequence, vs a background modelYou set a score thresholdAnything above threshold → a “hit”Scoring:

“Forward” / “Inside” algorithm - sum over all pathsViterbi approximation - find single best path(Bonus: alignment & structure prediction)

Page 3: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

3

Example:searching fortRNAs

Alignment Quality

Comparison to TRNASCAN

Fichant & Burks - best heuristic then97.5% true positive0.37 false positives per MB

CM A1415 (trained on trusted alignment)> 99.98% true positives<0.2 false positives per MB

Current method-of-choice is “tRNAscanSE”, a CM-based scan with heuristic pre-filtering (includingTRNASCAN?) for performance reasons.

Slig

htly

diff

eren

tev

alua

tion

crite

ria

Mj: Match states (20 emission probabilities)Ij: Insert states (Background emission probabilities)Dj: Delete states (silent - no emission)

Profile Hmm Structure

Page 4: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

4

CM Structure

A: Sequence + structure

B: the CM “guide tree”

C: probabilities ofletters/ pairs & of indels

Think of each branchbeing an HMM emittingboth sides of a helix (but3’ side emitted inreverse order)

Overall CMArchitectureOne box (“node”) per nodeof guide tree

BEG/MATL/INS/DEL justlike an HMMMATP & BIF are the keyadditions: MATP emits pairsof symbols, modeling base-pairs; BIF allows multiplehelices

CM Viterbi Alignment

!

!

xi = ith letter of input

xij = substring i,..., j of input

Tyz = P(transition y" z)

Exi ,x j

y= P(emission of xi,x j from state y)

Sijy

=max# logP(xij gen'd starting in state y via path # )

!

Sijy

=max" logP(xij generated starting in state y via path " )

Sijy

=

maxz[Si+1, j#1z

+ logTyz + logExi ,x j

y ] match pair

maxz[Si+1, jz

+ logTyz + logExi

y ] match/insert left

maxz[Si, j#1z

+ logTyz + logEx j

y ] match/insert right

maxz[Si, jz

+ logTyz] delete

maxi<k$ j[Si,kyleft + Sk+1, j

yright ] bifurcation

%

&

' ' '

(

' ' '

Time O(qn3), q states, seq len n

Page 5: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

5

Model TrainingMutual Information

Max when no seq conservation but perfect pairing

MI = expected score gain from using a pair state

Finding optimal MI, (i.e. opt pairing of cols) is hard(?)

Finding optimal MI without pseudoknots can be doneby dynamic programming

!

Mij = fxi,xjxi,xj

" log2fxi,xj

f xi f xj; 0 # Mij # 2

* 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 i,j: 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3,4 3,5 3,6 3,7 3,8 3,9 4,5 4,6 4,7 4,8 4,9 5,6 5,7 5,8 5,9 6,7 6,8 6,9 7,8

A G A U A A U C U 9 0 0 0 0 0 0 0 0 AG AA AU AA AA AU AC AU GA GU GA GA GU GC GU AU AA AA AU AC AU UA UA UU UC UU AA AU AC AU AU AC AU UC

A G A U C A U C U 8 0 0 0 0 0 0 0 AG AA AU AC AA AU AC AU GA GU GC GA GU GC GU AU AC AA AU AC AU UC UA UU UC UU CA CU CC CU AU AC AU UC

A G A C G U U C U 7 0 0 2 0.30 0 1 AG AA AC AG AU AU AC AU GA GC GG GU GU GC GU AC AG AU AU AC AU CG CU CU CC CU GU GU GC GU UU UC UU UC

A G A U U U U C U 6 0 0 1 0.55 1 AG AA AU AU AU AU AC AU GA GU GU GU GU GC GU AU AU AU AU AC AU UU UU UU UC UU UU UU UC UU UU UC UU UC

A G C C A G G C U 5 0 0 0 0.42 AG AC AC AA AG AG AC AU GC GC GA GG GG GC GU CC CA CG CG CC CU CA CG CG CC CU AG AG AC AU GG GC GU GC

A G C G C G G C U 4 0 0 0.30 AG AC AG AC AG AG AC AU GC GG GC GG GG GC GU CG CC CG CG CC CU GC GG GG GC GU CG CG CC CU GG GC GU GC

A G C U G C G C U 3 0 0 AG AC AU AG AC AG AC AU GC GU GG GC GG GC GU CU CG CC CG CC CU UG UC UG UC UU GC GG GC GU CG CC CU GC

A G C A U C G C U 2 0 AG AC AA AU AC AG AC AU GC GA GU GC GG GC GU CA CU CC CG CC CU AU AC AG AC AU UC UG UC UU CG CC CU GC

A G G U A G C C U 1 AG AG AU AA AG AC AC AU GG GU GA GG GC GC GU GU GA GG GC GC GU UA UG UC UC UU AG AC AC AU GC GC GU CC

A G G G C G C C U AG AG AG AC AG AC AC AU GG GG GC GG GC GC GU GG GC GG GC GC GU GC GG GC GC GU CG CC CC CU GC GC GU CC

A G G U G U C C U AG AG AU AG AU AC AC AU GG GU GG GU GC GC GU GU GG GU GC GC GU UG UU UC UC UU GU GC GC GU UC UC UU CC

A G G C U U C C U AG AG AC AU AU AC AC AU GG GC GU GU GC GC GU GC GU GU GC GC GU CU CU CC CC CU UU UC UC UU UC UC UU CC

A G U A A A A C U AG AU AA AA AA AA AC AU GU GA GA GA GA GC GU UA UA UA UA UC UU AA AA AA AC AU AA AA AC AU AA AC AU AC

A G U C C A A C U AG AU AC AC AA AA AC AU GU GC GC GA GA GC GU UC UC UA UA UC UU CC CA CA CC CU CA CA CC CU AA AC AU AC

A G U U G C A C U AG AU AU AG AC AA AC AU GU GU GG GC GA GC GU UU UG UC UA UC UU UG UC UA UC UU GC GA GC GU CA CC CU AC

A G U U U C A C U AG AU AU AU AC AA AC AU GU GU GU GC GA GC GU UU UU UC UA UC UU UU UC UA UC UU UC UA UC UU CA CC CU AC

MI: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.0 1.0 2.0 0.0 0.0 0.4 0.5 0.3 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

fxi,xj:

A 16 0 4 2 4 4 4 0 0 AA 0 4 2 4 4 4 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 1 1 1 0 0 2 1 0 0 2 0 0 0

C 0 0 4 4 4 4 4 16 0 AC 0 4 4 4 4 4 16 0 0 0 0 0 0 0 0 1 1 0 0 4 0 0 1 0 2 0 0 1 4 0 0 4 0 4

G 0 16 4 2 4 4 4 0 0 AG 16 4 2 4 4 4 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 2 1 0 0 0 0 0 0

U 0 0 4 8 4 4 4 0 16 AU 0 4 8 4 4 4 0 16 0 0 0 0 0 0 0 3 1 2 4 0 4 1 0 0 0 2 0 1 0 4 2 0 4 0

CA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 2 1 0 0 2 0 0 0

CC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 0 4 0 1 0 1 4 0 0 1 4 0 0 4 0 4

CG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 4 0 0 1 1 1 0 0 2 1 0 0 2 0 0 0

CU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 4 1 2 1 0 4 0 1 0 4 0 0 4 0

GA 0 0 0 0 0 0 0 0 4 2 4 4 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

GC 0 0 0 0 0 0 0 0 4 4 4 4 4 16 0 1 1 0 4 4 0 2 0 1 2 0 2 1 4 0 2 4 0 4

GG 0 0 0 0 0 0 0 0 4 2 4 4 4 0 0 1 1 2 0 0 0 0 2 1 0 0 0 1 0 0 2 0 0 0

GU 0 0 0 0 0 0 0 0 4 8 4 4 4 0 16 2 1 2 0 0 4 0 0 0 0 2 2 1 0 4 0 0 4 0

UA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 4 0 0 2 2 2 0 0 0 1 0 0 0 0 0 0

UC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 0 4 0 1 3 2 8 0 2 1 4 0 2 4 0 4

UG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 3 1 1 0 0 0 1 0 0 0 0 0 0

UU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 4 2 2 3 0 8 2 1 0 4 2 0 4 0

N= 9 log dealy:

AA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 0.06 0.06 0.063 0 0 0.13 0 0 0 0.125 0 0 0

AC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0 0 0 0 0 0 0 0 0 0

AG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.063 0 0 0.13 0 0 0 0 0 0 0

AU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.11 0 0.13 0.5 0 0 0.06 0 0 0 0 0 0 0 0 0.125 0 0 0

CA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 0.125 0 0 0

CC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

CG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0.13 0.5 0 0 0 0 0 0 0 0.13 0 0 0 0.125 0 0 0

CU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.1 0 0 0 0 0 0 0.13 0 0 0 0 0 0 0 0 0 0 0

GA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

GC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0 0 0.25 0 0.063 0 0 0.13 0 0 0 0.125 0 0 0

GG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0.13 0 0 0 0 0.25 0.063 0 0 0 0 0 0 0.125 0 0 0

GU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 0 0 0 0 0 0.13 0 0 0 0 0 0 0

UA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0.13 0.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

UC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 -0.1 0.11 0 0 0 0.13 0 0 0 0.125 0 0 0

UG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.11 -0.1 -0.06 0 0 0 0 0 0 0 0 0 0

UU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.11 0 0 0.13 0 0 0 0.125 0 0 0

M.I. Example (Artificial)

Cols 1 & 9, 2 & 8: perfect conservation & might bebase-paired, but unclear whether they are. M.I. = 0

Cols 3 & 7: No conservation, but always W-C pairs,so seems likely they do base-pair. M.I. = 2 bits.

Cols 7->6: unconserved, but each letter in 7 hasonly 2 possible mates in 6. M.I. = 1 bit.

Page 6: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

6

Find best (max total MI) subset of column pairsamong i…j, subject to absence of pseudo-knots

“Just like Nussinov/Zucker folding”

BUT, need enough data---enough sequences at rightphylogenetic distance

MI-Based Structure-Learning

!

Si, j =maxSi, j"1maxi#k< j"4 Si,k"1 + Mk, j + Sk+1, j"1

$ % &

Pseudoknotsdisallowed allowed

!

max j Mi, ji=1

n

"# $ % & ' ( /2

Rfam – an RNA family DBGriffiths-Jones, et al., NAR ‘03,’05

Biggest scientific computing user in Europe -1000 cpu cluster for a month per release

Rapidly growing:Rel 1.0, 1/03: 25 families, 55k instances

Rel 7.0, 3/05: 503 families, >300k instances

Page 7: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

7

IRE (partial seed alignment):

Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAACHom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAACHom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAACHom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAAHom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAACHom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAUHom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAGHom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUGHom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAUCav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGCMus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUGMus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAACMus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAACRat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUGRat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAACSS_cons <<<<<...<<<<<......>>>>>.>>>>>

Rfam

Input (hand-curated):MSA “seed alignment”

SS_consScore Thresh TWindow Len W

Output:CMscan results & “fullalignment”

Faster Genome Annotationof Non-coding RNAs

Without Loss of AccuracyZasha Weinberg

& W.L. Ruzzo

Recomb ‘04, ISMB ‘04, Bioinfo ‘06

CovarianceModel

Key difference of CM vs HMM:Pair states emit paired symbols, corresponding to base-paired nucleotides; 16 emissionprobabilities here.

Page 8: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

8

CM’s are good, but slow

EMBL

CM

hitsjunk

Rfam Goal

10 years,1000 computers

1 month,1000 computers

Our Work

~2 months,1000 computers

EMBL

CM

hits

Ravenna

Rfam Reality

EMBL

hitsjunk

BLAST

CM

Oversimplified CM(for pedagogical purposes only)

ACGU–

ACGU –

ACGU –

ACGU –

CM to HMM

25 emisions per state 5 emissions per state, 2x states

ACGU–

ACGU –

ACGU –

ACGU –

ACGU–

ACGU –

ACGU –

ACGU –

CM HMM

Need: log Viterbi scores CM ≤ HMM

Key Issue: 25 scores → 10

P

ACGU–

ACGU –

L

ACGU–

ACGU –

R

CM HMM

Page 9: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

9

Viterbi/Forward Scoring

Path π defines transitions/emissionsScore(π) = product of “probabilities” on πNB: ok if “probs” aren’t, e.g. ∑≠1(e.g. in CM, emissions are odds ratios vs0th-order background)

For any nucleotide sequence x:Viterbi-score(x) = max{ score(π) | π emits x}Forward-score(x) = ∑{ score(π) | π emits x}

Key Issue: 25 scores → 10

Need: log Viterbi scores CM ≤ HMMPCA ≤ LC + RAPCC ≤ LC + RCPCG ≤ LC + RGPCU ≤ LC + RUPC– ≤ LC + R–

……………

PAA ≤ LA + RAPAC ≤ LA + RCPAG ≤ LA + RGPAU ≤ LA + RUPA– ≤ LA + R– N

B: H

MM

not

a p

rob.

mod

el

P

ACGU–

ACGU –

L

ACGU–

ACGU –

R

CM HMM

Rigorous Filtering

Any scores satisfying the linear inequalitiesgive rigorous filtering

Proof: CM Viterbi path score ≤ “corresponding” HMM path score ≤ Viterbi HMM path score (even if it does not correspond to any CM path)

PAA ≤ LA + RAPAC ≤ LA + RCPAG ≤ LA + RGPAU ≤ LA + RUPA– ≤ LA + R–…

Some scores filter better

PUA = 1 ≤ LU + RA

PUG = 4 ≤ LU + RG

Assuming ACGU ≈ 25%

Option 1: Opt 1: LU = RA = RG = 2 LU + (RA + RG)/2 = 4

Option 2: Opt 2: LU = 0, RA = 1, RG = 4 LU + (RA + RG)/2 = 2.5

Page 10: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

10

Optimizing filtering

For any nucleotide sequence x:Viterbi-score(x) = max{ score(π) | π emits x }Forward-score(x) = ∑{ score(π) | π emits x }

Expected Forward ScoreE(Li, Ri) = ∑all sequences x Forward-score(x)*Pr(x)NB: E is a function of Li, Ri only

Optimization:Minimize E(Li, Ri) subject to score Lin.Ineq.s

This is heuristic (“forward↓ ⇒ Viterbi↓ ⇒ filter↓”)But still rigorous because “subject to score Lin.Ineq.s”

Under 0th-order background model

Calculating E(Li, Ri)

E(Li, Ri) = ∑x Forward-score(x)*Pr(x)

Forward-like: for every state, calculateexpected score for all paths ending there,easily calculated from expected scores ofpredecessors & transition/emissionprobabilities/scores

Minimizing E(Li, Ri)

Calculate E(Li, Ri) symbolically, in terms ofemission scores, so we can do partialderivatives for numerical convex optimizationalgorithm

!

"E (L1 , L2 , ...)

"Li

Estimated Filtering Efficiency(139 Rfam 4.0 families)

37.99 - 1.0

46.25 - .99

22.10 - .25

311.01 - .10

17810-4 - 10-2

110105< 10-4

# families(expanded)

# families(compact)

Filteringfraction

~100xspeedup

Page 11: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

11

Results: New ncRNA’s?

7290283U4 snRNA

1200199U5 snRNA

3131128S-box

5412369Purine riboswitch

313

1464

264

193

59

1106

322

180

# foundrigorous filter+ CM

1312U7 snRNA

21462U6 snRNA

13251Hammerhead III

26167Hammerhead I

4811Retron msr

1021004Histone 3’ element

121201Iron response element

12357Pyrococcus snoRNA

# new# foundBLAST+ CM

Name

Results: With additional work

And more…

117160Lysineriboswitch

21247226tmRNA

121729608tRNAscan-SE(human)

33160395708Group IIintron

51586376758609Rfam tRNA

# new# with rigorousfilter series + CM

# withBLAST+CM

“Additional work”

Profile HMM filters use no 2ary structure infoThey work well because, tho structure can be critical tofunction, there is (usually) enough primary sequenceconservation to exclude most of DBBut not on all families (and may get worse?)

Can we exploit some structure (quickly)?Idea 1: “sub-CM”Idea 2: extra HMM states remember mateIdea 3: try lots of combinations of “some hairpins”Idea 4: chain together several filters (select via Dijkstra)

for some hairpins}

Filter Chains

Fig. 2. Filter creation and selection. Filters for Rfam tRNA (RF00005) generatedby the store-pair and sub-CM techniques and those selected for actual filteringare plotted by filtering fraction and run time. The CM runs at 3.5 secs/kbase.The four selected filters are run one after another, from highest to lowestfraction.

Page 12: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

12

Heuristic Filters

Rigorous filters optimized for worst casePossible to trade improved speed for smallloss in sensitivity?Yes – profile HMMs as before, but optimizedfor average case“ML heuristic”: train HMM from the infinitealignment generated by the CMOften 10x faster, modest loss in sensitivity

Heuristic Filters

cobalamine (B12) riboswitch tRNA SECIS

* *

*

* rigorous HMM, not rigorous threshold

Cmfinder--A CovarianceModel Based RNA Motif

Finding AlgorithmBioinformatics, 2006, 22(4): 445-452

Zizhen YaoZasha Weinberg

Walter L. Ruzzo

University of Washington, Seattle

Searching for noncoding RNAs

CM’s are great, but where do they come from?An approach: comparative genomics

Search for motifs with common secondary structure in aset of functionally related sequences.

ChallengesThree related tasks

Locate the motif regions.Align the motif instances.Predict the consensus secondary structure.

Motif search space is huge!Motif location space, alignment space, structure space.

Page 13: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

13

Approaches

Align sequences, then look for commonstructurePredict structures, then try to align themDo both together

Pitfall for sequence-alignment-first approach

Structural conservation ≠ Sequence conservationAlignment without structure information is unreliable

CLUSTALW alignment of SECIS elements with flanking regions

same-colored boxes should be aligned

Approaches

Align sequences, then look for commonstructurePredict structures, then try to align them

single-seq struct prediction only ~ 60% accurate;exacerbated by flanking seq; no biologically-validated model for structural alignment

Do both togetherSankoff – good but slowHeuristic

Design Goals

Find RNA motifs in unaligned sequencesSeq conservation exploited, but not requiredRobust to inclusion of unrelated sequencesRobust to inclusion of flanking sequenceReasonably fast and scalableProduce a probabilistic model of the motifthat can be directly used for homolog search

Page 14: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

14

CMfinder Outline

Search

Folding predictions

Heuristics

Candidatealignment CM

Realign

M step

E step

M-step uses M.I. + folding energy for structure prediction

CMfinder Accuracy(on Rfam families with flanking sequence)

/CW/CW

A pipeline for RNA motif genome scans

CMfinder

SearchGenome database

BLAST/CDD

Ortholgousgenes

Upstream sequences

FootprinterRank datasets

Top datasets Motifs

Homologs

Bacillus subtilis genes

Footprinter finds patterns ofconservation

1B_SUBTILIS

Upstream of folC

Page 15: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

15

A blind test

tyrS T box structure

1ST genome scan: 234 sequences2ND genome scan: 447 sequencesThe motif turned out to be T boxMatch to RFAM T box family: 299 OF 342False Positives: 89/148 are probable (upstream of

annotated tRNA-synthetase genes)

Chloroflexus aurantiacus

Geobacter metallireducensGeobacter sulphurreducens

Chloroflexiδ -Proteobacteria

Symbiobacterium thermophilum

CMfinder: 9 instances

Found by Scan: 447 hits

Some Preliminary Actino Results8 of 10 Rfam families found

Rfam Family Type (metabolite) Rank accession function/metabolite

THI riboswitch (thiamine) 4 RF00059 thiamin (pyrophosphate?) aka B1

ydaO-yuaA riboswitch (unknown) 19 RF00379 osmotic shock; triggers AA transporters

Cobalamin riboswitch (cobalamin) 21 RF00174 adenosylcobalamin (aka b12?)

SRP_bact gene 28 RF00169 signal recognition particle

RFN riboswitch (FMN) 39 RF00050 flavin mononucleotide (FMN)

yybP-ykoY riboswitch (unknown) 48 RF00080 unknown (diverse genes); called SraF in E.coli

gcvT riboswitch (glycine) 53 RF00504 glycine

S_box riboswitch (SAM) 401 RF00162 SAM: s-adenyl methionine

tmRNA gene Not found RF00023 aka 10Sa RNA or SsrA; frees mRNAs from stalled ribosomes

RNaseP gene Not found RF00010 tRNA maturation; is a ribozyme in bacteria

not cis-regulatory(got oneanyway)

Alberts, et al, 3e.

Gene Regulation: The MET Repressor

SAM

DNAProtein

Page 16: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

16

Alb

erts

, et a

l, 3e

.

Corbino et al., Genome Biol. 2005

Theproteinway

Riboswitchalternative

More Prelim Actino Results

Many others (not in Rfam) are likely realof top 50:known (Rfam, 23S) 10probable (Tbox, CIRCE, LexA, parP, pyrR) 7probable (ribosomal genes) 9potentially interesting 12unknown or poor 12

One bench-verified, 2 more in progress

Preliminary results of genome scan

Top 115 datasets (some are redundant)13 T box, 22 riboswitches, 30 ribosomal genesRNase P, tRNA, CIRCE elements and other DNA binding sites

0.2600.9713312774yybP-ykoY3410ykoY

0.8331.000305366237THI30516thiA

0.8921.000333714glmS3316glmS

0.9700.9159710037Purine10614xpt

0.8740.66929934267T_box4479folC

0.8510.9159711448RFN1069ribB

0.9600.96714515171S_box15013metK

sensitivity specificity#TP#full#seedRFAMhits#motifGene

Summary

ncRNA - apparently widespread, much interestCovariance Models - powerful but expensive toolfor ncRNA motif representation, search, discoveryRigorous/Heuristic filtering - typically 100x speedupin search with no/little loss in accuracy

CMfinder - CM-based motif discovery in unalignedsequences

Page 17: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

17

Course Wrap Up

What is DNA? RNA?How many Amino Acids are there?Did human beings, as we know them, developfrom earlier species of animals?What are stem cells?What did Viterbi invent?What is dynamic programming?What is a likelihood ratio test?What is the EM algorithm?How would you find the maximum of f(x) = ax3 +bx2 + cx +d in the interval -10<x<25?

“High-ThroughputBioTech”

SensorsDNA sequencing

Microarrays/Gene expression

Mass Spectrometry/Proteomics

Protein/protein & DNA/protein interaction

ControlsCloning

Gene knock out/knock in

RNAi

Floods of data

“Grand Challenge” problems

CS Points of Contact

Scientific visualizationGene expression patterns

DatabasesIntegration of disparate, overlapping data sourcesDistributed genome annotation in face of shifting underlying coordinates

AI/NLP/Text MiningInformation extraction from journal texts with inconsistentnomenclature, indirect interactions, incomplete/inaccurate models,…

Machine learningSystem level synthesis of cell behavior from low-level heterogeneous data(DNA sequence, gene expression, protein interaction, mass spec,

Algorithms…

Page 18: Outline RNA Search and Whirlwind tour of ncRNA …courses.cs.washington.edu/courses/csep590a/06su/slides/...2 “RNA sequence analysis using covariance models” Eddy & Durbin Nucleic

18

Frontiers & Opportunities

New data:Proteomics, SNP, arrays CGH, comparativesequence information, methylation, chromatinstructure, ncRNA, interactome

New methods:graphical models? rigorous filtering?

Data integrationmany, complex, noisy sources

Frontiers & Opportunities

Open Problems:splicing, alternative splicingmultiple sequence alignment (genome scale, w/ RNA etc.)protein & RNA structureinteraction modelingnetwork modelsRNA trafficingncRNA discovery…

Exciting Times

Lots to doVarious skills needed

I hope I’ve given you a taste of it

Thanks!