finding, aligning and analyzing non coding rnas
DESCRIPTION
Finding, Aligning and Analyzing Non Coding RNAs. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. They are Everywhere…. And ENCODE said… - PowerPoint PPT PresentationTRANSCRIPT
Finding, Aligning and AnalyzingNon Coding RNAs
Cédric NotredameComparative Bioinformatics GroupBioinformatics and Genomics Program
They are Everywhere…
And ENCODE said…“nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions”
Who Are They?– tRNA, rRNA, snoRNAs, – microRNAs, siRNAs– piRNAs– long ncRNAs (Xist, Evf, Air, CTN, PINK…)
How Many of them– Open question– 30.000 is a common guess– Harder to detect than proteins
.
Searching
“…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”
ncRNAs can have different sequences and Similar Structures
ncRNAs Can Evolve Rapidly
CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**
GAACGGACC
CTTGCCTGG
GG
AAC CA
CGG
AG
AC G
CTTGCCTCC
GAACGGAGG
GG
AAC CA
CGG
AG
AC G
ncRNAs are Difficult to Align
--CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG-- * * *** * * *** *
CCAGGCAAGACGGGACGAGAGTTGCCTGGCCTCCGTTCAGAGGTGCATAGAACGGAGG**-------*--**---*-**------**
Regular Alignment
ncRNAs are Difficult to Align
Same Structure Low Sequence Identity
Small Alphabet, Short Sequences Alignments often Non-Significant
Obtaining the Structure of a ncRNA is difficult
Hard to Align The Sequences Without the Structure
Hard to Predict the Structures Without an Alignment
The Holy Grail of RNA Comparison:Sankoff’ Algorithm
The Holy Grail of RNA ComparisonSankoff’ Algorithm
Simultaneous Folding and Alignment
– Time Complexity: O(L2n)– Space Complexity: O(L3n)
In Practice, for Two Sequences:
– 50 nucleotides: 1 min. 6 M.– 100 nucleotides 16 min. 256 M.– 200 nucleotides 4 hours 4 G.– 400 nucleotides 3 days 3 T.
Forget about– Multiple sequence alignments– Database searches
The next best Thing: Consan
Consan = Sankoff + a few constraints
Use of Stochastic Context Free Grammars
– Tree-shaped HMMs– Made sparse with constraints
The constraints are derived from the most confident positions of the alignment
Equivalent of Banded DP
Consan for Databases: Infernal
Infernal is a Faster version of Consan
For Database Search
Sill Very Slow
Receiver operating characteristic (ROC)Comparison of Infernal with BLAST
Consan for Databases: Infernal
BLAST: 360 s.
Fast Infernal: 182 000 s. Slow Infernal: 5 320 000 s.
Searching Databases for New RNAs
Rfam: In practice
Rfam contains RNA families
– Families Multiple Sequence Alignment Models
– Models are like Pfam Profiles Use Consan or Cmsearch rather than HMMer Much Slower
– Too expensive to search the models Models are used to build Rfam People usually BLAST Rfam
Where do Rfam Families Come From?
Infernal Requires a Model
Models requires an MSA
The MSA requires a Family
It all starts with a BlastN
Rfam, Gardner et al. NAR 2008
Can we make BlastN more accurate ?
BlastN is not very accurate because:
– Poor substitution models for Nucleic Acids– Low information density (4 symbols)
BlastN assumes– Equal evolution rates for all nucleotides– Independence form Neighbors
Love Thy Neighbor
Measured Nearest Neighbor Dependencies on Rfam sequences
High Rate of CpG mutations
Measuring Di-Nucleotide Evolution
Each Nucleotide can be made more informative
It can incorporate the “name” of its Neighbor– AA => a– AG => b– AC => c– AT => d– …
A 16 Letter alphabet can be used to recode all nucleotide sequences
We name these extended Nucleotides
Blosum-R and eRNA
Substitutions ??
How much does it cost to turn one nucleotide into another one ?
Blosum/Pam style matrix
Matrices estimated on Rfam families
Blosum-R and eRNA
Using BlastR
When Nucleic Acids look like Proteins They can be aligned with Protein Methods
– BlastN BlastP
– BlastP with eRNA is BlastR
Validating Blast-R
Benchmarking BlastR
Rfam
PPPN
E
VALUES
Blast
Query
Benchmarking BlastR
Rfam 001
Rfam 002
Rfam …
Rfam 001
Rfam 002
Rfam …
Blast
Blast
Blast
ROC
Benchmarking BlastR
Good Bad
False Positives
True Positive
GoodBad
Benchmarking BlastR
False Positives
True Positive
GoodBad
Area Under Curve
Small AUC Better
BlastR vs The World
The 3 Components of Blast R
BlastP is better than BlastN BlosumR makes BlastP a little
bit better
Blast: wuBlast
The 3 Components of Blast R
BlastP is better than BlastN BlosumR makes BlastP a little
bit better And Faster
BlastR and Clustering
Given all Rfam in Bulk
How good is BlastR at reconstituting all the families
Sensitivity
1-Specificty
BlastR and Clustering
Given all Rfam in Bulk
How good is BlastR at reconstituting all the families
Sensitivity
1-Specificty
BllastR: In Practice
BllastR: In Practice
E-Value Threshold: 10-20
BlastN
BlastR
Take Home
Searching Nucleotides is Difficult
BlastN is not a very good algorithm
Simple Adaptations can improve the situation– Changing the algorithm (BlastP)– Changing the Scoring Scheme (BlastP-Nuc)– Changing the alphabet (BlastR)