17 ½ weeks in leipzig, saxonia · leipzig, 1. 6. 2009 finish vienna, 1. 10. 2009 rnaz! yes, there...
TRANSCRIPT
17 ½ Weeks in Leipzig, Saxonia
Andreas Gruber
Institute for Theoretical ChemistryUniversity of Vienna
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Bacterial Self Killing System
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Shigella sonnei Ss046 8Serratia proteamaculans 568 5Escherichia coli IAI1 1Enterobacter sp. 638 1Salmonella enterica 2Edwardsiella ictaluri 93-146 1Shigella boydii CDC 3083-94 10Escherichia fergusonii ATCC 35469 4Escherichia coli O157:H7 str. EC4115 16Photobacterium profundum SS9 1Proteus mirabilis HI4320 1Klebsiella pneumoniae 342 2Escherichia coli str. K-12 substr 5Photorhabdus luminescens 3Shigella flexneri 2a str. 2457T 8Shigella flexneri 2a str. 301 6Shigella boydii Sb227 6Klebsiella pneumoniae NTUH-K2044 2Vibrio fischeri ES114 1Salmonella enterica 1Cronobacter sakazakii ATCC BAA-894 6Vibrio vulnificus CMCP6 1Aliivibrio salmonicida LFI1238 3Klebsiella pneumoniae 2Shigella dysenteriae Sd197 5Shewanella baltica OS223 2Aeromonas salmonicida 2
So far ...
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
functional ncRNAs
ncRNA detection with RNAz
functional ncRNAs
structured ncRNAs
ncRNA detection with RNAz
functional ncRNAs
structured ncRNAs
ncRNA detection with RNAz
Why focus on structured RNAs?
Structured RNAs are the only class offunctional RNAs that give at least somestatistically relevant signals.
functional ncRNAs
structured ncRNAs
ncRNA detection with RNAz
Why focus on structured RNAs?
Structured RNAs are the only class offunctional RNAs that give at least somestatistically relevant signals.
Thermodynamic stability
functional ncRNAs
structured ncRNAs
ncRNA detection with RNAz
Why focus on structured RNAs?
Structured RNAs are the only class offunctional RNAs that give at least somestatistically relevant signals.
Thermodynamic stability
Structural conservation
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Thermodynamic stability
z-score = E - μ
Background
σBackground
Thermodynamic stability
z-score = E - μ
Background
σBackground
1) Generate randomized sequences of the same length and same base composition
Thermodynamic stability
z-score = E - μ
Background
σBackground
1) Generate randomized sequences of the same length and same base composition
2) Fold the sequences using RNAfold
Thermodynamic stability
z-score = E - μ
Background
σBackground
1) Generate randomized sequences of the same length and same base composition
2) Fold the sequences using RNAfold
3) calculate μ and σ
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
The Wash way
Explicit generation and folding of sequences is too costly!
The Wash way
Explicit generation and folding of sequences is too costly!
Clue: Regression - train a SVM instead that does the job!
The Wash way
Explicit generation and folding of sequences is too costly!
Clue: Regression - train a SVM instead that does the job!
for length (50 ... 400 by 50) { for C+G (0.25 ... 0.75 by 0.05) { for A/(A+U) (0.25 ... 0.75 by 0.05) { for C/(C+G) (0.25 ... 0.75 by 0.05) { 1) generate a synthetic sequence of given length with nucleotide frequencies derived from C+G, A/(A+U), and C/(C+G)
2) generate 1,000 shuffled sequences 3) fold those 1,000 sequences and calculate μ and σ } } }}
==> ~ 10,000 training examples to train a SVM for μ and σ, respectively.
The Wash way
Explicit generation and folding of sequences is too costly!
Clue: Regression - train a SVM instead that does the job!
for length (50 ... 400 by 50) { for C+G (0.25 ... 0.75 by 0.05) { for A/(A+U) (0.25 ... 0.75 by 0.05) { for C/(C+G) (0.25 ... 0.75 by 0.05) { 1) generate a synthetic sequence of given length with nucleotide frequencies derived from C+G, A/(A+U), and C/(C+G)
2) generate 1,000 shuffled sequences 3) fold those 1,000 sequences and calculate μ and σ } } }}
==> ~ 10,000 training examples to train a SVM for μ and σ, respectively.
Energy minimization is based on stacking energies!
The Wash way
Explicit generation and folding of sequences is too costly!
Clue: Regression - train a SVM instead that does the job!
for length (50 ... 400 by 50) { for C+G (0.25 ... 0.75 by 0.05) { for A/(A+U) (0.25 ... 0.75 by 0.05) { for C/(C+G) (0.25 ... 0.75 by 0.05) { 1) generate a synthetic sequence of given length with nucleotide frequencies derived from C+G, A/(A+U), and C/(C+G)
2) generate 1,000 shuffled sequences 3) fold those 1,000 sequences and calculate μ and σ } } }}
==> ~ 10,000 training examples to train a SVM for μ and σ, respectively.
Energy minimization is based on stacking energies!
It would be better to consider dinucleotide composition as well!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Generate Sequences
Generate Sequences
Shuffle
Generate Sequences
Shuffle
Vary LengthRepresentative Set
Generate Sequences
Shuffle
Vary LengthRepresentative Set
μ, σ
Generate Sequences
Shuffle
Vary LengthRepresentative Set
μ, σSplit to Subsets
Generate Sequences
Shuffle
Vary LengthRepresentative Set
μ, σSplit to Subsets
Train
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
NEW
NEW
NEW
NEW
NEWNEW
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
We did it!
We did it!
We did it!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Structural vs. sequence based alignments
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
On a genome wide scale
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Manja agruber||||| ||||| ||||| |||| ||||| ||||| ||||| ||||||| ||||| ||||| ||||||
Dom agruber||||| ||||| ||||| ||||| ||||| ||||| ||||| ||||||| |||||
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
J. Mattick
Alu repeats
J. Mattick
Alu repeats
J. Mattick
Retrotransposition
Alu repeats
J. Mattick
Retrotransposition
Pseudogenes
Alu repeats
J. Mattick
Retrotransposition
Pseudogenes
Expressed Sequence Tags (ESTs)
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
BLAST query: > 75% query coverage, > 95% identity
BLAST query: > 75% query coverage, > 95% identity
?
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Full length match
Partial match
Full length match
77% of ESTs with U1 snRNA sigantures are Chimeras
Partial match
Full length match
Chimera
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
What do we learn?
Chimera are most likely artifacts caused during the process of library generation.
There are expressed RNA pseudogenes.
Small ncRNAs can be found in ESTs but it is likely that they are fused to something else.
In most cases only the protein component is annotated.
What's the plan?
There are lots of ESTs and there are lots of sequenced genomes!
A pipeline for detection of chimeric ESTs and hence expressed small ncRNAs is in production.
What do we learn?
Chimera are most likely artifacts caused during the process of library generation.
There are expressed RNA pseudogenes.
Small ncRNAs can be found in ESTs but it is likely that they are fused to something else.
In most cases only the protein component is annotated.
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Some C. elegans snoRNAs have a tRNA-like promoter
Some C. elegans snoRNAs have a tRNA-like promoter
Is there a already powerfulbox-A-box-B-finder?
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Yes, there is!
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.Nucleic Acids Res. (1997) 25(5)
tRNAscan-SE
Yes, there is!
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.Nucleic Acids Res. (1997) 25(5)
tRNAscan-SE
tRNAscan algorithm
* screen for putative box A and box b motifs
* then call the computationally more costly structure validation
Yes, there is!
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.Nucleic Acids Res. (1997) 25(5)
tRNAscan-SE
tRNAscan algorithm
* screen for putative box A and box b motifs
* then call the computationally more costly structure validation
All we need to do is to take the tRNAscan FALSE POSITVES!
Yes, there is!
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.Nucleic Acids Res. (1997) 25(5)
tRNAscan-SE
tRNAscan algorithm
* screen for putative box A and box b motifs
* then call the computationally more costly structure validation
All we need to do is to take the tRNAscan FALSE POSITVES!
snoRNAs and snoRNA predictions
Screen 200 nt upstream of each known snoRNA to identify those that have tRNA-like promoters
Screen 200 nt upstream of snoReporthits in P. pacificus to lend some hitsadditional reliability
Yes, there is!
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.Nucleic Acids Res. (1997) 25(5)
tRNAscan-SE
tRNAscan algorithm
* screen for putative box A and box b motifs
* then call the computationally more costly structure validation
All we need to do is to take the tRNAscan FALSE POSITVES!
snoRNAs and snoRNA predictions
Screen 200 nt upstream of each known snoRNA to identify those that have tRNA-like promoters
Screen 200 nt upstream of snoReporthits in P. pacificus to lend some hitsadditional reliability
Genome-wide scale
Take tRNAscan FP-hits
Cut out 70 nt down stream or search for poly T stretch
Call blastlcust and look atinter-species clusters
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Idea
?
!
STARTLeipzig, 1. 6. 2009
FINISHVienna, 1. 10. 2009
RNAz
!
Thanks to the whole Bierinformatik!
Thanks to the whole Bierinformatik!
MOTIVATIONSupport on almost
everything I worked on
Thanks to the whole Bierinformatik!
Talking about RNA, “Gott und
die Welt” ...
MOTIVATIONSupport on almost
everything I worked on
Thanks to the whole Bierinformatik!
MOTIVATIONSupport on almost
everything I worked on
For being the “Mr. Nice Guy “
at my back
Talking about RNA, “Gott und
die Welt” ...
Thanks to the whole Bierinformatik!
MOTIVATIONSupport on almost
everything I worked on
For being the “Mr. Nice Guy “
at my back
For being PeterF. Stadler
Talking about RNA, “Gott und
die Welt” ...