genome wide searches for rna secondary structure motifs russell s. hamilton davis lab wellcome trust...
TRANSCRIPT
Genome Wide Searches for RNA Secondary
Structure MotifsRussell S. Hamilton
Davis LabWellcome Trust Centre for Cell Biology
Drosophila melanogaster
Introduction: RNA Localization 2
+ -microtubules
mRNA cis-acting signal
Trans-acting factors
Dynein
RNA Localization is a mode of targeting various proteins to their site of function
Cis-acting signals in the mRNA are recognised by trans-acting factors bound to the dynein motor
Translation of the mRNA into protein is blocked during transport
The mRNA is anchored at the site of function before being translated to protein(Delanoue & Davis, 2005, Cell, in press)
gurken is localized to the dorso/anterior corner, forming a cap around the oocyte nucleus and establishes the dorso/ventral axis
gurken localization has been shown to be dynein dependent (MacDougal et al, 2003, Dev. Cell, 4, 307-19)
gurken localization signal has been mapped to 64nt necessary and sufficient for localization (Van De Bor & Davis, 2004, Curr. Opin. Cell Biol. 16, 300-7 )
gurken also localizes in the embryo
Introduction: gurken 3
D
V
A P
osk
bcd
grk
Localizing mRNA in oocyte
gurken encodes a TGFα homologue
Introduction: I Factor 4
Localized I Factor
nucleus
I Factor is a retrotransposon (or transposable element), which inserts itself into the genome of an organism
I Factor has been found to localize in a similar manner to gurken(Van De Bor, Hartswood, Jones, Finnegan & Davis)
The localization signal has been mapped to a 58nt signal necessary and sufficient for localization.
Van De Bor
Sequence Similarity %ID = 34%
gurken AAGTAATTTTCGTGCTCTCAACAATTGTCGCCGTCACAGATTGTTGTTCGAGCCGAATCTTACT 64Ifactor ---TGCACACCTCCCTCGTCACTCTTGATTTT-TCAAGAGCCTTCGATCGAGTAGGTGTGCA-- 58 * * *** ** *** *** * * ***** * *
Structural Similarity
V. Van Der Bor, D. Finnegan, E. Harstwood and C. Jones
H
St
I1
B
I2
H
St
I1
B
I2
gurken64nt stem loop
I Factor58nt stem loop
Are there more examples in the Drosophila genome using a similar mechanism of localization?
Search by secondary structure not sequence
Introduction: gurken and I Factor 5
Genome sequences
Database
Folded Genome sequences Comparison with grk & I Factor structures
Method Outline 6
RNALFOLDFolds large genomic sequences outputting stable structures of a given sizeSimilar to mfold, but optimised for folding on genome wide scale
2L chromosome arm genomic sequence
StableStructures
RNALfold Hofacker et al (2004) Bioinformatics 20, 191-198
Method: RNALFOLD 7
Window Length user defined• Use 64 and 58 (grk & I Factor LEs)
RNAdistance & RNAforester • Structures represented in bracket format Minimal representation maintaining all structural characteristics
• Structures then aligned (not by sequence) with the query structure e.g. gurken LE
• Scores can be weighted by sequence length and total number of base pairs
..(((((.....))))). Matches = + score
.-(-(((-....))))-. Mismatches = - score
( = base pair . = unpaired base - = gapRNAdistance
Global Structure Comparison Hofacker (1994) Monatsh.Chem. 125, 167-188
RNAforester Local Structure ComparisonHochsmann (2003) Proc. Comp. Sys. Bioinf. (CSB 2003)
Method: RNAdistance & RNAforester 8
Flexible secondary structure definition and searching algorithm
Two step processStep 1. Create a structure descriptionStep 2. Use the description to find matching structures in a sequence database
Uses Mfold (and pknots) for secondary structure predictionsOutput can be ranked by thermodynamic stability
User Defined ScoringBased on if/then/else statementse.g. if loop has 6-8 bases
then score += 10 else score -= 10
Algorithm SummaryDescription converted to a tree structureSequence being matched, has secondary structure converted to tree structureThen the matching can occur.
Method: RNAMotif 9
Macke, T.J. et al (2001) Nucl., Acids., Res., 29, 4724-4735
Define base pairings allowed (in addition to Watson-Crick)
Define stems, loops, and bulges• Including number of nucleotides • Setting a range 0-N means it can either be present or not
Can also put in sequence constraintsIncluding tolerated mismatches
Can search for pseudoknots, triplexes & quadruplexes
Very flexible method of describing secondary structures
Method: RNAMotif 10
4 Description files so far…
1. Basic2900 hitsMatches both gurken and I factor LEs
2. Basic + score2900 hitsScores nearer gurken as positiveScores nearer I factor as negative
3. Basic + score + seq contraint UU394 hitsUU in bulge present in both gurken and I factor
4. Basic + score + seq contraint UU + CAA/AAC151+ hitsCAA/AAC stem1 present in both gurken and I factor
loop4-12nt
stem7-8nt
stem2-4nt
stem4nt
bulge3-5nt
bulge1-2nt
bulge0-1nt
stem5-9nt
loop4-12nt
stem7-8nt
stem2-4nt
stem4nt
bulge3-5nt
bulge1-2nt
bulge0-1nt
stem5-9nt
Method: RNAMotif 11
PC1.3X107
CDS3.0X107
TS4.0X107
GN1.2X108
TE4.5X106
NC4.8X106
3-UTR6.3X106
5-UTR3.9X106
Sequence Databases
RNALfold[3]
RNADistance[4]
RNAMotif[5]
MatchesDatabase
Folds 2.8X108 ntat window lengths of 58 and 64 nt
Each stable structure is compared to gurken and I Factor LEs
Stable structures are filtered by rule based pattern matching
Web based Database Interface
Candidates for experimental validation
PC1.3X107
CDS3.0X107
TS4.0X107
GN1.2X108
TE4.5X106
NC4.8X106
3-UTR6.3X106
5-UTR3.9X106
Sequence Databases
PC1.3X107
CDS3.0X107
TS4.0X107
GN1.2X108
TE4.5X106
NC4.8X106
3-UTR6.3X106
5-UTR3.9X106
PC1.3X107
CDS3.0X107
TS4.0X107
GN1.2X108
TE4.5X106
NC4.8X106
3-UTR6.3X106
5-UTR3.9X106
Sequence Databases
RNALfold[3]
RNADistance[4]
RNAMotif[5]
MatchesDatabase
Folds 2.8X108 ntat window lengths of 58 and 64 nt
Each stable structure is compared to gurken and I Factor LEs
Stable structures are filtered by rule based pattern matching
Web based Database Interface
Candidates for experimental validation
Take all available sequence databases
Predict all stable secondary structures
Calculate similarity between grk/Ifactor and stable structures
Pattern match structures against an RNAMotif description
Results put in database and accessed via web interface
Method: Overview 12
Processing 6 processing nodes • Pentium 4 HT 1GB RAM
Data StorageRAID Array File ServerTape Backup Robot
Computational requirements are beyond desktop PC’sMain requirements are for processing power and enough storage space for the sequences being searched and the database of matching structures
Computational Infrastructure 13
Web ServerLinked to Database
Development Platform
http://wcbweb.icmb.ed.ac.uk/~ilan/bioinformatics.html
To stop your browser crashing, you can limit the number of hits displayed
Filter by percentage of the sequence deemed to have low complexity
Select the RNAMotif structure description used in the searches
Narrow down the search by CG, TE, CR or individual identifiers
X
Web Interface: Searching 14
RNAMotif raw output showing how sequence matches the structure description
Indicates if the sequence has regions of low complexity/repeat regions (option to filter these out)
RNAdistance scores displayedCustom RNAMotif Score
Web Interface: Search Results 15
Web Interface: Gene Mapping 16
Web Interface: Conservation Assessment 17
Results: Candidate Injections 18
We are currently in the process of injecting candidates from the database into oocytes and embryos to determine if the RNA is localized.
There have been suggestions that up to 20% of Drosophila genes may localize in the oocyte and/or embryo
So we want to show that our method is able to enrich for localizing genes
Results of candidate injections are stored in the database
Depending of the success of the experimental localization assays…
Expand the searches to: • Other Drosophilid genomes
12 will be sequenced in the near future• Mammalian genomes (particularly human)
Will require considerable computational powerSearch for LINE/SINE elements in human (transposon equivalents)
Develop the web interface to enable real time searches to be performed on genes/genomes of interest
• Requires massive computational power…
Future Work: Expanding Searches 19
Squid Protein
gurken mRNA is known to bind Squid protein
Used homology modelling to predict squid tertiary structure (~2.5Å)(Hamilton & Soares)
RNA tertiary structure prediction
Secondary structure alone may not be sufficient for finding similar structures
Experimental Structure Determination
RNA + Protein - X-Ray and/or NMRRNA only - NMR
Future Work: Tertiary Structure 20
RNA Binding Sites
Flexible Linker region
Squid homology model
RNA + protein 3D Structure
Staufen + RNARamos et al, 2000, EMBO, 19, 997-1009
Long Term Future…
Support Vector Machines (SVMs)
Take sequence & structure for localizing and non-localizing matches (+ other data)
Algorithm learns how to separate localizing from non-localizing
Future Work: Machine Learning 21
Problem is we don’t have enough data at the moment
However with all the candidate injections we will hopefully generate enough data for localizing and non-localizing genes
Funding
Davis LabIlan DavisVeronique Van De BorGeorgia VendraHille TekotteRenald DelanueCarine Meignin Alejandra ClarkIsabelle KosRichard Parton
Software
Acknowledgements 22
Finnegan LabDavid FinneganEve HartswoodCheryl Jones
Bioinformatics DiscussionsAlastair Kerr
Systems AdministrationPaul Taylor
Homology ModellingDinesh Soares