rnas in the human genome sam griffiths-jones the wellcome trust sanger institute
TRANSCRIPT
Outline
• I. Non-coding RNA• The genome’s dark matter• Family classification• Genome annotation
• II. ncRNA genes in the human genome• Rogue’s gallery• miRNAs• Regulatory elements
ncRNA genes
• …. code for functional RNAs• Many cellular machines contain RNA
• Ribosome rRNA• Spliceosome snRNAs (U1,U2,U4,U5,U6)• Telomerase Telomerase RNA• SRP SRP RNA
Gene sweep
• CSHL 2000-2003
• Rules• $1 in 2000, $5 in 2001 and $20 in 2002
• A gene is a set of connected transcripts. A transcript is a set of exons connected via transcription. At least one transcript must be expressed outside of the nucleus and one transcript must encode a protein.
• One bet per person, per year
• Results• 165 bets
• Mean 61710
• Lowest 25947
• Highest 153478
• Answer: 21000 Winner: Lee Rowen
• http://www.ensembl.org/Genesweep/
ncRNA genes
• Genomic dark matter• Ignored by gene prediction methods• Not in EnsEMBL• Computational complexity
• ~10% of human gene count?
The RNA World
• Origin of life / central dogma paradox• DNA needs proteins to replicate• Proteins coded for by DNA
• RNA can be code and machinery• Selex, aptamers
• RNAs are remnants• Ancient• Essential
Gene finding
• Rules• ATG• TAA, TGA, TAG• GT…..AG
• Compositional features• Exon lengths• Intron lengths• Codon bias• General genomic properties
• Homology
?
?
Protein sequence analysis
Query: 1 MKFYTIKLPKFLGGIVRAMLGSFRKD 26 M+ TIKLPKFL IVR G+ + D Sbjct: 390 MRIMTIKLPKFLAKIVRMFKGNKKSD 467
Why are families useful?
• Alignments of related sequences
• Phylogenetic trees
• Homologue detection
• Genome annotation
• Secondary structure prediction
S. cerevisiae UCCUCGUGAGAGGGP. canadensis GUCUC.UGAGAGAUP. strasburgensis CUCUC.UGAGAGAGK. thermotolerans UUCUCGUGAGAGAASS <<<<<....>>>>>
RNA models
• Covariance models (profile-SCFGs)• Analogue to profile-HMMs• Statistical representation of the alignment
with structure• Homologue detection• Multiple sequence alignment• (Sean Eddy)
Protein sequence analysis - HMMs
ERELKKQKKLSNRERELKK..KQSNRERELKRQRKQSNRKAAAQRQKMIKNR
M M M M
D
I
EREKKKRKQSNR
D
I
B E
D D
I
RNA sequence analysis - SCFGs
MP
G G A A G A U C C< < < . . . > > >
MP
MP
ML
ML
A – UG – CG – C
A AG
ML
RNA models - problems
• Problems• Speed• Memory• Sensitivity
• Speed• 30 billion bases in DBs• O(N3) wrt model length• small model 300 b/s• 28S rRNA 200 b/day
Rfam 5.0
• http://www.sanger.ac.uk/Software/Rfam/• http://rfam.wustl.edu/• 176 ncRNA families
• Structure annotated alignments• Species distributions• Keyword searches• Sequence searches
• >235000 regions in EMBL 76
ncRNA families
What we have:• tRNA• 5S, 5.8S rRNAs• Spliceosomal RNAs• SRP, RNaseP• Telomerase, tmRNA, vault• E. coli screens• Some snoRNAs• Some miRNAs• Some UTR elements• Self-splicing introns• …… more
What we don’t:• 18S, 23S rRNAs• Other large things (Xist etc)• Lots of snoRNAs• Lots of miRNAs• Many small families• Unknowns
Genome annotation
• GeneralOne tool fits all Compute drainAutomatic Eukaryotic complicationsComprehensiveGreat for prokaryotes
• SpecificHeuristics One family, one gene
finderIncreased speedIncreased sensitivity
tRNAscan-SE, BRUCE, SRPscan, snoscan
Outline
• I. Non-coding RNA• The genome’s dark matter• Family classification• Genome annotation
• II. ncRNA genes in the human genome• Rogue’s gallery• miRNAs• Regulatory elements
Outline
• I. Non-coding RNA• The genome’s dark matter• Family classification• Genome annotation
• II. ncRNA genes in the human genome• Rogue’s gallery• miRNAs• Regulatory elements
microRNAs
• A novel class of ncRNA gene• Products are ~22 nt RNAs• Precursors are 70-100 nt hairpins• Gene regulation by pairing to mRNA• Unknown before 2001
Timeline
• Late 70’s – lin-4 and let-7 regulate developmental timing in worm
• 1993 – lin-4 codes for a ~22 nt RNA, complementary to 3’ UTR of lin-14
• 2000 – …. so does let-7 (stRNAs)
• 2000 – let-7 is conserved in bilaterally symmetric animals
• 2001 – ~100 miRNAs discovered by cloning in worm, fly and human
• 2002 – miRNAs conserved in plants
• 2002 – Science magazine’s breakthrough of the year
• 2002 – miRNA Registry established
• 2003 – miRNAs may account for 1% of total gene count in animals
• 2003 – a few targets of miRNAs identified
• 2004 – miRNA Registry has 719 miRNAs
0
20
40
60
80
100
120
140
1999 2000 2001 2002 2003 2004
Year
Nu
mb
er
of
pu
bli
ca
tio
ns
“miRNA” in PubMed
miRNA Registry 3.0
• Searchable database of published miRNAs• http://www.sanger.ac.uk/Software/Rfam/mirna/
• 719 entries from human, mouse, rat, worm, fly, and plants
• Naming service• Pre-publication
• Unique names for distinct miRNAs
• Confidentiality for unpublished data
Genomic context
180 known miRNAs in human
130 intergenic 50 intronic
60 polycistronic
70 monocistronic
UTR elements in human
• IRE regulation of iron metabolism
• SECIS UGA -> SeC
• Histone 3’ UTR 3’ end formation
• Vimentin 3’ UTR mRNA localisation
• CAESAR CTGF repression
• …. many more
ncRNAs in human genome
• tRNA 600
• 18S rRNA 200
• 5.8S rRNA 200
• 28S rRNA 200
• 5S rRNA 200
• snoRNA 300
• miRNA 250
• U1 40
• U2 30
• U4 30
• U5 30
• U6 20
• U4atac 5
• U6atac 5
• U11 5
• U12 5
• SRP RNA 1
• RNase P RNA 1
• Telomerase RNA 1
• RNase MRP 1
• Y RNA 5
• Vault 4
• 7SK RNA 1
• Xist 1
• H19 1
• BIC 1
• Antisense RNAs 1000s?
• Cis reg regions 100s?
• Others ?
Summary
• ncRNA genes ….• have diverse and essential roles
• may be relics of ancient RNA-based life
• provide major computational challenges
• are often ignored!
• >10% of human gene count?
• Family classifications are useful for ….• finding homologues
• predicting structure
• allow automatic genome annotation
Just plain weird
http://vaults.arc.ucla.edu/sci/sci_home.htm
• Vault is huge• 13 Md
• 30 x 55 nm
• Described in 1986
• 3 proteins• MVP
• TEP1
• vPARP
• vRNA
• Conserved in higher euks
Thanks
• Alex Bateman
• Mhairi Marshall
• Simon Moxon
• Ajay Khanna
• Sean Eddy
• Informatics support group
• Ian Holmes
• Bjarne Knudsen
• Robbie Klein
• David Bartel
• Tom Tuschl
• Victor Ambros
Bibliography
• Computational genomics of non-coding RNA genes. Sean R. Eddy, Cell 109:137-140 (2002)
• Non-coding RNAs: the architects of eukaryotic complexity. John S. Mattick, EMBO Reports 2:986-991 (2001)
• MicroRNAs: Genomics, biogenesis, mechanism and function. David P. Bartel, Cell 116:281-297 (2004)
• Rfam: An RNA family database. Sam Griffiths-Jones et al., Nucl. Acids Res. 31:439-441 (2003)
http://www.sanger.ac.uk/Software/Rfam/[email protected]
http://www.stats.ox.ac.uk/~hein/HumanGenome/