rnas in the human genome sam griffiths-jones the wellcome trust sanger institute

56
RNAs in the human genome Sam Griffiths-Jones The Wellcome Trust Sanger Institute

Upload: paul-sowerby

Post on 15-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

RNAs in the human genome

Sam Griffiths-Jones

The Wellcome Trust Sanger Institute

Outline

• I. Non-coding RNA• The genome’s dark matter• Family classification• Genome annotation

• II. ncRNA genes in the human genome• Rogue’s gallery• miRNAs• Regulatory elements

T. thermophilus - Ramakrishnan et al., Cell, 2002

Protein/RNA genes

DNA

RNA

proteinX

ncRNA genes

• …. code for functional RNAs• Many cellular machines contain RNA

• Ribosome rRNA• Spliceosome snRNAs (U1,U2,U4,U5,U6)• Telomerase Telomerase RNA• SRP SRP RNA

How many genes in the human genome?

Gene sweep

• CSHL 2000-2003

• Rules• $1 in 2000, $5 in 2001 and $20 in 2002

• A gene is a set of connected transcripts. A transcript is a set of exons connected via transcription. At least one transcript must be expressed outside of the nucleus and one transcript must encode a protein.

• One bet per person, per year

• Results• 165 bets

• Mean 61710

• Lowest 25947

• Highest 153478

• Answer: 21000 Winner: Lee Rowen

• http://www.ensembl.org/Genesweep/

ncRNA genes

• Genomic dark matter• Ignored by gene prediction methods• Not in EnsEMBL• Computational complexity

• ~10% of human gene count?

The RNA World

• Origin of life / central dogma paradox• DNA needs proteins to replicate• Proteins coded for by DNA

• RNA can be code and machinery• Selex, aptamers

• RNAs are remnants• Ancient• Essential

Biological sequence analysis

Protein easyRNA hard

Gene finding

• Rules• ATG• TAA, TGA, TAG• GT…..AG

• Compositional features• Exon lengths• Intron lengths• Codon bias• General genomic properties

• Homology

?

?

Protein sequence analysis

Query: 1 MKFYTIKLPKFLGGIVRAMLGSFRKD 26 M+ TIKLPKFL IVR G+ + D Sbjct: 390 MRIMTIKLPKFLAKIVRMFKGNKKSD 467

RNA sequence analysis

RNA sequence analysis

Why are families useful?

• Alignments of related sequences

• Phylogenetic trees

• Homologue detection

• Genome annotation

• Secondary structure prediction

S. cerevisiae UCCUCGUGAGAGGGP. canadensis GUCUC.UGAGAGAUP. strasburgensis CUCUC.UGAGAGAGK. thermotolerans UUCUCGUGAGAGAASS <<<<<....>>>>>

RNA models

• Covariance models (profile-SCFGs)• Analogue to profile-HMMs• Statistical representation of the alignment

with structure• Homologue detection• Multiple sequence alignment• (Sean Eddy)

Protein sequence analysis - HMMs

ERELKKQKKLSNRERELKK..KQSNRERELKRQRKQSNRKAAAQRQKMIKNR

M M M M

D

I

EREKKKRKQSNR

D

I

B E

D D

I

RNA sequence analysis - SCFGs

MP

G G A A G A U C C< < < . . . > > >

MP

MP

ML

ML

A – UG – CG – C

A AG

ML

RNA models - problems

• Problems• Speed• Memory• Sensitivity

• Speed• 30 billion bases in DBs• O(N3) wrt model length• small model 300 b/s• 28S rRNA 200 b/day

Sanger supercomputers

Rfam 5.0

• http://www.sanger.ac.uk/Software/Rfam/• http://rfam.wustl.edu/• 176 ncRNA families

• Structure annotated alignments• Species distributions• Keyword searches• Sequence searches

• >235000 regions in EMBL 76

ncRNA families

What we have:• tRNA• 5S, 5.8S rRNAs• Spliceosomal RNAs• SRP, RNaseP• Telomerase, tmRNA, vault• E. coli screens• Some snoRNAs• Some miRNAs• Some UTR elements• Self-splicing introns• …… more

What we don’t:• 18S, 23S rRNAs• Other large things (Xist etc)• Lots of snoRNAs• Lots of miRNAs• Many small families• Unknowns

Genome annotation

• GeneralOne tool fits all Compute drainAutomatic Eukaryotic complicationsComprehensiveGreat for prokaryotes

• SpecificHeuristics One family, one gene

finderIncreased speedIncreased sensitivity

tRNAscan-SE, BRUCE, SRPscan, snoscan

Outline

• I. Non-coding RNA• The genome’s dark matter• Family classification• Genome annotation

• II. ncRNA genes in the human genome• Rogue’s gallery• miRNAs• Regulatory elements

Outline

• I. Non-coding RNA• The genome’s dark matter• Family classification• Genome annotation

• II. ncRNA genes in the human genome• Rogue’s gallery• miRNAs• Regulatory elements

International Human Genome Sequencing Consortium, Nature, 2001

X chromosome inactivation in mammals

X X X Y

X

Dosage compensation

Xist – X inactive-specific transcript

Avner and Heard, Nat. Rev. Genetics 2001 2(1):59-67

International Human Genome Sequencing Consortium, Nature, 2001

microRNAs

• A novel class of ncRNA gene• Products are ~22 nt RNAs• Precursors are 70-100 nt hairpins• Gene regulation by pairing to mRNA• Unknown before 2001

Timeline

• Late 70’s – lin-4 and let-7 regulate developmental timing in worm

• 1993 – lin-4 codes for a ~22 nt RNA, complementary to 3’ UTR of lin-14

• 2000 – …. so does let-7 (stRNAs)

• 2000 – let-7 is conserved in bilaterally symmetric animals

• 2001 – ~100 miRNAs discovered by cloning in worm, fly and human

• 2002 – miRNAs conserved in plants

• 2002 – Science magazine’s breakthrough of the year

• 2002 – miRNA Registry established

• 2003 – miRNAs may account for 1% of total gene count in animals

• 2003 – a few targets of miRNAs identified

• 2004 – miRNA Registry has 719 miRNAs

0

20

40

60

80

100

120

140

1999 2000 2001 2002 2003 2004

Year

Nu

mb

er

of

pu

bli

ca

tio

ns

“miRNA” in PubMed

miRNA biogenesis

Adapted from DP Bartel, Cell 116:281-297(2004)

miRNAs targets

DP Bartel, Cell 2004 116:281-287

PNAS 99:15524-15529(2002)

miRNA Registry 3.0

• Searchable database of published miRNAs• http://www.sanger.ac.uk/Software/Rfam/mirna/

• 719 entries from human, mouse, rat, worm, fly, and plants

• Naming service• Pre-publication

• Unique names for distinct miRNAs

• Confidentiality for unpublished data

Genomic context

180 known miRNAs in human

130 intergenic 50 intronic

60 polycistronic

70 monocistronic

ncRNA gene contexts

AAAAAAA

tRNA, snRNAs,SRP, RNase P …..

Xist

miRNAs

miRNAs, snoRNAs

Inside-out genes

protein

Inside-out genes

degradation

Gas5, UHG, U17HG,U19H

snoRNA

PrfA

37oC

25oC

Virulence gene expression

Cis-regulatory RNA elementsPrfA in Listeria

UTR elements in human

• IRE regulation of iron metabolism

• SECIS UGA -> SeC

• Histone 3’ UTR 3’ end formation

• Vimentin 3’ UTR mRNA localisation

• CAESAR CTGF repression

• …. many more

ncRNAs in human genome

• tRNA 600

• 18S rRNA 200

• 5.8S rRNA 200

• 28S rRNA 200

• 5S rRNA 200

• snoRNA 300

• miRNA 250

• U1 40

• U2 30

• U4 30

• U5 30

• U6 20

• U4atac 5

• U6atac 5

• U11 5

• U12 5

• SRP RNA 1

• RNase P RNA 1

• Telomerase RNA 1

• RNase MRP 1

• Y RNA 5

• Vault 4

• 7SK RNA 1

• Xist 1

• H19 1

• BIC 1

• Antisense RNAs 1000s?

• Cis reg regions 100s?

• Others ?

Summary

• ncRNA genes ….• have diverse and essential roles

• may be relics of ancient RNA-based life

• provide major computational challenges

• are often ignored!

• >10% of human gene count?

• Family classifications are useful for ….• finding homologues

• predicting structure

• allow automatic genome annotation

Just plain weird

http://vaults.arc.ucla.edu/sci/sci_home.htm

• Vault is huge• 13 Md

• 30 x 55 nm

• Described in 1986

• 3 proteins• MVP

• TEP1

• vPARP

• vRNA

• Conserved in higher euks

http://vaults.arc.ucla.edu/sci/sci_home.htm

Thanks

• Alex Bateman

• Mhairi Marshall

• Simon Moxon

• Ajay Khanna

• Sean Eddy

• Informatics support group

• Ian Holmes

• Bjarne Knudsen

• Robbie Klein

• David Bartel

• Tom Tuschl

• Victor Ambros

Bibliography

• Computational genomics of non-coding RNA genes. Sean R. Eddy, Cell 109:137-140 (2002)

• Non-coding RNAs: the architects of eukaryotic complexity. John S. Mattick, EMBO Reports 2:986-991 (2001)

• MicroRNAs: Genomics, biogenesis, mechanism and function. David P. Bartel, Cell 116:281-297 (2004)

• Rfam: An RNA family database. Sam Griffiths-Jones et al., Nucl. Acids Res. 31:439-441 (2003)

[email protected]

http://www.sanger.ac.uk/Software/Rfam/[email protected]

http://www.stats.ox.ac.uk/~hein/HumanGenome/