march 9, 2007 bologna, february 2007 1 the complexity of human genes the encode genes &...

45
March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

Upload: avis-norton

Post on 25-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 1

the complexity of human genes

The ENCODE Genes & Transcripts group

Roderic Guigó Centre de Regulació Genòmica, Barcelona

Page 2: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 2

genes and proteins

• One gene, one enzymeBeadle and Tatum

• The Central DogmaFrancis Crick

Page 3: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 3

from DNA to proteins

most of the transcriptional output of the human genome is localized in well defined genomic loci, which encode mRNAs that, when exported into the cytosol, are translated into proteins

Page 4: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona
Page 5: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 5

Page 6: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 6

• 1% of the genome. 44 regions

• target selection. commitee to select sequence targets

– manual targets – a lot of information

– radom targets – stratified by non exonic conservation with mouse gene density

Page 7: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 7

                                                                                      

m001

m002

m003

m004m005m007

m008

m009m010m011

m012m013

m014

r111

r112

r113

r114

r121

r122

r123

r131

r132

r133

r211

r212

r213

r221

r222

r223

r231

r232

r233r311

r312

r313

r321

r322

r323

r334

r324

m006

r331

r332

r333

12

3 4 5

6 987 10 1211

13 1514

2019

16

2221 Y

X

17 18

Page 8: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 8

Long-range regulatory elements(enhancers, repressors/silencers,

insulators)

Cis-regulatory elements(promoters, transcription factor binding

sites)

DNA Replication

DNase Hypersensitive

Sites

Genes and Transcripts

Epigenetic

Page 9: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 9

gencode: encyclopedia of

genes and gene variants

• Roderic Guigó, IMIM-UPF-CRG• Stylianos Antonarakis, Geneve

Alexandre Reymond• Ewan Birney, EBI• Michael Brent, WashU• Lior Pachter, Berkeley• Manolis Dermitzkakis, Sanger• Jennifer Ashurst, Tim Hubbard

identify all protein coding genes in the ENCODE regions:• identify one complete mRNA sequence for at least one splice isoform of each protein coding gene.

• eventually, identify a number of additional alternative splice forms.

Page 10: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 10

the gencode pipeline

1. mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the human genome

2. manual curation to resolve conflicting evidence3. additional computational predictions4. experimental verification5. FINAL ANNOTATION

Page 11: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

THE GENCODE PIPELINE

manual curation: havana (sanger)experimental verification:genevabioinformatics: imim

•2608 transcripts in 487 loci•137 transcripts in 53 non-coding loci•1097 coding transcripts and 1374 non-coding transcripts in 434 protein coding loci

most of protein coding loci encode a mixture of protein coding and non-coding transcripts

Page 12: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 12

one gene - many proteinsvery complex transcription units

Page 13: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 13

chimering tandem transcription / intergenic splicing

Page 14: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 14

KUA and UEV, Thomson et al., Genome Research 2000

Page 15: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 15

systematic search for functional chimeras in ENCODE:165 tandem pairs in the same orientation126 chimeric predictions obtained96 tested, at least 4 positve

Parra et al., Genome Research 2006Akiva et al., Genome Research 2006

Page 16: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

Locus RP11-298J23.1 codes for pepsinogen C. The structure of pepsinogen C is 1htrA.

Isoform -003 is missing 80 residues with respect to pepsinogen C. Here the missing section of -003 is in light green.

The missing section in this isoform would remove the core from both subdomains of the structure. Both the N-terminal sub-domain (on the left) and the C-terminal sub-domain would have to refold.

This is the view from above looking down into the active cleft of the proteinase. Active site aspartates are shown in ball and chain. One of the two active site residues is in the missing section.

The symmetry apparent in this isoform suggests that although it will have to refold it may very well be able to reform into a single subdomain.

Structural Effects of Pepsinogen C Alternative Splice Variant

Michael Tress & Alfonso ValenciaCNB, Madrid

Page 17: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 17

ITGB4B

11 supporting ESTs

Adam FrankishSanger

Page 18: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

03/09/07 Bologna, February 2007 18

ALL EXONS

CODING EXONS

GENCODE vs OTHER

GENE SETS

Page 19: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 19

from the ENCODE Chromatin and Replication Group, John Stamatoyannopoulos

Page 20: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 20

EGASP’05• the complete annotation of 13 regions was

released in january 30. – The annotation of the remaining 31 regions was

being obtained, and it was withheld.

• gene prediction groups were asked to submit predictions by april 15 in the remaining 31 regions.– 18 groups participated, submiting 30 prediction

sets

• predictions were compared to the annoations in an NHGRI sponsored workshop at the Wellcome Trust Sanger Institute, on may 6 and 7.

Page 21: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 21

EGASP’05• two main goals:

1. to assess how automatic methods are able to reproduce the (costly) manual/computational/experimental gencode annotation

2. how complete is the gencode annotation. are there still genes consistenly predicted by computational methods

Page 22: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 22

accuracy measures

Page 23: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 23

accuracy at the coding exon level

evidence-baseddual genome“ab intio”

Page 24: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 24

accuracy at the exon level

evidence-baseddual genome“ab intio”

Page 25: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 25

programs are quite good at calling the protein coding exons (accuracy at 80%) Not as good at calling the transcribed exons), but the best of the programs

predict correctly only 40% of the complete transcripts (considering only the coding fraction)

Page 26: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 26

many novel exons predicted:

- 8,634 unique exons predicted in intergenic regions- we ranked the exons according to the accuracy of te predicted programs- tested 238 exon pairs by RT-PCR in 24 tissues- only 7 (less than 3%) were confirmed positive

PROGRAMGENEMARK.2 2802 2080Ecgene 2661 1545AUGUSTUS.7 2629 2580EXONHUNTER.3 1869 1373DOGFISH-CE.4 1857 1820Genscan 1233 900GENEZILLA.2 1217 849Acembly 1162 750TWINSCAN-MARS.4 796 388FGENESH++.1 546 452SAGA.4 504 390Geneid 500 331SGP 393 245ACEVIEW.3 377 301AUGUSTUS.2 372 214SPIDA.7 332 309AUGUSTUS.4 274 142N-SCAN.4 252 183N-SCAN.5 252 183Twinscan 232 67AUGUSTUS.1 215 94AUGUSTUS.3 206 94EXOGEAN.3 154 84PAIRAGON+N-SCAN.1 151 128PAIRAGON+N-SCAN.3 151 128J IGSAW.1 88 47ENSEMBL.3 86 54DOGFISH-CE.7 1 1

nbr of exons not overlapping

annotated exons

nbr of exons intergenic to annotations

Page 27: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 27

Page 28: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 28

Long-range regulatory elements(enhancers, repressors/silencers,

insulators)

Cis-regulatory elements(promoters, transcription factor binding

sites)

DNA Replication

DNase Hypersensitive

Sites

Genes and Transcripts

Epigenetic

Page 29: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 29

Page 30: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 30

TRANSCRIPTION OF PROCESSED POLY A+ RNA

based on a number of high throughput tecnologies

0.1%24,939Ditags*

14.7%2,355,238TOTAL UNIQUE

Transcribed Bases

0.5%151,149CAGE Tags*

9.3%1,278,588transfrag/tar

9.8%1,650,821Annotated exons

% nucleotides

covered

Nb of nucleotide

covered

Total # of nucleotides : 29,998,060

non repeat masked : 14,707,189

Page 31: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

 13618240(92.6%)

9767311(66.4%)

11763410(80.0%)

9496360(64.6%)

3545358(24.1%)

2163303(14.7%)

1369304(9.3%)

19629(0.1%)

116013(0.8%)

1447192(9.8%)

49.1470718929998060INTERROGATED

 27325931(91.1%)

19658563(65.5%)

23318182(77.7%)

17758738(59.2%)

4826292(16.1%)

2519280(8.4%)

1369611(4.6%)

24939(0.1%)

151149(0.5%)

1776157(5.9%)

49.1470718929998060TOTAL(interrogated and uninterrogated)

 Total Bases 12

(%)*

BasesbetweenPETs 11

(%)*

Baseswith5'RACE 

10(%)*

BasesinExonsandIntrons 9

(%)*

Basesin PT(ESTsincluded) 8

(%)*

Total Basesin PT 7

(%)*

bp inTF 6

(%)*

bpinPET 5

(%)*

bp inCAGEtags 4

(%)*

bp inExons 3

(%)*%

TotalInterro-gatedBases 2

TotalBases 1

PRIMARY TRANSCRIPTSPROCESSED TRANSCRIPTS (PT)

Table 1: Summary of Transcriptional Coverage of ENCODE Regions.

Page 32: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

 13618240(92.6%)

9767311(66.4%)

11763410(80.0%)

9496360(64.6%)

3545358(24.1%)

2163303(14.7%)

1369304(9.3%)

19629(0.1%)

116013(0.8%)

1447192(9.8%)

49.1470718929998060INTERROGATED

 27325931(91.1%)

19658563(65.5%)

23318182(77.7%)

17758738(59.2%)

4826292(16.1%)

2519280(8.4%)

1369611(4.6%)

24939(0.1%)

151149(0.5%)

1776157(5.9%)

49.1470718929998060TOTAL(interrogated and uninterrogated)

 Total Bases 12

(%)*

BasesbetweenPETs 11

(%)*

Baseswith5'RACE 

10(%)*

BasesinExonsandIntrons 9

(%)*

Basesin PT(ESTsincluded) 8

(%)*

Total Basesin PT 7

(%)*

bp inTF 6

(%)*

bpinPET 5

(%)*

bp inCAGEtags 4

(%)*

bp inExons 3

(%)*%

TotalInterro-gatedBases 2

TotalBases 1

PRIMARY TRANSCRIPTSPROCESSED TRANSCRIPTS (PT)

Table 1: Summary of Transcriptional Coverage of ENCODE Regions.

Page 33: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 34

tiling arrays reveal many novel sites of transcription

TRANSCRIPTION MAP of HL-60 DEVELOPMENTAL TIME COURSE (data by Tom Gingeras, affymerix)

Page 34: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 36

characteristics of unannotated transfrags

• short: 78bp on average compared with 121 for exonic transfrags

• very gc-rich: 56% vs 42% in the background of unannoated regions

• lack splice sites• no matches to protein or domain databases• lack of selective constraintsHOWEVER:• reproducible across cell lines• support by independent evidence of

transcription (mostly unspliced ESTs).• enriched for RNA structures.

Page 35: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 38

Denoeud et al., “Prominent use of distal 5’ transcription start sites and discovery of a large number of additional exons in ENCODE regions”, accepted for publication Genome Research

•5’ RACE on 12 tissues•primers in internal exons of 399 protein coding loci•RACE products hybridized into genome tiling arrays

–4573 race exons detected. 2324 novel

the RACE/array experiments

Page 36: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 41

Target gene

5’ RACE from TMEM15 Gene (region Enr232) identifies several tissue specific distal 5’ exons.

Page 37: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

03/09/07 Bologna, February 2007 42

Page 38: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

distal RACEfrags are associated to independently predictes sites of transcription initiation

Page 39: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 44

cloning and sequencing of RACEarray products

Page 40: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

Bologna, February 2007 45

cloning and sequencing of RACEarray products

almost 30% of the sequenced products incorporate exons from upstream genes in chimeric structures

Page 41: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 46

RT-PCR/arrays, cloning and sequencing•136 novel transcripts (29 chimeric) in 69 loci•71 potential new CDS in 37 loci (14 chimeric)•225 novel exons

Page 42: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 47

CONCLUSIONS• there is substantial amount of transcription

which does not appear to be associated to protein coding loci

• only a fraction of the transcript diversity of protein coding loci appears to have been surveyed so far.– in particular, protein coding loci appear to have tissue

specific distal alternative transcriptional start sites

• ENCODE transcriptional landscape: network of overlapping coding and non-coding transcripts, resulting in a continuum of transcription (more than 90% of the ENCODE regions are transcribed in at least one strand)

Page 43: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

ACKNOWLEDGEMNTSENCODE GT GROUP

Stilyanos Antonarakis (Geneva)Robert Baertsch (UCSC) Ian Bell (Affx)Ewan Birney (EBI)Robert Castelo (IMIM)Jill Cheng (Affx)Evelyn Cheung (Affx)Hiram Clawson (UCSC)France Denoeud (IMIM) Sujit Dike (Affymetrix)Jorg Drenkow (Affymetrix)Olof.Emanuelsson (Yale) Paul Flicek (Sanger)Mark Gerstein (Yale) Srinka Ghosh (Affx)Jenn Harrow (Sanger)Greg Helt (Afffx)Ivo Hofacker (U. Vienna)Tim Hubbard (Sanger)Phil Kapranov (Affx)Damian Keefe (EBI)

Jan Korbel (Yale)Julien Lagarde (IMIM)Jeff Long (Affx)Todd Lowe (UCSC) G. Madhavan (Affx)Anton Nekrutenko (Penn State) David Nix (Affx)Jakob Pedersen (UCSC)Alex Reymond Geneva)Joel Rozowsky (Yale)Yijun Runan (GIS)Albin Sandelin (RIKEN)Mike Snyder (Yale)Peter F. Stadler (U. Vienna)Kevin Struhl (Harvard)Hari Tammana (Affx)Scott Tennenbaun (SUNY, Albany) Chia Lin Wei (GIS)Matt Weirauch (UCSC)Deyou Zheng (Yale)Addam Frankish(Sanger)

Tom Gingeras (Affymetrix) Roderic Guigó (CRG)

Page 44: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 49

Page 45: March 9, 2007 Bologna, February 2007 1 the complexity of human genes The ENCODE Genes & Transcripts group Roderic Guigó Centre de Regulació Genòmica, Barcelona

March 9, 2007 Bologna, February 2007 50