gencode mar '10 meeting: pseudogene project...

22
Do not reproduce without permission Gencode Mar '10 Meeting: Pseudogene Project Update Mark Gerstein Illustration from Gerstein & Zheng (2006). Sci Am.

Upload: others

Post on 31-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Gencode Mar '10 Meeting:

Pseudogene Project Update

Mark Gerstein

Illustration from Gerstein & Zheng (2006). Sci Am.

Page 2: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Overall Flow: Pipeline Runs, Coherent Sets, Annotation, Transfer to Sanger

•  Overall Approach

1.  Overall Pipeline runs at Yale and UCSC, yielding raw pseudogenes

2.  Extraction of coherent subsets for further analysis and annotation

3.  Passing to Sanger for detailed manual analysis and curation

4.  Incorporation into final GENCODE annotation

5.  Pipeline modification

•  Chronology of Sets

1.  Encode Pilot 1%

2.  Ribosomal Protein pseudogenes

3.  Unitary pseudogenes (Hard)

4.  Glycolytic Pseudogenes

5.  Polymorphic Pseudogenes

6.  Pseudogenes Associated with SDs

Page 3: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Specific Pseudogene Assignments: Glycolytic Pseudogenes (completed)

Page 4: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Number of pseudogenes for each

glycolytic enzyme

Processed/Duplicated

[Liu et al. BMC Genomics ('09)]

GAPDH

GAPDH

Large numbers of processed GAPDH pseudogenes in mammals comprise one of the biggest families but numbers not obviously correlated with mRNA abundance.

Page 5: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Number of pseudogenes for each

glycolytic enzyme

Processed/Duplicated

[Liu et al. BMC Genomics ('09)]

GAPDH

GAPDH

Large numbers of processed GAPDH pseudogenes in mammals comprise one of the biggest families but numbers not obviously correlated with mRNA abundance.

60 Proc/2 Dup

Page 6: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Distribution of human GAPDH pseudogenes

[Liu et al. BMC Genomics ('09, in press)]

Large numbers of processed GAPDH pseudogenes in mammals comprise one of the biggest families but numbers not obviously correlated with mRNA abundance.

60 Proc/2 Dup

Page 7: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Age calculated based on Kimura-2 parameter model of nucleotide substitution

Aproximate Age of GAPDH pseudogenes

[Liu

et

al. B

MC

Gen

om

ics

('09

)]

Page 8: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Synteny of GAPDH

pseudogenes

Synteny derived based on local gene orthology

[Liu et al. BMC Genomics ('09)]

Page 9: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Specific Pseudogene Assignments: Unitary

Pseudogenes (completed)

Page 10: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

10

zdz

© m

mix

Pseudogenes

▪  Pseudogenes: nongenic DNA segments with high sequence similarity to functional genes

▪  Unitary pseudogenes: unprocessed pseudogenes with no functional counterparts

{ Unitary pseudogene

Processed pseudogenes Transcription Transposition

Duplication +

Duplicated pseudogenes

Unitary pseudogenes In situ pseudogenization

Page 11: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

11

zdz

© m

mix

Identification pipeline { Unitary pseudogene

~16k human-mouse orthologs

~23k mouse proteins

~6k mouse proteins without human orthologs

~600 candidate human unitary pseudogene loci

76 human unitary pseudogenes

HG

[Zhang et al. GenomeBiology (in press, '10)]

Page 12: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

12

zdz

© m

mix

Relativity of unitary pseudogenes { Unitary pseudogene

[Zhang et al. GenomeBiology (in press, '10)]

Page 13: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Unitary Pseudogene Families

Page 14: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

14

zdz

© m

mix

{ Unitary pseudogene Dating the pseudogenization events

Page 15: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Specific Pseudogene Assignments: Polymophic Pseudogenes (in process)

Page 16: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

11 Polymorphic Pseudogenes

Page 17: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

17

zdz

© m

mix

Polymorphic pseudogenes (3 with allele frequency data)

[Zhang et al. GenomeBiology (in press, '10)] 3 SNPs not found to be under recent positive selection....

Page 18: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

18

zdz

© m

mix

Fst hierarchical clustering for rs4940595 in SERPINB11

....but population structure at rs4940595—the difference in the allelic frequencies in different populations—could

be result of different selective regimes that the same allele at rs4940595 is subjected to in different population subdivisions.

Page 19: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Specific Pseudogene Assignments: SD-associated

Pseudogenes (in process)

Page 20: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Segmental duplications (SDs) •  Regions of the genome

with ≥ 90% sequence identity and ≥ 1kb in length

•  Based on neutral divergence correspond to last ~40 million years of human evolution

•  Comprise ~5-6% of the human genome

•  Enriched with genes (~18%) and pseudogenes (duplicated ~45%, processed ~22%)

       20  

Bailey  et  al,  Science,  2002  

Can  the  study  of  ψgenes  in  SDs  provide  informa=on  not  obvious  from  individual  dataset  ?  

Page 21: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Nucleotide substitutions in ψgenes and SDs containing them

       21  

Most ψgenes show the same number of substitutions as larger SD region containing them

- Duplication accompanied by disablement - Followed by neutral rate of evolution

K2m : Nucleotide substitutions per site computed using Kimura’s two parameter model

Parent gene Duplicated ψgene

Page 22: Gencode Mar '10 Meeting: Pseudogene Project Updatelectures.gersteinlab.org/ppt/Gencode-20100309-Pseudogenes/Genco… · raw pseudogenes 2. Extraction of coherent subsets for further

Do not reproduce without permission

Acknowledgements Z Zhang E Khurana Y J Liu YK Lam S Balasubramanian G Fang N Carriero R Robilotto P Cayting M Wilson A Frankish M Diekhans R Harte T Hubbard J Harrow

Pseudogene.org