ashg2015 grc-pruitt
TRANSCRIPT
RefSeq curation and annotation of the reference human genome GRCh38
Kim D. PruittNational Center for Biotechnology Information
National Library of MedicineNational Institutes of Health
www.ncbi.nlm.nih.gov/refseq/
RefSeq Background
• RefSeq provides -• Human genome annotation • Known transcripts & proteins (manually curated) • Model transcripts & proteins (annotation pipeline)
• Collaborations -• Genome Reference Consortium (GRC)• HUGO Gene Nomenclature Committee (HGNC)• Consensus CDS (CCDS) Collaboration (HAVANA curators)• RefSeqGene/Locus Reference Genomic (LRG)/LSDB
RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/
An NCBI project to provide reference sequence standards that incorporate current knowledge.
Archaea – Bacteria – Eukaryotes - Virus
Curation support of genic regions of the reference human assembly
• RefSeqGene and LRG collaboration• Genomic and cDNA standards for clinical reporting• Report potential issues to the GRC
• Consensus CDS collaboration • Stabilized human CDS annotation • Report potential issues to the GRC
• RefSeq• Curation of genes, transcript & protein records• Report potential issues to the GRC• Review GRC patch updates for gene annotation impact
Genome annotation leverages curation + computation
Genes:• Type, location, length
Sequence:• Accuracy, length• Alternate splice products• Functional annotation
Align curated RefSeqsAlign transcripts, proteinsAlign RNA-SeqFilter best alignmentsBuild model RefSeqsAssign accessions, GeneID
Evidence-based genome annotation pipeline
Manual CurationSequence - Literature
Transcripts ProteinsKnown RefSeqs 50,540 39,363
Model RefSeqs 112,735 60,599
Annotated Genes CountProtein-coding 20,576Non-coding 18,037Pseudogene 12,474
Transition from GRCh37 to GRCh38 • Identify gene/sequence differences vs. GRCh38• Automatic update at synonymous mismatches• Curation review of remainder• >5,100 Known RefSeq transcripts updated since October 2013• 47,031 Known RefSeqs identical to genome• 2,916 intentionally retain a mismatch or indel• ~600 pending• ~132 genes merged
0 200 400 600 800 1000 1200
2013 Q1
2013 Q3
2014 Q1
2014 Q3
2015 Q1
2015 Q3
Number of updates
* GRCh38 12/24/2013
*
Updating RefSeq to match GRCh38
• Post GRCh38 review: • NM_173477 updated to match genome (NM_173477.4)• Model RefSeq XM_005257026.1 promoted to Known RefSeq
GRCh38
GRCh37
alignment
alignment
RefSeq curation & genome maintenance
GRCh38
GRCh37
GRCh37 Issue: SCX duplicationMROH1 split
GRCh38 update:Gap closedMROH1 completeOne SCX gene
gap
RefSeq curation & genome maintenance
• POLR2A (GeneID:5430) NM_000937.4 has a 2 nt deletion vs. GRCh38
• This maintains the correct reading frame GRCh38
alignment
GRCh38 ALT LOCI and PATCHES
Pre-Patch & ALT reviewPolymorphic pseudogenesHaplotype & CNV variation
ALT-specific RefSeq recordsCurator-stored placement data
Evidence-based genome annotation pipelineManual Curation
Assembly-ALT alignmentsAlignment quality reports
Subsequent genome annotation build corrects the annotation
Interim alignment updates
Polymorphic pseudogenes
• RefSeq provides different transcripts to represent the protein-coding gene versus the pseudogene
• Curators store assembly placement information (chromosome versus ALT) in a local database
• This is used by annotation pipeline to ensure correct annotation
Assembly Unit GSTT1 GSTT2 GSTT2B GSTTP1 GSTTP2GRCh38 chr22 null pseudo coding pseudo nullALT_REF_LOCI_1 coding coding coding pseudo pseudo
An example – GSTT cluster on chromosome 22:
GSTT* variation, chromosome 22
• Copy number variation of glutathione-S-transferase theta genes is associated with digestive track cancers and more
• Accurate gene annotation is important to downstream users
GRCh38 chr22
GRCh38 ALT
pseudogene
chr22 = null allelecoding allele
ulcerative colitis - laryngeal cancer - esophageal cancer - colorectal cancer
GSTT2 polymorphism
AT splice donor Premature stop codon
GT splice donor Stop codon
GRCh38 chr22
GRCh38 ALT
Data access• Genes:
• <…ncbi root url…>/gene/• ftp://ftp.ncbi.nlm.nih.gov/gene/• NCBI YouTube ‘Download genomic sequence for a gene’
• https://www.youtube.com/watch?v=RHz2nZbzjpA
• RefSeq transcripts and proteins:• Links from NCBI Gene• Nucleotide/protein query:
• human[organism] + use facets to specify RefSeq and molecule type• ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/
• NCBI Genome Annotation• Links from NCBI Assembly or Genome resources
• <ncbi>/assembly/ or <ncbi>/genome/
Genome FTP formats• FASTA
• genome, transcripts, proteins • GenBank file format
• – genome transcripts, proteins• GFF genome annotation • Feature table
• features and locations in tabular format
• AGP, Assembly details & statistics • Repeat masker results• Md5checksums• Documentation
• README files• <ncbi>/genome/doc/ftpfaq/
AcknowledgementsRefSeq Curators
Annotation pipelinePaul KittsTerence MurphyFrancoise Thibaud-Nissen
Eric CoxCatherine FarrellTamara GoldfarbTripti GuptaVinita JoardarVamsi Kodali
Kelly McGarveyMike MurphyNuala O'LearyShashi PujarBhanu RajputSanjida Rangwala
Lillian RiddickDave WebbMatt Wright
Susan Hiatt
www.ncbi.nlm.nih.gov/refseq/
CollaboratorsElspeth Bruford (HGNC)Jen Harrow (HAVANNA)Locus-Specific DatabasesExpert databasesIndividual scientists