church emory2013
DESCRIPTION
Seminar at Emory Sep 2013TRANSCRIPT
Deanna M. Church Staff Scientist, NCBI
@deannachurch
The intersection of genome assembly and
variation management.
http://genomereference.org
Valerie Schneider, NCBI
Variation Resources Team at NCBI
Ming WardLon PhanBrad HolmesAnna GlodekMichael KholodovRama MaitiJuliana SampsonDavid ShaoEugene ShekhtmanQiang WangHua Zhang
Donna MaglottMelissa LandrumJennifer LeeGeorge RileyRay TullyCraig WallinShanmuga ChitipirallaDouglas HoffmanWonhee JangKen KatzMichael OvetskyRicardo Villamarin
Tim HefferonJohn LopezJohn GarnerChao Chen
Learning Objectives
Why the reference assembly matters for your analysis
How the reference assembly is changing
Tools and Resources to find data
Why should you care about the Reference Assembly?
Genes, NCBI Homo sapiens Annotation Release 105
Transcript
CDS
dbSNP Build 138 using annotation release 104
http://www.bioplanet.com/gcat
What is the Reference Assembly?
An assembly is a MODEL of the genome
BAC insertBAC vector
Shotgun sequence
Assemble
GAPS
“finishers” go in to manually fill the gaps, often by PCR
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1012
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321
RP11-34P13 64E8 RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7
Gaps
http://genomereference.org
NCBI36 (hg18)
GRC
h37
(hg1
9)
NCBI35 (hg17)
GRCh37 (hg19)
AL139246.20
AL139246.21
Build sequence contigs based on contigs defined in TPF (Tiling Path File).
Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis
Switch point
Consensus sequence
NCBI36
nsv832911 (nstd68) Submitted on NCBI35 (hg17)
NCBI35 (hg17) Tiling Path
GRCh37 (hg19) Tiling Path
Gap Inserted
Moved approximately 2 Mb distal on chr15
NC_0000015.8 (chr15)
NC_0000015.9 (chr15)
Removed from assembly
Added to assembly
HG-24
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
AC074378.4AC079749.5
AC134921.2AC147055.2
AC140484.1AC019173.4
AC093720.2AC021146.7
NCBI36 NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37 NC_000004.11 (chr4) Tiling Path
AC074378.4AC079749.5
AC134921.1AC147055.2
AC093720.2AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4AC140484.1
AC019173.4AC226496.2
AC021146.7
TMPRSS11E2
nsv532126 (nstd37)
GRCh37 (hg19)
http://genomereference.org
7 alternate haplotypesat the MHC
Alternate loci released as:FASTA
AGPAlignment to chromosome
UGT2B17 MHC MAPT
MHC (chr6)Chr 6 representation (PGF)
Alt_Ref_Locus_2 (COX)
Data management and the Reference Assembly?
NC_000086.123456 CM001013.17 2Mouse chrX: 34,800,000-34,890,000
Mouse chrX: 35,000,000-36,000000
X
MGSCv3 MGSCv36
ABC14-1065514J1GapsPhase LengthDate
FP565796.1 1 121-Oct-2009
FP565796.2 1 014-Oct-2010
FP565796.3 3 007-Nov-2010
hg19GRCh37
mm8MGSCv37
NCBIM37
danRer5Zv7
chr21:8,913,216-9,246,964
Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
http://www.ncbi.nlm.nih.gov/genome/assembly
GenBank RefSeq vs
Submitter Owned RefSeq Owned
Redundancy Non-RedundantUpdated rarely Curated
INSDC Not INSDC
BRCA183 genomic records31 mRNA records27 protein records
3 genomic records 5 mRNA records1 RNA record5 protein records
http://www.ncbi.nlm.nih.gov/refseq/rsghttp://www.lrg-sequence.org/
http://www.ncbi.nlm.nih.gov/refseq/rsg
RefSeq Gene
L R
http://www.ncbi.nlm.nih.gov/genome/tools/remap
From Assembly 1 <-> Assembly 2Assembly <-> RefSeqGene/LRGPrimary Assembly <-> Alternate loci
Variant Calling and the Reference Assembly
Kidd et al, 2007 APOBEC cluster
Part of chr22 assembly
Alternate locus for chr22
White: InsertionBlack: Deletion
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
(Paralogous)
(Allelic)Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Doggett et al., 2006
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
CDC27
1KG Phase 1 Strict accessibility mask
SNP (all)
SNP (not 1KG)
Sudmant et al., 2010
Issues with the Reference Assembly
http://genomereference.org
Dennis et al., 2012
1q32 1q21 1p21
1p21 patch alignment to chromosome 1
Fixing Rare/Incorrect Bases
Adding Novel Sequence
Karen Miga and Jim Kent arXiv:1307.0035
Preview of GRCh38 (scheduled Fall 2013)
TEX28 TKTL1
LOC101060233(opsin related)
LOC101060234(TEX28 related)
GRCh37 (current reference assembly)NC_000023.10 (chrX)
NW_003871103.3
FAM23_MRC1 Region, chr10
Segmental Duplications
1KG accessibility Mask
Novel Patch 250 kb of artificial duplication
Adding Novel Sequence
GRCh37p13120 Fix Patches60 Novel
Human Resolved for GRCh38
http://genomereference.org
How to identify problemregions in the
Reference Assembly
1000 Genomes Browser: http://www.ncbi.nlm.nih.gov/variation/tools/1000genomesGeT-RM Browser: http://www.ncbi.nlm.nih.gov/variation/tools/getrmVariation Viewer: http://www.ncbi.nlm.nih.gov/variation/view (coming Oct 2013!)
Tiling Path
Sequence Bar
Segmental Duplications, Eichler Lab
1000 Genomes strict accessibility mask
Annotated clone assembly problems
dbSNP Build 138 based on annotation run 104
Model based paralogous sequence differences, NCBI annotation run #Paralogous/pseudo gene alignments, NCBI annotation run #
Single Unique Nucleotide (SUN) map, Sudmant 2010ClinVar Long Variations
GRC Curation Issues
ClinVar Short Variations