the human reference assembly
DESCRIPTION
The Human Reference Assembly. Updating the assembly. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Updating the assembly. Oh No! Not a new version of the human genome!. Updating the assembly. GRCh37.p13 (160 regions: >3% of chromosomes). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/1.jpg)
The Human Reference AssemblyDeanna M. Church Staff Scientist, NCBI
@deannachurch Short Course in Medical Genetics 2013
Updating the assembly
![Page 2: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/2.jpg)
Oh No! Not a new version of the human genome!
Updating the assembly
![Page 3: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/3.jpg)
Updating the assembly
![Page 4: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/4.jpg)
120 Fix PATCHES: Chromosome update in GRCh38
71 Novel PATCHES: Additional sequence added
(adds >5 Mb of novel sequence to the assembly)
(adds >800K of novel sequence to the assembly)
Releasing patches quarterly
GRCh37.p13(160 regions: >3% of chromosomes)
Summer of 2013
![Page 5: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/5.jpg)
Assembly (e.g. GRCh37.p5)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
…
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Patches
Genomic Region(ABO)
Genomic Region(SMA)
Genomic Region
(PECAM1)
Data Model
![Page 6: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/6.jpg)
GCA_000001405.6 /GCF_000001405.17
Primary Assembly
GCA_000001305.1/GCF_000001305.13
ALT 1
GCA_000001315.1/GCF_000001315.1
ALT 2
GCA_000001325.1/GCF_000001325.2
ALT 3
GCA_000001335.1/GCF_000001335.1
ALT 4
GCA_000001345.1/GCF_000001345.1
ALT 5
GCA_000001355.1/GCF_000001355.1
ALT 6
GCA_000001365.1/GCF_000001365.2
ALT 7
GCA_000001375.1/GCF_000001375.1
ALT 8
GCA_000001385.1/GCF_000001385.1
ALT 9
GCA_000001395.1/GCF_000001395.1
Patches GCA_000005045.5GCF_000005045.4
Non-nuclear assembly unit
(e.g. MT)
GCA_000006015.1/GCF_000006015.1
Data Model
![Page 7: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/7.jpg)
GRCh38 is coming(September, 2013)
![Page 8: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/8.jpg)
![Page 9: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/9.jpg)
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
![Page 10: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/10.jpg)
![Page 11: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/11.jpg)
![Page 12: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/12.jpg)
http://genomereference.org
![Page 13: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/13.jpg)
Why does missing sequence matter?
GRCh37
Sample genomeDuplicon A Duplicon B
Duplicon A
May or may not detect increased coverage depending on sequencing depthand library quality (easier to find with new technologies than with old, low through technologies)
x x
G>A (allelic difference – true variant)
G>C (paralogous sequence variant- false positive)
![Page 14: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/14.jpg)
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
CDC27
1KG Phase 1 Strict accessibility mask
SNP (all)
SNP (not 1KG)
![Page 15: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/15.jpg)
Sudmant et al., 2010
![Page 16: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/16.jpg)
Kidd et al, 2007 APOBEC cluster
Part of chr22 assembly
Alternate locus for chr22
White: InsertionBlack: Deletion
![Page 17: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/17.jpg)
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
![Page 18: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/18.jpg)
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320
129S6/SVEvTac tiling path
Alignment to C57BL/6J chr1
B6 Genes
129S6/SvEvTac Genes
+ 32Kb in 129S6/SvEvTac
![Page 19: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/19.jpg)
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N
129S6/SvEvTac Alt Locus Alignment (allelic)
FVB/N Transcript Alignment (paralog)
![Page 20: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/20.jpg)
129S6/SvEvTac Ren1
FVB Ren2 Tx
Paralogousdiff
SNP +Paralogous
diff
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N
![Page 21: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/21.jpg)
Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Doggett et al., 2006
![Page 22: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/22.jpg)
Dennis et al., 2012
1q32 1q21 1p21
1p21 patch alignment to chromosome 1
![Page 23: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/23.jpg)
Preview of GRCh38 (scheduled Fall 2013)
TEX28 TKTL1
LOC101060233(opsin related)
LOC101060234(TEX28 related)
GRCh37 (current reference assembly)chrX
![Page 24: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/24.jpg)
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-21
NCBI36 (hg18)
GRCh
37 (h
g19)
![Page 25: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/25.jpg)
NCBI35 (hg17)
GRCh37 (hg19)
AL139246.20
AL139246.21
![Page 26: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/26.jpg)
Fixing Rare/Incorrect Bases
![Page 27: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/27.jpg)
Fixing Rare/Incorrect Bases
![Page 28: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/28.jpg)
GRCh37B Sites for Update: n=1164Sites with unique successful ctg 1148 (98.6%)Avg Length 448 bpMin/Max Success Length 51/791 bpAvg Coverage 80x
Read Source (all contigs)High coverage 32%Low coverage 57%Exome 10%
Fixing Rare/Incorrect Bases
![Page 29: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/29.jpg)
A = 0.000G=1.000
rs4732519
![Page 30: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/30.jpg)
rs4732519
RP11 WGS reads
Private RP11 variant?Missing in 1000G?
![Page 31: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/31.jpg)
FAM23_MRC1 Region, chr10
Segmental Duplications
1KG accessibility Mask
Novel Patch 250 kb of artificial duplication
![Page 32: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/32.jpg)
Genovese et al., 2013
![Page 33: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/33.jpg)
Adding Novel Sequence
![Page 34: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/34.jpg)
Adding Novel Sequence
Karen Hayden and Jim Kent
![Page 35: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/35.jpg)
Human Resolved for GRCh38
http://genomereference.org
![Page 36: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/36.jpg)
Richa Agarwala
MHC Alternate locus
Alignment to chr6
![Page 37: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/37.jpg)
![Page 38: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/38.jpg)
Making the assembly accessible to existing tools: masking
Query set: 439,109,084 NA12878 HiSeq reads
![Page 39: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/39.jpg)
Masking effectively blocks alignments in regions with high identity
Simulated reads from GRCh37.p9• Unpaired reads• 101 bp• 1x coverage• Default wgsim parameters
Masking parameters• Percent Id: 100%• Step size: 5 bp• Minimum length: 101 bp• Center SNPs in unmasked regions
![Page 40: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/40.jpg)
Masking improves alignments in regions with alternate loci or patches
![Page 41: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/41.jpg)
NA12878 reads whose best alignment was on an alt/patch in the masked assembly were evaluated for their alignment location when aligned to the primary assembly alone
Masking effectively reduces the increase in NA12878 reads that have alignments with MAPQ=0 that occurs when the full assembly is used as an alignment substrate
![Page 42: The Human Reference Assembly](https://reader036.vdocuments.mx/reader036/viewer/2022062812/56816274550346895dd2e580/html5/thumbnails/42.jpg)
Take home messages
The assembly you use for analysis is an important part ofyour analysis package. The reference assembly is not a set of linear sequences butcan now represent allelic diversity
Tools still need to catch up. The human reference assembly is updating soon!(Remember: assemblies are not static if you are lucky!)