copy number variant detection with pacbio long reads · variant calling with hifi reads attains...
TRANSCRIPT
For Research Use Only. Not for use in diagnostic procedures. © Copyright 2020 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners.
Copy-Number Variant Detection with PacBio Long ReadsAaron M. Wenger, Armin Töpfer, Luke HickeyPacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025
Long Highly Accurate HiFi Reads
The polymerase reads are trimmed of adapters to yield
subreads
Consensus is called from subreads
Start with high-quality double stranded DNA
Ligate SMRTbell adapters and size select
Anneal primers and bind DNA polymerase
Circularized DNA is sequenced in repeated passes
HG00211 kb HiFi
reads
HG002 2×250 bp
reads
hapl
otyp
e 1
hapl
otyp
e 2
STRC
Variant Detection with HiFi Reads
Figure 1. (a) PacBio Circular Consensus Sequencing produces HiFi reads1 that are long (10-25 kb) and accurate (>99%). (b) HiFi reads align to difficult-to-map regions, including exons of the disease gene STRC.
GRCh37 chr15:43,891,619-43,911,196 (19 kb)
Existing long read variant detection methods rely on de novo assembly or spanning reads. These methods are effective for SVs but miss many CNVs that involve long segmental duplications.
Copy-Number Variant (CNV)Detection with pbsv CNVs in HG001 and COLO829T
References1. Wenger AM, Peluso P et al. (2019). Accurate circular consensus long-read
sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 84(2):125-140.
2. Zook JM et al. (2019). An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 37(5):561-566.
3. Zook JM et al. (2019). A robust benchmark for germline structural variant detection. bioRxiv. doi:10.1101/664623. [Preprint]
4. Poplin R et al. (2018). A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 36(10):983-987.
5. https://github.com/PacificBiosciences/pbsv6. Chaisson MJP et al. (2019). Multi-platform discovery of haplotype-resolved
structural variation in human genomes. Nat Commun. 10(1):1784.
Thank you to Billy Rowell, David Scherer, and Pamela Bentley Mills for poster production support.
"We determined that 57% and 15% of the copy number variable bases within segmental duplications detected by dCGH and Genome STRiP, respectively, were not in contigs resolved by de novo assembly [of long reads]."– Chaisson et al. 2019 (ref. 6)
We extended the PacBio SV caller, pbsv, to detect CNVs using a combination of read clipping and depth.
Determine genome-wide coveragemedian coverage of non-gap
positions at mapping quality 60
Identify candidate CNV breakpointspositions with multiple clipped reads
Evaluate coverage between adjacent candidate breakpoints
calculate z-score vs Poisson expectation
pbsv CNV algorithm
Figure 2. PacBio read coverage is Poisson distributed in autosomes for (a) samples from GIAB and (b) the normal sample from a tumor/normal pair. A tumor sample shows CNV regions of reduced and increased coverage. (c) To call CNVs, pbsv identifies candidate CNV breakpoints with multiple clipped reads and then evaluates read depth between adjacent breakpoints compared to the genome-wide typical coverage.
a b
c
HG001HiFi
reads
Segdups
Repeats
GRCh37 chr19:50,579,869-50,659,719 (80 kb)
HG001pbsv CNV
HG001HiFi
reads
Segdups
Repeats
HG001pbsv CNV
GRCh37 chr6:220,626-410,470 (189 kb)
Genes
Genes
Figure 3. pbsv CNV calls in HG001 in (a)segmentally duplicated (segdup) sequence and (b) unique sequence.
a
b
Figure 4. Genomic distribution of pbsv CNV calls, by copy number (normal = 2). (a, b) Most CNVs in the "healthy" genome HG001 involve segdups. (c, d)The tumor genome COLO829T has many more CNVs than HG001 and most fall outside of segdups.
Summary
• PacBio HiFi reads comprehensively detect variants in a human genome.
• The pbsv variant caller identifies CNVs in HiFi reads using read clipping and depth signatures.
Copy number Copy number
Tota
l CN
V le
ngth
(Mb)
HG001 COLO829T (tumor)
segdup segdup
unique unique
b
a
d
c
90 1 2 3 4 5 6 7 8 ≥10 90 1 2 3 4 5 6 7 8 ≥100
1
2
3
4
5
0
1
2
3
4
5
0
50
100
150
200
0
50
100
150
200
candidate breakpoint
candidate breakpoint
HG001HiFi
reads
Segdups
Repeats
GRCh37 chr6:103,727,761-103,773,116 (45 kb)
25 kb
CoverageCoverage
100
bp w
indo
ws
100
bp w
indo
ws
540 6 12 18 24 30 36 42 48 60 900 10 20 30 40 50 60 70 80 1000%
1%
2%
3%
4%
5%
6%
0%
1%
2%
3%
4%
5%HG001HG002HG005
COLO829NCOLO829T
a
b
Variant calling with HiFi reads attains high precision and recall for single nucleotide variants (SNVs), indels, and structural variants (SVs).
Variant type Precision Recall
SNV 99.97% 99.95%
Indel (<50 bp) 99.10% 98.86%
SV (≥50 bp) 96.26% 94.93%
Table 1. Variant calling performance for 32-fold coverage of HG002 HiFi reads, measured against Genome in a Bottle (GIAB) benchmarks2,3. SNV and indel calling is with Google DeepVariant4. SV calling is with pbsv5.