copy number variant detection with pacbio long reads · variant calling with hifi reads attains...

1
For Research Use Only. Not for use in diagnostic procedures. © Copyright 2020 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners. Copy-Number Variant Detection with PacBio Long Reads Aaron M. Wenger, Armin Töpfer, Luke Hickey Pacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025 Long Highly Accurate HiFi Reads The polymerase reads are trimmed of adapters to yield subreads Consensus is called from subreads Start with high-quality double stranded DNA Ligate SMRTbell adapters and size select Anneal primers and bind DNA polymerase Circularized DNA is sequenced in repeated passes HG002 11 kb HiFi reads HG002 2×250 bp reads haplotype 1 haplotype 2 STRC Variant Detection with HiFi Reads Figure 1. (a) PacBio Circular Consensus Sequencing produces HiFi reads 1 that are long (10-25 kb) and accurate (>99%). (b) HiFi reads align to difficult-to-map regions, including exons of the disease gene STRC. GRCh37 chr15:43,891,619-43,911,196 (19 kb) Existing long read variant detection methods rely on de novo assembly or spanning reads. These methods are effective for SVs but miss many CNVs that involve long segmental duplications. Copy-Number Variant (CNV) Detection with pbsv CNVs in HG001 and COLO829T References 1. Wenger AM, Peluso P et al. (2019). Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome . Nat Biotechnol. 84(2):125-140. 2. Zook JM et al. (2019). An open resource for accurately benchmarking small variant and reference calls . Nat Biotechnol. 37(5):561-566. 3. Zook JM et al. (2019). A robust benchmark for germline structural variant detection . bioRxiv. doi:10.1101/664623. [Preprint] 4. Poplin R et al. (2018). A universal SNP and small-indel variant caller using deep neural networks . Nat Biotechnol. 36(10):983-987. 5. https://github.com/PacificBiosciences/pbsv 6. Chaisson MJP et al. (2019). Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun. 10(1):1784. Thank you to Billy Rowell, David Scherer, and Pamela Bentley Mills for poster production support. "We determined that 57% and 15% of the copy number variable bases within segmental duplications detected by dCGH and Genome STRiP, respectively, were not in contigs resolved by de novo assembly [of long reads]." – Chaisson et al. 2019 (ref. 6) We extended the PacBio SV caller, pbsv, to detect CNVs using a combination of read clipping and depth. Determine genome-wide coverage median coverage of non-gap positions at mapping quality 60 Identify candidate CNV breakpoints positions with multiple clipped reads Evaluate coverage between adjacent candidate breakpoints calculate z-score vs Poisson expectation pbsv CNV algorithm Figure 2. PacBio read coverage is Poisson distributed in autosomes for (a) samples from GIAB and (b) the normal sample from a tumor/normal pair. A tumor sample shows CNV regions of reduced and increased coverage. (c) To call CNVs, pbsv identifies candidate CNV breakpoints with multiple clipped reads and then evaluates read depth between adjacent breakpoints compared to the genome-wide typical coverage. a b c HG001 HiFi reads Segdups Repeats GRCh37 chr19:50,579,869-50,659,719 (80 kb) HG001 pbsv CNV HG001 HiFi reads Segdups Repeats HG001 pbsv CNV GRCh37 chr6:220,626-410,470 (189 kb) Genes Genes Figure 3. pbsv CNV calls in HG001 in (a) segmentally duplicated (segdup) sequence and (b) unique sequence. a b Figure 4. Genomic distribution of pbsv CNV calls, by copy number (normal = 2). (a, b) Most CNVs in the "healthy" genome HG001 involve segdups. (c, d) The tumor genome COLO829T has many more CNVs than HG001 and most fall outside of segdups. Summary PacBio HiFi reads comprehensively detect variants in a human genome. The pbsv variant caller identifies CNVs in HiFi reads using read clipping and depth signatures. Copy number Copy number Total CNV length (Mb) HG001 COLO829T (tumor) segdup segdup unique unique b a d c 9 0 1 2 3 4 5 6 7 8 ≥10 9 0 1 2 3 4 5 6 7 8 ≥10 0 1 2 3 4 5 0 1 2 3 4 5 0 50 100 150 200 0 50 100 150 200 candidate breakpoint candidate breakpoint HG001 HiFi reads Segdups Repeats GRCh37 chr6:103,727,761-103,773,116 (45 kb) 25 kb Coverage Coverage 100 bp windows 100 bp windows 54 0 6 12 18 24 30 36 42 48 60 90 0 10 20 30 40 50 60 70 80 100 0% 1% 2% 3% 4% 5% 6% 0% 1% 2% 3% 4% 5% HG001 HG002 HG005 COLO829N COLO829T a b Variant calling with HiFi reads attains high precision and recall for single nucleotide variants (SNVs), indels, and structural variants (SVs). Variant type Precision Recall SNV 99.97% 99.95% Indel (<50 bp) 99.10% 98.86% SV (≥50 bp) 96.26% 94.93% Table 1. Variant calling performance for 32-fold coverage of HG002 HiFi reads, measured against Genome in a Bottle (GIAB) benchmarks 2,3 . SNV and indel calling is with Google DeepVariant 4 . SV calling is with pbsv 5 .

Upload: others

Post on 21-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copy number variant detection with PacBio long reads · Variant calling with HiFi reads attains high precision and recall for single nucleotide variants (SNVs), indels, and structural

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2020 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. Femto Pulse and Fragment Analyzer are trademarks of Agilent Technologies Inc. All other trademarks are the sole property of their respective owners.

Copy-Number Variant Detection with PacBio Long ReadsAaron M. Wenger, Armin Töpfer, Luke HickeyPacific Biosciences, 1305 O'Brien Drive, Menlo Park, CA 94025

Long Highly Accurate HiFi Reads

The polymerase reads are trimmed of adapters to yield

subreads

Consensus is called from subreads

Start with high-quality double stranded DNA

Ligate SMRTbell adapters and size select

Anneal primers and bind DNA polymerase

Circularized DNA is sequenced in repeated passes

HG00211 kb HiFi

reads

HG002 2×250 bp

reads

hapl

otyp

e 1

hapl

otyp

e 2

STRC

Variant Detection with HiFi Reads

Figure 1. (a) PacBio Circular Consensus Sequencing produces HiFi reads1 that are long (10-25 kb) and accurate (>99%). (b) HiFi reads align to difficult-to-map regions, including exons of the disease gene STRC.

GRCh37 chr15:43,891,619-43,911,196 (19 kb)

Existing long read variant detection methods rely on de novo assembly or spanning reads. These methods are effective for SVs but miss many CNVs that involve long segmental duplications.

Copy-Number Variant (CNV)Detection with pbsv CNVs in HG001 and COLO829T

References1. Wenger AM, Peluso P et al. (2019). Accurate circular consensus long-read

sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 84(2):125-140.

2. Zook JM et al. (2019). An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 37(5):561-566.

3. Zook JM et al. (2019). A robust benchmark for germline structural variant detection. bioRxiv. doi:10.1101/664623. [Preprint]

4. Poplin R et al. (2018). A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 36(10):983-987.

5. https://github.com/PacificBiosciences/pbsv6. Chaisson MJP et al. (2019). Multi-platform discovery of haplotype-resolved

structural variation in human genomes. Nat Commun. 10(1):1784.

Thank you to Billy Rowell, David Scherer, and Pamela Bentley Mills for poster production support.

"We determined that 57% and 15% of the copy number variable bases within segmental duplications detected by dCGH and Genome STRiP, respectively, were not in contigs resolved by de novo assembly [of long reads]."– Chaisson et al. 2019 (ref. 6)

We extended the PacBio SV caller, pbsv, to detect CNVs using a combination of read clipping and depth.

Determine genome-wide coveragemedian coverage of non-gap

positions at mapping quality 60

Identify candidate CNV breakpointspositions with multiple clipped reads

Evaluate coverage between adjacent candidate breakpoints

calculate z-score vs Poisson expectation

pbsv CNV algorithm

Figure 2. PacBio read coverage is Poisson distributed in autosomes for (a) samples from GIAB and (b) the normal sample from a tumor/normal pair. A tumor sample shows CNV regions of reduced and increased coverage. (c) To call CNVs, pbsv identifies candidate CNV breakpoints with multiple clipped reads and then evaluates read depth between adjacent breakpoints compared to the genome-wide typical coverage.

a b

c

HG001HiFi

reads

Segdups

Repeats

GRCh37 chr19:50,579,869-50,659,719 (80 kb)

HG001pbsv CNV

HG001HiFi

reads

Segdups

Repeats

HG001pbsv CNV

GRCh37 chr6:220,626-410,470 (189 kb)

Genes

Genes

Figure 3. pbsv CNV calls in HG001 in (a)segmentally duplicated (segdup) sequence and (b) unique sequence.

a

b

Figure 4. Genomic distribution of pbsv CNV calls, by copy number (normal = 2). (a, b) Most CNVs in the "healthy" genome HG001 involve segdups. (c, d)The tumor genome COLO829T has many more CNVs than HG001 and most fall outside of segdups.

Summary

• PacBio HiFi reads comprehensively detect variants in a human genome.

• The pbsv variant caller identifies CNVs in HiFi reads using read clipping and depth signatures.

Copy number Copy number

Tota

l CN

V le

ngth

(Mb)

HG001 COLO829T (tumor)

segdup segdup

unique unique

b

a

d

c

90 1 2 3 4 5 6 7 8 ≥10 90 1 2 3 4 5 6 7 8 ≥100

1

2

3

4

5

0

1

2

3

4

5

0

50

100

150

200

0

50

100

150

200

candidate breakpoint

candidate breakpoint

HG001HiFi

reads

Segdups

Repeats

GRCh37 chr6:103,727,761-103,773,116 (45 kb)

25 kb

CoverageCoverage

100

bp w

indo

ws

100

bp w

indo

ws

540 6 12 18 24 30 36 42 48 60 900 10 20 30 40 50 60 70 80 1000%

1%

2%

3%

4%

5%

6%

0%

1%

2%

3%

4%

5%HG001HG002HG005

COLO829NCOLO829T

a

b

Variant calling with HiFi reads attains high precision and recall for single nucleotide variants (SNVs), indels, and structural variants (SVs).

Variant type Precision Recall

SNV 99.97% 99.95%

Indel (<50 bp) 99.10% 98.86%

SV (≥50 bp) 96.26% 94.93%

Table 1. Variant calling performance for 32-fold coverage of HG002 HiFi reads, measured against Genome in a Bottle (GIAB) benchmarks2,3. SNV and indel calling is with Google DeepVariant4. SV calling is with pbsv5.