170326 giab abrf

30
Genome in a Bottle: So you’ve sequenced a genome – how well did you do? Justin Zook and Marc Salit NIST Genome-Scale Measurements Group Joint Initiative for Metrology in Biology (JIMB) March 26, 2017

Upload: genomeinabottle

Post on 12-Apr-2017

47 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: 170326 giab abrf

Genome in a Bottle:So you’ve sequenced a genome – how well did you do?

Justin Zook and Marc SalitNIST Genome-Scale Measurements Group

Joint Initiative for Metrology in Biology (JIMB)

March 26, 2017

Page 2: 170326 giab abrf

Genome in a Bottle ConsortiumAuthoritative Characterization of Human Genomes

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials to evaluate performance– materials characterized for their

variants against a reference sequence, with confidence estimates

• established consortium to develop reference materials, data, methods, performance metrics

gene

ric m

easu

rem

ent p

roce

ss

www.slideshare.net/genomeinabottle

Page 3: 170326 giab abrf

In September, we released 4 new GIAB RM Genomes.

• PGP Human Genomes– AJ son– AJ trio– Asian son

• Parents also characterized

Page 4: 170326 giab abrf

We also released a Microbial Genome RM

This Reference Material (RM) is intended for validation, optimization, process evaluation, and performance assessment of whole genome sequencing.

• Salmonella Typhimurium • Pseudomonas aeruginosa • Staphylococcus aureus• Clostridium sporogenes

Page 5: 170326 giab abrf

Bringing Principles of Metrologyto the Genome

• Reference materials– DNA in a tube you can buy from

NIST• Extensive state-of-the-art

characterization– arbitrated “gold standard” calls for

SNPs, small indels• “Upgradable” as technology

develops

• PGP genomes suitable for commercial derived products

• Developing benchmarking tools and software– with GA4GH

• Samples being used to develop and demonstrate new technology

Page 6: 170326 giab abrf

NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #CEPH Mother/Daughter

N/A GM12878 HG001 RM8398

AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)

AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)Asian Son hu91BD69 GM24631 HG005 RM8393Asian Father huCA017E GM24694 N/A N/AAsian Mother hu38168C GM24695 N/A N/A

Page 7: 170326 giab abrf

Data for GIAB PGP TriosDataset Characteristics Coverage Availability Most useful for…

Illumina Paired-end WGS 150x150bp250x250bp

~300x/individual~50x/individual

on SRA/FTP SNPs/indels/some SVs

Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs

SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs

Illumina Paired-end WES 100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome

Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome

Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs

Illumina “moleculo” Custom library ~30x by long fragments on FTP SVs/phasing/assembly

Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing

10X Linked reads 30-45x/individual on FTP SNPs/SVs/phasing/assembly

PacBio ~10kb reads ~70x on AJ son, ~30x on each AJ parent

on SRA/FTP SVs/phasing/assembly/STRs

Oxford Nanopore 5.8kb 2D reads 0.05x on AJ son on FTP SVs/assembly

Nabsys 2.0 ~100kbp N50 nanopore maps

70x on AJ son SVs/assembly

BioNano Genomics 200-250kbp optical map reads

~100x/AJ individual; 57x on Asian son

on FTP SVs/assembly

Page 8: 170326 giab abrf

Paper describing data…51 authors14 institutions12 datasets7 genomesData described in ISA-tab

Page 9: 170326 giab abrf

Principles of Integration Process

• Form sensitive variant calls from each dataset

• Define “callable regions” for each callset

• Filter calls from each method with annotations unlike concordant calls

• Compare high-confidence calls to other callsets and manually inspect subset of differences– vs. pedigree-based calls– vs. common pipelines– Trio analysis

• When benchmarking a new callset against ours, most putative FPs/FNs should actually be FPs/FNs

Page 10: 170326 giab abrf

Integration Methods to Establish Benchmark Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

Page 11: 170326 giab abrf

Integration Methods to Establish Benchmark Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

NEW: Reproducible integration pipeline with

new calls for NA12878 and PGP Trios on GRCh37 and

GRCh38!

Page 12: 170326 giab abrf

Evolution of high-confidence calls

CallsHC

Regions HC CallsHC

indelsConcordant

with PG

NIST-only in beds

PG-only in beds PG-only

Variants Phased

v2.19 2.22 Gb 3153247 352937 3030703 87 404 1018795 0.3%v3.2.2 2.53 Gb 3512990 335594 3391783 57 52 657715 3.9%v3.3 2.57 Gb 3566076 358753 3441361 40 60 608137 8.8%v3.3.2 2.58 Gb 3691156 487841 3529641 47 61 469202 99.6%

5-7 errors in NIST

1-7 errors in NIST

~2 FPs and ~2 FNs per million NIST variants in PG and NIST bed files

Page 13: 170326 giab abrf

Global Alliance for Genomics and Health Benchmarking Task Team

• Developed standardized definitions for performance metrics like TP, FP, and FN.

• Developing sophisticated benchmarking tools• Integrated into a single framework

with standardized inputs and outputs

• Standardized bed files with difficult genome contexts for stratification

https://github.com/ga4gh/benchmarking-tools

Variant types can change when decomposing or recomposing variants:

Complex variant:chr1 201586350 CTCTCTCTCT CA

DEL + SNP:

chr1 201586350 CTCTCTCTCT Cchr1 201586359 T A

Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team

Page 14: 170326 giab abrf

Workflow output

Benchmarking example: NA12878 / GiaB / 50X / PCR-Free / Hiseq2000

https://illumina.box.com/s/vjget1dumwmy0re19usetli2teucjel1

Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team

Page 15: 170326 giab abrf

Benchmarking Tools

Standardized comparison, counting, and stratification with Hap.py + vcfeval

https://precision.fda.gov/ https://github.com/ga4gh/benchmarking-tools

Page 16: 170326 giab abrf

FN rates high in some tandem repeats

1x0.3x 10x3x 30x11

to 5

0 bp

51 to

200

bp

2bp unit repeat

3bp unit repeat

4bp unit repeat

2bp unit repeat

3bp unit repeat

4bp unit repeat

FN rate vs. average

Page 17: 170326 giab abrf

GA4GH benchmarking on Github

In-progress benchmarking standards document: doc/standards Description of intermediate formats: doc/ref-impl Truthset descriptions and download links: resources/high-confidence-sets Stratification bed files and descriptions: resources/stratification-bed-files Python-code for HTML reporting and running benchmarks: reporting/basic

Please contribute / join the discussion!

https://github.com/ga4gh/benchmarking-tools

Credit: Peter Krusche, IlluminaGA4GH Benchmarking Team

Page 18: 170326 giab abrf

Benchmarking stats can be difficult to interpretExample: decoy-like regions

“Decoy” sequence for GRCh37• Created to capture reads that are from

sequences that are not in the GRCh37 reference assembly, which otherwise can cause FPs

• We only include calls in decoy-homologous regions if they have clear support in both 10X haplotypes

• We look at error rates for bwa-GATK without using the decoy

SNP benchmarking stats vs. different callsetsBWA/GATK-no decoy

vs. 2.18 vs. 3.3.2 vs. PG

Precision 91% 67% 93%

Recall 99.8% 99.4% 93%

Outside bed 91% 92% 78%

• v3.3.2 best at identifying FP SNPs– 43% of FPs in decoy (only 0.5% of TPs)

• PG best at identifying FN SNPs– Mostly clustered, unclear variants in

difficult-to-map regions

Page 19: 170326 giab abrf

Benchmarking stats can be difficult to interpretExample: FN SNPs in coding regions

RefSeq Coding Regions• Studies often focus on variants in

coding regions • We look at FN SNP rates for bwa-GATK

using the decoy

SNP benchmarking stats vs. PG and 3.3.2• 97.98% sensitivity vs. PG

– FNs predominately in low MQ and/or segmental duplication regions

– ~80% of FNs supported by long or linked reads

• 99.96% sensitivity vs. NISTv3.3.2– 62x lower FN rate than vs PG

• As always, true sensitivity is unknown

Page 20: 170326 giab abrf

Benchmarking stats can be difficult to interpretExample: FN SNPs in coding regions

RefSeq Coding Regions• Studies often focus on variants in

coding regions • We look at FN SNP rates for bwa-GATK

using the decoy

SNP benchmarking stats vs. PG and 3.3.2• 97.98% sensitivity vs. PG

– FNs predominately in low MQ and/or segmental duplication regions

– ~80% of FNs supported by long or linked reads

• 99.96% sensitivity vs. NISTv3.3.2– 62x lower FN rate than vs PG

• As always, true sensitivity is unknown

True accuracy is hard to estimate, especially in

difficult regions

Page 21: 170326 giab abrf

Approaches to Benchmarking Variant Calling

• Well-characterized whole genome Reference Materials• Many samples characterized in clinically relevant regions• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time

Page 22: 170326 giab abrf

Challenges in Benchmarking Variant Calling

• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)

• Easiest to benchmark only within high-confidence bed file, but…• Benchmark calls/regions tend to be biased towards easier

variants and regions– Some clinical tests are enriched for difficult sites

• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance metrics

Page 23: 170326 giab abrf

How can we extend this approach to structural variants?

Similarities to small variants• Collect callsets from multiple

technologies• Compare callsets to find calls

supported by multiple technologies

Differences from small variants• Callsets have limited sensitivity• Variants are often imprecisely

characterized– breakpoints, size, type, etc.

• Representation of variants is poorly standardized, especially when complex

• Comparison tools in infancy

Page 24: 170326 giab abrf

Preliminary process for integrated deletions

Merge deletions

within 1kb

Rank calls by closeness of

predicted size to median size and select call in each region

from best callset

Find calls supported by

2+ technologies

with size within 20%

Filter calls overlapping

seg dups, reference N’s,

or with call with

predicted size 2x larger

ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_DraftIntegratedDeletionsgt19bp_v0.1.8

<50bp 50-100bp 100-1000bp 1kb-3kb >3kbp Pre-filtered calls 2627 1600 2306 385 389

Post-filtered calls 2548 1448 1996 297 262

Page 25: 170326 giab abrf

Proposed SV integration process

Calls with REF and ALT

sequence

SV Discovery

Imprecise SV calls

Sequence- based

comparisonSV

corroboration methods (e.g., svviz, nabsys,

bionano, Illumina

population, lumpy?)

Heuristics to form tiers of

benchmark SVs

Machine learning to

form benchmark SVs

Comparison of all

candidate calls

(SURVIVOR/svcompare)

SV Comparison

SV Corroboration

Form SV benchmark calls

SV sequence refinement (Parliament,

Spiral Genetics, PBRefine, graphs?)

Paper about calls and

comparisons?

SV Refinement

Manually curate

alignments around a

subset of calls; ask community

for feedback

Evaluate/optimize benchmark calls

Page 26: 170326 giab abrf

Draft de novo assemblies for AJ SonData Method

Contig N50

Scaffold N50

Number Scaffolds

Total Size

PacBio Falcon 5.3 Mb 5.3 Mb 13231 3.04 GbPacBio PBcR 4.5 Mb 4.5 Mb 12523 2.99 GbPacBio+ BioNano

Falcon+ BioNano 6.1 Mb 59.4 Mb 10591 3.27 Gb

PacBio+ Dovetail

Falcon+ HiRise 5.3 Mb 12.9 Mb 12459 3.04 Gb

PacBio+ Dovetail

PBcR+ HiRise 4.1 Mb 20.6 Mb 10491 2.99 Gb

Illumina DISCOVAR 81 kb 149 kb 1.06M 3.13 GbIllumina+ Dovetail

DISCOVAR+HiRise 85 kb 12.9 Mb 1.03M 3.15 Gb

10X Supernova 106 kb 15.2 Mb 1360 2.73 Gb

Credits for assemblies: Ali Bashir, Mt. SinaiJason Chin, PacBioAlex Hastie, BioNanoSerge Koren, NHGRIAdam Phillippy, NHGRIKareina Dill, DovetailNoushin Ghaffari, TAMU10X Genomics

Assembly-based SV calls: MSPACAssemblyticsPBRefineIMPORTANT NOTE: These are draft assemblies and statistics should not be used to

compare quality of assembly methods.

Page 27: 170326 giab abrf

New Samples

Additional ancestries• Shorter term

– Use existing PGP individual samples– Use existing integration pipeline

• Data-based selection– Proportion of potential genomes from

different ancestries• 3 to 8 new samples• Longer term

– Recruit large family– Recruit trios from other ancestry groups

Cancer samples• Longer term• Make PGP-consented tumor and

normal cell lines from same individual• Select tumor with diversity of mutation

types

Page 28: 170326 giab abrf
Page 29: 170326 giab abrf

Acknowledgements

• NIST/JIMB– Marc Salit– Jenny McDaniel– Lindsay Vang– David Catoe– Lesley Chapman

• Genome in a Bottle Consortium• GA4GH Benchmarking Team

• FDA– Liz Mansfield– Zivana Tevak– David Litwack

Page 30: 170326 giab abrf

For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails

github.com/genome-in-a-bottle – Guide to GIAB data & ftp

www.slideshare.net/genomeinabottle

www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser

Data: http://www.nature.com/articles/sdata201625

Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools

Public workshops – Possible SV integration mini-workshop in 2017– Next large workshop early 2018

NIST/JIMB postdoc opportunities available!Justin Zook: [email protected] Salit: [email protected]