150224 giab 30 min generic slides

30
Genome in a Bottle: So you’ve sequenced a genome – how well did you do? February 2015 Justin Zook, Marc Salit, and the Genome in a Bottle Consortium

Upload: genomeinabottle

Post on 16-Jul-2015

314 views

Category:

Health & Medicine


2 download

TRANSCRIPT

Page 1: 150224 giab 30 min generic slides

Genome in a Bottle: So you’ve sequenced a genome – how well did

you do?

February 2015

Justin Zook, Marc Salit, and the Genome in a Bottle Consortium

Page 2: 150224 giab 30 min generic slides

Whole genome sequencing technologies disagree about 100,000’s of variants

3,198,316 (80.05%)

125,574 (3.14%)

Platform #1

Platform #2

Platform #3

230,311 (5.76%)

121,440 (3.04%)

208,038 (5.21%)

71,944 (1.80%)

39,604 (0.99%)

# SNPs (% of SNPs detected

by any platform)

Page 3: 150224 giab 30 min generic slides

Bioinformatics programs also disagree

O’Rawe et al. Genome Medicine 2013, 5:28

Page 4: 150224 giab 30 min generic slides

NIST-hostedGenome in a Bottle Consortium

• Infrastructure for performance assessment of NGS– support science-based regulatory

oversight

• No widely accepted set of metrics to characterize the fidelity of variant calls from NGS…

• Genome in a Bottle Consortium is developing standards to address this…– well-characterized human genomes

as Reference Materials (RMs)• characterized and disseminated by NIST

– tools and methods to use these RMs• Global Alliance for Genomics and

Health Benchmarking Team

http://genomeinabottle.org

Page 5: 150224 giab 30 min generic slides

Genome in a Bottle Consortium Development

• NIST met with sequencing technology developers to assess standards needs– Stanford, June 2011

• Open, exploratory workshop– ASHG, Montreal, Canada– October 2011

• Small, invitational workshop at NIST to develop consortium for human genome reference materials– FDA, NCBI, NHGRI, NCI, CDC, Wash

U, Broad, technology developers, clinical labs, CAP, PGP, Partners, ABRF, others

– developed draft work plan– April 2012

• Open, public meetings of GIAB– August 2012 at NIST– March 2013 at Xgen– August 2013 at NIST– January 2014 at Stanford– August 2014 at NIST– January 2015 at Stanford

• Website– www.genomeinabottle.org

Page 6: 150224 giab 30 min generic slides

Others working in this space…

Well-characterized genomes

• Illumina Platinum Genomes

• CDC GeT-RM

• Korean Genome Project

• Human Longevity, Inc.

• Hyditaform mole haploid cell line

• Genome Reference Consortium

Performance Metrics

• Global Alliance for Genomics and Health Benchmarking Team

• NCBI/CDC GeT-RM Browser

• GCAT website

Page 7: 150224 giab 30 min generic slides

NIST Plays a Role in the First FDA Authorization for Next-Generation Sequencer

November 20, 2013

Page 8: 150224 giab 30 min generic slides

Measurement Process

Sample

gDNA isolation

Library Prep

Sequencing

Alignment/Mapping

Variant Calling

Confidence Estimates

Downstream Analysis

• gDNA reference materials will be developed to characterize performance of a part of process– materials will be

certified for their variants against a reference sequence, with confidence estimates

gen

eric

me

asu

rem

en

t p

roce

ss

Analyticalsteps

Pre-Analyticalsteps

ClinicalInterpretation

Page 9: 150224 giab 30 min generic slides

• NIST worked with GIAB to select genomes

• Current genomes

– NA12878 HapMapsample as Pilot sample• part of 17-member

pedigree

– 2 trios from PGP • Ashkenazim

• Asian

12889 12890 12891 12892

12877 12878

12879 12880 12881 12882 12883 12884 12885 12887 12886 12888 12893

CEPH Utah Pedigree 1463

Putting “Genomes” in Bottles

11 children

Page 10: 150224 giab 30 min generic slides

NIST Human Genome RMs in the pipeline

• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable,

homogeneous references suitable for use in regulated applications

– all genomes also available from Coriell repository

• Pilot Genome– ~8400 tubes

• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent

• Asian Trio– ~10000 son; parents not yet

planned as NIST RM

Page 11: 150224 giab 30 min generic slides

Goals for Data to Accompany RM

• ~0 false positive AND false negative calls in confident regions

• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)

• Avoid bias towards any particular platform– take advantage of strengths of each platform

• Avoid bias towards any particular bioinformatics algorithms

11

Page 12: 150224 giab 30 min generic slides

Pilot Genome: Integrate 12 14 Datasets from 5 platforms

12

Page 13: 150224 giab 30 min generic slides

Dat

aset

#1

Dat

aset

#2

Dat

aset

#3

Annotation #1Histogram

(e.g., coverage)

Dat

aset

#1

Dat

aset

#2

Dat

aset

#3

Annotation #2Histogram

(e.g., strand bias)

Site A

Site B

PotentialBias

Site C

Dataset Site A Site B Site C

Dataset #1 0/0 0/0 1/1

Dataset #2 0/1 0/1 1/1

Dataset #3 0/0 0/1 1/1

Integration 0/0 0/1 Uncer-tain

Candidate variants

Concordant variants

Find characteristics

of bias

Arbitrate using evidence of

bias

Confidence Level

Integration Methods to Establish Benchmark Variant Calls

Page 14: 150224 giab 30 min generic slides

Integration Methods to Establish Benchmark Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence LevelZook et al., Nature Biotechnology, 2014.

Page 15: 150224 giab 30 min generic slides

Assigning confidence to genotypes

High-confidence sites

• Sequencing/bioinformatics methods agree or we understand the biases causing disagreement

• At least some methods have no evidence of bias

• Inherited as expected

Less confident sites

• In a region known to be difficult for current technologies

• State reasons for lower confidence

• If a site is near a low confidence site, make it low confidence

Page 16: 150224 giab 30 min generic slides

Challenges with assessing performance

• All variant types are not equal

• All regions of the genome are not equal

• Labeling difficult variants as uncertain leads to higher apparent accuracy when assessing performance

• Genotypes fall in 3+ categories (not positive/negative)

– standard diagnostic accuracy measures not well posed

16

Page 17: 150224 giab 30 min generic slides

Challenge in variant comparison: Complex variants have multiple correct representations

BWA

ssaha2

CGTools

Novo-align

Ref:

T insertion

TCTCT insertion

17

FP SNPs FP MNPs FP indels

Traditionalcomparison

0.38% (610)

100% (915)

6.5% (733)

Comparison with realignment

0.15% (249)

4.2% (38)

2.6% (298)

Page 18: 150224 giab 30 min generic slides

Global Alliance for Genomics and HealthBenchmarking Task Team

• Formed June 2014 to develop methods and tools for comparing variant calls to a benchmark

• Developed standardized definitions for performance metrics like TP, FP, and FN.

• Initial focus on germline SNPs/indels• Developing benchmarking tools

• Comparison engine• Pluggable web interface with

modules for:• Reporting/calculation of metrics• Visualization/user interface

• Working with Genome in a Bottle Consortium to host data and calls from their well-characterized genomes

www.bioplanet.com/gcat

Example User Interface

Page 19: 150224 giab 30 min generic slides

Stratifying Performance

• Measure performance for different types of variants in different sequence contexts– Types of variants

• SNPs• indels of different sizes• complex variants• structural variants

– Sequence contexts• Homopolymers, • STRs• Duplications

– Functional context• Exome vs genome, etc

– Data characteristics• Coverage• Mapping quality

• Challenge of smaller gene panels vs genome sequencing– one RM may not have a

sufficient number of examples of different classes of variants or sequence contexts

– likely need more samples with specific types of variants

Page 20: 150224 giab 30 min generic slides

NCBI/CDC GeT-RM Browser• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/

• Allows visualization of questionable calls

Page 21: 150224 giab 30 min generic slides

Initial uses of high-confidence NIST-GIAB genotypes for NA12878

• NIST have released several versions of high-confidence genotypes for its pilot RM

• These data are presently being used for benchmarking

– prior to release of RMs

– SNPs & indels• ~77% of the genome

Page 22: 150224 giab 30 min generic slides

Using Genome in a Bottle calls to benchmark clinical exome sequencing

at Mount Sinai School of Medicine

“We evaluate a set of NA12878 technical replicates against GIAB for each new pipeline version.”

Page 23: 150224 giab 30 min generic slides

Benchmarking somatic variant callingat Qiagen

Page 24: 150224 giab 30 min generic slides

Implications of Technical Accuracy in Medical Genome Sequencing

• Collaboration with EuanAshley group at Stanford

• What is accuracy for functional variants?

• How much of the exomefalls in high confidence regions?

• “Black list” in databases

• Sensitivity – WExS (95%) < WGS (98%)

• especially splicing

– genome < nonsyn < syn

– Most exome FNs caused by low coverage

– Most WGS FNs cause by filtering

• Only 81 % of ClinVarpathogenic or likely pathogenic SNPs fall in high-confidence regions– Lots of work to do!

Page 25: 150224 giab 30 min generic slides

Overview of NIST RM DevelopmentGenome(s) Q4 2014 Q1 2015 Q2 2015 Q3 2015 Q4 2015

HG-001/NA12878(“Pilot” Genome)

Release NIST RM8398; Preliminary large deletions

RefinedStructural Variants

HG-002 to HG-004 (Ashkenazim trio)

Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability

Preliminary SNPs/indels; 120x-150x PacBio data; “moleculo”;mate-pair; CG-LFR

Refined SNPs/indels; Preliminary SVs

RefinedStructural Variants

NIST RMs 8391/8392 release

HG-005 (son in Asian trio)

Illumina,Complete Genomics, Ion, BioNano, homogeneity/stability

“moleculo”;mate-pair; CG-LFR

Preliminary SNPs/indels

Refined SNPs/indels; RefinedStructural Variants

NIST RM8393release

Page 26: 150224 giab 30 min generic slides

Ashkenazim Jewish PGP RM TrioDataset Characteristics Coverage Availability Good for…

Illumina Paired-end

150x150bp ~300x/individual

Fastq on ftp SNPs/indels/some SVs

Illumina Long Mate pair

~6000 bp insert ~40x/individual Feb-Mar 2015 SVs

Illumina “moleculo”

Custom library ~30x by long fragments

Feb-Mar 2015 SVs/phasing/assembly

Complete Genomics

100x/individual On ftp SNPs/indels/some SVs

Complete Genomics

LFR ?? SNPs/indels/phasing

Ion Proton Exome 1000x/individual

On SRA SNPs/indels in exome

BioNanoGenomics

Feb 2015 SVs/assembly

PacBio ~10kb reads ~120-150x on AJ trio

Finished ~Mar 2015

SVs/phasing/assembly/STRs

Page 27: 150224 giab 30 min generic slides

Asian PGP trio

• Similar sequencing to Ashkenazim trio except for PacBio

• Only son will be NIST RM

Page 28: 150224 giab 30 min generic slides

Future Directions

Germline mutations

• Difficult regions/variants– Long-read technologies

– Forming an analysis group

• Tools for assessing performance– How to stratify performance

and understand biases?

Somatic mutations

• Pilot interlaboratory study to assess comparability of spike-ins

• Commercial members developing FFPE cell lines

• Participants interested in mixing different RMs

Page 29: 150224 giab 30 min generic slides

How to get involved• Use our integrated

SNP/indel genotypes for NA12878 and give us feedback– Cells and DNA currently

available from Coriell– NIST RM available April

2015

• Join our new Analysis group– Use Long-read

technologies– Structural Variant calls– De novo assembly– Help create the best-ever

characterized trio

• Attend our biannual workshops (January in CA, August in MD)

• Develop tools/metrics with Global Alliance for Genomics and Health Benchmarking Team

Page 30: 150224 giab 30 min generic slides

Acknowledgments

• FDA – Elizabeth Mansfield, HPC staff

• HSPH

• GCAT - David Mittelman, Jason Wang

• Francisco De La Vega

• Illumina - Mike Eberle

• Personalis - Deanna Church

• NCBI – Chunlin Xiao

• Celera - Andrew Grupe

• Genome in a Bottle– www.genomeinabottle.org

– New members welcome!

– Sign up for email newsletters

[email protected]