establishing reference samples for detection of somatic … · 1 establishing reference samples for...

22
Establishing reference samples for detection of somatic 1 mutations and germline variants with NGS technologies 2 3 Li Tai Fang 1* , Bin Zhu 2* , Yongmei Zhao 3* , Wanqiu Chen 4 , Zhaowei Yang 4,30 , Liz Kerrigan 5 , Kurt 4 Langenbach 5 , Maryellen de Mars 5 , Charles Lu 6 , Kenneth Idler 6 , Howard Jacob 6 , Ying Yu 7 , Luyao 5 Ren 7 , Yuanting Zheng 7 , Erich Jaeger 8 , Gary Schroth 8 , Ogan D. Abaan 8 , Justin Lack 3 , Tsai-Wei Shen 3 , 6 Keyur Talsania 3 , Zhong Chen 4 , Seta Stanbouly 4 , Jyoti Shetty 9 , Bao Tran 9 , Daoud Meerzaman 10 , Cu 7 Nguyen 10 , Virginie Petitjean 11 , Marc Sultan 11 , Margaret Cam 12 , Tiffany Hung 13 , Eric Peters 13 , 8 Rasika Kalamegham 13 , Sayed Mohammad Ebrahim Sahraeian 1 , Marghoob Mohiyuddin 1 , Yunfei 9 Guo 1 , Lijing Yao 1 , Lei Song 2 , Hugo YK Lam 1 , Jiri Drabek 14,15 , Roberta Maestro 15,16 , Daniela 10 Gasparotto 15,16 , Sulev Kõks 15,17,18 , Ene Reimann 15,18 , Andreas Scherer 19,15 , Jessica Nordlund 20,15 , 11 Ulrika Liljedahl 20,15 , Roderick V Jensen 21 , Mehdi Pirooznia 22 , Zhipan Li 23 , Chunlin Xiao 24 , Stephen 12 Sherry 24 , Rebecca Kusko 25 , Malcolm Moos 26 , Eric Donaldson 27 , Zivana Tezak 28 , Baitang Ning 29 , Jing 13 Li 30 , Penelope Duerken-Hughes 31 , Huixiao Hong 29# , Leming Shi 7# , Charles Wang 4,31# , Wenming 14 Xiao 28,29# , and The Somatic Working Group of SEQC-II Consortium 15 16 1 Bioinformatics Research & Early Development, Roche Sequencing Solutions Inc., 1301 Shoreway 17 Road, Suite #300, Belmont, CA 94002; 2 Division of Cancer Epidemiology and Genetics, National 18 Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Bethesda, Maryland 19 20892, USA.; 3 Advanced Biomedical and Computational Sciences, Biomedical Informatics and 20 Data Science Directorate, Frederick National Laboratory for Cancer Research, 8560 Progress Drive, 21 Frederick, MD21701; 4 Center for Genomics, Loma Linda University School of Medicine, 11021 22 Campus St., Loma Linda, CA 92350; 5 ATCC (American Type Culture Collection), 10801 University 23 Blvd, Manassas, VA 20110; 6 Computational Genomics, Genomics Research Center (GRC), AbbVie, 24 1 North Waukegan Road, North Chicago, IL 60064; 7 State Key Laboratory of Genetic Engineering, 25 School of Life Sciences and Shanghai Cancer Center, Fudan University, 2005 SongHu Road, 26 Shanghai, China 200438; 8 Core applications group, Product development, Illumina Inc , 200 27 Lincoln Centre Dr. Foster City, CA 94404; 9 Genomics Laboratory, Cancer Research Technology 28 Program, Frederick National Laboratory for Cancer Research, 8560 Progress Drive, Frederick, 29 MD21701; 10 Computational Genomics Research, Center for Biomedical Informatics and 30 Information Technology (CBIIT), National Cancer Institute, 9609 Medical Center Drive, Rockville, 31 MD 20850; 11 Biomarker development, Novartis Institutes for Biomedical Research, Fabrikstrasse 32 10, CH-4056 Basel, Switzerland; 12 CCR Collaborative Bioinformatics Resource (CCBR), Office of 33 Science and Technology Resources, Center for Cancer Research, NCI, Bldg 37, Rm 3041C, 37 34 Convent Drive, Bethesda, MD 20892; 13 Companion Diagnostics Development, Oncology 35 Biomarker Development, Genentech, 1 DNA Way, South San Francisco, CA 94080; 14 IMTM, 36 Faculty of Medicine and Dentistry, Palacky University Olomouc, Hnevotinska 5, 77900 Olomouc, 37 the Czech Republic; 15 member of EATRIS ERIC- European Infrastructure for Translational Medicine; 38 16 Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, National Cancer Institute, Unit of 39 Oncogenetics and Functional Oncogenomics, Via Gallini 2, 33081 Aviano (PN), Italy; 17 Perron 40 Institute for Neurological and Translational Science, Verdun St, Nedlands, Western Australia, 41 6009, Australia; 18 the Centre for Molecular Medicine and Innovative Therapeutics, Murdoch 42 University, Murdoch, 6150, Western Australia; 19 Institute for Molecular Medicine Finland (FIMM), 43 made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also The copyright holder for this preprint (which was not this version posted May 13, 2019. ; https://doi.org/10.1101/625624 doi: bioRxiv preprint

Upload: others

Post on 28-Aug-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

Establishing reference samples for detection of somatic 1

mutations and germline variants with NGS technologies 2

3 Li Tai Fang1*, Bin Zhu2*, Yongmei Zhao3*, Wanqiu Chen4, Zhaowei Yang4,30, Liz Kerrigan5, Kurt 4 Langenbach5, Maryellen de Mars5, Charles Lu6, Kenneth Idler6, Howard Jacob6, Ying Yu7, Luyao 5 Ren7, Yuanting Zheng7, Erich Jaeger8, Gary Schroth8, Ogan D. Abaan8, Justin Lack3, Tsai-Wei Shen3, 6 Keyur Talsania3, Zhong Chen4, Seta Stanbouly4, Jyoti Shetty9, Bao Tran9, Daoud Meerzaman10, Cu 7 Nguyen10, Virginie Petitjean11, Marc Sultan11, Margaret Cam12, Tiffany Hung13, Eric Peters13, 8 Rasika Kalamegham13, Sayed Mohammad Ebrahim Sahraeian1, Marghoob Mohiyuddin1, Yunfei 9 Guo1, Lijing Yao1, Lei Song2, Hugo YK Lam1, Jiri Drabek14,15, Roberta Maestro15,16, Daniela 10 Gasparotto15,16, Sulev Kõks15,17,18, Ene Reimann15,18, Andreas Scherer19,15, Jessica Nordlund20,15, 11 Ulrika Liljedahl20,15, Roderick V Jensen21, Mehdi Pirooznia22, Zhipan Li23, Chunlin Xiao24, Stephen 12 Sherry24, Rebecca Kusko25, Malcolm Moos26, Eric Donaldson27, Zivana Tezak28, Baitang Ning29, Jing 13 Li30, Penelope Duerken-Hughes31, Huixiao Hong29#, Leming Shi7#, Charles Wang4,31#, Wenming 14 Xiao28,29#, and The Somatic Working Group of SEQC-II Consortium 15 16 1Bioinformatics Research & Early Development, Roche Sequencing Solutions Inc., 1301 Shoreway 17 Road, Suite #300, Belmont, CA 94002; 2Division of Cancer Epidemiology and Genetics, National 18 Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, Bethesda, Maryland 19 20892, USA.; 3Advanced Biomedical and Computational Sciences, Biomedical Informatics and 20 Data Science Directorate, Frederick National Laboratory for Cancer Research, 8560 Progress Drive, 21 Frederick, MD21701; 4Center for Genomics, Loma Linda University School of Medicine, 11021 22 Campus St., Loma Linda, CA 92350; 5ATCC (American Type Culture Collection), 10801 University 23 Blvd, Manassas, VA 20110; 6Computational Genomics, Genomics Research Center (GRC), AbbVie, 24 1 North Waukegan Road, North Chicago, IL 60064; 7State Key Laboratory of Genetic Engineering, 25 School of Life Sciences and Shanghai Cancer Center, Fudan University, 2005 SongHu Road, 26 Shanghai, China 200438; 8Core applications group, Product development, Illumina Inc , 200 27 Lincoln Centre Dr. Foster City, CA 94404; 9Genomics Laboratory, Cancer Research Technology 28 Program, Frederick National Laboratory for Cancer Research, 8560 Progress Drive, Frederick, 29 MD21701; 10Computational Genomics Research, Center for Biomedical Informatics and 30 Information Technology (CBIIT), National Cancer Institute, 9609 Medical Center Drive, Rockville, 31 MD 20850; 11Biomarker development, Novartis Institutes for Biomedical Research, Fabrikstrasse 32 10, CH-4056 Basel, Switzerland; 12CCR Collaborative Bioinformatics Resource (CCBR), Office of 33 Science and Technology Resources, Center for Cancer Research, NCI, Bldg 37, Rm 3041C, 37 34 Convent Drive, Bethesda, MD 20892; 13Companion Diagnostics Development, Oncology 35 Biomarker Development, Genentech, 1 DNA Way, South San Francisco, CA 94080; 14IMTM, 36 Faculty of Medicine and Dentistry, Palacky University Olomouc, Hnevotinska 5, 77900 Olomouc, 37 the Czech Republic; 15member of EATRIS ERIC- European Infrastructure for Translational Medicine; 38 16Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, National Cancer Institute, Unit of 39 Oncogenetics and Functional Oncogenomics, Via Gallini 2, 33081 Aviano (PN), Italy; 17Perron 40 Institute for Neurological and Translational Science, Verdun St, Nedlands, Western Australia, 41 6009, Australia; 18 the Centre for Molecular Medicine and Innovative Therapeutics, Murdoch 42 University, Murdoch, 6150, Western Australia; 19Institute for Molecular Medicine Finland (FIMM), 43

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 2: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

HiLIFE, P.O.Box 20, FI-00014 University of Helsinki, Finland; 20Department of Medical Sciences, 44 Molecular Medicine and Science for Life Laboratory, Uppsala University, Molekylär Medicin, Box 45 1432, BMC, Uppsala, 75144 Sweden; 21Department of Biological Sciences, Virginia Tech, Life 46 Sciences 1, 970 Washington St., Blacksburg, VA 24061; 22Bioinformatics and Computational 47 Biology Core, National Heart Lung and Blood Institute, National Institutes of Health, 12 SOUTH 48 DR. Bethesda MD 20892; 23Sentieon Inc., 465 Fairchild Drive, Suite 135, Mountain View CA 94043; 49 24National Center for Biotechnology Information, National Library of Medicine, National 50 Institutes of Health,45 Center Drive, Bethesda, Maryland 20894; 25Immuneering Corporation, 51 One Broadway 14th Fl Cambridge MA 02142 USA; 26The Center for Biologics Evaluation and 52 Research, U.S. Food and Drug Administration, FDA, Silver Spring, Maryland; 27Division of Antiviral 53 Products, Office of Antimicrobial, Center for Drug Evaluation and Research, FDA, Silver Spring, 54 Maryland; 28The Center for Devices and Radiological Health, U.S. Food and Drug Administration, 55 FDA, Silver Spring, Maryland; 29Bioinformatics branch, Division of Bioinformatics and Biostatistics, 56 National Center for Toxicological Research, Food and Drug Administration, 3900 NCTR Road, 57 Jefferson, AR 72079; 30Departemnt of Allergy and Clinical Immunology, State Key Laboratory of 58 Respiratory Disease, Guangzhou Institute of Respiratory Health, the First Affiliated Hospital of 59 Guangzhou Medical University, Guangzhou, Guangdong, 510182, P. R. China; 31Department of 60 Basic Science, Loma Linda University School of Medicine, Loma Linda, CA 92350 61

62 * Contributed equally 63

# To whome correspondence should be addressed to: [email protected], 64 [email protected], [email protected] and [email protected] 65 66

Abstract 67

68 We characterized two reference samples for NGS technologies: a human triple-negative 69

breast cancer cell line and a matched normal cell line. Leveraging several whole-genome 70 sequencing (WGS) platforms, multiple sequencing replicates, and orthogonal mutation detection 71 bioinformatics pipelines, we minimized the potential biases from sequencing technologies, 72 assays, and informatics. Thus, our “truth sets” were defined using evidence from 21 repeats of 73 WGS runs with coverages ranging from 50X to 100X (a total of 140 billion reads). These “truth 74 sets” present many relevant variants/mutations including 193 COSMIC mutations and 9,016 75 germline variants from the ClinVar database, nonsense mutations in BRCA1/2 and missense 76 mutations in TP53 and FGFR1. Independent validation in three orthogonal experiments 77 demonstrated a successful stress test of the truth set. We expect these reference materials and 78 “truth sets” to facilitate assay development, qualification, validation, and proficiency testing. In 79 addition, our methods can be extended to establish new fully characterized reference samples 80 for the community. 81

82 83

Introduction 84

85 In oncology, accurate somatic mutation detection is essential to diagnose cancer, pinpoint 86

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 3: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

targeted therapies, predict survival, and identify resistance mutations. Despite the recent 87 explosion of technological advancements, many studies have reported difficulties in obtaining 88 consistent and concordant somatic mutation calls from individual platforms or pipelines1–3, which 89 hampers clinical validation and advancement of these biomarkers. 90

91 As more sequencing technologies can detect clinically actionable somatic mutations for 92

oncology, the need grows stronger for benchmark samples with known “ground-truth” variants. 93 Such a publicly available sample set would allow platform and pipeline developers to quantify 94 accuracy of somatic mutation calls, study reproducibility across platforms or pipelines, perform 95 validation usig orthogonal techniques, and calibrate best practices of protocols and methods. The 96 FDA has released a guidance on the use of NGS technologies for in vitro diagnosis of suspected 97 germline diseases4, in which well-characterized reference materials are recommended to 98 establish NGS test performance. 99

100 In the absence of well-characterized samples with somatic mutations, normal samples 101

such as the Platinum Genome5, HapMap6 cell lines, or Genome in a Bottle (GiaB) consortium 102 materials7,8 are often used in clinical test development and validation of somatic applications. 103 Also there are some gene-specific reference samples available, such as KRAS in the WHO 1st 104 International Reference Panel9, or from synthetic materials10. Such samples do not adequately 105 address cancer-specific quality metrics such as somatic mutation variant allele frequency (VAF), 106 heterogeneity, tumor mutation burden (TMB), etc. Therefore, cancer reference samples with an 107 abundance of well-defined genetic alterations characterized across the whole genome are highly 108 desirable and urgent needed. 109

110 Previous attempt has characterized a cancer cell line (from metastatic melanoma) that 111

inquired somatic mutations (SNV/indels) in exon regions only. Germline variants and somatic 112 mutations across the rest of the genome were not defined11. In addition, this dataset is 113 distributed under dbGAP-controlled access, limiting its accesibility and utility. In fact, a recent 114 landscape analysis of currently available somatic variant reference samples published by the 115 Medical Devices Innovation Consortium (MDIC) did not identify any reference mutation sets that 116 can be used to evaluate the somatic mutation calling accuracy on a whole-genome basis12. 117 118

To fulfill this unmet need, we chose a pair of cell lines, HCC1395 (triple-negative breast 119 cancer) and HCC395BL (B lymphocytes) from the same donor, supplied by the American Type 120 Culture Collection (ATCC). These two specific cell lines were chosen because they are rich in 121 testable features (CNVs, SNVs, indels, SVs, and genome rearrangements13), and may have a 122 potential to serve as a long-term, publicly available, and renewable reference samples with 123 appropriate consent from donor. Using multiple next generation sequencing (NGS) platforms, 124 sequencing centers, and various bioinformatics analysis pipelines we profiled these tumor-125 normal matching cell lines. Thus, we minimized biases that were specific to any platform, 126 sequencing center, or bioinformatic algorithm, to create a list of high-confidence mutation calls 127 across the whole genome, here called the “truth set.” A subset of these calls was further 128 confirmed with orthogonal targeted sequencing and Whole Exome Sequencing (WES). We also 129 sequenced a series of titrations between HCC1395 and HCC1395BL genomic DNA (gDNA) to 130

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 4: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

confirm candidate somatic SNV/indels. 131 132 We defined truth sets containing somatic mutations and germline variants in a paired cell 133

lines, HCC1395/HCC1395BL, with methods that minimized potential bias from library preparation, 134 sequencing center, or bioinformatics pipeline. While the “truth set” germline variants in 135 HCC1395BL can be used for benchmarking germline variant detection, the “truth set” somatic 136 mutations in HCC1395 can be used for benchmarking cancer mutation detection with VAF as low 137 as 5%. Many of variants and mutations have clinical implications. In the coding regions, a total of 138 193 somatic mutations are documented in the COSMIC database and 8 germline variants are 139 annotated as pathogenic in the ClinVar database. Interestingly, there is a nonsense somatic 140 mutation in the BRCA2 gene and a nonsense germline variant in the BRAC1 gene. Other hotspot 141 somatic mutations are also observed in the TP53 and FGFR1 genes. Thus, we believe these paired 142 cell lines may be highly valuable for those looking for reference samples to benchmark products 143 in detection of mutations in these four genes. 144 145

Results 146

147 Massive data generated to characterize the reference samples 148 149

To provide reference samples for the community well into the future, a matched pair, 150 HCC1395 and HCC1395BL was selected for profiling14. Previous studies of this triple negative 151 breast cancer cell line have revealed the existence of many somatic structural and ploidy 152 changes13, which are confirmed by our cell karyotype and cytogenetic analysis (Suppl. Fig S1, S2). 153 Several attempts have been made to identify SNVs and small indels15–17. Given that appropriate 154 consent from the donor has been obtained for tumor HCC1395 and normal HCC1395BL for the 155 purposes of genomic research, we sought to characterize this pair of cell lines as publicly available 156 reference samples for the NGS community. In this manuscript, we focued our efforts on germline 157 and somatic SNVs and indels. By performing numerous sequencing experiments with multiple 158 platforms at different sequencing centers, we obtained high-confidence call sets of both somatic 159 and germline SNVs and indels (Table 1). Larger structural variants and copy number analysis will 160 be included in a separate manuscript that will discuss these fundings in greater detail. 161 162 Initial Determination of Somatic Mutation Call Set 163

164 High-confidence somatic SNVs and indels were obtained based primarily on 21 pairs of 165

tumor-normal Whole Genome Sequencing (WGS) replicates from six sequencing centers; 166 sequencing depth ranged from 50X to 100X (see manuscript DOI: 10.1101/626440). Each of the 167 21 tumor-normal sequencing replicates was aligned with BWA MEM18, Bowtie219, and 168 NovoAlign20 to create 63 pairs of tumor-normal Binary Sequence Alignment/Map (BAM) files. Six 169 mutation callers (MuTect221, SomaticSniper22, VarDict23, MuSE24, Strelka225, and TNscope26) were 170 applied to discover somatic mutation candidates for each pair of tumor-normal BAM files (Fig. 1). 171 SomaticSeq27 was then utilized to combine the call sets and classify the candidate mutation calls 172 into “PASS”, “REJECT”, or “LowQual”. Four confidence levels (HighConf, MedConf, LowConf, and 173 Unclassified) were determined based on the cross-aligner and cross-sequencing center 174

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 5: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

reproducibility of each mutation call. HighConf and MedConf calls were grouped together as the 175 “truth set” (also known as high-confidence somatic mutations). The call set in its entirety is 176 referred to as the “super set” which includes low-confidence (LowConf) and likely false positive 177 (Unclassified) calls. For low-VAF (Variant Allele Frequency) calls, a HiSeq data set with 300× 178 coverage and a NovaSeq data set with 380× coverage were employed to rescue initial LowConf 179 and Unclassified calls into the truth set. The details are described in the Methods. 180

181 A breakdown of the four confidence levels is displayed in Fig. 2a. In the truth set, HighConf 182

calls consist of 94% of the SNVs and 79% of the indels. LowConf calls typically do not have enough 183 “PASS” classifications across the 63 data sets to be included in the truth set. Variants calls labeled 184 as Unclassified are not reproducible and likely false positives, with more “REJECT” classifications 185 than “PASS”. The vast majority of the calls in the super set are either HighConf or Unclassified. In 186 other words, super set calls tend to be either highly reproducible or not at all reproducible. 187

188 In general, HighConf calls were classified as “PASS” in the vast majority of the data sets, 189

with no variant read in the matched normal and high mapping quality scores. MedConf calls 190 tended to be low-VAF (VAF ≲ 0.10) variants. Due to stochastic sampling of low frequency variants, 191 MedConf calls were not reproduced as highly across different sequencing replicates as HighConf 192 calls. LowConf calls (not a part of truth set) tended to have VAF near or below our detection limits 193 (VAF ≲ 0.05). Distinguishing the LowConf calls with sequencing noise is challenging because they 194 were not reproduced enough to be high-confidence calls (Fig. 2b). 195

196 Independent AmpliSeq confirmation of Call Set 197

198 We randomly selected 450 SNV and 21 indel calls of different confidence levels from the 199

super set and performed PCR-based AmpliSeq with approximately 2000× depth for tumor and 200 normal cells on an Illumina MiSeq sequencer. As we treated the AmpliSeq data set as a 201 confirmatory experiment, simple rules were devised to determine whether a variant call was 202 deemed positively confirmed, not confirmed, or uninterpretable based on the presence or 203 absense of somatic mutation evidence in the AmpliSeq data. Overall, positively confirmed calls 204 had at least 100 variant-supporting reads in the tumor but had no variant read in the normal 205 sample, despite sequencing depths of 600× or more in the normal. Not confirmed calls either 206 had no more variant-supporting read than the expected from base call errors, and/or had 207 VAF≥10% in the normal cells. Uninterpretable calls did not satisfy the criteria for either positive 208 or no validation, either because they did not have enough read depth (<50) or had fewer than 209 10 variant-supporting reads. (See Methods for details). 210

211 Both HighConf and MedConf SNV calls had very high coverage in validation and thus had 212

impressive validation rates (99% and 92%) (Table 2). There were only three HighConf SNV calls 213 that were not confirmed by AmpliSeq. Two of them had germline signals below the detection 214 limit of 50× in the WGS, and the third one was likely an actual somatic mutation missed by 215 AmpliSeq. There were only seven “positively confirmed” Unclassified SNV calls. Four of those 216 seven were either a part of di-nucleotide change or had deletions within 1 bp of the call. The 217 other three had low mapping quality scores (MQ), which drove the categorization of 218

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 6: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

“Unclassified”. This result suggests that some of the “positively confirmed” Unclassified calls 219 might be false positives after all, but it also exposes the limitations of our truth set with regard 220 to complex variants and low mappability regions. LowConf and Unclassified calls (not part of the 221 truth set) also had higher fractions of uninterpretable calls, which consist of low-coverage 222 genomic positions or ambiguous variant signals. In addition, there were also 17 HighConf, 2 223 MedConf, 1 LowConf, and 1 Unclassified indel calls re-sequenced by AmpliSeq. The only not 224 confirmed HighConf indel call was caused by a germline signal. The lone Unclassified indel call 225 was not confirmed (we expect Unclassified calls to be not confirmed). For the inquisitive reader, 226 these discrepant calls (i.e., not confirmed HighConf calls and confirmed Unclassified calls) are 227 discussed in greater detail in the Supplementary Material. 228

229 The VAF calculated from the truth set correlated highly with the VAF calculated from 230

AmpliSeq data set, especially for HighConf calls (Fig. 2c). On the other hand, almost all the data 231 points at the bottom of the graph (i.e., VAF = 0 by AmpliSeq) are Unclassified calls (red). It 232 suggests that despite high VAFs (from 21 WGS replicates) for some of the calls, they were 233 categorized correctly as Unclassified (implying likely false positives). In addition, a large number 234 of uninterpretable Unclassified calls (red X’s) lying at the bottom suggest those were correctly 235 labeled as Unclassified in addition to the not confirmed ones (open red circles). Moreover, some 236 of the seven “positively confirmed” Unclassified calls had dubious supporting evidence. Taken 237 together, these results suggest that the actual true positive rate for the Unclassified calls may be 238 even lower than the validation rate (11%) we reported here. The indel equivalent is portrayed in 239 Suppl. Fig S7a. 240

241 Orthogonal Confirmation of Call Set with WES on Ion Torrent 242

243 We have also sequenced the tumor-normal pair with Whole Exome Sequencing (WES) on 244

the Ion Torrent S5 XL sequencer with the Agilent SureSelect All Exon + UTR v6 hybrid capture. 245 The sequencing depths for the HCC1395 and HCC1395BL were 34× and 47× , respectively. 246 Results from this Ion Torrent sequencing were leveraged to evaluate high-VAF SNV calls (Table 1 247 and 2). HighConf and MedConf SNV calls had high positive validation rates (99% and 89%). 248 However, because the Ion Torrent sequencing was performed at much lower depth, nearly 50% 249 of the calls were deemed uninterpretable (compared with 16% for AmpliSeq, despite having 250 AmpliSeq custom target enriched for low-confidence calls vs. WES). The trend of higher 251 uninterpretable fraction with lower confidence level calls was even more pronounced in this data 252 set because the coverage was too low to confirm or invalidate many low-VAF calls. The validation 253 rate for MedConf calls (predominantly low-VAF calls) may have suffered due to low coverage. 254

255 The VAF correlation between truth set and Ion Torrent WES (R=0.928) is lower than that 256

between truth set and AmpliSeq (R=0.958), although the vast majority of the HighConf SNV calls 257 in Ion Torrent data still stay within the 95% confidence interval area (Fig. 2d). 258

259 There are uninterpretable Unclassified calls (red X’s) at the bottom for high-VAF calls, 260

which is again highly suggestive that the true positive rate for Unclassified calls may be lower 261 than the reported validation rate (25%) for Ion Torrent data as well. The indel equivalent is 262

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 7: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

included in Suppl. Fig. S7b. 263 264

Independent Confirmation of Call Set with WES on HiSeq 265 266 We used 14 HiSeq WES replicates from six sequencing centers to evaluate the 267

concordance between these data sets and the WGS data sets employed to construct the truth 268 set. While the WES data sets were not sequenced from orthogonal platforms, they provide 269 insights in terms of the reproducibility of our call sets in different library preparations. The scatter 270 plot between the super set derived VAF and medium HiSeq WES-derived VAF is presented in Fig. 271 2e. Almost all truth set (HighConf and MedConf calls) variants had consistent VAFs calculated 272 from both sources. 273

274 Again, simple rules were implemented for validation with the WES data as well (Table 2). 275

The validation rate for HighConf, MedConf, LowConf, and Unclassified SNV calls by WES were 276 100%, 98.4%, 93.1%, and 42.4%. These validation rates are higher than other methods because 277 these WES data were sequenced on the same platform and sequencing centers as those used to 278 build the truth set. Thus, the truth set variant calls are reproducible in WES, though these data 279 sets do not eliminate sequencing center or platform specific artifacts that may exist in both WGS 280 and WES data sets. The indel equivalent is the subject of Suppl. Fig. S7c. 281 282 Validation with tumor content titration series 283

284 To evaluate the effects of tumor purity, we pooled HCC1395 DNA with HCC1395BL DNA 285

at different ratios to create a range of admixtures representing tumor purity levels of 100%, 75%, 286 50%, 20%, 10%, 5%, and 0%. For each tumor DNA dilution point, we performed WGS on a HiSeq 287 4000 with 300 × total coverage by combining three repeated runs (manuscript DOI: 288 10.1101/626440). We plotted the VAF fitting score between the expected values based on the 289 super set vs. the observed values at each tumor fraction (Fig. 2f). For real somatic mutations, 290 their observed VAF should scale linearly with tumor fraction in the tumor-normal titration series. 291 In contrast, the observed VAF for sequencing artifacts or germline variants will not scale in this 292 fashion. Fig. 2f shows that the fitting scores for HighConf and MedConf calls are much higher than 293 LowConf and Unclassified calls across all VAF brackets, indicating that the HighConf and MedConf 294 calls contain far more real somatic mutations than LowConf and Unclassified calls. The formula 295 [Eq. 2] for the fitting score is described in the Methods. 296

297 Definition and Confirmation of Germline SNVs/Indels in matched normal 298

299 For the 21 WGS sequencing replicates of HCC1395BL (aligned with BWA MEM, Bowtie2, 300

and NovoAlign to create 63 BAM files) we employed four germline variant callers, i.e., 301 FreeBayes28, Real Time Genomics (RTG)29, DeepVariant30, and HaplotypeCaller31, to discover 302 germline variants (SNV/indels). To consolidate all the calls, a generalized linear mixed model 303 (GLMM) was fit for each set of SNV calls which are sequenced at different centers on various 304 replicates, aligned by the three aligners, and discovered by the four callers. We estimated the 305 SNVs/indel call probability (SCP) averaged across four factors (sequencing center, sequencing 306

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 8: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

replicate, aligner, and caller), and examined the variance of SCP across these factors. The SNV 307 candidates considered were called at least four times (out of a maximum of 21x3x4=252 times) 308 by various combination of the four factors. The frequency histogram of the averaged SCPs 309 demonstrates a bimodal pattern (Fig. 3a). The vast majority of SNV calls (97%) had SCPs either 310 below 0.1 (57%) or above 0.9 (40%). Only a small minority of calls (3%) lie between 0.1 and 0.9. 311 This indicates when SNVs were repeatedly sequenced and called, only a small proportion of them 312 would be recurrently called as SNVs, and those recurrent calls were in fact highly recurrent. 313

314 Each of our germline SNV or indel calls had annotated SCP. See the Methods and Eq. 2 for 315

details. Suppl. Table S7 demonstrates that, of the highest-confidence calls (SCP=1, i.e, they were 316 called everywhere), the validation rates were approximately 99% for SNV and 98% for indels by 317 Illumina MiSeq, and 98% and 97% for Ion Torrent. Of the 11 SNV with SCP below 0.5, all were not 318 confirmed by MiSeq. Other calls had intermediate validation rates. 319

320 Figs 3b and 3c display that the vast majority of confirmed germline VAF was around 50% 321

and 100%. A considerable number of lower-confidence germline SNV calls clustered around 20% 322 VAF in non-exonic regions (Fig. 3b), with a large proportion of them being uninterpretable during 323 validation. Scatter plots for indels are qualitatively similar (Suppl. Fig. S13). 324

325 SNV Functional Relevance and TMB Benchmarks 326

327 Among the truth set somatic mutations, 186 COSMIC SNVs and 7 COSMIC indels are in 328

the coding region. One hotspot somatic mutation of particular biological significance is a TP53 329 c.128G>A (COSMIC99023, chr17:7675088 C>T, VAF>99%), which causes an amino acid change 330 p.Arg43His that leads to the inactivation of TP53 tumor suppressive function32. In addition, there 331 is also a stop gain mutation in BRCA2 c.4777G>T (COSMIC13843, chr13:32339132 G>A), which 332 causes a nonsense at p.Glu1593*, though it is only a heterozygous variant with VAF of 37.5%. 333 Furthermore, there is a missense mutation in FGFR1 c.473C>T (COSM1456963, chr8: 38428420 334 G>A, VAF>99%). 335 336

Of the over 3.5 million high-confidence germline variants discovered in HCC1395BL, 9,016 337 of them are in the ClinVar database. Most of them were annotated as “benign” or “like benign”; 338 however, 8 SNVs were annotated as “pathogenic” (Suppl. Table S9). One germline variant likely 339 to substantially increase the risk of an affected patient to develop breast cancer is a premature 340 stop gain in BRCA1 (chr17:43057078, c.5251C>A, p.Arg1751*, ClinVar #55480, OMIM Entry 341 #604370). The lifetime risk of breast cancer for carriers of this variant is 80 to 90%33. The 342 premature stop codon deactivates BRCA1’s function to repair DNA double-strand breaks. It is one 343 of the most common germline variants among breast cancer patients. HCC1395 has both BRCA1 344 and TP53 completely inactivated, one from germline and one acquired somatically. The loss of 345 two critical tumor suppressor genes likely contributed to tumorigenesis. A full list of COSMIC 346 somatic mutations and ClinVar germline variants in the coding region is provided in Supplemental 347 File 2. 348 349

Tumor mutational burden (TMB) is defined as the number of non-synonymous somatic 350

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 9: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

variants per unit area of the genome, i.e., typically the number of non-synonymous mutations 351 per Mbps34. Recent literature increasingly has reported correlations between TMB and response 352 to anti-PD(L)-1 immunotherapy treatment35. The “gold standard” to measure TMB is to perform 353 tumor-normal WES and find the total number of non-synonymous mutations (all in the coding 354 regions). Due to the high cost and time required for WES, researchers are trying to infer TMB 355 with much smaller and less expensive targeted oncology panels. One way to increase the 356 statistical power of a much smaller panel is to measure all somatic mutations, including 357 synonymous mutations, which is expected to correlate highly with frequency of non-synonymous 358 mutations if we believe most somatic mutations, especially in high-TMB patients, occur more-or-359 less randomly. We inferred TMB with various commercially available target panels. The 360 uncertainties of mutation rate (calculated as the 95% binomial confidence interval) inferred by 361 smaller oncology panels are quite large, so we advise caution when attempting to infer TMB from 362 targeted oncology panels (Suppl. Table S10). 363 364 Defining Genome Callable Regions 365

366 Accurate variant calling requires an abundance of high-quality reads aligned accurately to 367

the genomic coordinates in question. False positives are overwhelmingly enriched in genomic 368 regions where the alignments are challenging, base call qualities are low, and/or reported 369 coverage is far from the mean or median36. There are parts of the human genome that cannot be 370 covered by current technologies (Fig. 4a). To obtain the callable regions, we ran GATK CallableLoci 371 on each of the 63 HCC1395 and HCC1395BL BAM files to identify regions of low coverage (<10), 372 ultra-high coverage (8× the mean coverage of the sample), difficult to map (MQ<20), poor 373 reads (Base Quality Score BQ<20), or with N in the reference genome. We then created 374 consensus callable regions that we deemed callable for our truth set. A limitation of our callable 375 regions and our truth set is that they were defined and relied on short-read sequencing 376 technologies (i.e., Illumina sequencers), because currently only high-accuracy short-read 377 technologies are fit for somatic variant detection due to their low VAF. Variant calls outside the 378 consensus callable regions were labeled NonCallable in the super set and truth set to warn users 379 of these potential problems (details in Methods). NonCallable regions consisted of approximately 380 8% of the genome but contained over 34% of all Unclassified calls and 23% of all LowConf calls in 381 the super set (Suppl. Table S6). 382

383 The consensus callable regions consist of a total of 2.73 billion bps (Fig. 4b). In comparison 384

with GiaB NA12878 genome’s more strictly defined high-confidence (HC) regions7, 88% of our 385 consensus callable regions are in common with GiaB’s HC regions. On the other hand, 98% of 386 GiaB’s HC regions are a part of our conensus callable regions. Unlike GiaB’s HC which exclude 387 regions with structural variations as well as regions where variant calls are inconsistent with 388 pidigree or regions with unexplained pipeline inconsnstencies, when there were disagreements 389 in a variant call from various sequencing data, we did not exclude the region. Instead, we 390 attempted to resolve these discrepancies. When there were nearby structural changes, we relied 391 on machine learning algorithms to resolve these challenging events. As a result, our conensus 392 callable regions included some difficult genomic regions, such as human leukocyte antigen (HLA) 393 and olfactory receptor genes which contain high homologus sequences. The confidence (or the 394

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 10: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

lack thereof) we hold for each variant call is annotated on a per call basis. We have demonstrated 395 some benchmarking results with different regions in the Supplementary, section 1.10. 396

397 398

Discussion 399

400 Through a community effort, we generated a high confidence somatic mutation call set 401

with limit of detection (LOD) at 5% VAF (Fig. 2b). To employ as an accuracy benchmark, we 402 recommend considering the variant calls labeled with both HighConf and MedConf as true 403 positives. These true positive variants can be used to assess sensitivity, i.e., the fraction of those 404 variants detected by a pipeline. On the other hand, variant calls labeled as Unclassified plus any 405 unspecified genomic coordinates are likely false positives. LowConf calls could not be confidently 406 determined here and should be blacklisted for current accuracy evaluation. LowConf calls had 407 validation rates around 50%, and often had VAF below our 50× depth detection limit. They 408 represent opportunities for future work to ascertain their actual somatic status. 409

410 The confidence level of each variant call was determined by the “PASS” classifications 411

provided by SomaticSeq across different sequencing centers with different aligners (see 412 Methods). If a variant was not detected by any caller in a data set, it was considered “Missing” in 413 that data set, which is common for low-VAF calls due to stochastic sampling. For most calls, 414 however, they either had “PASS” classifications or “REJECT or Missing” classifications, but not 415 both. Few variant candidates had a large number of “PASS and REJECT” classifications (Suppl. Fig. 416 S6a). HighConf calls had many “PASS” classifications, very few “REJECT” classifications, and a full 417 range of VAFs. MedConf calls had fewer “PASS” calls (still high), still very few “REJECT” 418 classifications, but were mostly low-VAF, which explains the lower number of “PASS” calls. 419 LowConf calls had even fewer “PASS” calls than MedConf though they overlaped significantly, 420 and also a low number of “REJECT” classifications. LowConf calls tended to have even lower VAF 421 than MedConf, around or below our detection limit (Fig. 2b). Only Unclassified calls suffered a 422 significant number of “REJECT” classifications, and they also displayed a full range of VAF. The 423 performance of Unclassified calls indicated that SomaticSeq labeled them “REJECT” due to poor 424 mapping, poor alignment, germline risk, or causes other than lack of variant reads. HighConf and 425 Unclassified calls are far apart in all of the metrics describedabove. 426

427 Variant re-sequencing with AmpliSeq (Suppl. Fig. S6c) pointed to a high validation rate for 428

HighConf and MedConf calls. Suppl. Fig. S6c also contains a cluster of Unclassified and LowConf 429 calls in the middle of the XY plane, representing calls with some conflicts (i.e., large number of 430 “PASS” and “REJECT” calls). 431

432 Each time a human cell divides, somatic mutations could be introduced by replication 433

errors. Somatic mutations can occur much more frequently in cancer cells with malfunctioning 434 DNA repair systems. It is not feasible to detect extremely low-VAF somatic mutations because 435 they may appear in few tumor cells. Our ”truth set” for somatic mutation was built upon WGS 436 with 50×-100X coverages, and thus it was designed to detect somatic mutations limited to 5% of 437 VAF. Variants with low-VAF (≤12%) were cross-referenced with two data sets with depths over 438

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 11: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

300× to ascertain their presence. While we do not expect our truth set to be 100% accurate or 439 100% comprehensive, the AmpliSeq and Ion Torrent data sets demonstrated combined 99% and 440 91% validation rates for HighConf and MedConf SNV calls, respectively. AmpliSeq also showed a 441 94% validation rate for HighConf indel calls. VAF of 5% represents the lower detection limit of the 442 first release of the somatic mutation truth set, even though there are many true mutations with 443 VAF under that threshold. We recommend that if using this truth set as a benchmark, novel 444 variant calls (i.e., variants calls not present in our super set) with VAF<5% should be blacklisted 445 from the accuracy calculations because we cannot confidently determine their status. Due to 446 losses of chr6p, chr16q, and chrX in HCC1395BL (Suppl. Fig S1, S2), somatic mutations in these 447 regions were excluded. 448

449 For the first time, tumor-normal paired “reference samples” with a whole-genome 450

characterized somatic mutation and germline “truth sets” are available to the community. Our 451 samples, data sets, and the list of known somatic mutations can serve as a public resource for 452 evaluating NGS platforms and pipelines. The massive and diverse amount of sequencing data 453 generated from multiple platforms at multiple sequencing centers can help tool developers to 454 create and validate new algorithms and to build more accurate artificial intelligence (AI) models 455 for somatic mutation detection. The reference samples and call set presented here can help in 456 assay development, qualification, validation, and proficiency testing. Such community defined 457 tumor-normal paired reference samples can be helpful in quality assessment by clinical 458 laboratories engaged in NGS, data exchange between laboratories, characterization of gene 459 therapy products, and premarket review of NGS-based products. Furthermore, the methodology 460 used in this study can be extended to establish truth sets for additional cancer reference samples. 461 Other reference sample efforts may be able to build on the data sets we established or consider 462 using these samples as a genomic background for other reference samples. 463

464 465

Methods 466

See Online Methods 467 468

Acknowledgements 469

We thank Justin Zook of the National Institute of Standard Technology for advice in establishing 470 reference samples and truth set; Sivakumar Gowrisankar of Novartis, Susan Chacko of the Center 471 for Information Technology, the National Institute of Health for their assistance with data 472 transfer; Dr. Jun Ye of Sentieon for providing the Sentieon software package. We also appreciate 473 Dr. Laufey Amundadottir of the Division of Cancer Epidemiology and Genetics, National Cancer 474 Institute (NCI), National Institutes of Health (NIH), for the sponsorship and the usage of the NIH 475 Biowulf cluster; Drs Reena Phillip, Yun-Fu Hu, Sharon Liang, and You Li of the Center for Devices 476 and Radiological Health, U.S. Food and Drug Administration, for their advices on study design and 477 manuscript writing; and Seven Bridges for providing storage and computational support on the 478 Cancer Genomic Cloud (CGC). The CGC has been funded in whole or in part with Federal funds 479 from the National Cancer Institute, National Institutes of Health, Contract No. 480 HHSN261201400008C and ID/IQ Agreement No. 17X146 under Contract No. 481

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 12: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

HHSN261201500003I. Chunlin Xiao and Steve Sherry were supported by the Intramural Research 482 Program of the National Library of Medicine, National Institutes of Health. This work also used 483 the computational resources of the NIH Biowulf cluster (http://hpc.nih.gov). Original data was 484 also backed up on the servers provided by Center for Biomedical Informatics and Information 485 Technology (CBIIT), NCI. We would also like to thank the partially support from the Ardmore 486 Institute of Health (AIH) grant (2150141) and Dr. Charles A. Sims' gift to Loma Linda University 487 (LLU) Center for Genomics. The LLU Center for Genomics is partially supported by AIH grant 488 (2150141) and Charles A. Sims’ gift. 489 490

Disclaimer 491

This is a research study, not intended to guide clinical applications. The views presented in this 492 article do not necessarily reflect current or future opinion or policy of the US Food and Drug 493 Administration. Any mention of commercial products is for clarification and not intended as 494 endorsement. 495 496

Data availability 497

All raw data (FASTQ files) are available on NCBI’s SRA database (SRP162370). The truth set for 498 somatic mutations in HCC1395, VCF files derived from individual WES and WGS runs, and source 499 codes are available on NCBI’s ftp site (ftp://ftp-500 trace.ncbi.nlm.nih.gov/seqc/ftp/Somatic_Mutation_WG/). Some alignment files (BAM) are also 501 available on Seven Bridges’ s Cancer Genomics Cloud (CGC) platform. 502

503

Author contributions 504

Study conceived and designed by: W.X., C.W., L.S., H.H., E.D., Z.T., B.N., W.T., R.J. 505 506 Biosample preparation: L.K., K.L., M. M., T.H., W.C. 507 508 NGS library preparation and sequencing: W.C., Z.C., S.S., K.I., H.J., E.J., G.S., S.S., J.S., P.K., J.B., B.T., 509 V.P., M.S., T.H., E.P., R.K., J.D., P.V., R.M., D.G., S.K., E.R., A.S., J.N., U.L., J.W., J.L., P.PH. 510 511 Data analysis: L.T.F., W.X., B.Z., Y.Z., Z.Y., C.L., O.A., L.S., J.L., T.S., K.T., D.M., C.N., M.C., S.M.S., 512 M.M., Y.G., L.Y., H.L., M.P., Z.L. 513 514 Data management: W.X., C.X., S. S. 515 516 Manuscript writing: L.T.F., W.X., R.K., M.M., C.X., S.S. 517 518 Project management: W.X. 519 520

References 521

522 1. Hofmann, A. L. et al. Detailed simulation of cancer exome sequencing data reveals differences 523

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 13: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

and common limitations of variant callers. BMC Bioinformatics 18, 8 (2017). 524

2. Krøigård, A. B., Thomassen, M., Lænkholm, A.-V., Kruse, T. A. & Larsen, M. J. Evaluation 525

of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted 526

Deep Sequencing Data. PLOS ONE 11, e0151664 (2016). 527

3. Shi, W. et al. Reliability of Whole-Exome Sequencing for Assessing Intratumor Genetic 528

Heterogeneity. Cell Rep. 25, 1446–1457 (2018). 529

4. Tezak, Z. & Berger, A. Considerations for Design, Development, and Analytical Validation 530

of Next Generation Sequencing (NGS) – Based In Vitro Diagnostics (IVDs) Intended to Aid 531

in the Diagnosis of Suspected Germline Diseases. (2018). 532

5. Eberle, M. A. et al. A reference data set of 5.4 million phased human variants validated by 533

genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 534

157–164 (2017). 535

6. The International HapMap Project. Nature 426, 789 (2003). 536

7. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP 537

and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014). 538

8. Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference 539

calls. Nat. Biotechnol. 1 (2019). doi:10.1038/s41587-019-0074-6 540

9. National Institute for Biological Standards and Control. WHO Reference Panel 1st 541

International Reference Panel for genomic KRAS codons 12 and 13 mutations NIBSC code: 542

16/250. (World Health Organization). 543

10. Oncospan Reference Standard HD827. Available at: 544

https://www.horizondiscovery.com/reference-standards/type/oncospan. (Accessed: 17th April 545

2019) 546

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 14: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

11. Craig, D. W. et al. A somatic reference standard for cancer genome sequencing. Sci. Rep. 6, 547

24607 (2016). 548

12. Zehnbauer, B. et al. MDIC SRS Report: Somatic Variant Reference Samples for NGS. 549

(2019). 550

13. Popova, T. et al. Ploidy and Large-Scale Genomic Instability Consistently Identify Basal-like 551

Breast Carcinomas with BRCA1/2 Inactivation. Cancer Res. 72, 5454–5462 (2012). 552

14. Gazdar, A. F. et al. Characterization of paired tumor and non-tumor cell lines established from 553

patients with breast cancer. Int. J. Cancer 78, 766–774 (1998). 554

15. Kalyana-Sundaram, S. et al. Gene Fusions Associated with Recurrent Amplicons Represent a 555

Class of Passenger Aberrations in Breast Cancer. Neoplasia 14(8), 702–708 (2012). 556

16. Zhang, J. et al. INTEGRATE: gene fusion discovery using whole genome and transcriptome 557

data. Genome Res. 26, 108–118 (2016). 558

17. Robinson, D. R. et al. Functionally recurrent rearrangements of the MAST kinase and Notch 559

gene families in breast cancer. Nat. Med. 17, 1646–1651 (2011). 560

18. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 561

ArXiv13033997 Q-Bio (2013). 562

19. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 563

357–359 (2012). 564

20. NovoCraft Technologies Sdn Bhd. NovoAlign. Available at: 565

http://www.novocraft.com/products/novoalign/. 566

21. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous 567

cancer samples. Nat. Biotechnol. 31, 213–219 (2013). 568

22. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome 569

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 15: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

sequencing data. Bioinformatics 28, 311–317 (2012). 570

23. Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation sequencing in 571

cancer research. Nucleic Acids Res. 44, e108 (2016). 572

24. Fan, Y. et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model 573

improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 574

17, 178 (2016). 575

25. Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. 576

Methods 15, 591 (2018). 577

26. Freed, D., Pan, R. & Aldana, R. TNscope: Accurate Detection of Somatic Mutations with 578

Haplotype-based Variant Candidate Detection and Machine Learning Filtering. bioRxiv 579

250647 (2018). doi:10.1101/250647 580

27. Fang, L. T. et al. An ensemble approach to accurately detect somatic mutations using 581

SomaticSeq. Genome Biol. 16, 197 (2015). 582

28. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. 583

ArXiv12073907 Q-Bio (2012). 584

29. Real Time Genomics (RTG) Variant Caller. Available at: https://www.realtimegenomics.com/. 585

30. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. 586

Nat. Biotechnol. 36, 983–987 (2018). 587

31. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. 588

bioRxiv 201178 (2017). doi:10.1101/201178 589

32. Soussi, T., Leroy, B. & Taschner, P. E. M. Recommendations for Analyzing and Reporting 590

TP53 Gene Variants in the High-Throughput Sequencing Era. Hum. Mutat. 35, 766–778 591

(2014). 592

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 16: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

33. OMIM Clinical Synopsis - #604370 - BREAST-OVARIAN CANCER, FAMILIAL, 593

SUSCEPTIBILITY TO, 1; BROVCA1. Available at: 594

https://www.omim.org/clinicalSynopsis/604370. (Accessed: 22nd March 2019) 595

34. Meléndez, B. et al. Methods of measurement for tumor mutational burden in tumor tissue. 596

Transl. Lung Cancer Res. 7, 661–667 (2018). 597

35. Goodman, A. M. et al. Tumor Mutational Burden as an Independent Predictor of Response to 598

Immunotherapy in Diverse Cancers. Mol. Cancer Ther. 16, 2598–2608 (2017). 599

36. Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. 600

Bioinformatics 30, 2843–2851 (2014). 601

602 603

NGS technologies platforms # of reads (coverage)

HCC1395 HCC1395BL

Initial WGS

HiSeq 57 billion (2,800X) 57 billion (2,800X)

NovaSeq 13 billion (650X) 13 billion (650X)

10X Genomics 20 billion (1,000X) 20 billion (1,000X)

PacBio 20 million (50X) 20 million (50X)

Validation

WGS-tumor content

HiSeq 7.6 billion (380X) 7.6 billion (380X)

WES HiSeq 5 billion (12,500X) 5 billion (12,500X)

Ion Torrent 67 million (34X) 82 million (47X)

AmpliSeq MiSeq 3.3 million (2000X) 3.3 million (2000X)

Table 1. Massive data from multiple NGS platforms was obtained to derive and confirm germline 604 and somatic variants in HCC1395 and HCC1395BL 605

606 Validation Platform

Variant Type Category Total

Number Fraction Interpretable

Validation Rate (Interpretable)

Validation Rate (Total)

AmpliSeq Deep Sequencing

SNV

HighConf 247 (237/247) 96.0% (234/237) 98.7% 94.7%

MedConf 40 (37/40) 92.5% (34/37) 91.9% 85.0%

LowConf 58 (41/58) 70.7% (22/41) 53.7% 37.9%

Unclassified 105 (62/105) 59.0% (7/62) 11.3% 6.7%

INDEL

HighConf 17 (17/17) 100.0% (16/17) 94.1% 94.1%

MedConf 2 (2/2) 100.0% (2/2) 100.0% 100.0%

LowConf 1 (0/1) 0.0% (0/0) NA nan%

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 17: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

Unclassified 1 (1/1) 100.0% (0/1) 0.0% 0.0%

Ion Torrent WES

SNV

HighConf 703 (629/703) 89.5% (623/629) 99.0% 88.6%

MedConf 43 (27/43) 62.8% (24/27) 88.9% 55.8%

LowConf 134 (39/134) 29.1% (28/39) 71.8% 20.9%

Unclassified 802 (155/802) 19.3% (39/155) 25.2% 4.9%

INDEL

HighConf 31 (25/31) 80.6% (22/25) 88.0% 71.0%

MedConf 15 (7/15) 46.7% (6/7) 85.7% 40.0%

LowConf 8 (0/8) 0.0% (0/0) NA nan%

Unclassified 36 (8/36) 22.2% (6/8) 75.0% 16.7%

WES

SNV

HighConf 1074 (1068/1074) 99.4% (1068/1068) 100% 99.4%

MedConf 64 (63/64) 98.4% (62/63) 98.4% 96.9%

LowConf 197 (144/197) 73.1% (134/144) 93.1% 68.0%

Unclassified 1218 (436/1218) 35.8% (184/436) 42.4% 15.1%

INDEL

HighConf 45 (43/45) 95.6% (43/43) 100% 95.6%

MedConf 17 (17/17) 100.0% (17/17) 100% 100.0%

LowConf 13 (10/13) 76.9% (9/10) 90% 69.2%

Unclassified 54 (19/54) 35.2% (14/19) 73.7% 25.9%

607 Table 2: Validation of SNVs of different confidence levels by three different methods 608

Figure legends 609

610

Figure 1: Schematic of the bioinformatics pipeline used to define the confidence levels of the 611 super set and truth set (see Online Methods for detail) 612

613 Figure 2: Initial definition of somatic mutation truth set and subsequent validation. (a) A 614 breakdown of the four confidence levels in the super set. (b) Histograms of VAF for SNVs (top) 615 and Indels (bottom) calls. (c) Validation of initial definition of somatic mutation truth set with 616 AmpliSeq. Solid circles are variant calls that were positively confirmed. Open circles are variants 617 that were not confirmed. X’s are when validation data were deemed uninterpretable due to low 618 depth or unclear signal. The dashed lines at the diagonal represent the 95% binomial confidence-619 interval of observed VAF given the actual VAF, calculated based on 2000× depth for AmpliSeq. 620 The figure shows very high correlation between VAF estimated from super set data and validation 621 data for HighConf calls (R=0.958). Many Unclassified data points lie at the bottom, implying that 622 those calls were not real mutations despite the large number of apparent variant-supporting 623 reads in the super set data. X-axis: VAF calculated from the super set. Y-axis: VAF calculated from 624 AmpliSeq data. (d) Validation of the initial definition of the somatic mutation truth set with Ion 625 Torrent WES. The 95% binomial confidence-interval dash lines were calculated based on 34× 626 depth for Ion Torrent. R=0.928 for HighConf calls. (e) Validation of initial definition of somatic 627 mutation truth set with 12 repeats of WES on the HiSeq platform. Y-axis: median VAF calculated 628 based on 12 HiSeq WES replicates. The 95% binomial confidence-interval dashed lines were 629 calculated based on 150× depth for HiSeq WES. R=0.992 for HighConf calls. (f) Average tumor 630 purity fitting scores for the VAF of each SNV across the four different confidence levels vs. the 631 observed VAF in the tumor-normal titration series. The formula for fitting scores is described in 632

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 18: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

Eq. 1 in the Online Methods. 633

634

Figure 3: Initial definition of germline variants and validation. (a) Histogram of SNV call probability 635 for germline variants identified by four callers from 63 BAM files. (b) VAF scatter plot of germline 636 SNVs by the truth set and AmpliSeq. R=0.986 for SCP=1 calls. (c) VAF scatter plot of germline SNVs 637 by the truth set and Ion Torrent WES. R=0.758 for SCP=1 calls. 638

639

Figure 4: Genome coverage and high-confidence regions on reference genome GRCh38. a) 640 Genome coverage comparison between three technologies. Inner track: PacBio. Middle track: 641 10X Genomics. Outer track: Illumina. Red line: HCC1395. Green line: HCC1395BL. b) Genome 642 regions coverage by Illumina short reads in comparison to NA12878. Inner track: NA12878. 643 Middle track: the callable regions in HCC1395 and HCC1395BL. Outer track: gene density 644

645

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 19: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 20: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

a b c

d e f

VAF

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 21: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

a b

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint

Page 22: Establishing reference samples for detection of somatic … · 1 Establishing reference samples for detection of somatic 2 mutations and germline variants with NGS technologies 3

a

b c

made available for use under a CC0 license. certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also

The copyright holder for this preprint (which was notthis version posted May 13, 2019. ; https://doi.org/10.1101/625624doi: bioRxiv preprint