sept2016 smallvar illumina_platinumgenomes
TRANSCRIPT
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY © 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the US and/or other countries. All other names, logos, and other trademarks are the property of their respective owners.
QC-ing and merging truth data Michael Eberle September 15, 2016
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
2
Pedigree validation (alone)
● Using the inheritance allows us to systematically incorporate “good” variants from different callers - Systematic errors are unlikely to result in pedigree consistent calls
● PG merges calls from six different workflows - Small variant call set includes 4,862,204 SNVs and 758,540 indels
● Compared to single sample or trios this call set will… - Mostly fail calls co-segregating with germline or cell line CNVs - Fail germline or cell line de novo mutations
● Even with pedigree check problems can still arise - Same variant may be included twice (merging problem) - Systematic errors may create pedigree consistent calls (SV edges) - Incorrect ploidy may still produce “platinum” variants
89 90 91 92
77 78
79 80 81 82 83 84 85 87 86 88 93
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
3
Validating & filtering variants with k-mers
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG ALT
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG REF
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTG
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATAT
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTT
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGCCAGGAAATTTG
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGC
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGCCATTTGAAAAGGTATAAGTTCTGGAAGGTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGTAAGGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGCCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
GTTCTGGAAGCTTAACAACGCCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG CAACGGCCGCCGTCAAAAATGAAATCCTAATCTTTGGCAGGAACTTTG
AAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCTTAATCTTTGGCAGGAACTTTG TATAGGTTCTGGAGGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGTAAAGGTATAGGGTCTGCAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGA
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAATTGATATCCTA
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGAAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAA
CGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG TGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
9 Reads do not fully span the 51-mer 4 Reads contain base errors in 51-mer
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
4
Validating & filtering variants with k-mers
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG ALT
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG REF
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTG
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATAT
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTT
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGCCAGGAAATTTG
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGC
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGCCATTTGAAAAGGTATAAGTTCTGGAAGGTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGTAAGGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGCCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
GTTCTGGAAGCTTAACAACGCCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG CAACGGCCGCCGTCAAAAATGAAATCCTAATCTTTGGCAGGAACTTTG
AAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCTTAATCTTTGGCAGGAACTTTG TATAGGTTCTGGAGGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGTAAAGGTATAGGGTCTGCAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGA
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAATTGATATCCTA
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
CCATTTGAAAAGGTATAGGTTCTGGAAGCTTAACAACGGCTGCCGTCAA
CGGCTGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG TGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTG
Validate the REF with 6 k-mers and ALT with 5 k-mers
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
5
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAG
TAAAGGTATAGGGTCTGGAAGCTTAACAACGGCCGCCGTC AAAAAGATATC
TAAAAGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACCTTG
TAAAGGTATAGGTTCTGGAAGCTTAAAAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAAATTTGTCTTTCC
TAAAGGTATAGGTTCTGGAAGCCTAACAACGGCCGCCGTCAA
TAAAGGTATAGGTTCTGGAAACTTAACAACGGCCGCCGTAAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCCTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCCGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAAAGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCCTAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC
AAGCTTAACAACGGCCGCCGTCAAAATTGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC
ACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC CGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC
AGGTTCTGGAAGCTTAACAAAGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC TCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAAC
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCT
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTT
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAAT
CGTAAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC AAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC
Inconsistent variants (homopolymer example)
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC ALT 1TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAAATGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC REF
TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTCAAAA TGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC ALT 2TAAAGGTATAGGTTCTGGAAGCTTAACAACGGCCGCCGTC AAA TGATATCCTAATCTTTGGCAGGAACTTTGTCTTTCC ALT 1+2
Validated the REF allele with 6 k-mers but 0 ALT k-mers
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
6
K-mer reporting on consistent variants
● Number of GT errors within the 13 member pedigree - GT Pass means that there is one k-mer supporting each allele/haplo
§ Homozygous calls require two supporting k-mers
● Number of k-mer errors in four founders - Pass means that the haplo-predicted k-mer is observed in the founder
● Normalized count for each k-mer - Number of times each k-mer is observed in 13 member pedigree
divided by number of predicted haplotypes § e.g. if six family members have the haplotype and we count the k-mer 60 times
then normalized count is 60/6 = 10
● Minimum normalized count for each variants - K-mer with the lowest normalized count for each variant
89 90 91 92
77 78
79 80 81 82 83 84 85 87 86 88 93
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
7
K-mer filtering removes ~195k “artefacts”
1Failed/Passing means that all pedigree GTs are supported by k-mers 2Failed/Passing means that all founder k-mers are observed in the proper founders
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
8
Adding content with k-mers
● Some k-mer failures due to conflicting representations
● Nearby variants that are missed can cause k-mer failures
● We only consider pedigree-consistent variants
● Working on a modified k-mer application that can take in many variants for hypothesis testing - Merge PG and GIAB-specific variants (or other putative variants) - Resolve conflicting representations - Recover pedigree-failed variants
● Need more complete truth data & improved k-mer GT-ing & assessment algorithms
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
9
Beyond “platinum” variants
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
10
Recoverable SNVs
● Identified 334,652 “high-quality” SNV calls that are not pedigree consistent - High quality = SNV called by >1 pipeline with consistent GTs (when
called) and every sample contains a GT call
● Broke these into four categories based on likely cause - CAT1 (191,087) = het in every sample (dup. or paralogous sequence) - CAT2 (3,861) = GTs consistent with deletion in the pedigree - CAT3 (49,800) = variant called in only one sample (cell line de novo) - CAT4 (25,501) = all others (duplications and cell line artefacts)
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
11
● See an excess of “all-het” failed SNVs (red) versus consistent SNVs (blue)
● Depth confirms that these are likely true variants - Could incorporate these as 2
ref & 2 alt or 1 ref & 3 alt…
● ~34% of CAT1 SNVs overlap population duplications from Sudmant et al [2015]
CAT1 failed SNVs: Duplications
NA12878 NA12877 Children
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
12
● Clusters of CAT2 failed SNVs (red) identifies deletions in this pedigree
● Depth confirms that these are deletions in the pedigree - This deletion is common
(~15%) in the population
CAT2 failed SNVs: Deletions
NA12878 NA12877 Children
COMPANY CONFIDENTIAL – FOR INTERNAL USE ONLY
13
Incorporating CNVs into pedigree check
● Many SNVs and indels that fail the pedigree check are true variants - Many also show population-level HW deviations consistent with
predicted call
● Can modify the pedigree check to include CNVs to create a more complete call set - Straightforward for deletions - Duplications will require additional “B-Allele frequency” consideration
● Need to improve CNV characterization first - Improved breakpoint resolution - Tandem duplications will work but not translocations