kmerstream streaming algorithms for k-mer abundance estimation páll melsted @pmelsted joint work...
TRANSCRIPT
KMERSTREAM
Streaming algorithms for k-mer abundance estimation
Páll Melsted @pmelstedjoint work with Bjarni V. Halldórsson
Error rates vs. Quality values
What error rates can we expect from NGS Specifically whole genome sequencing with
Illumina sequencing technology
How informative are quality values Rubbish? Worth using for analysis?
Quality values
A probability estimate that the basecall is correct
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>C
Phred scale, Pr[base call incorrect] ~ 10-Q/10
!=33, bad basecall
Error rates
What percentage of basecalls are correct How to estimate
Align reads to a reference Count mismatches and non-alignments Correct for snps and variants.
Reference free Whole genome assembly?
K-mer counting
Count k-mers, want k large, say ~31.
GATTTGGGGTTCAAAGCAGTAGAT ATT TTT TTG TGG ...
GAT 2ATT 3TTT 2TTG 3TGG 2...
ATTTGGGGTTGATT ATT TTT TTG TGG GGG GGT GTT TTG TGA GAT ATT
Errors and k-mers
Basecall errors impact many k-mersGATTTGGGGTTCAAAGCAGTAGATTT ATTTG TTTGG TTGGG TGGGG GGGGT
... AAGCAG AGCAGT GCAGTA
GATTTGGGGTTCAAAGCAGTAGATTT ATTTG TTTGG TTGGG TGGGG GGGGT GGGTT GGTTC
Errors and k-mers
Basecall errors are not independent Multiple errors more likely Ends of reads contain more errors
K-mer error rate underestimates true basecall error rate Discounts reads with many errors or errors
at the ends Can be off by a factor of 2
Frequency histograms
Sequencing at normal coverage, ~30x, most true k-mers will have high coverage and most error k-mers will have coverage of 1
Naïve method
Assumptions: Sampling from a genome of size G Poisson distribution, Poi(λ), of coverage of
each position Each k-mer sampled is an error with prob ε
independently. When we sample an error k-mer, it is
replaced by a single nucleotide substitution at random
Naïve model
Probability that a k-mer has coverage 1 ε Pr[error k-mer has cov 1]
+ (1-ε) Pr[true k-mer has cov 1]
ε 1-ε
TGAC
TGACTGGC
Genome length GSample random position
Produce correct k-merIntroduce one error
Frequency moments
From the frequency histogram we define fi = number of k-mers with coverage i f1 = number of singletons F0 = number of distinct k-mers
= Σ fi
F1 = number of all k-mers = Σ i fi
Computing the moments
Count all k-mers? – very memory intensive
Sample k-mers (à la KmerGenie) Streaming algorithm, KmerStream
Estimates f1, F0, F1 directly without storing any k-mers
Accuracy can be specified (default ~2%)
KmerStream
Very fast, 5-10s per million reads Low memory overhead, ~11M One pass over the dataset
Uses hashing to sample k-mers adaptively Lossy counting similar to Bloom filter Does not keep track of k-mers
2-3x faster than KmerGenie, 10x better memory
Validation
Sampled reads from PhiX sequencing lane at 30x coverage, repeated 1000 times.
KmerStream estimates True kmer counts
Real data
Sequenced at deCODE genetics, 2656 individuals, sequenced at 10x to 30x coverage.
KmerStream run for all samples, model fit to estimate k-mer error rates for k=31
Quality cutoff
Keep only k-mers in reads where quality is above q.
Run for q = 0, 13, 20, 30.
Should correspond to upper bound on error of1.0, 0.05, 0.01, 0.001
K-mer error rates
Moving from q0 to q13 huge improvementq20 to q30 not recommended, 50% samples increased error rate
Wrap up
Quality values are informative Can get speed up by prioritizing processing
based on quality values e.g. alignment
Error rates are highly variable Quality value cutoffs can be done on a
case by case basis with minimal overhead.