kmerstream streaming algorithms for k-mer abundance estimation páll melsted @pmelsted joint work...

KMERSTREAM

Streaming algorithms for k-mer abundance estimation

Páll Melsted @pmelstedjoint work with Bjarni V. Halldórsson

Error rates vs. Quality values

What error rates can we expect from NGS Specifically whole genome sequencing with

Illumina sequencing technology

How informative are quality values Rubbish? Worth using for analysis?

Quality values

A probability estimate that the basecall is correct

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>C

Phred scale, Pr[base call incorrect] ~ 10-Q/10

!=33, bad basecall

Error rates

What percentage of basecalls are correct How to estimate

Align reads to a reference Count mismatches and non-alignments Correct for snps and variants.

Reference free Whole genome assembly?

K-mer counting

Count k-mers, want k large, say ~31.

GATTTGGGGTTCAAAGCAGTAGAT ATT TTT TTG TGG ...

GAT 2ATT 3TTT 2TTG 3TGG 2...

ATTTGGGGTTGATT ATT TTT TTG TGG GGG GGT GTT TTG TGA GAT ATT

Errors and k-mers

Basecall errors impact many k-mersGATTTGGGGTTCAAAGCAGTAGATTT ATTTG TTTGG TTGGG TGGGG GGGGT

... AAGCAG AGCAGT GCAGTA

GATTTGGGGTTCAAAGCAGTAGATTT ATTTG TTTGG TTGGG TGGGG GGGGT GGGTT GGTTC

Errors and k-mers

Basecall errors are not independent Multiple errors more likely Ends of reads contain more errors

K-mer error rate underestimates true basecall error rate Discounts reads with many errors or errors

at the ends Can be off by a factor of 2

Frequency histograms

Sequencing at normal coverage, ~30x, most true k-mers will have high coverage and most error k-mers will have coverage of 1

Naïve method

Assumptions: Sampling from a genome of size G Poisson distribution, Poi(λ), of coverage of

each position Each k-mer sampled is an error with prob ε

independently. When we sample an error k-mer, it is

replaced by a single nucleotide substitution at random

Naïve model

Probability that a k-mer has coverage 1 ε Pr[error k-mer has cov 1]

+ (1-ε) Pr[true k-mer has cov 1]

ε 1-ε

TGAC

TGACTGGC

Genome length GSample random position

Produce correct k-merIntroduce one error

Frequency moments

From the frequency histogram we define fi = number of k-mers with coverage i f1 = number of singletons F0 = number of distinct k-mers

= Σ fi

F1 = number of all k-mers = Σ i fi

Fitting the model

3 unknown parameters G, λ, ε 3 k-mer frequency statistics, f1, F1, F0

Computing the moments

Count all k-mers? – very memory intensive

Sample k-mers (à la KmerGenie) Streaming algorithm, KmerStream

Estimates f1, F0, F1 directly without storing any k-mers

Accuracy can be specified (default ~2%)

KmerStream

Very fast, 5-10s per million reads Low memory overhead, ~11M One pass over the dataset

Uses hashing to sample k-mers adaptively Lossy counting similar to Bloom filter Does not keep track of k-mers

2-3x faster than KmerGenie, 10x better memory

Validation

Sampled reads from PhiX sequencing lane at 30x coverage, repeated 1000 times.

KmerStream estimates True kmer counts

Real data

Sequenced at deCODE genetics, 2656 individuals, sequenced at 10x to 30x coverage.

KmerStream run for all samples, model fit to estimate k-mer error rates for k=31

K-mer error rates

Quality cutoff

Keep only k-mers in reads where quality is above q.

Run for q = 0, 13, 20, 30.

Should correspond to upper bound on error of1.0, 0.05, 0.01, 0.001

K-mer error rates

Moving from q0 to q13 huge improvementq20 to q30 not recommended, 50% samples increased error rate

Wrap up

Quality values are informative Can get speed up by prioritizing processing

based on quality values e.g. alignment

Error rates are highly variable Quality value cutoffs can be done on a

case by case basis with minimal overhead.

Thank you

Paper on bioRxiv Code on github.com/pmelsted/KmerStream

Ph.D. position available “Streaming algorithms for whole genome assembly.”

kmerstream streaming algorithms for k-mer abundance estimation páll melsted @pmelsted joint work...

Documents