all kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes
Post on 22-Jun-2015
113 Views
Preview:
DESCRIPTION
TRANSCRIPT
All kmers are not created equal: finding the signal from the noise in large-‐scale metagenomes.
Will Trimble metagenomic annota<on group Argonne Na<onal Laboratory
BEACON seminar April 23, 2014 MSU
Apology: I speak biology with an accent
• I spent six years in dark rooms with lasers • Now I use computers to analyze high-‐throughput sequence data.
• I introduce myself as an applied mathema<cian.
• Finding scoring func<ons to answer ques<ons with ambiguous data
Apology: I speak biology with an accent
• I spent six years in dark rooms with lasers • Now I use computers to analyze high-‐throughput sequence data.
• I introduce myself as an applied mathema<cian.
• Finding scoring func<ons to answer ques<ons with ambiguous data
• Shoveling data from the data producing machine into the data-‐consuming furnace.
• Sequences are different • How much did my sequencing run give me? kmerspectrumanalyzer!
• How much did I sample? nonpareil-k • PreXy pictures thumbnailpolish!
Outline
• Sequences are different (math) • How much did my sequencing run give me? kmerspectrumanalyzer (graphs)
• How much did I sample? nonpareil-k (graphs) • PreXy pictures thumbnailpolish (micrographs)!
Outline
Sequences are different
• Sequencing produces sequences. Sequences are qualita<vely different from all other data types.
@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!
Instrument readings, spectra, micrographs Not categorical.
Low-‐throughput categorical data Categories are sound
High throughput sequence data Categoriza4on is an art
Sequences are different
• Sequencing produces sequences. Sequences are qualita<vely different from all other data types.
@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!
Instrument readings, spectra, micrographs Not categorical.
Low-‐throughput categorical data Categories are sound
High throughput sequence data Categoriza4on is an art
107 channels 103 channels 1011 channels
Sequences are different
• Sequencing produces sequences. Sequences are qualita<vely different from all other data types.
• Each sequence is an informa<on-‐rich (possibly corrupted) quota4on from the catalog of gene<c polymers.
What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”
Searching
We know what to do with these puzzles. You go to this website, and type it in…
What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”
Searching
How long do reads need to be to recognize them?
What is this sequence ? >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!
Who wrote this line ? “be regarded as unproved until it has been checked against more exact results”
Searching
How long do reads need to be to recognize them?
To do what, to place on a reference genome? this can be turned into a math problem that I will illustrate with a search engine analogy.
How long do reads need to be?
Informa4on (Shannon, 1949, BSTJ): is a quan<ta<ve summary of the uncertainty of a probability distribu4on – a model of the data Profound applicability in paXern matching + modeling
Logarithmic measurements have units!
H =
X
i
pi log2
✓1
pi
◆
A word on the sign of the entropy
• A popular straw man among-‐mathema<cians-‐and-‐CS-‐people is the “random sequence model.” Uniform categorical distribu<on over all 4L sequences.
• When we learn something—like we collect some genomes and expect our new sequences to look like them—we implicitly construct a less flat distribu<on. Models always have less entropy than the model of ignorance.
How long do phrases need to be?
Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
• Informa<on content of English words: Hword ca. 12 bits per word. • Size of google books? Big libraries have few 107 books, each one has 105 indexed words ….so a database size of 1012 words. log(database size) = 1012 = 239.9 = 40 bits • So we expect on average 40 / 12 = 3.3 = 4 words to be enough to find a phrase in google’s index.
Try it.
How long do phrases need to be?
How long do phrases need to be?
Exercise: Pick a book from your bookshelf. Pick an arbitrary page and arbitrary line. for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!
Most oken takes 4 words
• Informa<on content of English words: Hword ca. 12 bits per word. • Size of google books? Big libraries have few 107 books, each one has 105 indexed words ….so a database size of 1012 words. log(database size) = 1012 = 239.9 = 40 bits • So we expect on average 40 / 12 = 3.3 = 4 words to be enough to find a phrase in google’s index.
Try it.
How long do phrases need to be?
Not all phrases are equally dis<nc<ve.
• Maximum informa<on content of base pairs Hread 2 bits per length-‐ sequence • Most long kmers are dis<nct: genome of size G (ca 1010 bp) log(G) = 1010 = 233.2 = 34 bits • So we expect that when 2 > 34 bits, we should be able to place any sequence.
• That means we need at least 17 base pairs (seems small) to deliver mail anywhere in the genome.
How long do reads need to be?
``
`
`
The data deluge
• There were some technological breakthroughs in the mid-‐2000s that led to inexpensive collec<on of 10s of Gbytes of sequence data at once.
• The data has outgrown some favorite algorithms from the 1990s (BLAST)
Picture, if you will, a hiseq flowcell Paris of microbial genomes
Microbial transcriptomes + replicates
Environmental isolate genomes Environmental extract sequencing Prepara<on-‐intensive sequencing
Eukaryo<c sequencing Eukaryo<c sequencing for variants
What’s in there?
Picture, if you will, a hiseq flowcell Paris of microbial genomes
Microbial transcriptomes + replicates
Environmental isolate genomes Environmental extract sequencing Prepara<on-‐intensive sequencing
Eukaryo<c sequencing Eukaryo<c sequencing for variants
What’s in there?
Let’s count kmers!
The kmer spectrum.
21mer abundance
numbe
r of kmers
microbial genome
The kmer spectrum.
21mer abundance
numbe
r of kmers
microbial genome
low-‐abundance errors
peak contains most of genome
high-‐abundance peak contains mul<copy genes
really high abundance stuff oken ar<facts
rare abundant
Ranked kmer spectrum
kmer rank (cumula<ve sum of number of kmers)
21mer abu
ndance
Ranked kmer spectrum
rare
abundant
Ranked kmers consumed
21mer abundance
frac<o
n of observed km
ers
Ranked kmers consumed
rare
abundant
data frac<on is unusually stable
Different kinds of data have different spectra
Different kinds of data have different spectra
Redundancy is good
• OMG! Check out these three sequences! I’ve found the fourth, fikh, and sixth domains of life.
• OMG! I see this sequence 10 million <mes.
• OMG! There are more than 10 billion dis<nct 31mers in my dataset. I only have 128 Gbases of memory.
• Error correc<on and diginorm somewhat amusingly strive for opposite ends.
Redundancy is good
• OMG! Check out these three sequences! I’ve found the fourth, fikh, and sixth domains of life.
• OMG! I see this sequence 10 million <mes.
• OMG! There are more than 10 billion dis<nct 31mers in my dataset. I only have 128 Gbases of memory.
• Error correc<on and diginorm somewhat amusingly strive for opposite ends.
Abundance-‐based inferences are beXer in the high-‐
abundance part of the data.
kmerspectrumanalyzer: infer genome size and depth
PNO (x; c, {an}, s) =X
n
anNBpdf (s;µ = cn,↵ = s/n)
Generaliza<on of mixed-‐Poisson model to es<mate how much sequence is in each peak.
0 2000 4000 6000 8000 10000
0
2000
4000
6000
8000
10000
Complete Genome size (kb)
Estim
ated
Gen
ome
Size
(kb)
Fig 2 Coun<ng kmers tells you genome size
…for single genomes, most of the <me.
so much for calibra<on data
10% 5.5% 4% 3%
1.7% 1%
0.5% 0.3% 0.1%
The kink does measure error
Ar<ficial E. coli data varying subs<tu<on errors
But I want to sequence everything! Ok, we can count kmers in everything too..
kmerspectrumanalyzer summarizes distribu<on, es<mates genome size, coverage depth
How much novelty is in my dataset?
How many sequences do you need to see before you start seeing the same ones over and over again? Ini<ally, everything is novel, but there will come a point at which less than half of your new observa<ons are already in the catalog.
Nonuniqefraction(✏; {r}, {n}) =X
i
ni · riPj nj · rj
(1� Poisscdf (✏ · ri, 1))(1� Poisscdf (✏ · ri, 0))
How much novelty is in my dataset?
How many sequences do you need to see before you start seeing the same ones over and over again? Ini<ally, everything is novel, but there will come a point at which less than half of your new observa<ons are already in the catalog. We can calculate this efficiently using the kmer spectrum.
Nonpareil: model of sequence coverage
Nonpareil-k: kmer rarefaction
summary of sequence diversity
Nonpareil– uses subset-‐against-‐all alignment to find out how much of dataset is unique
Nonpareil-‐k – crunches kmer spectrum to approximate the unique frac<on, 300x faster.
Nonpareil: model of sequence coverage
Nonpareil-k: kmer rarefaction
summary of sequence diversity
Nonpareil-‐k: stra<fy datasets by coverage distribu<on
most of dataset likely contained in assembly
assembly is likely to miss or aXenuate the large unique frac<on of dataset.
kmer spectra reveal sequencing problems
• Amok PCR – seemingly random sequences • Amok MDA – 10 Gbases of sequence, one gene • PCR duplicates: en<re sequencing run was 50x exact-‐ and near-‐exact duplicate reads
• Unusually high error rate: indicated by low frac<on of “solid” kmers (for isolate genomes)
• Contaminated samples: 95% E. coli 5% E. faecalis
Figure'1c!
-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04
0100
200
300
400
500
600
PC02 vs Alpha Diversity
eigen_vectors[, "PCO2"]
colo
r_m
atr
ix[, "
alp
ha
-div
ers
ity"]
All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7
Figure'1d!
HMP / quan<le norm / euclidean / colored by alpha
MG-‐RAST API R-‐package matR
Hey kid, you want some unlabeled data?
Figure'2a!
Figure'2b!
Hey kid, you want some preXy ordina<ons?
Generali<es from the kmer coun<ng mines
• Many datasets have as much as 5-‐45% of the sequence yield in adapters.
• FEW DATASETS have well-‐separated abundance peaks (of the sort metavelvet was engineered to find)
• Diverse datasets have a featureless, geometric rela4onship between kmer rank and kmer abundance.
• Shannon entropy is oversensi4ve to errors. Higher-‐order Rényi entropy is more stable.
kmer sta<s<cal summaries • H0 kmer richness (VERY BAD) • H1 Shannon entropy (BAD) • H2 Reyni entropy / Simpson index (GOOD)
• observa<on-‐weighted coverage (BAD) • observa<on-‐weighted size (BAD) • observa<on-‐median coverage (GOOD) • observa<on-‐median size (GOOD) • frac<on in top 100 kmers (USEFUL) • frac<on unique (OK but requires size correc<on)
kmer sta<s<cal summaries • H0 kmer richness (VERY BAD) • H1 Shannon entropy (BAD) • H2 Reyni entropy / Simpson index (GOOD)
• observa<on-‐weighted coverage (BAD) • observa<on-‐weighted size (BAD) • observa<on-‐median coverage (GOOD) • observa<on-‐median size (GOOD) • frac<on in top 100 kmers (USEFUL) • frac<on unique (OK but requires size correc<on)
Most of these give answers which vary so strongly with sampling depth as to be unusable. Observa<on-‐weighted frac<on-‐of-‐data metrics behave fairly well. Frac<ons of the data with par<cular proper<es are stable with respect to sampling.
thumbnailpolish!
http://www.mcs.anl.gov/~trimble/flowcell/!
Some<mes the sequencer has a bad day.
Metagenomic annota<on group Folker Meyer Elizabeth Glass Narayan Desai Kevin Keegan Adina Howe Wolfgang Gerlach Wei Tang Travis Harrison Jared Bishof Dan Braithwaite Hunter MaXhews Sarah Owens
Formerly of Yale: Howard Ochman David Williams Georgia Tech: Kostas Konstan<nidis Luis Rodriguez-‐Rojas
Observa<on: Most scien<sts seem to be self-‐taught in compu<ng.
Observa<on: Most scien<sts waste a
lot of <me using computers inefficiently.
Adina and I volunteer with
We teach scien<sts how to get more done
Woods Hole
Tuks
U. Chicago
U. Chicago
UIC
top related