pycon 2011 talk - ngram assembly with bloom filters
TRANSCRIPT
![Page 1: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/1.jpg)
![Page 2: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/2.jpg)
Handling ridiculous amounts of data with probabilistic data structures
C. Titus BrownMichigan State University
Computer Science / Microbiology
![Page 3: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/3.jpg)
Resources
http://www.slideshare.net/c.titus.brown/
Webinar: http://oreillynet.com/pub/e/1784
Source: github.com/ctb/N-grams (this talk): khmer-ngramDNA (the real cheese): khmer
khmer is implemented in C++ with a Python wrapper, which has been awesome for scripting, testing, and general
development. (But man, does C++ suck…)
![Page 4: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/4.jpg)
Lincoln Stein
Sequencing capacity is outscaling Moore’s Law.
![Page 5: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/5.jpg)
Hat tip to Narayan Desai / ANL
We don’t have enough resources or people to analyze data.
![Page 6: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/6.jpg)
Data generation vs data analysis
It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week.
(Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.)
…x1000 sequencers
Many useful analyses do not scale linearly in RAM or CPU with the amount of data.
![Page 7: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/7.jpg)
The challenge?
Massive (and increasing) data generation capacity, operating at a boutique level, with
algorithms that are wholly incapable of scaling to the data volume.
Note: cloud computing isn’t a solution to a sustained scaling problem!! (See: Moore’s Law slide)
![Page 8: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/8.jpg)
Awesomeness
Easy stuff like Google Search
Life’s too short to tackle the easy problems – come to academia!
![Page 9: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/9.jpg)
A brief intro to shotgun assembly
It was the best of times, it was the wor, it was the worst of times, it was the isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
…but for 2 bn+ fragments.Not subdivisible; not easy to distribute; memory intensive.
![Page 10: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/10.jpg)
Define a hash function (word => num)
def hash(word): assert len(word) <= MAX_K
value = 0 for n, ch in enumerate(word): value += ord(ch) * 128**n
return value
![Page 11: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/11.jpg)
class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) \
for size in tablesizes ] self.k = k
def add(self, word): # insert; ignore collisions val = hash(word) for size, ht in self.tables: ht[val % size] = 1 def __contains__(self, word): val = hash(word) return all( ht[val % size] \
for (size, ht) in self.tables )
![Page 12: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/12.jpg)
class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) \
for size in tablesizes ] self.k = k
def add(self, word): # insert; ignore collisions val = hash(word) for size, ht in self.tables: ht[val % size] = 1 def __contains__(self, word): val = hash(word) return all( ht[val % size] \
for (size, ht) in self.tables )
![Page 13: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/13.jpg)
class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K): self.tables = [ (size, [0] * size) \
for size in tablesizes ] self.k = k
def add(self, word): # insert; ignore collisions val = hash(word) for size, ht in self.tables: ht[val % size] = 1 def __contains__(self, word): val = hash(word) return all( ht[val % size] \
for (size, ht) in self.tables )
![Page 14: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/14.jpg)
Storing words in a Bloom filter>>> x = BloomFilter([1001, 1003, 1005])>>> 'oogaboog' in xFalse>>> x.add('oogaboog')>>> 'oogaboog' in xTrue
>>> x = BloomFilter([2])>>> x.add('a')>>> 'a' in x # no false negativesTrue>>> 'b' in xFalse>>> 'c' in x # …but false positivesTrue
![Page 15: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/15.jpg)
Storing words in a Bloom filter>>> x = BloomFilter([1001, 1003, 1005])>>> 'oogaboog' in xFalse>>> x.add('oogaboog')>>> 'oogaboog' in xTrue
>>> x = BloomFilter([2]) # …false positives>>> x.add('a')>>> 'a' in xTrue>>> 'b' in xFalse>>> 'c' in xTrue
![Page 16: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/16.jpg)
Storing text in a Bloom filter
class BloomFilter(object): …def insert_text(self, text):
for i in range(len(text)-self.k+1): self.add(text[i:i+self.k])
![Page 17: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/17.jpg)
def next_words(bf, word): # try all 1-ch extensions prefix = word[1:] for ch in bf.allchars: word = prefix + ch if word in bf: yield ch
# descend into all successive 1-ch extensionsdef retrieve_all_sentences(bf, start): word = start[-bf.k:]
n = -1 for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch) for sentence in ss: yield sentence
if n < 0: yield start
![Page 18: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/18.jpg)
def next_words(bf, word): # try all 1-ch extensions prefix = word[1:] for ch in bf.allchars: word = prefix + ch if word in bf: yield ch
# descend into all successive 1-ch extensionsdef retrieve_all_sentences(bf, start): word = start[-bf.k:]
n = -1 for n, ch in enumerate(next_words(bf, word)): ss = retrieve_all_sentences(bf,start + ch) for sentence in ss: yield sentence
if n < 0: yield start
![Page 19: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/19.jpg)
Storing and retrieving text
>>> x = BloomFilter([1001, 1003, 1005, 1007])
>>> x.insert_text('foo bar baz bif zap!')>>> x.insert_text('the quick brown fox jumped over the lazy
dog')
>>> print retrieve_first_sentence(x, 'foo bar ')foo bar baz bif zap!
>>> print retrieve_first_sentence(x, 'the quic')the quick brown fox jumped over the lazy dog
![Page 20: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/20.jpg)
Sequence assembly
>>> x = BloomFilter([1001, 1003, 1005, 1007])
>>> x.insert_text('the quick brown fox jumped ')>>> x.insert_text('jumped over the lazy dog')
>>> retrieve_first_sentence(x, 'the quic')the quick brown fox jumped over the lazy dog
(This is known as the de Bruin graph approach to assembly; c.f. Velvet, ABySS, SOAPdenovo)
![Page 21: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/21.jpg)
Repetitive strings are the devil
>>> x = BloomFilter([1001, 1003, 1005, 1007])
>>> x.insert_text('na na na, batman!')>>> x.insert_text('my chemical romance: na na na')
>>> retrieve_first_sentence(x, "my chemical")'my chemical romance: na na na, batman!'
![Page 22: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/22.jpg)
Note, it’s a probabilistic data structure
Retrieval errors:
>>> x = BloomFilter([1001, 1003]) # small Bloom filter…>>> x.insert_text('the quick brown fox jumped over the lazy dog’)>>> retrieve_first_sentence(x, 'the quic'),('the quick brY',)
![Page 23: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/23.jpg)
Assembling DNA sequence
• Can’t directly assemble with Bloom filter approach (false connections, and also lacking many convenient graph properties)
• But we can use the data structure to grok graph properties and eliminate/break up data:– Eliminate small graphs (no false negatives!)– Disconnected partitions (parts -> map reduce)– Local graph complexity reduction & error/artifact trimming
…and then feed into other programs.
This is a data reducing prefilter
![Page 24: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/24.jpg)
Right, but does it work??
• Can assemble ~200 GB of metagenome DNA on a single 4xlarge EC2 node (68 GB of RAM) in 1 week ($500).
…compare with not at all on a 512 GB RAM machine.
• Error/repeat trimming on a tricky worm genome: reduction from– 170 GB resident / 60 hrs– 54 GB resident / 13 hrs
![Page 25: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/25.jpg)
How good is this graph representation?
• V. low false positive rates at ~2 bytes/k-mer;– Nearly exact human genome graph in ~5 GB.– Estimate we eventually need to store/traverse 50 billion k-
mers (soil metagenome)
• Good failure mode: it’s all connected, Jim! (No loss of connections => good prefilter)
• Did I mention it’s constant memory? And independent of word size?
• …only works for de Bruijn graphs
![Page 26: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/26.jpg)
Thoughts for the future
• Unless your algorithm scales sub-linearly as you distribute it across multiple nodes (hah!), or your problem size has an upper bound, cloud computing isn’t a long-term solution in bioinformatics
• Synopsis data structures & algorithms (which incl. probabilistic data structures) are a neat approach to parsing problem structure.
• Scalable in-memory local graph exploration enables many other tricks, including near-optimal multinode graph distribution.
![Page 27: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/27.jpg)
Groxel view of knot-like region / Arend Hintze
![Page 28: PyCon 2011 talk - ngram assembly with Bloom filters](https://reader036.vdocuments.mx/reader036/viewer/2022062319/554a249ab4c90542548b48e5/html5/thumbnails/28.jpg)
Acknowledgements:The k-mer gang:
• Adina Howe
• Jason Pell• Rosangela Canino-Koning• Qingpeng Zhang• Arend Hintze
Collaborators:
• Jim Tiedje (Il padrino)
• Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI)
• Charles Ofria (MSU)
Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.