data science at scale - prashant pandeyapocrypha 4. hyrise 5. a data security startup theoretically...
TRANSCRIPT
![Page 1: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/1.jpg)
Data Science at Scale:Scaling Up by Scaling Down and Out (to Disk)
Prashant [email protected]
Berkeley Lab/UC Berkeley
1
![Page 2: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/2.jpg)
Sequence Read Archive (SRA) database growth
2
Total bases
Open access
Log
-sca
le
SRA contains a lot of diversity informationGoal: perform sequence searches on the database
![Page 3: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/3.jpg)
Scalability is the bottleneck for data science
3
Total bases
Open access
Log
-sca
le
Data science applications only looking at a small portion of data
Current index size (few TBs)
![Page 4: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/4.jpg)
Scalable data systems → Scalable data science
4
Total bases
Open access
Log
-sca
le
Goal: index data at Exa Byte scale
My goal as a researcher is to build scalable data systems to accelerate and scale data science applications
Current index size (few TBs)
![Page 5: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/5.jpg)
Three approaches to handle massive data
5
![Page 6: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/6.jpg)
Goal: make data smaller to fit in RAM
Techniques:● Compact &
succinct data structures
● Filters, e.g., Bloom, quotient, etc.
Three approaches to handle massive data
Shrink it
6
![Page 7: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/7.jpg)
Goal: organize data in a disk-friendly way
Techniques:● B-tree● Bε-tree● LSM-tree
Goal: make data smaller to fit in RAM
Techniques:● Compact &
succinct data structures
● Filters, e.g., Bloom, quotient, etc.
Three approaches to handle massive data
Shrink it Organize it
7
![Page 8: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/8.jpg)
Goal: partition and distribute data on multiple nodes
Techniques:● Distributed
hash table● Distributed
key-value store
Goal: organize data in a disk-friendly way
Techniques:● B-tree● Bε-tree● LSM-tree
Goal: make data smaller to fit in RAM
Techniques:● Compact &
succinct data structures
● Filters, e.g., Bloom, quotient, etc.
Three approaches to handle massive data
Shrink it Organize it Distribute it
8
![Page 9: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/9.jpg)
Research outputData structures & Algorithms
(Counting) Quotient FilterSIGMOD ‘17,
arXiv ‘17
Buffered Count-Min
SketchESA ‘18
Order Min HashISMB ‘19
9
![Page 10: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/10.jpg)
Research outputData structures & Algorithms
Buffered Count-Min
SketchESA ‘18
Order Min HashISMB ‘19
BεtrFS file systemFAST ‘15, TOS 15,
FAST ‘16, TOS 16, SPAA ‘19
Bε-tree
File systems
(Counting) Quotient FilterSIGMOD ‘17,
arXiv ‘17
10
![Page 11: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/11.jpg)
Research outputData structures & Algorithms
Buffered Count-Min
SketchESA ‘18
Order Min HashISMB ‘19
BεtrFS file systemFAST ‘15, TOS 15,
FAST ‘16, TOS 16, SPAA ‘19
Bε-tree
Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis
ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,
RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,
JCB ‘20
LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21
Distributed k-mer countingIPDPS ‘21
File systemsComputational biology
(Counting) Quotient FilterSIGMOD ‘17,
arXiv ‘17
11
![Page 12: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/12.jpg)
Research outputData structures & Algorithms
Buffered Count-Min
SketchESA ‘18
Order Min HashISMB ‘19
BεtrFS file systemFAST ‘15, TOS 15,
FAST ‘16, TOS 16, SPAA ‘19
Bε-tree
Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis
ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,
RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,
JCB ‘20
LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21
Distributed k-mer countingIPDPS ‘21
Stream processing
LERTsarXiv ‘19, SIGMOD ‘20
File systemsComputational biology
(Counting) Quotient FilterSIGMOD ‘17,
arXiv ‘17
12
![Page 13: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/13.jpg)
In this talkData structures & Algorithms
Buffered Count-Min
SketchESA ‘18
Order Min HashISMB ‘19
BεtrFS file systemFAST ‘15, TOS 15,
FAST ‘16, TOS 16, SPAA ‘19
Bε-tree
Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis
ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,
RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,
JCB ‘20
LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21
Distributed k-mer countingIPDPS ‘21
Stream processing
LERTsarXiv ‘19, SIGMOD ‘20
File systemsComputational biology
(Counting) Quotient FilterSIGMOD ‘17,
arXiv ‘17
Shrink it
13
![Page 14: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/14.jpg)
In this talkData structures & Algorithms
Buffered Count-Min
SketchESA ‘18
Order Min HashISMB ‘19
BεtrFS file systemFAST ‘15, TOS 15,
FAST ‘16, TOS 16, SPAA ‘19
Bε-tree
Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis
ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,
RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,
JCB ‘20
LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21
Distributed k-mer countingIPDPS ‘21
Stream processing
LERTsarXiv ‘19, SIGMOD ‘20
File systemsComputational biology
(Counting) Quotient FilterSIGMOD ‘17,
arXiv ‘17
14
Shrink it
Organize it
![Page 15: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/15.jpg)
Dictionary data structure
ac
b
d
A dictionary maintains a set S from universe U.
A dictionary supports membership queries on S.
membership(a):
membership(b):
membership(c):
membership(d):
S
15
![Page 16: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/16.jpg)
Filter data structure
ac
b
d
A filter is an approximate dictionary.
A filter supports approximate membership queries on S.
membership(a):
membership(b):
membership(c):
membership(d): false positive
S
16
![Page 17: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/17.jpg)
A filter guarantees a false-positive rate ε
if q ∈ S, return with probability 1
with probability ﹥ 1 - ε if q ∉ S, return with probability ≤ ε false positive
true negative
true positive
one-sided errors
17
![Page 18: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/18.jpg)
False-positive rate enables filters to be compact
DictionaryFilter
18
![Page 19: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/19.jpg)
False-positive rate enables filters to be compact
DictionaryFilter
Small
Large
For most practical purposes: ε = 2%, Bloom filter requires ≈ 8 bits/item
19
![Page 20: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/20.jpg)
Classic filter: The Bloom filter [Bloom ‘70]
Bloom filter: a bit array + k hash functions
ac b
d
S
20
![Page 21: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/21.jpg)
Classic filter: The Bloom filter [Bloom ‘70]
Bloom filter: a bit array + k hash functions (here k = 2)
ac b
d
S
h1(a) = 1h2(a) = 3
h1(c) = 5h2(c) = 3
21
![Page 22: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/22.jpg)
Classic filter: The Bloom filter [Bloom ‘70]
Bloom filter: a bit array + k hash functions (here k=2)
ac b
d
S
h1(b) = 2h2(b) = 5 true
negative
22
![Page 23: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/23.jpg)
Classic filter: The Bloom filter [Bloom ‘70]
Bloom filter: a bit array + k hash functions (here k=2)
ac b
d
S
h1(d) = 1h2(d) = 3 False
positive
23
![Page 24: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/24.jpg)
Bloom filter are ubiquitous (> 4300 citations)
Storage systems
NetworkingStreaming applications
Computational biology
Databases
24
![Page 25: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/25.jpg)
Bloom filter have suboptimal asymptotics
Bloom filter Optimal
Space
CPU cost
Data locality
25
![Page 26: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/26.jpg)
Limitations Workarounds
No deletes Rebuild
No resizes Guess N, and rebuild if wrong
No filter merging or enumeration ???
No values associated with keys Combine with another data structure
Application often work around Bloom filter limitations
Bloom filter limitations increase system complexity, waste space, and slow down application performance
26
![Page 27: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/27.jpg)
● Store fingerprints compactly in a hash table.○ Take a fingerprint h(x) for each element x.
● Only source of false positives:○ Two distinct elements x and y, where h(x) = h(y)○ If x is stored and y isn’t, query(y) gives a false positives
Quotienting is an alternative to Bloom filters [Knuth. Searching and Sorting Vol. 3, ‘97]
p
27
![Page 28: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/28.jpg)
b(x) = location in the hash tablet(x) = tag stored in the hash table
q rb(x)
t(x)
TagBucket index
Storing fingerprints compactly
p
28
![Page 29: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/29.jpg)
b(x) = location in the hash tablet(x) = tag stored in the hash table
Collisions in the hash table?
b(x)
t(x)
b(y)
t(y)
Storing fingerprints compactly
q r
TagBucket index
p
29
![Page 30: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/30.jpg)
b(x) = location in the hash tablet(x) = tag stored in the hash table
Collisions in the hash table?● Linear probing● Robin Hood hashing
b(x)
t(x)
t(y)
b(y)
t(y)
Storing fingerprints compactly
q r
TagBucket index
p
30
![Page 31: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/31.jpg)
b(x) = location in the hash tablet(x) = tag stored in the hash table
Collisions in the hash table?● Linear probing● Robin Hood hashing
b(x)
t(x)
t(y)
b(y)
t(y)
Storing fingerprints compactly
q r
TagBucket index
p
t(y) belongs to slots 4 or 5?
31
![Page 32: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/32.jpg)
● QF uses two metadata bits to resolve collisions and identify home bucket
● The metadata bits group tags by their home bucket
Resolving collisions in the QF [Bender ‘12, Pandey ‘17]
1 1
t(u) t(v) t(w) t(x) t(y)
32
![Page 33: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/33.jpg)
● QF uses two metadata bits to resolve collisions and identify home bucket
● The metadata bits group tags by their home bucket
insert v
Resolving collisions in the QF [Bender ‘12, Pandey ‘17]
1 1
t(u) t(v) t(v) t(w) t(x) t(y)
33
![Page 34: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/34.jpg)
● QF uses two metadata bits to resolve collisions and identify home bucket
● The metadata bits group tags by their home bucket
The metadata bits enable us to identify the slots holding the contents of each bucket.
Resolving collisions in the QF [Bender ‘12, Pandey ‘17]
insert v
1 1
t(u) t(v) t(v) t(w) t(x) t(y)
34More
![Page 35: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/35.jpg)
● Good cache locality● Efficient scaling out-of-RAM● Deletions● Enumerability/Mergeability● Resizing● Maintains count estimates● Uses variable-sized encoding for counts [Counting quotient filter]
○ Asymptotically optimal space: O(∑ |C(x)|)
Quotienting enables many features in the QF
35
![Page 36: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/36.jpg)
Quotient filters use less space than Bloom filters for all practical configurations
Quotient filter Bloom filter Optimal
Space
CPU cost
Data locality
The quotient filter has theoretical advantages over the Bloom filter
36
![Page 37: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/37.jpg)
Bloom filter: ~1.44 log(1/ε) bits/element.Quotient filter: ~2.125 + log(1/ε) bits/element.
Quotient filters use less space than Bloom filters for all practical configurations
False-positive rate < 1/64 (or 0.15).
37
Accuracy
![Page 38: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/38.jpg)
● Insert performance is similar to the state-of-the-art non-counting filters● Query performance is significantly fast at low load-factors and slightly slower
at higher load-factors
Inserts Lookups
Quotient filters perform better (or similar) to other non-counting filters
38
![Page 39: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/39.jpg)
Quotient filter’s impact in computer science
39
Computational biology1. Squeakr2. deBGR3. Mantis4. SPAdes assembler5. Khmer software6. MQF7. VariantStore
Databases/Systems1. Counting on GPUs2. Concurrent filters3. Anomaly detection4. BetrFS file system
Industry1. VMware2. Nutanix3. Apocrypha4. Hyrise5. A data security
startup
Theoretically well-founded data structures can have a big impact on multiple subfields across academia and industry
QFDatabases
SequenceSearchIndex
StreamAnalysis
Key-valueStores
QF on GPUs
Deduplication
Graph represen-
tationAssembler(SPAdes)
![Page 40: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/40.jpg)
Learned “Shrink it”. Now “Organize it”Data structures & Algorithms
Buffered Count-Min
SketchESA ‘18
Order Min HashISMB ‘19
BεtrFS file systemFAST ‘15, TOS 15,
FAST ‘16, TOS 16, SPAA ‘19
Bε-tree
Squeakr, deBGR, Mantis, Rainbowfish, MST-Mantis
ISMB ‘17, WABI ‘17, BIOINFORMATICS ‘17,
RECOMB ‘18, Cell Systems ‘18, RECOMB ‘19,
JCB ‘20
LSM-Mantis, VaraintStore bioRxiv ‘20, bioRxiv ‘21
Distributed k-mer countingIPDPS ‘21
Stream processing
LERTsarXiv ‘19, SIGMOD ‘20
File systemsComputational biology
(Counting) Quotient FilterSIGMOD ‘17,
arXiv ‘17
40
Organize it
![Page 41: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/41.jpg)
Open problem in stream processing
● A high-speed stream of key-value pairs arriving over time
● Goal: report every key as soon as it appears T times without
missing any
● Firehose benchmark (Sandia National Lab) simulates the stream https://firehose.sandia.gov/
41
![Page 42: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/42.jpg)
Why should we care about this problem
Defense systems for cyber security monitor high-speed streams for malicious traffic
Malicious traffic forms a small portion of the stream
Automated systems take defensive actions for every reported event
Catch all malicious events
Small reporting threshold
Minimize false positives
42
![Page 43: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/43.jpg)
Timely event detection problem
● Stream of elements arrive over time
S1
Time
S2 St
43
![Page 44: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/44.jpg)
● Stream of elements arrive over time● An event occurs at time t if St occurs exactly T times in
(s1,s2….st)
S1
Time
S2 St
t
Timely event detection problem
44
![Page 45: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/45.jpg)
● Stream of elements arrive over time● An event occurs at time t if St occurs exactly T times in
(s1,s2….st)
S1
Time
S2 St
t
Event!
Suppose T= 4
Timely event detection problem
45
![Page 46: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/46.jpg)
● Stream of elements arrive over time● An event occurs at time t if St occurs exactly T times in
(s1,s2….st)● In timely event-detection problem (TED), we want to report
all events shortly after they occur.
S1
Time
S2 St
t
Event!
Suppose T= 4
Report
Timely event detection problem
46
![Page 47: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/47.jpg)
Features we need in the solution
● Stream is large (e.g., terabytes) and high-speed (millions/sec)
High throughput ingestion
47
![Page 48: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/48.jpg)
Features we need in the solution
● Stream is large (e.g., terabytes) and high-speed (millions/sec)
● Events are high-consequence real-life events
High throughput ingestion
No false-negatives; few false-positives
Timely reporting (real-time)
Sampling
48
![Page 49: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/49.jpg)
Features we need in the solution
● Stream is large (e.g., terabytes) and high-speed (millions/sec)
● Events are high-consequence real-life events
● Malicious traffic forms a small portion of the stream
High throughput ingestion
No false-negatives; few false-positives
Timely reporting (real-time)
Very small reporting thresholds
Sampling
49
![Page 50: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/50.jpg)
One-pass streaming has errors
● Heavy hitter problem: φ ● Exact one-pass solution solution requires
RAM
50
![Page 51: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/51.jpg)
One-pass streaming has errors
● φ< (φ−ε)N [Alon et al. 96, Berinde et al. 10, Bhattacharyya et al. 16, Bose et al. 03, Braverman et al.
16, Charikar et al. 02, Cormode et al. 05, Demaine et al. 02, Dimitropoulos et al. 08, Larsen et al. 16, Manku et al. 02.]
● ε
RAM
Real time with false-positives!
Maintain count estimates in RAM[Misra & Gries ‘82]
51
![Page 52: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/52.jpg)
One-pass streaming has errors
● φ< (φ−ε)N [Alon et al. 96, Berinde et al. 10, Bhattacharyya et al. 16, Bose et al. 03, Braverman et al.
16, Charikar et al. 02, Cormode et al. 05, Demaine et al. 02, Dimitropoulos et al. 08, Larsen et al. 16, Manku et al. 02.]
● ε
RAM
Real time with false-positives!
Maintain count estimates in RAM[Misra & Gries ‘82]
For Sandia, φN is a small constant (e.g., 24), So Ω(1/ε) is very very large!!
Can’t solve in RAM for very small φ
52
![Page 53: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/53.jpg)
One-pass solution has:
● Stream is large (e.g., terabytes) and high-speed (millions/sec)
● Events are high-consequence real-life events
● Malicious traffic forms a small portion of the stream
High throughput ingestion
No false-negatives; few false-positives
Timely reporting (real-time)
Very small reporting thresholds
53
![Page 54: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/54.jpg)
Two-pass streaming isn’t real-time
● A second pass over the stream can get rid of errors● Store the stream on SSD and access it later
RAM
Scales to very small φ but offline!
Second pass
SSD
54
![Page 55: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/55.jpg)
Two-pass solution has:
● Stream is large (e.g., terabytes) and high-speed (millions/sec)
● Events are high-consequence real-life events
● Malicious traffic forms a small portion of the stream
High throughput ingestion
No false-negatives; few false-positives
Timely reporting (real-time)
Very small reporting thresholds
55
![Page 56: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/56.jpg)
If data is stored: why not access it?
RAM
SSD
Why wait for second pass?
56
![Page 57: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/57.jpg)
Idea: combine Streaming and EM
Use an efficient external-memory counting data structure to scale Misra-Gries algorithm to SSDs
57
Streaming model
External memory algorithms
![Page 58: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/58.jpg)
● How computations work:○ Data is transferred in blocks between RAM and disk.
○ The number of block transfers dominate the running time.
● Goal: Minimize number of block transfers○ Performance bounds are parameterized by block size B, memory size M,
data size N.
RAM DISK
M
B
B
External memory model [Aggarwal+Vitter ‘08]
58
![Page 59: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/59.jpg)
L
0
1
RAM
FLASH
log(N/M)
N
Counting QFM
Cascade filter: write-optimized quotient filter[Bender et al. ‘12, Pandey et al. ‘17]
Mr1
MrL
● The Cascade filter efficiently scales out-of-RAM● It accelerates insertions at some cost to queries
59More
![Page 60: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/60.jpg)
Cascade filter operations
Insert Query
60
![Page 61: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/61.jpg)
Cascade filter operations
Insert Query
< 1 I/O per observation
61
![Page 62: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/62.jpg)
Cascade filter operations
Insert Query
< 1 I/O per observation
> 1 I/O per observation
62
![Page 63: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/63.jpg)
Cascade filter doesn’t have real-time reporting
Insert Query
< 1 I/O per observation
> 1 I/O per observation
But every insert is also a query in real-time reporting!
63
![Page 64: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/64.jpg)
Cascade filter doesn’t have real-time reporting
Insert Query
< 1 I/O per observation
> 1 I/O per observation
But every insert is also a query in real-time reporting!
Traditional cascade filter doesn’t solve the problem!
64
![Page 65: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/65.jpg)
We define the time stretch of a report to be
1st occurrence Tth occurrence Reporting time
Timeline LDLifetime
Delay
Time stretch = 1 + 𝛼 = 1 + Delay
Lifetime
Idea: reporting with bounded delay
65More
![Page 66: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/66.jpg)
We define the time stretch of a report to be
1st occurrence Tth occurrence Reporting time
Timeline LDLifetime
Delay
Time stretch = 1 + 𝛼 = 1 + Delay
Lifetime
Idea: reporting with bounded delay
Main idea: the longer the lifetime of an item, the more leeway we have in reporting it
66More
![Page 67: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/67.jpg)
Leveled External-Memory Reporting Table (LERT) [Pandey ‘20]
● Given a stream of size N and φ > the amortized cost of solving real-time event detection is
● For a constant 𝛼, can support arbitrarily small thresholds φ with amortized cost
Takeaway: Online reporting comes at the cost of throughput but almost online reporting is essentially free! 67
![Page 68: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/68.jpg)
Leveled External-Memory Reporting Table (LERT) [Pandey ‘20]
● Given a stream of size N and φ > the amortized cost of solving real-time event detection is
● For a constant 𝛼, can support arbitrarily small thresholds φ with amortized cost
Takeaway: Online reporting comes at the cost of throughput but almost online reporting is essentially free!
Can achieve timely reporting at effectively the optimal insert cost; no query cost
68
![Page 69: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/69.jpg)
Evaluation
● Empirical timeliness
● High-throughput ingestion
69
![Page 70: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/70.jpg)
Evaluation: empirical time stretch
Average time stretch is 43% smaller than theoretical upper bound.70
![Page 71: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/71.jpg)
Evaluation: scalability
The insertion throughput increases as we add more threads.We can achieve > 13M insertions/sec.
71
![Page 72: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/72.jpg)
LERT: supports scalable and real-time reporting
● Stream is large (e.g., terabytes) and high-speed (millions/sec)
● Events are high-consequence real-life events
● Malicious traffic forms a small portion of the stream
High throughput ingestion
No false-negatives; few false-positives
Timely reporting (real-time)
Very small reporting thresholds
72
![Page 73: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/73.jpg)
Future work overview
73
Data structures & Algorithms
Scalable Data Systems
Data Science
![Page 74: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/74.jpg)
Future work: Data Structures & Algorithms
Goal: Overcome decades-old data structure trade-offs using modern
hardware and new algorithmic paradigms
74
![Page 75: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/75.jpg)
Trade-off 1: Insertion throughput degrades with load factor
Insertion throughput vs load factor of state-of-the-art filters
Many update-intensive applications (e.g., network caches, data analytics, etc.) maintain filters at high load factors
75
![Page 76: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/76.jpg)
16X drop4X
drop
Performance suffers due to high-overhead of collision resolution
76
Trade-off 1: Insertion throughput degrades with load factor
![Page 77: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/77.jpg)
Combining techniques + new hardware
Combining hashing techniques (Robin Hood + 2-choice hashing)Using ultra-wide vector operations (AVX512-BW)
77
![Page 78: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/78.jpg)
Combining techniques + new hardware
Combining hashing techniques (Robin Hood + 2-choice hashing)Using ultra-wide vector operations (AVX512-BW)
78
Constant high performance from
empty to full
More
![Page 79: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/79.jpg)
Future work: Data Systems
Goal: Build a population-scale index on variation data to enable downstream apps
gain quick insights into variants
79
![Page 80: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/80.jpg)
Country-scale sequencing efforts produce huge amounts of sequencing data
AAACGCCCGTACTT
AAACGCCCGTACTTA
CGTACTTA
AAACGCCCGTT
CCCGTACTTA
AAACGCCCGTACTTA
AAACGCCCGTACTT
AAAGTACTTA
AAACGCCCTA
CCCGTACTTA
Assembled data
Individuals
Raw sequencing data
Assembly…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATG
AGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….…..ATGGAGATAGGATGAGATAGATGATAGA….
Variant calling
Genomic variantsVariant call
Format (VCF)
Sequencing
● 1000 Genomes project [https://www.internationalgenome.org/]
● The Cancer Genome Atlas (TCGA) [https://portal.gdc.cancer.gov/]
● Genotype-Tissue Expression (GTEx) [https://gtexportal.org/home/]80
![Page 81: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/81.jpg)
Variation data analysis can improve downstream applications● Population-level disease analysis
● Genome-wide association studies
● Personalized medicine
● Cancer remission-rate prediction
● Colocalization analysis
● PCR primer design
● Genome assembly
AAACGCCCGTACTTA
AAACGCCCGTACTTA
AAACGCCCGTACTTA
AAACGCCCGTACTTA
AAACGCCCGTACTTA
AAACGCCCGTACTTA
AAACGCCCGTACTTA
AAACGCCCGTACTTA
AAACGCCCGTACTTA
AAACGCCCGTACTTA
Sequencing & assembly
IndividualsPopulation Genomes
Return all positions with variants in a gene
List all people, with sequence S in a gene
Count the number of variants in a gene
List all people, with > N variants in a geneFor person P, return
the closest variant from position X
81
![Page 82: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/82.jpg)
Indexing in multiple coordinates is challengingReference-only indexes map positions only in the reference coordinate system
......
Pan-genome analysis involves queries based on sample coordinate systems
Num Samples
Maintaining thousands of mappings increases computational complexity and memory footprintLimits scalability to population-scale data
82
![Page 83: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/83.jpg)
Indexing in multiple coordinates is challengingReference-only indexes map positions only in the reference coordinate system
......
Pan-genome analysis involves queries based on sample coordinate systems
Num Samples
Maintaining thousands of mappings increases computational complexity and memory footprintLimits scalability to population-scale data
Existing systems don’t support multiple coordinate systems. The ones that do, don’t scale beyond a few
thousand samples.
83
![Page 84: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/84.jpg)
An inverted index on the pan-genome graph
● Partition the variation graph based on coordinate ranges
● Store partitions on disk
● Succinct index for reference coordinate system
● Local-graph exploration to map position from reference to sample coordinate
Position index
Variation graph
Queries often require loading 1-2 partitions
84
![Page 85: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/85.jpg)
Future work: Data Science for genomics
Goal: Classification of metagenomic reads and identification of novel species using
graph neural networks (GNN)
85
![Page 86: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/86.jpg)
Metagenomic classification pipeline
86
Binning
Profiling
[Ye et al. 2019]
![Page 87: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/87.jpg)
Existing techniques offer low recall
87
Low recall
Classification is done based only on the read contents
![Page 88: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/88.jpg)
Use overlap relationship between reads
88
Metagenomic reads
Overlap graphAssign taxonomic
labels to reads
Minimap2Node features:
Tetra Nucleotide freq,
GC bias, andTaxonomic embedding
Semi-supervised Learning GNN
Taxonomic binning of reads
● Generate overlap graph: reads→nodes & overlap →edges● Node features →Tetra nucleotide freq of reads● Reference-based mapping as ground truth labels
![Page 89: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/89.jpg)
Overlap graph + graph neural network (GNN)
89
Higher recall
Can achieve high recall using graph learning
![Page 90: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/90.jpg)
https://prashantpandey.github.io
● Scalability of data management systems will be the biggest challenge in future
● Changing hardware give riseto new algorithmic paradigms
Conclusion
We need to redesign existing data structures to take full advantage of modern hardware and rebuild data systems to efficiently support future data science.
Data Science at Scale
ML Genomics Cyber Sec. NLP
Data Systems
Scale down Scale to disk Scale out
Modern hardwareVector inst. GPU NVM SSD
Data structures & Algorithms
90
![Page 91: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/91.jpg)
91
![Page 92: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/92.jpg)
92
![Page 93: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/93.jpg)
Backup slides
93
![Page 94: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/94.jpg)
Implementation:2 Meta-bits per slot.
h(x) --> h0(x) || h1(x)
2q
occupieds
runends
2qAbstract Representation
Quotient filter design
![Page 95: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/95.jpg)
Implementation:2 Meta-bits per slot.
h(x) --> h0(x) || h1(x)
2q
occupieds
runends
2qAbstract Representation
Quotient filter design
![Page 96: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/96.jpg)
Implementation:2 Meta-bits per slot.
h(x) --> h0(x) || h1(x)
2q
occupieds
runends
2qAbstract Representation
Quotient filter design
![Page 97: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/97.jpg)
Implementation:2 Meta-bits per slot.
h(x) --> h0(x) || h1(x)
2q
occupieds
runends
2qAbstract Representation
Quotient filter design
![Page 98: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/98.jpg)
Implementation:2 Meta-bits per slot.
h(x) --> h0(x) || h1(x)
2q
occupieds
runends
2qAbstract Representation
Quotient filter design
![Page 99: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/99.jpg)
Implementation:2 Meta-bits per slot.
h(x) --> h0(x) || h1(x)
2q
occupieds
runends
2qAbstract Representation
Quotient filter design
Back
![Page 100: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/100.jpg)
● Quotient filters store h(x) exactly
● To store x exactly, use an invertible hash function
● For n elements and p-bit hash function:
Space usage: ~p-log2n bits/element
Quotient filters can also be exact
100
![Page 101: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/101.jpg)
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Cascade filter: write-optimized quotient filter[Bender et al. ‘12, Pandey et al. ‘17]
Mr1
MrL
Efficient merging
101
● The Cascade filter efficiently scales out-of-RAM● It accelerates insertions at some cost to queries
![Page 102: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/102.jpg)
L
0
1
RAM
FLASH
log(N/M)
Items are initially inserted in the RAM level
N
Quotient filterM
Mr1
MrL
Efficient merging
Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]
102
![Page 103: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/103.jpg)
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Mr1
MrL
Efficient merging
Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]
103
When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1
![Page 104: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/104.jpg)
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Mr1
MrL
Efficient merging
Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]
104
When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1
![Page 105: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/105.jpg)
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Mr1
MrL
Efficient merging
Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]
105
When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1
![Page 106: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/106.jpg)
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Mr1
MrL
Efficient merging
Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]
106
When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1
![Page 107: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/107.jpg)
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Mr1
MrL
Efficient merging
Cascade filter: flushing[Bender et al. ‘12, Pandey et al. ‘17]
107
When RAM is full, items are flushed to the smallest level on disk i with space to insert items in level 0 to i-1
![Page 108: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/108.jpg)
L
0
1
RAM
FLASH
log(N/M)
N
Query (x)M
Mr1
MrL
Cascade filter: query[Bender et al. ‘12, Pandey et al. ‘17]
108
A query operation requires a lookup in each non-empty levelBack
![Page 109: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/109.jpg)
Time-stretch LERT
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Divide each level into 1+ 1/𝛼, equal-sized bins.
Mr1
MrL
109
![Page 110: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/110.jpg)
Time-stretch LERT
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
When a bin is full, items move to the adjacent bin
Mr1
MrL
110
![Page 111: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/111.jpg)
Time-stretch LERT
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
When a bin is full, items move to the adjacent bin
Mr1
MrL
111
![Page 112: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/112.jpg)
Time-stretch LERT
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Last bin flushed to first bin of the next level
Mr1
MrL
112
![Page 113: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/113.jpg)
Time-stretch LERT
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Last bin flushed to first bin of the next level
While flushing consolidate counts; report if hits threshold
Mr1
MrL
113
![Page 114: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/114.jpg)
Time-stretch LERT
L
0
1
RAM
FLASH
log(N/M)
N
Quotient filterM
Last bin flushed to first bin of the next level
While flushing consolidate counts; report if hits threshold
Mr1
MrL
Main idea: item is not put on a deeper level until it’s “aged sufficiently”
114
![Page 115: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/115.jpg)
Time-stretch LERT I/O complexity
Optimal insert cost for Write-optimized data
structure
115
![Page 116: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/116.jpg)
Time-stretch LERT I/O complexity
Extra cost because we only move one bin during a flush. Constant loss for
constant 𝛼
Optimal insert cost for Write-optimized data
structure
116Back
![Page 117: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/117.jpg)
Trade-off 2: “One-size-fits-all” approach leaves performance on table
LIGRA [Shun & Blelloch ‘13] ASPEN [Dhulipala et al.
‘19]
LIGRA ASPEN
add_edge
get_neighbors
Static Dynamic
Neighbor access requires at least two cache missesFor dynamic, all operations have a log factor
117
![Page 118: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/118.jpg)
Trade-off 2: “One-size-fits-all” approach leaves performance on table
Compressed Sparse Rows [Shun & Blelloch ‘13]
C-tree [Dhulipala et al.
‘19]
LIGRA ASPEN
add_edge
get_neighbors
Static Dynamic
Neighbor access requires at least two cache missesFor dynamic, all operations have a log factor
118
Static → Fast computations; no updatesDynamic → Slower computations; updates
![Page 119: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/119.jpg)
Real world graphs are often skewed
High variance in the degree distribution
● Dynamic partitioning of vertices based on the degree
● Separate structures for each partition to minimize cache misses
119
![Page 120: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/120.jpg)
Dynamic partitioning + hierarchical structure
High variance in the degree distribution
● Dynamic partitioning of vertices based on the degree
● Separate structures for each partition to minimize cache misses
120
![Page 121: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/121.jpg)
Dynamic partitioning + hierarchical structure
High variance in the degree distribution
● Dynamic partitioning of vertices based on the degree
● Separate structures for each partition to minimize cache misses
121
Terrace: Fast updates Terrace:
Faster computations
Back
![Page 122: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/122.jpg)
Scalable data systems → Scalable data science
122
BiologyAstroPhysics
ChemistryCyber monitoringInternet of Things
Environmental science.....
Massive data Data Science
Machine LearningRaw sequence search
Graph analyticsCyber monitoring
Weather predictionsPersonalized medicine
.
.
.
.
.
Data systems
My goal as a researcher is to build scalable data systems to accelerate and scale data science applications
![Page 123: Data Science at Scale - Prashant PandeyApocrypha 4. Hyrise 5. A data security startup Theoretically well-founded data structures can have a big impact on multiple subfields across](https://reader035.vdocuments.mx/reader035/viewer/2022071419/611731c7046588623e61546e/html5/thumbnails/123.jpg)
Our contribution
Combine streaming and EM algorithms to solve real-time event detection problem
Streaming model
External memory algorithms
123