![Page 1: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/1.jpg)
Space-Efficient Data Structures for Top-k Completion
Giuseppe OttavianoUniversità di Pisa
Bo-June (Paul) HsuMicrosoft Research
WWW 2013
![Page 2: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/2.jpg)
String auto-completion
![Page 3: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/3.jpg)
Scored string sets
Top-k Completion query:Given prefix p, return k strings prefixed by p with highest scores
Example: p=“tr”, k=2(triangle, 9), (trie, 5)
three 2
trial 1
triangle 9
trie 5
triple 4
triply 3
![Page 4: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/4.jpg)
Space-Efficiency
• Scored string sets can be very large– Hundreds of millions of queries for web search auto-
suggest• Must fit in RAM for fast access– Need space-efficient solutions!
• We compare three solutions– RMQ Trie, based on Range Minimum Queries– Completion Trie, based on a modified trie with variable-
sized pointers– Score-Decomposed Trie, based on succinct data structures
![Page 5: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/5.jpg)
RMQ TRIE (RT)
![Page 6: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/6.jpg)
RMQ Trie
• Lexicographic order → strings starting with given prefix in a contiguous range
• If we can find the max in a range, it is top-1• Range is split in two subranges, can proceed
recursively using a heap to retrieve top-k
three 2
trial 1
triangle 9
trie 5
triple 4
triply 3
![Page 7: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/7.jpg)
RMQ Trie
• To store the strings we can use any data structure that keeps the strings sorted– We use a compressed trie
• To find max score in a range we use a succinct Range Minimum Query (RMQ) data structure– Needs only 2.6 additional bits per score, answers
queries in O(log n) time• This is a standard strategy, but not very fast.
We use it as a baseline.
![Page 8: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/8.jpg)
COMPLETION TRIE (CT)
![Page 9: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/9.jpg)
t
iree
εε l
εε εgle
h r
ep
a
ln
Node labelBranching character
(Scored) compacted tries
ye
three 2
trial 1
triangle 9
trie 5
triple 4
triply 3
2
14 3
5
9
![Page 10: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/10.jpg)
t
iree
εε l
εε εgle
h r
ep
a
ln
Completion Trie
ye
three 2
trial 1
triangle 9
trie 5
triple 4
triply 3
2
14 3
5
9
4 9
9
9
![Page 11: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/11.jpg)
Completion Trie
t9
hree2
ri9
a9
e5
pl4
e4
y3
ngle9
l1
t
iree
εε l
εε εgle
h r
ep
a
lnye
2
14 3
5
9
4 9
9
9
![Page 12: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/12.jpg)
Completion Trie
• Scores encoded differentially (either from parent or previous sibling)
• Pointers and score deltas encoded with variable bytes
• All node information in the same stream, favoring cache-efficiency
t9
hree2
ri9
a9
e5
pl4
e4
y3
ngle9
l1
![Page 13: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/13.jpg)
SCORE-DECOMPOSED TRIE (SDT)
![Page 14: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/14.jpg)
Trees as balanced parentheses
()
()
() ()
(()()())
(()(()()()))
2n bits are sufficient (and necessary) to represent a treeCan support O(1) operations with 2n + o(n) bits
![Page 15: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/15.jpg)
Score-decomposed trie
• Builds on compressed path-decomposed tries [Grossi-Ottaviano ALENEX 2012]
• Parentheses-based representation of trees• Dictionary-compression of node labels
![Page 16: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/16.jpg)
Score-decomposed trie
t
iree
εε l
εε εgle
h r
ep
a
lnye
2
14 3
5
9
triangle
h,2 e,5 p,4 l,1
9
L : t1ri2a1ngleBP: ( ((( )B : h eplR : 2 541
![Page 17: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/17.jpg)
three 2trial 1triangle 9trie 5triple 4triply 3
![Page 18: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/18.jpg)
Score compression
... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ...
3 bits/value 11 bits/value 16 bits/value
• Data structure to store scores in RT and SDT• Packed-blocks array
– “Folklore” data structure, similar to many existing packed arrays, Frame-Of-Reference, PFORDelta,…
• Divide the array into fixed-size blocks• Encode the values of each block with the same number of
bits• Store separately the block offsets
![Page 19: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/19.jpg)
Score compression
... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000 ...
3 bits/value 11 bits/value 16 bits/value
• Can be unlucky– Each block may contain a large value
• But scores are power-law distributed• Also, tree-wise monotone sorting• On average, 4 bits per score
![Page 20: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/20.jpg)
Space
Dataset gzip CT SDT RT
QueriesA 27% 57% 30% 31%
QueriesB 25% 48% 26% 27%
URLs 24% 57% 26% 27%
Unigrams 39% 43% 35% 37%
![Page 21: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/21.jpg)
Space
![Page 22: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/22.jpg)
Time
Time per returned completion on a top-10 query
![Page 23: Space-Efficient Data Structures for Top-k Completion](https://reader030.vdocuments.mx/reader030/viewer/2022032806/5681349a550346895d9b9020/html5/thumbnails/23.jpg)
Thanks for your attention!
Questions?