dictiona - itu.dk€¦ · smaller and f aster: succinct data structures session 1: intro, bit v...
TRANSCRIPT
Smaller and Faster: Su in t Data Stru turesSession 1: Intro, Bit Ve torsJérémy Barbay(some �gures by Gonzalo Navarro)1/ 24
OutlineThe CourseSu in t Data Stru turesWhy?How?What?Bit Ve tors and Di tionariesDi tionaries in a pre-grad ourse.Whi h Di tionary DS do you know?Di tionary as a Bit Ve torHomework 3/ 24
Stru ture of the Course1. Basi Con epts; Bit Ve tors and Stati Di tionaries;2. Trees; Graphs;3. Permutations; Fun tions;4. Strings; Binary Relations;5. Labeled Obje ts; Overview;Talk From Sorting to Compression
4/ 24
ForewordWarning◮ Not a spe ialist.◮ Not te hni ally in lined.◮ Five Sessions is very short.
5/ 24
ForewordWarning◮ Not a spe ialist.◮ Not te hni ally in lined.◮ Five Sessions is very short.Why do I give those le tures?◮ Tea hing = learning better.◮ Did not �nd a good (English) tutorial.◮ Su in t Data Stru tures get to be pra ti al.
5/ 24
What you should get from it◮ List of useful referen es;◮ Sele tion of te hni al examples;◮ Some general ulture about te hniques to apply to yourproblem.
6/ 24
OutlineThe CourseSu in t Data Stru turesWhy?How?What?Bit Ve tors and Di tionariesDi tionaries in a pre-grad ourse.Whi h Di tionary DS do you know?Di tionary as a Bit Ve torHomework 7/ 24
Su in t Data Stru turesWhy?Se ondary Memory arguments◮ Ma hines run faster,◮ Memory gets larger,◮ A ess gets slower.
8/ 24
Su in t Data Stru turesWhy?(Old) Se ondary Memory arguments◮ Ma hines run faster,◮ Memory gets larger,◮ A ess gets slower.More spe i� ally◮ Lower bounds (e.g. Beame and Fi h (1999))◮ Provably optimal (e.g. Golynski (2007))
8/ 24
The Memory Hierar hyCurrent Situation◮ Rough numbers:
◮ A few CPU registers, less than 1 nanose ond.◮ A few KBs of a he L1, a few 10 nanose unds.◮ A few MBs of a he L2, a few 30 nanose unds.◮ A few GBs of RAM, a few 60 nanose unds.◮ A few TBs of dis o, a few 10 milise unds of laten y, more afew 500 nanose unds per word transfered.
◮ The memory hierar hy is more and more relevant!�The di�eren e in time between a essing RAM and a essinga hard drive is omparable to the di�eren e between a essinga book on the shelves on your o� e and getting to the otherend of the globe and oming ba k.� 9/ 24
Su in t Data Stru turesHow?◮ An obje t hosen among N is optimally en oded by ⌈lgN⌉bits
10/ 24
Su in t Data Stru turesHow?◮ An obje t hosen among N is optimally en oded by ⌈lgN⌉bits, but rarely usable as su h (e.g. trees).
10/ 24
Su in t Data Stru turesHow?◮ An obje t hosen among N is optimally en oded by ⌈lgN⌉bits, but rarely usable as su h (e.g. trees).◮ We explore the relation between:
10/ 24
Su in t Data Stru turesHow?◮ An obje t hosen among N is optimally en oded by ⌈lgN⌉bits, but rarely usable as su h (e.g. trees).◮ We explore the relation between:
◮ Spa e to ode;10/ 24
Su in t Data Stru turesHow?◮ An obje t hosen among N is optimally en oded by ⌈lgN⌉bits, but rarely usable as su h (e.g. trees).◮ We explore the relation between:
◮ Spa e to ode;◮ Spa e to represent;
10/ 24
Su in t Data Stru turesHow?◮ An obje t hosen among N is optimally en oded by ⌈lgN⌉bits, but rarely usable as su h (e.g. trees).◮ We explore the relation between:
◮ Spa e to ode;◮ Spa e to represent;◮ Time to pre ompute the representation;
10/ 24
Su in t Data Stru turesHow?◮ An obje t hosen among N is optimally en oded by ⌈lgN⌉bits, but rarely usable as su h (e.g. trees).◮ We explore the relation between:
◮ Spa e to ode;◮ Spa e to represent;◮ Time to pre ompute the representation;◮ Time to navigate and sear h.
10/ 24
Su in t Data Stru turesHow?◮ An obje t hosen among N is optimally en oded by ⌈lgN⌉bits, but rarely usable as su h (e.g. trees).◮ We explore the relation between:
◮ Spa e to ode;◮ Spa e to represent;◮ Time to pre ompute the representation;◮ Time to navigate and sear h.
◮ What is the smallest amount of spa e and timeneeded to pre ompute a representation allowing tonavigate and sear h su h an obje t?10/ 24
Su in t Data Stru turesExample: ordinal trees◮ 2n − Θ(lg n) bits to ode an ordinal tree;
11/ 24
Su in t Data Stru turesExample: ordinal trees◮ 2n − Θ(lg n) bits to ode an ordinal tree;◮ Using pointers
◮ 4n lg n ≈ 128× n bits represents◮ and navigates it in onstant time,◮ with O(n) prepro essing time.
11/ 24
Su in t Data Stru turesExample: ordinal trees◮ 2n − Θ(lg n) bits to ode an ordinal tree;◮ Using pointers
◮ 4n lg n ≈ 128× n bits represents◮ and navigates it in onstant time,◮ with O(n) prepro essing time.
◮ Using te hniques from [Sadakane (2008)℄◮ 2n + n/ lg n ∈ 2n + o(n) bits (< 3n in pra ti e) represents◮ and navigates and sear hes it in onstant time,◮ with O(n) prepro essing time.
11/ 24
Su in t Data Stru turesExample: ordinal trees◮ 2n − Θ(lg n) bits to ode an ordinal tree;◮ Using pointers
◮ 4n lg n ≈ 128× n bits represents◮ and navigates it in onstant time,◮ with O(n) prepro essing time.
◮ Using te hniques from [Sadakane (2008)℄◮ 2n + n/ lg n ∈ 2n + o(n) bits (< 3n in pra ti e) represents◮ and navigates and sear hes it in onstant time,◮ with O(n) prepro essing time.(All results in a θ(lg n) bits ar hite ture.) 11/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).
12/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).
12/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).◮ What is the smallest amount of spa e and time needed torepresent, navigate and sear h su h an obje t?
12/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).◮ What is the smallest amount of spa e and time needed torepresent, navigate and sear h su h an obje t?◮ Several type of answers:
12/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).◮ What is the smallest amount of spa e and time needed torepresent, navigate and sear h su h an obje t?◮ Several type of answers:
◮ A Su in t En oding uses ⌈lgN⌉ + o(lgN) bits for everything;12/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).◮ What is the smallest amount of spa e and time needed torepresent, navigate and sear h su h an obje t?◮ Several type of answers:
◮ A Su in t En oding uses ⌈lgN⌉ + o(lgN) bits for everything;◮ A Su in t Index uses o(lgN) bits and a ess to theunmodi�ed data;
12/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).◮ What is the smallest amount of spa e and time needed torepresent, navigate and sear h su h an obje t?◮ Several type of answers:
◮ A Su in t En oding uses ⌈lgN⌉ + o(lgN) bits for everything;◮ A Su in t Index uses o(lgN) bits and a ess to theunmodi�ed data;◮ An Ultra-Su in t En oding uses H(I ) + o(lgN) bits to ompress the data;
12/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).◮ What is the smallest amount of spa e and time needed torepresent, navigate and sear h su h an obje t?◮ Several type of answers:
◮ A Su in t En oding uses ⌈lgN⌉ + o(lgN) bits for everything;◮ A Su in t Index uses o(lgN) bits and a ess to theunmodi�ed data;◮ An Ultra-Su in t En oding uses H(I ) + o(lgN) bits to ompress the data;◮ A Self-Index uses O(H(I )) bits to ompress and index the data;
12/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).◮ What is the smallest amount of spa e and time needed torepresent, navigate and sear h su h an obje t?◮ Several type of answers:
◮ A Su in t En oding uses ⌈lgN⌉ + o(lgN) bits for everything;◮ A Su in t Index uses o(lgN) bits and a ess to theunmodi�ed data;◮ An Ultra-Su in t En oding uses H(I ) + o(lgN) bits to ompress the data;◮ A Self-Index uses O(H(I )) bits to ompress and index the data;
12/ 24
Su in t Data Stru turesWhat?◮ An obje t hosen among N is optimally oded by ⌈lgN⌉ bits,but rarely usable as su h (e.g. trees).◮ What is the smallest amount of spa e and time needed torepresent, navigate and sear h su h an obje t?◮ Several type of answers:
◮ A Su in t En oding uses ⌈lgN⌉ + o(lgN) bits for everything;◮ A Su in t Index uses o(lgN) bits and a ess to theunmodi�ed data;◮ An Ultra-Su in t En oding uses H(I ) + o(lgN) bits to ompress the data;◮ A Self-Index uses O(H(I )) bits to ompress and index the data;Note that:
◮ Su in t En oding ⇐ Su in t Index ⇐ Compressed Index;◮ Su in t En oding ⇐ Compressed Self Index. 12/ 24
OutlineThe CourseSu in t Data Stru turesWhy?How?What?Bit Ve tors and Di tionariesDi tionaries in a pre-grad ourse.Whi h Di tionary DS do you know?Di tionary as a Bit Ve torHomework 13/ 24
Di tionary ADT (from CS240, Waterloo)◮ Container of key-element pairs◮ Required operations:
◮ insert( k,e ),◮ remove( k ),◮ find( k ),◮ isEmpty().
14/ 24
Di tionary ADT (from CS240, Waterloo)◮ Container of key-element pairs◮ Required operations:
◮ insert( k,e ),◮ remove( k ),◮ find( k ),◮ isEmpty().
◮ When an order is provided:◮ losestKeyBefore( k ),◮ losestElemAfter( k ),◮ rank(k): nb. of values smaller than k in the set,◮ sele t(r): r -th smallest number in the set.Note: No dupli ate keys 14/ 24
Spe i� Di tionary ADTs and their DS
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array◮ Sequen e
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array◮ Sequen eOrdered ◮ Array
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array◮ Sequen eOrdered ◮ Array◮ Sequen e (Skip Lists)
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array◮ Sequen eOrdered ◮ Array◮ Sequen e (Skip Lists)◮ Binary Sear h Tree (BST)
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array◮ Sequen eOrdered ◮ Array◮ Sequen e (Skip Lists)◮ Binary Sear h Tree (BST)◮ AVL
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array◮ Sequen eOrdered ◮ Array◮ Sequen e (Skip Lists)◮ Binary Sear h Tree (BST)◮ AVL◮ (2, 4) Trees
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array◮ Sequen eOrdered ◮ Array◮ Sequen e (Skip Lists)◮ Binary Sear h Tree (BST)◮ AVL◮ (2, 4) Trees◮ B-Trees
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array◮ Sequen eOrdered ◮ Array◮ Sequen e (Skip Lists)◮ Binary Sear h Tree (BST)◮ AVL◮ (2, 4) Trees◮ B-TreesValued ◮ Hash Tables
15/ 24
Spe i� Di tionary ADTs and their DSUnordered ◮ Array◮ Sequen eOrdered ◮ Array◮ Sequen e (Skip Lists)◮ Binary Sear h Tree (BST)◮ AVL◮ (2, 4) Trees◮ B-TreesValued ◮ Hash Tables◮ Extendible Hashing
15/ 24
Notes on Computation Model◮ Di tionaries and Sets are implemented (almost) identi ally.◮ Fo us on stati data stru tures and operators
◮ losestKeyBefore( k ), losestElemAfter( k )◮ rank(k), sele t(r).
◮ Advan ed implementations address the other operations.◮ We will onsider the Θ(lg n)-RAM model, where operations on
Θ(lg n) bits are supported in onstant time.16/ 24
Di tionary as Bit Ve torString Su in t En odings support◮ bin_rank(α, x): nb. of α-o urren es before pos. x ;◮ bin_sele t(α, r): position of r -th α-o urren e.Example: 0 0 0 1 0 0 0 1 0 0◮ bin_rank(1, 6) = ?
◮ bin_sele t(1, 2) = ?
17/ 24
Di tionary as Bit Ve torString Su in t En odings support◮ bin_rank(α, x): nb. of α-o urren es before pos. x ;◮ bin_sele t(α, r): position of r -th α-o urren e.Example: 0 0 0 1 0 0 0 1 0 0◮ bin_rank(1, 6) = 1◮ bin_sele t(1, 2) = ?
17/ 24
Di tionary as Bit Ve torString Su in t En odings support◮ bin_rank(α, x): nb. of α-o urren es before pos. x ;◮ bin_sele t(α, r): position of r -th α-o urren e.Example: 0 0 0 1 0 0 0 1 0 0◮ bin_rank(1, 6) = 1◮ bin_sele t(1, 2) = 8
17/ 24
General Te hniqueSummaries of Summaries◮ Pre ompute (and ode) the fun tion for a small sele tion ofvalues;◮ For some in-between values, ode the di�eren e;◮ For the smallest sequen es of left in-between values, use agreedy approa h.Bender and Fara h-Colton (2004) gave a veryni e writing of the te hnique. Even though theywere not aiming at su in t data stru ture butrather small pre omputing time, they a hievethe same goal, and explain it in a ni e in re-mental way. 18/ 24
Spe i� Te hniqueBasi s◮ Binary Rank through re ursive de�nition of the bit ve tor insuperblo s and blo s;◮ Binary Sele t through re ursive de�nition of the parameterspa e in sparse and dense blo s.Further Improvements◮ Raman et al. (2002) showed how to ompress to the binary entropy.◮ Golynski (2006) redu ed the spa erequired through the ommon use of ounting index. 19/ 24
Binary Entropyif there are n0 zeroes and n1 ones in a bit ve tor B(n0 + n1 = n = |B |)
H0(B) =n0n lg nn0 +
n1n lg nn1=
1n lg( nn0) + O ( lg nn )
20/ 24
Binary RankTrivial Solution: Constant Time◮ Divide B in blo s of b = (log n)/2 bits.◮ Pre ompute in an array R [1, n/b] the values of bin_rank atthe begining of ea h blo ◮ Pre ompute an array T [0, 2b − 1][0, b − 1], su h thatT [x , r ] = total number of 1's in x [1, r ]where x is a blo of b bits.◮ Compute bin_rank(B , i) as follows:
◮ De ompose i = q · b + r , 0 ≤ r < b;◮ The number of 1's till position qb is R[q];◮ The number of 1's in B[qb + 1, qb + r ] isT [B[qb + 1, qb + b], r ];◮ The binary rank is obtained in onstant time throughR[q] + T [B[qb + 1..qb + b], r ] 21/ 24
Binary RankTrivial Solution: Linear Spa e◮ The array R usesnb log n =
n(log n)/2 · log n = 2n bits
◮ The array T , indexed on [0, b − 1], uses2b · b · log b = 2 log n2 · log n2 · (log log n − 1)≤ 12√n log n log log n = o(n) bits.For instan e, for n = 232, T requires only 512 KB.
◮ This solution takes 3n + o(n) bits. 22/ 24
Binary RankTrivial Solution
23/ 24
Binary RankSu in t Solution◮ Divide B in superblo s ofs =
(log n)22 = b log n bits.◮ Pre ompute in an array S [1, n/s] the values of bin_rank atthe begining of ea h superblo , relatively to the value ofbin_rank at the begining of ea h blo :R [q] = bin_rank(B , qb) − S [⌊q/ log n⌋]
= bin_rank(B , qb) − bin_rank(B , ⌊q/ log n⌋ · log n)24/ 24
Binary RankSu in t Solution: sublinear indexing spa e◮ Ea h value of bin_rank in S requires log n bits, so that Suses in totalns · log n =
n(log n)2/2 · log n =
2nlog n = o(n) bits◮ Ea h value of bin_rank in R is less than s = O(log n)2 andrequires 2 log log n bits, so that R uses in totalnb · 2 log log n =
n(log n)/2 · 2 log log n
=4n · log log nlog n = o(n) bits. 25/ 24
Binary RankSu in t Solution: onstant time◮ How to ompute bin_rank(B , i) ?
◮ De ompose i = q · b + r , 0 ≤ r < b.◮ De ompose i = q′ · s + r ′, 0 ≤ r ′ < s.◮ The sum of three tabulated value yields the result:bin_rank(B , i) = S [q′] + R[q] + T [B[qb + 1..qb + b], r ]
26/ 24
Binary RankSu in t Solution
27/ 24
Binary Sele tSele t in onstant time
◮ Based on sampling on arguments , not on B .◮ More omplex te hnique from bin_rank, based on thepartitioning of blo s into dense and sparse blo s.◮ Solution in o(n) = O(n/ lg lg n).
28/ 24
Binary Sele t◮ Divide the arguments of bin_sele t in superblo s ofs = lg2 n arguments.◮ Ea h blo orresponds to s ones in an interval of distin t sizein B : a blo is sparse if spanning at leasts lg n lg lg n = lg3 n lg lg n positions; and dense otherwise.◮ Note that all sparse blo s require in total at mostO(n/ lg lg n). bits.
29/ 24
Binary Sele t◮ Pre ompute a bit ve tor E [1, n/s] indi ating whi h superblo sare sparse.◮ Join the answers to bin_sele t in the sparse superblo s inan array of arrays R :bin_sele t(B , i) = R [bin_rank(E , ⌊i/s⌋)] [1 + (i mod s)]for ea h position i su h that E [⌊i/s⌋] = 1.◮ Add the starting position in B of ea h dense superblo in anarray P [1, n/s].◮ Ea h superblo is spanning at most s lg n lg lg n = lg3 n lg lg npositions in B . 30/ 24
Binary Sele t◮ Divide ea h dense superblo in blo s of b = (lg lg n)2arguments.◮ Apply the same te hnique re ursively inside ea h densesuperblo .◮ Require lg(lg3 n lg lg n) ≤ 4 lg lg n bits to ode ea h positioninside a dense superblo .◮ A blo is sparse if spanning at least 4b(lg lg n)2 positions in B ,and dense otherwise.◮ Combine all the (relative) answers of the sparse blo s inside thedense superblo s in an array, using in total O(n/ lg lg n) bits.
31/ 24
Binary Sele t◮ Indi ate whi h blo s are sparse in an array E ′[1, s/b].◮ Store the starting position of ea h dense superblo in an arrayP ′[1, s/b] using O(n/ log log n) bits.◮ Ea h dense blo spans at most 4(log log n)4 = o(log n)positions, so that they are supported in onstant time throughtable look-up.
32/ 24
Binary Sele t◮ How to ompute bin_sele t(B , i)?
◮ De ompose i = q · s + r .◮ If E [q] = 1, i falls in a sparse superblo , return R[...].◮ Otherwise, i falls in a dense superblo starting in P[q].◮ De ompose r = q′ · b + r ′.◮ If E ′q[q′] = 1, i falls in a sparse blo , return P[q] + R ′[...].◮ Otherwise, i falls in a dense blo starting in p = P[q] + P ′[q′].◮ Sear h for the r ′-th one in B[p...p + 4(log log n)4] throughtable look-up.
33/ 24
Binary Sele t
34/ 24
OutlineThe CourseSu in t Data Stru turesWhy?How?What?Bit Ve tors and Di tionariesDi tionaries in a pre-grad ourse.Whi h Di tionary DS do you know?Di tionary as a Bit Ve torHomework 36/ 24
HomeworkBalan ed Parenthesis: (()())◮ How does this relate to bit ve tors?◮ What kind of operators would you support on it?◮ How do you support them?Ordinal trees◮ How does this relate to BP?◮ What kind of operators would you support on it?◮ How do you support them?
37/ 24
Stru ture of the Course1. Basi Con epts; Bit Ve tors and Stati Di tionaries;2. Trees; Graphs;3. Permutations; Fun tions;4. Strings; Binary Relations;5. Labeled Obje ts; Overview;Talk From Sorting to Compression
38/ 24
BibliographyBeame, P. and Fi h, F. E. (1999). Optimal bounds for theprede essor problem. In Pro eedings of the 31st annual ACMsymposium on Theory of omputing, pages 295�304.Bender, M. A. and Fara h-Colton, M. (2004). The level an estorproblem simpli�ed. Theor. Comput. S i., 321(1):5�12.Golynski, A. (2006). Optimal lower bounds for rank and sele tindexes. In Pro eedings of the International Colloquium onAutomata, Languages and Programming (ICALP).Golynski, A. (2007). Upper and Lower Bounds for Text IndexingData Stru tures. PhD thesis, University of Waterloo.Raman, R., Raman, V., and Rao, S. S. (2002). Su in t indexabledi tionaries with appli ations to en oding k-ary trees andmultisets. In Pro eedings of the 13th Annual ACM-SIAMSymposium on Dis rete algorithms, pages 233�242. 39/ 24