succinct data structures: theory and practice · 2012. 3. 15. · problems: in theory develop...
TRANSCRIPT
Motivation and Context
Succinct Data Structures: Theory and Practice
March 16, 2012
— Succinct Data Structures: Theory and Practice 1/15
Motivation and Context
Contents
1 Motivation and ContextMemory HierarchySuccinct Data StructuresBasics
— Succinct Data Structures: Theory and Practice 2/15
Motivation and Context
1 Motivation and ContextMemory HierarchySuccinct Data StructuresBasics
— Succinct Data Structures: Theory and Practice 3/15
Motivation and Context
Succinct data structures
Data structure D
representation of an objectX
+ operations on X
Example: Rank-bit-vector
bit vector b of length n(0,1,0,1,1,0,1,1)
(0,0,1,1,2,3,3,4)
in n bits space
+access b[i ] in O(1) timerank(i) =
∑i−1j=0 b[j ] in O(n) time
Succinct data structure D
Space of D is close the information theoretic lower bound torepresent X , while operations can still be performed efficient.
— Succinct Data Structures: Theory and Practice 4/15
Motivation and Context
Succinct data structures
Data structure D
representation of an objectX
+ operations on X
Example: Rank-bit-vector
bit vector b of length n(0,1,0,1,1,0,1,1)
(0,0,1,1,2,3,3,4)
in n bits space
+access b[i ] in O(1) timerank(i) =
∑i−1j=0 b[j ] in O(n) time
Succinct data structure D
Space of D is close the information theoretic lower bound torepresent X , while operations can still be performed efficient.
— Succinct Data Structures: Theory and Practice 4/15
Motivation and Context
Succinct data structures
Data structure D
representation of an objectX
+ operations on X
Example: Rank-bit-vector
bit vector b of length n(0,1,0,1,1,0,1,1)
(0,0,1,1,2,3,3,4)
in n + n log n bits space
+access b[i ] in O(1) timerank(i) =
∑i−1j=0 b[j ] in O(1) time
Succinct data structure D
Space of D is close the information theoretic lower bound torepresent X , while operations can still be performed efficient.
— Succinct Data Structures: Theory and Practice 4/15
Motivation and Context
Succinct data structures
Data structure D
representation of an objectX
+ operations on X
Example: Rank-bit-vector
bit vector b of length n(0,1,0,1,1,0,1,1)
(0,0,1,1,2,3,3,4)
in n + n log n bits space
+access b[i ] in O(1) timerank(i) =
∑i−1j=0 b[j ] in O(1) time
Succinct data structure D
Space of D is close the information theoretic lower bound torepresent X , while operations can still be performed efficient.
— Succinct Data Structures: Theory and Practice 4/15
Motivation and Context
Succinct data structures
Can succinct data structures replace classic uncompressed datastructures in practice?
Less memory ⇒ fewer CPU cycles !?
Less memory ⇒ less costs !?
Problems:
in theorydevelop succinct data structures
in practiceconstants in O(1)-time terms are higho(n)-space term is not negligiblecomplex data structures are hard to implement
— Succinct Data Structures: Theory and Practice 5/15
Motivation and Context
Succinct data structures
Can succinct data structures replace classic uncompressed datastructures in practice?
Less memory ⇒ fewer CPU cycles !?
CPU
L1-Cache
L2-Cache
L3-Cache
DRAM
Disk
≈ 100 B
≈ 10 KB
≈ 512 KB
≈ 1-8 MB
≈ 4 GB
≈ x · 100 GB
≈ 1 CPU cycle
≈ 5 CPU cycles
≈ 10-20
≈ 20-100
≈ 100-500
≈ 106
Less memory ⇒ less costs !?
Problems:in theory
develop succinct data structuresin practice
constants in O(1)-time terms are higho(n)-space term is not negligiblecomplex data structures are hard to implement
— Succinct Data Structures: Theory and Practice 5/15
Motivation and Context
Succinct data structures
Can succinct data structures replace classic uncompressed datastructures in practice?
Less memory ⇒ fewer CPU cycles !?Less memory ⇒ less costs !?Instance name main memory price per hour
Micro 613.0 MB 0.02 US$High-Memory Quadruple Extra Large 68.4 GB 2.00 US$
Pricing of Amazons Elastic Cloud Computing (EC2) service in July 2011.
Problems:in theory
develop succinct data structuresin practice
constants in O(1)-time terms are higho(n)-space term is not negligiblecomplex data structures are hard to implement
— Succinct Data Structures: Theory and Practice 5/15
Motivation and Context
Succinct data structures
Can succinct data structures replace classic uncompressed datastructures in practice?
Less memory ⇒ fewer CPU cycles !?
Less memory ⇒ less costs !?
Problems:
in theorydevelop succinct data structures
in practiceconstants in O(1)-time terms are higho(n)-space term is not negligiblecomplex data structures are hard to implement
— Succinct Data Structures: Theory and Practice 5/15
Motivation and Context
Memory Hierarchy
Moore’s Law
Image source: Wikipedia
“The number of transistors that can beplaced inexpensively on an integratedcircuit doubles approximately every twoyears”
— Succinct Data Structures: Theory and Practice 6/15
Motivation and Context
Memory Hierarchy
Reality
Processor speed increasing Disks have not evolved at the same pace
Image source: www.intel.com
— Succinct Data Structures: Theory and Practice 7/15
Motivation and Context
Memory Hierarchy
Image source: Wikipedia
— Succinct Data Structures: Theory and Practice 8/15
Motivation and Context
Memory Hierarchy
Some Numbers
A few CPU registers, less than 1 nanosecond.
A few KBs of L1 cache, about 10 nanoseconds.
A few MBs of L2 cache, about 30 nanoseconds.
A few GBs of RAM, about 60 nanoseconds.
A few TBs of disk, about 10 milliseconds.
— Succinct Data Structures: Theory and Practice 9/15
Motivation and Context
Succinct Data Structures
There are Data Structures:
Modify to use less space than the original one.
That is not compression?
No: It must provide efficient algorithms for simulating itsoperations.
Why use it, if the memory is so cheap?
Improve performance given the memory hierarchy, especially ifwe operate on RAM something that would need the disk.
— Succinct Data Structures: Theory and Practice 10/15
Motivation and Context
Things that will be cover
We will review progress in various succinct compact datastructures.
These will give us theoretical and practical tools to takeadvantage of the memory hierarchy in the design ofalgorithms and data structures.
We will see compact structures for:
Manipulate bit sequencesManipulate symbols sequencesNavigating treesPattern matching and text searchNavigating graphs
We will also see hashing applications, sets, partial sums,geometry, permutations, and more.
— Succinct Data Structures: Theory and Practice 11/15
Motivation and Context
Zero-order Entropy
Average number of bits needed to represent a symbol of a text T ifeach symbol receives always the same code.
H0(T ) = −∑c∈Σ
ncn
logncn
where nc is the number of occurrences of the symbol c in T and nis the length of T .
— Succinct Data Structures: Theory and Practice 12/15
Motivation and Context
K-order Entropy
If we can encode each symbol depending on the context in which itappears, we can achieve better compression ratios.
Hk(T ) =∑s∈Σk
|T s |n
H0(T s)
where T s is the sequence of symbols preceded by the context s inT .
— Succinct Data Structures: Theory and Practice 13/15
Motivation and Context
Entropies of the Pizza&Chili 200MB test cases
dblp.xml dna english proteins rand k128 sources
k Hk CT/n Hk CT/n Hk CT/n Hk CT/n Hk CT/n Hk CT/n0 5.257 0.0000 1.974 0.0000 4.525 0.0000 4.201 0.0000 7.000 0.0000 5.465 0.00001 3.479 0.0000 1.930 0.0000 3.620 0.0000 4.178 0.0000 7.000 0.0000 4.077 0.00002 2.170 0.0000 1.920 0.0000 2.948 0.0001 4.156 0.0000 6.993 0.0001 3.102 0.00003 1.434 0.0007 1.916 0.0000 2.422 0.0005 4.066 0.0001 5.979 0.0100 2.337 0.00124 1.045 0.0043 1.910 0.0000 2.063 0.0028 3.826 0.0011 0.666 0.6939 1.852 0.00825 0.817 0.0130 1.901 0.0000 1.839 0.0103 3.162 0.0173 0.006 0.9969 1.518 0.02506 0.705 0.0265 1.884 0.0001 1.672 0.0265 1.502 0.1742 0.000 1.0000 1.259 0.05097 0.634 0.0427 1.862 0.0001 1.510 0.0553 0.340 0.4506 0.000 1.0000 1.045 0.08508 0.574 0.0598 1.834 0.0004 1.336 0.0991 0.109 0.5383 0.000 1.0000 0.867 0.12559 0.537 0.0773 1.802 0.0013 1.151 0.1580 0.074 0.5588 0.000 1.0000 0.721 0.170110 0.508 0.0955 1.760 0.0051 0.963 0.2292 0.061 0.5699 0.000 1.0000 0.602 0.2163
— Succinct Data Structures: Theory and Practice 14/15
Motivation and Context
Calculation of H1(T) in linear time
15
$
7
dumulmum$
11
m$
3
ndumulm
um$
lmu
14
$9
m$
1
ndumulm
um$
lmu
12m$
4
ndumulm
um$
um
6ndumulm
um$
10m$
2
ndumulm
um$
lmu
13
$
8
m$
0
ndumulm
um$
ulm
um
5
ndumulm
um$
u
+5H0([1, 4])+5H0([1, 4])
+6H0([2, 3, 1])+6H0([2, 3, 1])
+0
+0
+0
+0
— Succinct Data Structures: Theory and Practice 15/15