encoding survey
TRANSCRIPT
Encoding = (Data Structures) - (Data)
Rajeev Raman
University of Leicester
SPIRE 2015, King’s College London
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
RMQ problem
Problem Statement
Given a static array A[1..n], pre-process A to answer queries:
RMQ(l , r) : return maxl≤i≤r A[i ].
43 97 46 85 67 18 4524 8347 33 34
RMQ(5, 10) = 85.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
RMQ problem
Problem Statement
Given a static array A[1..n], pre-process A to answer queries:
RMQ(l , r) : return maxl≤i≤r A[i ].
43 97 46 85 67 18 4524 8347 33 34
RMQ(5, 10) = 85.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Data Structuring Problems
This is a data structuring problem.
• Pre-process input data (here array A) to answer long series ofqueries.
• Want to minimize:
1. Space usage of data structure.2. Query time.3. Time/space for pre-processing.
• In this talk we assume the input data is static i.e. it does not changebetween queries.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Solution to RMQ Problem: Cartesian Tree
The Cartesian tree of A [Vuillemin CACM’80] is a binary tree.
43 97 46 33 85 67 18 4524 8347
97
47 85
43
18
45
83
24 67
33
34
34
46
• Place largest value at root of tree.
• Recurse on sub-arrays to left and right.
• RMQ is the lowest common ancestor (LCA) of interval endpoints.
• n-node binary tree can support LCA in O(n) space and O(1) time.[Harel/Tarjan SICOMP’84]
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Compressing RMQ
• O(n) space = O(n) words = Ω(n lg n) bits1.
• Many applications where using O(n) words is way too much.• Suffix tree on a string of n bits occupies O(n) words
• The same is true for many applications of RMQ.• Can reconstruct A by asking RMQ(i , i) queries.• In general A can’t be compressed below Ω(n lg n) bits.• In specific applications (e.g. LCP array), A can be compressed, but
then accessing A[i ] is slow.
Can we do better?
1lg = log2.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
The RMQ Problem Redefined
Given a static array A[1..n], pre-process A to answer queries:
RMQ(l , r) = arg maxl≤i≤r
A[i ]
.
43 97 46 85 67 18 4524 8347 33 34
RMQ(5, 10) = 8.Often the value of A[i ] is not needed.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
The RMQ Problem Redefined
Given a static array A[1..n], pre-process A to answer queries:
RMQ(l , r) = arg maxl≤i≤r
A[i ]
.
43 97 46 85 67 18 4524 8347 33 34
RMQ(5, 10) = 8.Often the value of A[i ] is not needed.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding RMQ
RMQ(l , r) = arg maxl≤i≤r
A[i ]
. 3
1 8
2
10
12
11
4 9
6
7
5
• Shape of Cartesian tree is enough to answer modified RMQ queries.• A is not necessary!
• There are ≤ 4n distinct binary trees on n nodes.• Shape can be encoded in ≤ lg 4n = 2n bits.• Concrete encoding: 11 01 00 11 11 00 10 00 11 01 00 00.
• Data structures using 2n + o(n) bits, O(1) query time.[Fischer/Heun SICOMP’11],[Davoodi et al. COCOON’12].
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Data Structures
QUERY
RESULT
INPUTPREPROC
En
co
din
g
• Preprocess input data to answer a long series of queries.
• Preprocessing creates an encoding and deletes input.
Encodings = (Data Structures)− (Data)
• Queries only read encoding.
• Minimize: encoding size and query time.
• Non-trivial encodings must be smaller than original input data.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding: Effective Entropy
Encoding ≡ determining effective entropy.
• Extensive literature on succinct and compressed data structures.
• Entropy: “information content of data.”
• Effective Entropy is “the information content of the data structure”[Golin et al. TCS]:
• Given a set of objects S , a set of queries Q.• Let C be the equivalence class on S induced by Q (x , y ∈ S are
equivalent if they cannot be distinguished by queries in Q).
A B1 3 2 2 3 1
Arrays A and B cannot be distinguished by RMQ queries.• We want to store x in dlg |C|e bits.• Can define expected effective entropy as well.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Overview of Talk
• Overview of recent encoding results.
• Asymptotically optimal encodings• Range Top-k [Grossi et al. ESA’13, Gawrychowski/Nicholson
ICALP’15]• 2D Range Maximum [Brodal et al. Algor.’12][Brodal et al. ESA’13]
item Range Majority [Navarro/Thankachan CPM’14]• Range Selection [Navarro et al. FSTTCS’14, GN ICALP’15]• Range Maximum Sum Query [Nicholson/Gawrychowski, CPM ’15]• 2D NLVs [Jo et al. WALCOM’15]• Nondirectional NLV [Nicholson/Raman, CPM ’15]• NLV + Range Max/Min [Jo/Satti, COCOON ’15]
• Minimal encodings• RMQs [Fischer/Heun, SICOMP’11][Davoodi et al. PTRS-A ’14]• Range Second Maximum [Davoodi et al. PTRS-A ’14]• Bidirectional NLVs [Fischer, TCS’11]• Range Min-Max [Gawrychowski/Nicholson, ICALP ’15]• 2D Range Maximum, m = 2 [Golin et al. TCS]
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Nearest Larger Values (NLV)
Problem Definition
Given array A[1..n] of distinct values, encode A to answer
NLV(i): return i s.t. A[j ] > A[i ] and |j − i | is minimized.
9 11 2 0 1 8 5 6 410 7 3
NLV(6) = 3
• Can obtain NLVs in both directions from Cartesian tree:
• Unfortunately, NLVs in both directions ≡ RMQ.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Unidirectional NLVs
NLV(i): return j s.t. A[j ] > A[i ] and |j − i | is minimized.
• Can we modify the Cartesian tree?
• Eliminate zig-zags!
• How many binary trees with no zig-zags of degree-1 nodes?
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Counting Zig-Zag Free Binary Trees [Iacono]
• Change the encoding of degree-1 nodes:
011010
01
• Any encoding is a string over A = 01,B = 10,C = 00,D = 11.• AA does not appear in the string.
• Number of strings of length n, S(n) satisfies:
S(n) = 3S(n − 1)︸ ︷︷ ︸B,C ,D
+ 3S(n − 2)︸ ︷︷ ︸AB,AC ,AD
• Gives log S(n) ∼ n · log((3 +√
21)/2) ∼ 1.93n < 2n bits.• Adding forbidden patterns AB∗A gets ∼ 1.8999n bits.• Easy to support operations.• Same result obtained using a succinct Patricia trie, and much
optimization [Nicholson/Raman, CPM’15].
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
What’s the exact bound?
• Upper bound ∼ 1.89n.
• Lower bound by exhaustive enumeration ∼ 1.31n.
• Number of distinguishable configurations (equivalence classes):n 1 2 3 4 5 6 7 8 9 10
# configurations 1 2 5 14 40 116 341 1010 3009 9012
This sequence is not in oeis.org.
• Counting up to n = 40 suggests rate of growth nO(1)3n giving∼ n log 3 = 1.58n bits. [Hoffmann, personal communication.]
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection
Problem Definition
Given A[1..n] and κ, encode A to answer the query:
select(k, l , r): return the position of the k-th largest value in A[l ..r ], forany k ≤ κ.
• Non-encoding results by many authors including [Brodal andJørgensen, ISAAC’09] [Jørgensen/Larsen, SODA’11],[Chan/Wilkinson, SODA’13].
• O(n log n) bits, O(lg k/ lg lg n) time [CW SODA’13], optimal timefor n(lg n)O(1) bits of space [JL SODA’11].
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Lower Bound on Encoding Size
Proposition
Any encoding for range selection must take Ω(n lg κ) bits.
Proof: The index can encode n/κ independent permutations over κelements ⇒ Ω((n/κ) · κ lg κ) bits = Ω(n lg κ) bits.
For example (κ = 3).
A = 3 1 2 2 3 1 1 2 3 · · ·
Can trivially recover A from its encoding.
select(2, 4, 6) = 4⇒ A[4] = 2.
B κ must be known at construction time.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection [GN ’15]
Consider the 1-sided case: all queries of the form select(k , l , n). Exampleassumes κ = 3.
0 9 3 4 2 5 6 8 1
• For each i , count # values to right that are greater.
• Cap all values to κ.
• Claim: we know the sorted order among all positions with counts< κ.
• Positions = κ are never the answer to a select(k, l , n) query.
• We can answer select(k, l , n) queries using these counts whichoccupy n log(κ+ 1) bits.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection [GN ’15]
Consider the 1-sided case: all queries of the form select(k , l , n). Exampleassumes κ = 3.
0 9 3 4 2 5 6 8 1
8 0 4 3 3 2 1 0 0
• For each i , count # values to right that are greater.
• Cap all values to κ.
• Claim: we know the sorted order among all positions with counts< κ.
• Positions = κ are never the answer to a select(k, l , n) query.
• We can answer select(k, l , n) queries using these counts whichoccupy n log(κ+ 1) bits.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection [GN ’15]
Consider the 1-sided case: all queries of the form select(k , l , n). Exampleassumes κ = 3.
0 9 3 4 2 5 6 8 1
3 0 3 3 3 2 1 0 0
• For each i , count # values to right that are greater.
• Cap all values to κ.
• Claim: we know the sorted order among all positions with counts< κ.
• Positions = κ are never the answer to a select(k, l , n) query.
• We can answer select(k, l , n) queries using these counts whichoccupy n log(κ+ 1) bits.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection [GN ’15]
Consider the 1-sided case: all queries of the form select(k , l , n). Exampleassumes κ = 3.
0 9 3 4 2 5 6 8 1
3 0 3 3 3 2 1 0 0
• For each i , count # values to right that are greater.
• Cap all values to κ.
• Claim: we know the sorted order among all positions with counts< κ.
• Positions = κ are never the answer to a select(k , l , n) query.
• We can answer select(k , l , n) queries using these counts whichoccupy n log(κ+ 1) bits.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection [GN ’15]
Extend to the general 2-sided case.
• Let Sr be the 1-sided encoding for A[1 . . . r ], for r = 1, . . . , n.• Sr answers all queries of form select(k, l , r).
• δr+1 = # counts in encoding of Sr that are incremented to get Sr+1.
• Example, κ = 3, δ10 = 3:
• Knowing δr+1 suffices to get Sr+1 from Sr .
• Z = 0δ110δ21 . . . 0δn1 is an encoding of all S1, . . . ,Sn.
• Z has at most κn 0s and n 1s: there are ≤((κ+1)n
n
)distinct Z ’s.
• Encoding of size lg((κ+1)n
n
)∼ n lg(κ+ 1) + n lg e bits. This is
essentially optimal!
• Query time: O(κ6(log n)2+ε) vs. O(log k/ log log n).
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection [GN ’15]
Extend to the general 2-sided case.
• Let Sr be the 1-sided encoding for A[1 . . . r ], for r = 1, . . . , n.• Sr answers all queries of form select(k, l , r).
• δr+1 = # counts in encoding of Sr that are incremented to get Sr+1.
• Example, κ = 3, δ10 = 3:
0 9 3 4 2 5 6 8 1 7
3 0 3 3 3 3 2 0 1 0
• Knowing δr+1 suffices to get Sr+1 from Sr .
• Z = 0δ110δ21 . . . 0δn1 is an encoding of all S1, . . . ,Sn.
• Z has at most κn 0s and n 1s: there are ≤((κ+1)n
n
)distinct Z ’s.
• Encoding of size lg((κ+1)n
n
)∼ n lg(κ+ 1) + n lg e bits. This is
essentially optimal!
• Query time: O(κ6(log n)2+ε) vs. O(log k/ log log n).
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection [GN ’15]
Extend to the general 2-sided case.
• Let Sr be the 1-sided encoding for A[1 . . . r ], for r = 1, . . . , n.• Sr answers all queries of form select(k, l , r).
• δr+1 = # counts in encoding of Sr that are incremented to get Sr+1.
• Example, κ = 3, δ10 = 3:
0 9 3 4 2 5 6 8 1 7
3 0 3 3 3 3 2 0 1 0
• Knowing δr+1 suffices to get Sr+1 from Sr .
• Z = 0δ110δ21 . . . 0δn1 is an encoding of all S1, . . . ,Sn.
• Z has at most κn 0s and n 1s: there are ≤((κ+1)n
n
)distinct Z ’s.
• Encoding of size lg((κ+1)n
n
)∼ n lg(κ+ 1) + n lg e bits. This is
essentially optimal!
• Query time: O(κ6(log n)2+ε) vs. O(log k/ log log n).
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection [GN ’15]
Extend to the general 2-sided case.
• Let Sr be the 1-sided encoding for A[1 . . . r ], for r = 1, . . . , n.• Sr answers all queries of form select(k, l , r).
• δr+1 = # counts in encoding of Sr that are incremented to get Sr+1.
• Example, κ = 3, δ10 = 3:
0 9 3 4 2 5 6 8 1 7
3 0 3 3 3 3 2 0 1 0
• Knowing δr+1 suffices to get Sr+1 from Sr .
• Z = 0δ110δ21 . . . 0δn1 is an encoding of all S1, . . . ,Sn.
• Z has at most κn 0s and n 1s: there are ≤((κ+1)n
n
)distinct Z ’s.
• Encoding of size lg((κ+1)n
n
)∼ n lg(κ+ 1) + n lg e bits. This is
essentially optimal!
• Query time: O(κ6(log n)2+ε) vs. O(log k/ log log n).
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection [GN ’15]
Extend to the general 2-sided case.
• Let Sr be the 1-sided encoding for A[1 . . . r ], for r = 1, . . . , n.• Sr answers all queries of form select(k, l , r).
• δr+1 = # counts in encoding of Sr that are incremented to get Sr+1.
• Example, κ = 3, δ10 = 3:
0 9 3 4 2 5 6 8 1 7
3 0 3 3 3 3 2 0 1 0
• Knowing δr+1 suffices to get Sr+1 from Sr .
• Z = 0δ110δ21 . . . 0δn1 is an encoding of all S1, . . . ,Sn.
• Z has at most κn 0s and n 1s: there are ≤((κ+1)n
n
)distinct Z ’s.
• Encoding of size lg((κ+1)n
n
)∼ n lg(κ+ 1) + n lg e bits. This is
essentially optimal!
• Query time: O(κ6(log n)2+ε) vs. O(log k/ log log n).
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection: Fast DS
• View A geometrically in 2D:A[i ] = y ⇒ (i , y).
• Use idea of shallow cutting fortop-k [JL SODA’11].
• Take set of n given points anddecompose into O(n/κ) slabseach containing O(κ) pointssuch that:
• For any 2-sided queryselect(l , r) ∃ slab such that itand two other adjacent slabscontain the top κ elementsin A[l ..r ].
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding Range Selection: Fast DS
Create κ-shallow cutting. For O(κ) points in each slab, store rangeselection DS: O(κ lg κ) bits, or O(n lg κ) bits (asymptotically optimal).
1. Find resolving slab for given query [Grossi et al. ESA 13].
2. Use slab’s range selection data structure to answer query.• Slab’s points are numbered 1..O(κ), input query and answer are in
1..n.• Storing global coordinates of points in a slab takes O(κ lg n) bits per
slab or O(n lg n) bits overall.
3. Develop a representation of slabs which can space-efficiently:3.1 in O(lg κ/ lg lg n) time, perform predecessor search for l and r among
x coordinates in a slab.• Map query range to range among slab’s points.
3.2 in O(1) time, retrieve the i-th largest x-coordinate in the slab.• Convert answer back to “global” coordinates.
Theorem [Navarro et al. FSTTCS’14]
There is an encoding using O(n lg κ) bits of space and supports rangeselection in O(lg k/ lg lg n) time.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
2D NLV
Problem Statement
Given an n × n matrix A, preprocess to answer:
NLV(p) : if p = (i , j), return q = (i ′, j ′) s.t. A[q] > A[p] and|p − q|1 = |i − i ′|+ |j − j ′| is minimized.
0
1
2
3
4
5
0 1 2 3 4 5
If elements of A are distinct, explicitly store pointers (length i pointer inO(lg i) bits), overall O(n2) bits. [Jaypaul et al. IWOCA’14] Jaypaul et al.gave O(n2 lg lg n) bit encoding.
Pictures c© Pat Nicholson
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
2D NLV
Problem Statement
Given an n × n matrix A, preprocess to answer:
NLV(p) : if p = (i , j), return q = (i ′, j ′) s.t. A[q] > A[p] and|p − q|1 = |i − i ′|+ |j − j ′| is minimized.
0
1
2
3
4
5
0 1 2 3 4 5
Can’t point directly to answer when elements of A are non-distinct: thisrequires Ω(n2 lg n) bits, which is uninteresting.B Jaypaul et al. gave O(n2 lg lg n) bit encoding.
Pictures c© Pat Nicholson
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Encoding 2D NLV
Theorem [Jo et al. WALCOM’15]
There is an encoding of NLVs of a 2D matrix A that uses O(n2) bits andanswers queries in O(1) time, even when elements of A are not distinct.
• Encoding idea is simple:• Suppose wlog that NLV(p) = q is to the right and above p. If there
is a position p′ to the right of p in p’s row but not to the right of q,then p points to p′. Else, look for p′′ above p in column. If neitherp′ nor p′′ exist then point to q.
• 1D NLV problem closely related to RMQ problem.
• Encoding 2D-RMQ requires Ω(n2 lg n) bits [Demaine et al.ICALP’09].
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Minimal Encodings
1. Pre-process given data to obtain encoding E , discard input.
2. E should precisely characterize the query – # distinct E s shouldequal # distinguishable data instances using the query (|C|).
3. Create succinct DS on E , using lg |C|(1 + o(1)) bits. Secondpre-processing should not access input.
INPUT
QUERY
RESULT
PREPROC
Encoding
PREPROC
D
S
Advantages
• Optimal space.
• Only information in DS is what can be obtained from queries.• “Minimal-knowledge” data structures: contain only information
strictly necessary to answer queries.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Minimal Encodings for RMQ
Problem Definition
Given A[1..n], preprocess to answer:
RMQ(l , r) : return arg maxl≤i≤r A[i ].
3
1 8
2
10
12
11
4 9
6
7
5
• Shape of Cartesian tree precisely describes all possible RMQs.[Fischer, Heun, SICOMP’11].
• Pre-process A, output Cartesian tree, delete A.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Minimal Encodings for R2MQ
Problem Definition
Given A[1..n], encode A to answer:
R2MQ(l , r): return arg maxi∈l,...,r−RMQ(l,r) A[i ].
[10]
[1] [6]
[1]
[1]
[1]
[3]
[1] [1]
[1]
[1]
[3]
• Need to merge inner spines of Cartesian tree.
• Precisely described by “extended Cartesian tree”.
• Space needed is asymptotically ∼ 2.76n bits [Gawrychowski andNicholson, ICALP’15].
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Minimal Encodings for the Bidirectional NLV Problem
Problem Definition
Given A[1..n], encode A to answer:
BNLV(i): return j > i such that A[j ] > A[i ] and j − i is minimized,and j ′ < i such that A[j ′] > A[i ] and i − j ′ is minimized.
3 7 2 4 4 8 5 4 3 4 4 3
• When A has distinct values, this is justCartesian trees.
• When A has equal values, described by asubclass of Schroder trees [Fischer, TCS’11].
• Number of n-node Schroder trees is≤ (3 + 2
√2)n < 22.54n.
• Encoding using < 2.54n bits.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Minimal Encodings for Range Min-Max Queries
Problem Definition
Given A[1..n], encode A to answer:
Range-Min-Max(l , r): return both arg maxi∈l,...,r A[i ] andarg mini∈l,...,r A[i ].
Minimal encoding by [Gawrychowski and Nicholson, ICALP’15]:
• Precisely characterized by Baxter permutations.• Do not exist 1 ≤ l < i < r ≤ n such that:
π(i + 1) < π(l) < π(r) < π(i) (2− 41− 3)
orπ(i) < π(r) < π(l) < π(i + 1) (3− 14− 2)
• If A is a Baxter permutation, it can be recovered usingRange-Min-Max queries.
• Number of Baxter permutations on [n] = 23n/nO(1), gives3n − O(lg n) encoding size.
Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
Conclusions and Open Problems
Conclusions:
• Introduced the notion of encoding DS.
• Minimal encodings are combinatorially interesting and have goodprivacy properties.
Wide range of open problems:
• Challenging data structuring open problems:• Asymptotically optimal 2D RMQ encoding of [Brodal et al. ESA’13]
does not support efficient 2D RMQ queries.• Optimal top-k encoding of [Gawrychowski and Nicholson ICALP’15]
does not support efficient queries.
• Determining minimal encodings for a number of problems.
• Pre-processing time — ideally want O(n) time preprocessing.
• Apply encoding DS to reducing the space usage of “normal” DS. [cf.Chan and Wilkinson, SODA’13]