encoding survey

Encoding = (Data Structures) - (Data)

Rajeev Raman

University of Leicester

SPIRE 2015, King’s College London

Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions

RMQ problem

Problem Statement

Given a static array A[1..n], pre-process A to answer queries:

RMQ(l , r) : return maxl≤i≤r A[i ].

43 97 46 85 67 18 4524 8347 33 34

RMQ(5, 10) = 85.


Data Structuring Problems

This is a data structuring problem.

• Pre-process input data (here array A) to answer long series ofqueries.

• Want to minimize:

1. Space usage of data structure.2. Query time.3. Time/space for pre-processing.

• In this talk we assume the input data is static i.e. it does not changebetween queries.


Solution to RMQ Problem: Cartesian Tree

The Cartesian tree of A [Vuillemin CACM’80] is a binary tree.

43 97 46 33 85 67 18 4524 8347

97

47 85

43

18

45

83

24 67

33

34

34

46

• Place largest value at root of tree.

• Recurse on sub-arrays to left and right.

• RMQ is the lowest common ancestor (LCA) of interval endpoints.

• n-node binary tree can support LCA in O(n) space and O(1) time.[Harel/Tarjan SICOMP’84]


Compressing RMQ

• O(n) space = O(n) words = Ω(n lg n) bits1.

• Many applications where using O(n) words is way too much.• Suffix tree on a string of n bits occupies O(n) words

• The same is true for many applications of RMQ.• Can reconstruct A by asking RMQ(i , i) queries.• In general A can’t be compressed below Ω(n lg n) bits.• In specific applications (e.g. LCP array), A can be compressed, but

then accessing A[i ] is slow.

Can we do better?

1lg = log2.


The RMQ Problem Redefined

Given a static array A[1..n], pre-process A to answer queries:

RMQ(l , r) = arg maxl≤i≤r

A[i ]

.

43 97 46 85 67 18 4524 8347 33 34

RMQ(5, 10) = 8.Often the value of A[i ] is not needed.


Encoding RMQ

RMQ(l , r) = arg maxl≤i≤r

A[i ]

. 3

1 8

2

10

12

11

4 9

6

7

5

• Shape of Cartesian tree is enough to answer modified RMQ queries.• A is not necessary!

• There are ≤ 4n distinct binary trees on n nodes.• Shape can be encoded in ≤ lg 4n = 2n bits.• Concrete encoding: 11 01 00 11 11 00 10 00 11 01 00 00.

• Data structures using 2n + o(n) bits, O(1) query time.[Fischer/Heun SICOMP’11],[Davoodi et al. COCOON’12].


Encoding Data Structures

QUERY

RESULT

INPUTPREPROC

En

co

din

g

• Preprocess input data to answer a long series of queries.

• Preprocessing creates an encoding and deletes input.

Encodings = (Data Structures)− (Data)

• Queries only read encoding.

• Minimize: encoding size and query time.

• Non-trivial encodings must be smaller than original input data.


Encoding: Effective Entropy

Encoding ≡ determining effective entropy.

• Extensive literature on succinct and compressed data structures.

• Entropy: “information content of data.”

• Effective Entropy is “the information content of the data structure”[Golin et al. TCS]:

• Given a set of objects S , a set of queries Q.• Let C be the equivalence class on S induced by Q (x , y ∈ S are

equivalent if they cannot be distinguished by queries in Q).

A B1 3 2 2 3 1

Arrays A and B cannot be distinguished by RMQ queries.• We want to store x in dlg |C|e bits.• Can define expected effective entropy as well.


Overview of Talk

• Overview of recent encoding results.

• Asymptotically optimal encodings• Range Top-k [Grossi et al. ESA’13, Gawrychowski/Nicholson

ICALP’15]• 2D Range Maximum [Brodal et al. Algor.’12][Brodal et al. ESA’13]

item Range Majority [Navarro/Thankachan CPM’14]• Range Selection [Navarro et al. FSTTCS’14, GN ICALP’15]• Range Maximum Sum Query [Nicholson/Gawrychowski, CPM ’15]• 2D NLVs [Jo et al. WALCOM’15]• Nondirectional NLV [Nicholson/Raman, CPM ’15]• NLV + Range Max/Min [Jo/Satti, COCOON ’15]

• Minimal encodings• RMQs [Fischer/Heun, SICOMP’11][Davoodi et al. PTRS-A ’14]• Range Second Maximum [Davoodi et al. PTRS-A ’14]• Bidirectional NLVs [Fischer, TCS’11]• Range Min-Max [Gawrychowski/Nicholson, ICALP ’15]• 2D Range Maximum, m = 2 [Golin et al. TCS]


Encoding Nearest Larger Values (NLV)

Problem Definition

Given array A[1..n] of distinct values, encode A to answer

NLV(i): return i s.t. A[j ] > A[i ] and |j − i | is minimized.

9 11 2 0 1 8 5 6 410 7 3

NLV(6) = 3

• Can obtain NLVs in both directions from Cartesian tree:

• Unfortunately, NLVs in both directions ≡ RMQ.


Unidirectional NLVs

NLV(i): return j s.t. A[j ] > A[i ] and |j − i | is minimized.

• Can we modify the Cartesian tree?

• Eliminate zig-zags!

• How many binary trees with no zig-zags of degree-1 nodes?


Counting Zig-Zag Free Binary Trees [Iacono]

• Change the encoding of degree-1 nodes:

011010

01

• Any encoding is a string over A = 01,B = 10,C = 00,D = 11.• AA does not appear in the string.

• Number of strings of length n, S(n) satisfies:

S(n) = 3S(n − 1)︸︷︷︸B,C ,D

+ 3S(n − 2)︸︷︷︸AB,AC ,AD

• Gives log S(n) ∼ n · log((3 +√

21)/2) ∼ 1.93n < 2n bits.• Adding forbidden patterns AB∗A gets ∼ 1.8999n bits.• Easy to support operations.• Same result obtained using a succinct Patricia trie, and much

optimization [Nicholson/Raman, CPM’15].


What’s the exact bound?

• Upper bound ∼ 1.89n.

• Lower bound by exhaustive enumeration ∼ 1.31n.

• Number of distinguishable configurations (equivalence classes):n 1 2 3 4 5 6 7 8 9 10

# configurations 1 2 5 14 40 116 341 1010 3009 9012

This sequence is not in oeis.org.

• Counting up to n = 40 suggests rate of growth nO(1)3n giving∼ n log 3 = 1.58n bits. [Hoffmann, personal communication.]


Encoding Range Selection

Problem Definition

Given A[1..n] and κ, encode A to answer the query:

select(k, l , r): return the position of the k-th largest value in A[l ..r ], forany k ≤ κ.

• Non-encoding results by many authors including [Brodal andJørgensen, ISAAC’09] [Jørgensen/Larsen, SODA’11],[Chan/Wilkinson, SODA’13].

• O(n log n) bits, O(lg k/ lg lg n) time [CW SODA’13], optimal timefor n(lg n)O(1) bits of space [JL SODA’11].


Lower Bound on Encoding Size

Proposition

Any encoding for range selection must take Ω(n lg κ) bits.

Proof: The index can encode n/κ independent permutations over κelements ⇒ Ω((n/κ) · κ lg κ) bits = Ω(n lg κ) bits.

For example (κ = 3).

A = 3 1 2 2 3 1 1 2 3 · · ·

Can trivially recover A from its encoding.

select(2, 4, 6) = 4⇒ A[4] = 2.

B κ must be known at construction time.


Encoding Range Selection [GN ’15]

Consider the 1-sided case: all queries of the form select(k , l , n). Exampleassumes κ = 3.

0 9 3 4 2 5 6 8 1

• For each i , count # values to right that are greater.

• Cap all values to κ.

• Claim: we know the sorted order among all positions with counts< κ.

• Positions = κ are never the answer to a select(k, l , n) query.

• We can answer select(k, l , n) queries using these counts whichoccupy n log(κ+ 1) bits.




0 9 3 4 2 5 6 8 1

8 0 4 3 3 2 1 0 0









0 9 3 4 2 5 6 8 1

3 0 3 3 3 2 1 0 0









0 9 3 4 2 5 6 8 1

3 0 3 3 3 2 1 0 0




• Positions = κ are never the answer to a select(k , l , n) query.

• We can answer select(k , l , n) queries using these counts whichoccupy n log(κ+ 1) bits.



Extend to the general 2-sided case.

• Let Sr be the 1-sided encoding for A[1 . . . r ], for r = 1, . . . , n.• Sr answers all queries of form select(k, l , r).

• δr+1 = # counts in encoding of Sr that are incremented to get Sr+1.

• Example, κ = 3, δ10 = 3:

• Knowing δr+1 suffices to get Sr+1 from Sr .

• Z = 0δ110δ21 . . . 0δn1 is an encoding of all S1, . . . ,Sn.

• Z has at most κn 0s and n 1s: there are ≤((κ+1)n

n

)distinct Z ’s.

• Encoding of size lg((κ+1)n

n

)∼ n lg(κ+ 1) + n lg e bits. This is

essentially optimal!

• Query time: O(κ6(log n)2+ε) vs. O(log k/ log log n).



Extend to the general 2-sided case.

• Let Sr be the 1-sided encoding for A[1 . . . r ], for r = 1, . . . , n.• Sr answers all queries of form select(k, l , r).

• δr+1 = # counts in encoding of Sr that are incremented to get Sr+1.

• Example, κ = 3, δ10 = 3:

0 9 3 4 2 5 6 8 1 7

3 0 3 3 3 3 2 0 1 0

• Knowing δr+1 suffices to get Sr+1 from Sr .

• Z = 0δ110δ21 . . . 0δn1 is an encoding of all S1, . . . ,Sn.

• Z has at most κn 0s and n 1s: there are ≤((κ+1)n

n

)distinct Z ’s.

• Encoding of size lg((κ+1)n

n

)∼ n lg(κ+ 1) + n lg e bits. This is

essentially optimal!

• Query time: O(κ6(log n)2+ε) vs. O(log k/ log log n).


Encoding Range Selection: Fast DS

• View A geometrically in 2D:A[i ] = y ⇒ (i , y).

• Use idea of shallow cutting fortop-k [JL SODA’11].

• Take set of n given points anddecompose into O(n/κ) slabseach containing O(κ) pointssuch that:

• For any 2-sided queryselect(l , r) ∃ slab such that itand two other adjacent slabscontain the top κ elementsin A[l ..r ].


Encoding Range Selection: Fast DS

Create κ-shallow cutting. For O(κ) points in each slab, store rangeselection DS: O(κ lg κ) bits, or O(n lg κ) bits (asymptotically optimal).

1. Find resolving slab for given query [Grossi et al. ESA 13].

2. Use slab’s range selection data structure to answer query.• Slab’s points are numbered 1..O(κ), input query and answer are in

1..n.• Storing global coordinates of points in a slab takes O(κ lg n) bits per

slab or O(n lg n) bits overall.

3. Develop a representation of slabs which can space-efficiently:3.1 in O(lg κ/ lg lg n) time, perform predecessor search for l and r among

x coordinates in a slab.• Map query range to range among slab’s points.

3.2 in O(1) time, retrieve the i-th largest x-coordinate in the slab.• Convert answer back to “global” coordinates.

Theorem [Navarro et al. FSTTCS’14]

There is an encoding using O(n lg κ) bits of space and supports rangeselection in O(lg k/ lg lg n) time.


2D NLV

Problem Statement

Given an n × n matrix A, preprocess to answer:

NLV(p) : if p = (i , j), return q = (i ′, j ′) s.t. A[q] > A[p] and|p − q|1 = |i − i ′|+ |j − j ′| is minimized.

0

1

2

3

4

5

0 1 2 3 4 5

If elements of A are distinct, explicitly store pointers (length i pointer inO(lg i) bits), overall O(n2) bits. [Jaypaul et al. IWOCA’14] Jaypaul et al.gave O(n2 lg lg n) bit encoding.

Pictures c© Pat Nicholson


2D NLV

Problem Statement

Given an n × n matrix A, preprocess to answer:

NLV(p) : if p = (i , j), return q = (i ′, j ′) s.t. A[q] > A[p] and|p − q|1 = |i − i ′|+ |j − j ′| is minimized.

0

1

2

3

4

5

0 1 2 3 4 5

Can’t point directly to answer when elements of A are non-distinct: thisrequires Ω(n2 lg n) bits, which is uninteresting.B Jaypaul et al. gave O(n2 lg lg n) bit encoding.

Pictures c© Pat Nicholson


Encoding 2D NLV

Theorem [Jo et al. WALCOM’15]

There is an encoding of NLVs of a 2D matrix A that uses O(n2) bits andanswers queries in O(1) time, even when elements of A are not distinct.

• Encoding idea is simple:• Suppose wlog that NLV(p) = q is to the right and above p. If there

is a position p′ to the right of p in p’s row but not to the right of q,then p points to p′. Else, look for p′′ above p in column. If neitherp′ nor p′′ exist then point to q.

• 1D NLV problem closely related to RMQ problem.

• Encoding 2D-RMQ requires Ω(n2 lg n) bits [Demaine et al.ICALP’09].


Minimal Encodings

1. Pre-process given data to obtain encoding E , discard input.

2. E should precisely characterize the query – # distinct E s shouldequal # distinguishable data instances using the query (|C|).

3. Create succinct DS on E , using lg |C|(1 + o(1)) bits. Secondpre-processing should not access input.

INPUT

QUERY

RESULT

PREPROC

Encoding

PREPROC

D

S

Advantages

• Optimal space.

• Only information in DS is what can be obtained from queries.• “Minimal-knowledge” data structures: contain only information

strictly necessary to answer queries.


Minimal Encodings for RMQ

Problem Definition

Given A[1..n], preprocess to answer:

RMQ(l , r) : return arg maxl≤i≤r A[i ].

3

1 8

2

10

12

11

4 9

6

7

5

• Shape of Cartesian tree precisely describes all possible RMQs.[Fischer, Heun, SICOMP’11].

• Pre-process A, output Cartesian tree, delete A.


Minimal Encodings for R2MQ

Problem Definition

Given A[1..n], encode A to answer:

R2MQ(l , r): return arg maxi∈l,...,r−RMQ(l,r) A[i ].

[10]

[1] [6]

[1]

[1]

[1]

[3]

[1] [1]

[1]

[1]

[3]

• Need to merge inner spines of Cartesian tree.

• Precisely described by “extended Cartesian tree”.

• Space needed is asymptotically ∼ 2.76n bits [Gawrychowski andNicholson, ICALP’15].


Minimal Encodings for the Bidirectional NLV Problem

Problem Definition


BNLV(i): return j > i such that A[j ] > A[i ] and j − i is minimized,and j ′ < i such that A[j ′] > A[i ] and i − j ′ is minimized.

3 7 2 4 4 8 5 4 3 4 4 3

• When A has distinct values, this is justCartesian trees.

• When A has equal values, described by asubclass of Schroder trees [Fischer, TCS’11].

• Number of n-node Schroder trees is≤ (3 + 2

√2)n < 22.54n.

• Encoding using < 2.54n bits.


Minimal Encodings for Range Min-Max Queries

Problem Definition


Range-Min-Max(l , r): return both arg maxi∈l,...,r A[i ] andarg mini∈l,...,r A[i ].

Minimal encoding by [Gawrychowski and Nicholson, ICALP’15]:

• Precisely characterized by Baxter permutations.• Do not exist 1 ≤ l < i < r ≤ n such that:

π(i + 1) < π(l) < π(r) < π(i) (2− 41− 3)

orπ(i) < π(r) < π(l) < π(i + 1) (3− 14− 2)

• If A is a Baxter permutation, it can be recovered usingRange-Min-Max queries.

• Number of Baxter permutations on [n] = 23n/nO(1), gives3n − O(lg n) encoding size.


Conclusions and Open Problems

Conclusions:

• Introduced the notion of encoding DS.

• Minimal encodings are combinatorially interesting and have goodprivacy properties.

Wide range of open problems:

• Challenging data structuring open problems:• Asymptotically optimal 2D RMQ encoding of [Brodal et al. ESA’13]

does not support efficient 2D RMQ queries.• Optimal top-k encoding of [Gawrychowski and Nicholson ICALP’15]

does not support efficient queries.

• Determining minimal encodings for a number of problems.

• Pre-processing time — ideally want O(n) time preprocessing.

• Apply encoding DS to reducing the space usage of “normal” DS. [cf.Chan and Wilkinson, SODA’13]

encoding survey

Science