succinct data structures: theory and practice · 2012. 3. 15. · problems: in theory develop...

21
Motivation and Context Succinct Data Structures: Theory and Practice March 16, 2012 — Succinct Data Structures: Theory and Practice 1/15

Upload: others

Post on 11-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct Data Structures: Theory and Practice

March 16, 2012

— Succinct Data Structures: Theory and Practice 1/15

Page 2: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Contents

1 Motivation and ContextMemory HierarchySuccinct Data StructuresBasics

— Succinct Data Structures: Theory and Practice 2/15

Page 3: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

1 Motivation and ContextMemory HierarchySuccinct Data StructuresBasics

— Succinct Data Structures: Theory and Practice 3/15

Page 4: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct data structures

Data structure D

representation of an objectX

+ operations on X

Example: Rank-bit-vector

bit vector b of length n(0,1,0,1,1,0,1,1)

(0,0,1,1,2,3,3,4)

in n bits space

+access b[i ] in O(1) timerank(i) =

∑i−1j=0 b[j ] in O(n) time

Succinct data structure D

Space of D is close the information theoretic lower bound torepresent X , while operations can still be performed efficient.

— Succinct Data Structures: Theory and Practice 4/15

Page 5: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct data structures

Data structure D

representation of an objectX

+ operations on X

Example: Rank-bit-vector

bit vector b of length n(0,1,0,1,1,0,1,1)

(0,0,1,1,2,3,3,4)

in n bits space

+access b[i ] in O(1) timerank(i) =

∑i−1j=0 b[j ] in O(n) time

Succinct data structure D

Space of D is close the information theoretic lower bound torepresent X , while operations can still be performed efficient.

— Succinct Data Structures: Theory and Practice 4/15

Page 6: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct data structures

Data structure D

representation of an objectX

+ operations on X

Example: Rank-bit-vector

bit vector b of length n(0,1,0,1,1,0,1,1)

(0,0,1,1,2,3,3,4)

in n + n log n bits space

+access b[i ] in O(1) timerank(i) =

∑i−1j=0 b[j ] in O(1) time

Succinct data structure D

Space of D is close the information theoretic lower bound torepresent X , while operations can still be performed efficient.

— Succinct Data Structures: Theory and Practice 4/15

Page 7: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct data structures

Data structure D

representation of an objectX

+ operations on X

Example: Rank-bit-vector

bit vector b of length n(0,1,0,1,1,0,1,1)

(0,0,1,1,2,3,3,4)

in n + n log n bits space

+access b[i ] in O(1) timerank(i) =

∑i−1j=0 b[j ] in O(1) time

Succinct data structure D

Space of D is close the information theoretic lower bound torepresent X , while operations can still be performed efficient.

— Succinct Data Structures: Theory and Practice 4/15

Page 8: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct data structures

Can succinct data structures replace classic uncompressed datastructures in practice?

Less memory ⇒ fewer CPU cycles !?

Less memory ⇒ less costs !?

Problems:

in theorydevelop succinct data structures

in practiceconstants in O(1)-time terms are higho(n)-space term is not negligiblecomplex data structures are hard to implement

— Succinct Data Structures: Theory and Practice 5/15

Page 9: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct data structures

Can succinct data structures replace classic uncompressed datastructures in practice?

Less memory ⇒ fewer CPU cycles !?

CPU

L1-Cache

L2-Cache

L3-Cache

DRAM

Disk

≈ 100 B

≈ 10 KB

≈ 512 KB

≈ 1-8 MB

≈ 4 GB

≈ x · 100 GB

≈ 1 CPU cycle

≈ 5 CPU cycles

≈ 10-20

≈ 20-100

≈ 100-500

≈ 106

Less memory ⇒ less costs !?

Problems:in theory

develop succinct data structuresin practice

constants in O(1)-time terms are higho(n)-space term is not negligiblecomplex data structures are hard to implement

— Succinct Data Structures: Theory and Practice 5/15

Page 10: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct data structures

Can succinct data structures replace classic uncompressed datastructures in practice?

Less memory ⇒ fewer CPU cycles !?Less memory ⇒ less costs !?Instance name main memory price per hour

Micro 613.0 MB 0.02 US$High-Memory Quadruple Extra Large 68.4 GB 2.00 US$

Pricing of Amazons Elastic Cloud Computing (EC2) service in July 2011.

Problems:in theory

develop succinct data structuresin practice

constants in O(1)-time terms are higho(n)-space term is not negligiblecomplex data structures are hard to implement

— Succinct Data Structures: Theory and Practice 5/15

Page 11: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct data structures

Can succinct data structures replace classic uncompressed datastructures in practice?

Less memory ⇒ fewer CPU cycles !?

Less memory ⇒ less costs !?

Problems:

in theorydevelop succinct data structures

in practiceconstants in O(1)-time terms are higho(n)-space term is not negligiblecomplex data structures are hard to implement

— Succinct Data Structures: Theory and Practice 5/15

Page 12: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Memory Hierarchy

Moore’s Law

Image source: Wikipedia

“The number of transistors that can beplaced inexpensively on an integratedcircuit doubles approximately every twoyears”

— Succinct Data Structures: Theory and Practice 6/15

Page 13: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Memory Hierarchy

Reality

Processor speed increasing Disks have not evolved at the same pace

Image source: www.intel.com

— Succinct Data Structures: Theory and Practice 7/15

Page 14: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Memory Hierarchy

Image source: Wikipedia

— Succinct Data Structures: Theory and Practice 8/15

Page 15: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Memory Hierarchy

Some Numbers

A few CPU registers, less than 1 nanosecond.

A few KBs of L1 cache, about 10 nanoseconds.

A few MBs of L2 cache, about 30 nanoseconds.

A few GBs of RAM, about 60 nanoseconds.

A few TBs of disk, about 10 milliseconds.

— Succinct Data Structures: Theory and Practice 9/15

Page 16: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Succinct Data Structures

There are Data Structures:

Modify to use less space than the original one.

That is not compression?

No: It must provide efficient algorithms for simulating itsoperations.

Why use it, if the memory is so cheap?

Improve performance given the memory hierarchy, especially ifwe operate on RAM something that would need the disk.

— Succinct Data Structures: Theory and Practice 10/15

Page 17: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Things that will be cover

We will review progress in various succinct compact datastructures.

These will give us theoretical and practical tools to takeadvantage of the memory hierarchy in the design ofalgorithms and data structures.

We will see compact structures for:

Manipulate bit sequencesManipulate symbols sequencesNavigating treesPattern matching and text searchNavigating graphs

We will also see hashing applications, sets, partial sums,geometry, permutations, and more.

— Succinct Data Structures: Theory and Practice 11/15

Page 18: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Zero-order Entropy

Average number of bits needed to represent a symbol of a text T ifeach symbol receives always the same code.

H0(T ) = −∑c∈Σ

ncn

logncn

where nc is the number of occurrences of the symbol c in T and nis the length of T .

— Succinct Data Structures: Theory and Practice 12/15

Page 19: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

K-order Entropy

If we can encode each symbol depending on the context in which itappears, we can achieve better compression ratios.

Hk(T ) =∑s∈Σk

|T s |n

H0(T s)

where T s is the sequence of symbols preceded by the context s inT .

— Succinct Data Structures: Theory and Practice 13/15

Page 20: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Entropies of the Pizza&Chili 200MB test cases

dblp.xml dna english proteins rand k128 sources

k Hk CT/n Hk CT/n Hk CT/n Hk CT/n Hk CT/n Hk CT/n0 5.257 0.0000 1.974 0.0000 4.525 0.0000 4.201 0.0000 7.000 0.0000 5.465 0.00001 3.479 0.0000 1.930 0.0000 3.620 0.0000 4.178 0.0000 7.000 0.0000 4.077 0.00002 2.170 0.0000 1.920 0.0000 2.948 0.0001 4.156 0.0000 6.993 0.0001 3.102 0.00003 1.434 0.0007 1.916 0.0000 2.422 0.0005 4.066 0.0001 5.979 0.0100 2.337 0.00124 1.045 0.0043 1.910 0.0000 2.063 0.0028 3.826 0.0011 0.666 0.6939 1.852 0.00825 0.817 0.0130 1.901 0.0000 1.839 0.0103 3.162 0.0173 0.006 0.9969 1.518 0.02506 0.705 0.0265 1.884 0.0001 1.672 0.0265 1.502 0.1742 0.000 1.0000 1.259 0.05097 0.634 0.0427 1.862 0.0001 1.510 0.0553 0.340 0.4506 0.000 1.0000 1.045 0.08508 0.574 0.0598 1.834 0.0004 1.336 0.0991 0.109 0.5383 0.000 1.0000 0.867 0.12559 0.537 0.0773 1.802 0.0013 1.151 0.1580 0.074 0.5588 0.000 1.0000 0.721 0.170110 0.508 0.0955 1.760 0.0051 0.963 0.2292 0.061 0.5699 0.000 1.0000 0.602 0.2163

— Succinct Data Structures: Theory and Practice 14/15

Page 21: Succinct Data Structures: Theory and Practice · 2012. 3. 15. · Problems: in theory develop succinct data structures in practice constants in O(1)-time terms are high o(n)-space

Motivation and Context

Calculation of H1(T) in linear time

15

$

7

dumulmum$

11

m$

3

ndumulm

um$

lmu

14

$9

m$

1

ndumulm

um$

lmu

12m$

4

ndumulm

um$

um

6ndumulm

um$

10m$

2

ndumulm

um$

lmu

13

$

8

m$

0

ndumulm

um$

ulm

um

5

ndumulm

um$

u

+5H0([1, 4])+5H0([1, 4])

+6H0([2, 3, 1])+6H0([2, 3, 1])

+0

+0

+0

+0

— Succinct Data Structures: Theory and Practice 15/15