the sbc-tree: an index for run-length compressed sequences

23
The SBC-Tree: An Index for Run-Length Compressed Sequences Mohamed El-tabakh 1 , Wing-Kia Hon 2 Rahul Shah 3 , Walid Aref 1 , Jeffrey Vitter 1 1 Department of Computer Science, Purdue University 2 Department of Computer Science, National Tsing Hua University 3 Department of Computer Science, Louisiana State University

Upload: jillian-bernard

Post on 31-Dec-2015

31 views

Category:

Documents


3 download

DESCRIPTION

The SBC-Tree: An Index for Run-Length Compressed Sequences. Mohamed El-tabakh 1 , Wing-Kia Hon 2 Rahul Shah 3 , Walid Aref 1 , Jeffrey Vitter 1 1 Department of Computer Science, Purdue University 2 Department of Computer Science, National Tsing Hua University - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The SBC-Tree: An Index for Run-Length Compressed Sequences

The SBC-Tree: An Index for Run-Length Compressed Sequences

Mohamed El-tabakh1, Wing-Kia Hon2

Rahul Shah3, Walid Aref1, Jeffrey Vitter1

1 Department of Computer Science, Purdue University2 Department of Computer Science, National Tsing Hua University

3 Department of Computer Science, Louisiana State University

Page 2: The SBC-Tree: An Index for Run-Length Compressed Sequences

Outline Introduction

Related Work

SBC-Tree Structure

SBC-Tree Operations

Theoretical and Experimental Analysis

Summary

2

Page 3: The SBC-Tree: An Index for Run-Length Compressed Sequences

Introduction: Why Compression?

We deal with massive amount of data, scientific databases, …

Text and sequence formats are very common

Compression techniques gain significant importance because they achieve:

Significant storage reduction Reducing buffer requirements Reducing number of I/Os>>> Enhance the overall system performance

Page 4: The SBC-Tree: An Index for Run-Length Compressed Sequences

4

Introduction: Objective Current databases do not support data compression

Operate over the raw data

compress

Store, Index, and Search the compressed Sequences

Store, Index, and Search the decompressed sequences

The main challenge is how to operate on the compressed data without decompressing it

More challenging for external memory processing

Page 5: The SBC-Tree: An Index for Run-Length Compressed Sequences

5

Processing Compressed Sequences: Related Work(1)

A. Amir and G. Benson. Efficient two-dimensional compressed matching. In DCC, 1992.

A. Amir, G. Benson, and M. Farach. Let sleeping files lie: pattern matching in z-compressed files. In SODA, 1994.

A. Apostolico, G. M. Landau, and S. Skiena. Matching for run-length encoded strings. Journal of Complexity,1999.

T. Bell, M. Powell, A. Mukherjee, and D. Adjeroh. Searching BWT compressed text with the boyer-moore algorithm and binary search. In DCC, 2002.

V. Freschi and A. Bogliolo. Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism. Information Processing Letters, 2004.

• Searching compressed data is addressed in main memory• Substring matching, longest common subsequence, edit distance

1. Processing compressed data in main memory

Page 6: The SBC-Tree: An Index for Run-Length Compressed Sequences

6

Processing Compressed Sequences: Related Work(2)

M. Stonebraker, D. Abadi, A. Batkin, X. Chen,M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik, C-store: A column oriented dbms, In VLDB, 2005.

D. Abadi, S. Madden. Compression in Column Oriented Databases. In SIGMOD, 2006.

20

20

100 times (20, 100)

Run-Length encoding Operations such as SUM can be applied directly over the compressed data

Column in a database table

More complex operations have not been addressed yet• Indexing RLE-compressed sequences• Substring searching

2. Processing compressed data in DBMSs

Page 7: The SBC-Tree: An Index for Run-Length Compressed Sequences

What is SBC-Tree? SBC-Tree (String B-tree for Compressed sequences)

An index for Run-Length Encoding (RLE) compressed sequences

Supports prefix, range, and substring matching

Optimal theoretical bounds for: External memory space complexity Search I/O requirements>> Relative to the size of the compressed

sequences

7

Page 8: The SBC-Tree: An Index for Run-Length Compressed Sequences

8

SBC-Tree: An Index for RLE Compressed Sequences

S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHHRLE(S) = L10 E6 L4 E3 H18>> S has 41 suffixes>> RLE(S) has 5 RLE-suffixes

RLE-char

L10 E6 L4 E3 H18E6 L4 E3 H18L4 E3 H18E3 H18H18

1. Store the compressed sequences 2. Index the RLE-suffixes3. Perform efficiently substring operations

RLE-suffixes

Run-Length Encoding (RLE) Replace tandem repeated characters with their frequency Effective with small alphabets

Page 9: The SBC-Tree: An Index for Run-Length Compressed Sequences

9

SBC-tree Structure Two-level index structure

String B-tree: Indexes the RLE-suffixes Two-dimensional index: built on top of the leaves of

the string B-tree

Two-dimensional Index(e.g., R-tree)

Tags

Pre

ced

ing

ch

ara

cte

r

String B-tree

root

Numeric tag assigned to each suffix

Page 10: The SBC-Tree: An Index for Run-Length Compressed Sequences

10

String B-tree Overview[P. Ferragina and R. Grossi., Journal of ACM,1999]

S = LLLLLLLLLLEEEEEELLLLEEEHHHHHHHHHHHHHHHHHH

(S,21)(S,12)(S,11) (S,13)

Store logical pointers instead of the keys

1. Generate all suffixes of S

2. Insert the suffixes into the String B-tree (ordered alphabetically)

3. Store the logical keys instead of the key sequence

4. Several optimizations to achieve optimal theoretical bounds for:

External memory space complexity Search I/O requirements

11 21

>> Relative to the size of the raw (decompressed) sequences

Page 11: The SBC-Tree: An Index for Run-Length Compressed Sequences

11

String B-tree over RLE-suffixes

String B-tree CANNOT be used directly to index RLE-suffixes RLE-suffixes are subset of the total suffixes

3

1

5

24

Order

• We indexed only subset of the suffixes (RLE-suffixes)• Searching for “L10 E6 L3” Found • Searching for “L5 E6 L3” Not Found• Searching for “E3 L4” Not Found

Implicit in L10 E6 L4 E3 H18

Implicit in E6 L4 E3 H18

L10 E6 L4 E3 H18E6 L4 E3 H18L4 E3 H18E3 H18H18

(S,6)(S,8)(S,4) (S,1)(S,10)

L10 E6 L4 E3 H184 6 108

Page 12: The SBC-Tree: An Index for Run-Length Compressed Sequences

12

SBC-Tree over RLE-suffixes

Query Pattern Mapping Rule: Substring query pattern P = x1f1 x2f2 … xnfn is mapped

into P’ = x1f1+ x2f2 … xnfn

L10 E6 L4 E3 H18E6 L4 E3 H18L4 E3 H18E3 H18H18

RLE-suffixes

• Searching for “L5 E6 L3” (L5+ E6 L3) Found• Searching for “E3 L4” (E3+ L4) Found

Challenge:The answer set is no longer consecutive in the index tree Unbounded number of I/Os to answer a query

L5+ E6 L3

L5 E6 L3

L6 E6 L3

L5 H2L5 K10 L3

Not part of the query answer

Page 13: The SBC-Tree: An Index for Run-Length Compressed Sequences

13

SBC-tree: Insertion Procedure

Given an RLE sequence S = Ω1 x1f1 x2f2 … xnfn

1. Insert S as the first suffix into the SBC-tree first level

2. 1 ≤ i ≤ n, insert RLE-suffix xi1 xi+1fi+1 … xnfn into the SBC-tree first level Assign it a position tag T (Tag assignment problem)

3. Insert into the SBC-tree second level point = (T, f i)

Two-dimensional Index(e.g., R-tree)

Tags

Pre

ce

din

g

cha

racte

r

String B-tree

root

Numeric tag assigned to each suffix

Two-dimensional Index(e.g., R-tree)

Tags

Pre

ce

din

g

cha

racte

r

String B-tree

root

Numeric tag assigned to each suffix

Page 14: The SBC-Tree: An Index for Run-Length Compressed Sequences

14

SBC-tree: Substring Searching

Given a query Q = y1f1 y2f2 … ymfm

1. Map Q into Q’ = y1f1+ y2f2 … ymfm

2. Search the String B-tree for Q’’ = y11 y2f2 … ymfm

Returns (min_tag, max_tag) as a contiguous range

3. Search the SBC-tree second level for suffixes with frequency >= f1

String B-tree

The answer set

Pre

cedi

ng R

LE

-cha

r

Suffix tag

f1

Two-dimensional indexTwo-dimensional Index

(e.g., R-tree)

Tags

Pre

cedi

ng

char

acte

r

String B-tree

root

Numeric tag assigned to each suffix

Two-dimensional Index(e.g., R-tree)

Tags

Pre

cedi

ng

char

acte

r

String B-tree

root

Numeric tag assigned to each suffix

Max_tagMin_tag

Page 15: The SBC-Tree: An Index for Run-Length Compressed Sequences

SBC-Tree: Example

15

P = A5 E3 B4 P’ = A5+ E3 B4 P’’ = A1 E3 B4

Page 16: The SBC-Tree: An Index for Run-Length Compressed Sequences

SBC-Tree Variants

3-sided structure[L. Arge, V. Samoladas, J. Vitter, PODS99]

External memory structure based on priority search tree and B-tree Answers 3-sided range queries in 2D space Provides optimal worst-case theoretical bounds for:

External memory space complexity Insertion and deletion 3-sided range query

R-tree Available in all DBMSs Provides good performance in practice Does not have worst-case theoretical bounds for searching

One-Level SBC-tree Remove the second level structure Disadvantage: In queries scan many tuples outside the answer set

16

Page 17: The SBC-Tree: An Index for Run-Length Compressed Sequences

17

SBC-tree Theoretical Bounds

Optimal external-memory space complexity O(N/B)

Optimal substring, prefix, and range searching in

O(LogBN + (|p| +T)/B) I/O operations

Insertion and deletion in (m LogB(N+m)) amortized I/O operations

Parameter Definition

B Disk page size

N Total length of the RLE-compressed sequences

T Query output size

|p| Length of the RLE-compressed query pattern

m Length of the RLE-compressed sequence to be inserted or deleted

Page 18: The SBC-Tree: An Index for Run-Length Compressed Sequences

18

SBC-tree Implementation SBC-tree (R-tree variant) is implemented inside PostgreSQL

Query operators: ^^ (substring search) @@ (Prefix search) ==<< ==>> (Range search)

CREATE TABLE sequences (id INT, RLE_seq VARCHAR);

CREATE INDEX ON sequences USING sbctree (seq);

SELECT id FROM sequencesWHERE RLE_seq ^^ ‘A5H7N2’;

Substring searchingoperator

Page 19: The SBC-Tree: An Index for Run-Length Compressed Sequences

19

SBC-tree Performance Analysis: Storage Requirements

Up to an order of magnitude saving in storage

Comparing SBC-tree performance relative to String B-tree over uncompressed sequences

DatasetsSwissProt (Protein secondary structure) alphabet size = 3

WalMart (Sales profile time series) alphabet size = 5

Temperature (Time series of sensor readings) alphabet size = 52

Page 20: The SBC-Tree: An Index for Run-Length Compressed Sequences

20

SBC-tree Performance Analysis:Insertion

Around 30% saving in Insertion

Comparing SBC-tree performance relative to String B-tree over uncompressed sequences

DatasetsSwissProt (Protein secondary structure) alphabet size = 3

WalMart (Sales profile time series) alphabet size = 5

Temperature (Time series of sensor readings) alphabet size = 52

Page 21: The SBC-Tree: An Index for Run-Length Compressed Sequences

21

SBC-tree Performance Analysis:Searching

• Retain the optimal search performance (only the query answer is retrieved)• Some additional overhead because of the two-level structure

Comparing SBC-tree performance relative to String B-tree over uncompressed sequences

DatasetsSwissProt (Protein secondary structure) alphabet size = 3

WalMart (Sales profile time series) alphabet size = 5

Temperature (Time series of sensor readings) alphabet size = 52

Page 22: The SBC-Tree: An Index for Run-Length Compressed Sequences

Summary Addressing the challenge of storing and operating on

compressed data inside DBMSs without decompression

Introduced the SBC-tree as an index for Run-Length Encoded (RLE) compressed sequences

SBC-Tree has optimal theoretical bounds for: External memory space complexity Search I/O requirements

Implementation inside PostgreSQL

22

Page 23: The SBC-Tree: An Index for Run-Length Compressed Sequences

Thank youMohamed Eltabakh ([email protected])

23