c-store: integrating compression and execution jianlin feng school of software sun yat-sen...

24
C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Upload: oscar-harrison

Post on 05-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

C-Store: Integrating Compression and Execution

Jianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYMar 20, 2009

Page 2: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

High Compressibility in Column Store Each attribute is stored in a separate column.

A Column Store can not only use traditional compression techniques Dictionary encoding, Huffman Encoding, etc

But also can use column-oriented techniques Run-Length Encoding

Page 3: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Benefits of Compression in DBMS Reduces the Size of Data

Improve I/O Performance Reducing seek times: data are stored nearer

to each other. Reducing transfer times: there is less data Increasing buffer hit rate: buffer can hold

larger fraction of data

Page 4: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

How to Query a Compressed Column? De-compress the data

Query the compressed column directly? Run-Length Encoding

Page 5: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

A Simple Example

In column C1, the value “42” appears 1000 times consecutively.

Assume C1 is compressed by Run-Length Encoding

Query: SUM(C1) ==42 * 1000

Page 6: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

History of Compression in DBMS 80s:

Focus on compression ratio 90s:

CPU cost of compressing/decompressing should be less than the savings of reducing the size of data.

Now:CPU speed increases much faster than memory speed and disk speed. Light-weight Compression is good

Page 7: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Reducing CPU cost on compressed data(Graefe and Shapiro, 1991) Lazy Decompression

Data is decompressed only if needed to be operated on.

Query the compressed data directly Exact-match Comparison, Natural Join

If the constant portion of the predicate is compressed in the same way as the data

Page 8: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

New Work in C-Store

Simultaneously apply an operation on multiple values in a single column.

Introduces a novel architecture for passing compressed data between query operators. Minimizes operator code complexity While maximizes chances for direct operation on

compressed data.

Page 9: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Review: Basic Concepts of C-Store A logical table is physically represented as a

set of projections. Each projection consists of a set of columns

Columns are stored separately, along with a common sort order defined by SORT KEY.

Each column appears in at least one projection.

A column can have different sort orders if it is stored in multiple projections.

Page 10: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

An example of C-Store Projection LINEITEM(shipdate, quantity, retflag, suppkey |

shipdate, quantity, retflag) First sorted by shipdate Second sorted by quantity Third sorted by retflag

Sorting increases locality of data. Favors Compression Techniques such as Run-Length

Encoding

Page 11: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

C-Store Operators vs. Relational Operators Selection

Produce bitmaps that can be efficiently combined. Mask

Materialize a set of values from a column and a bitmap. Permute

Reorder a column using a join index. Projection

Is free to project a column. Two columns in the same order can be concatenated for fr

ee. Join

Produces positions rather than values.

Page 12: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Join over Two Columns: An Example

Page 13: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Compressed Query Execution:Two Classes for Each New Compression Technique Compression Block

Encapsulates an intermediate representation for compressed data.

DataSource operator Reads in compressed pages from disk and conver

ts them into compression blocks.

Page 14: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

A Compression Block

contains a buffer of the column data in compressed format

Provides an API that allows the buffer to be accessed in several ways.

Page 15: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Accessing Properties of Compression Block isOneValue()

Returns whether or not the block contains just one value (and many positions for that value).

isValueSorted() Returns whether or not the block’s values are sort

ed. isPosContig()

Returns whether or not the block a consecutive subeset of a column.

Page 16: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Properties of Compression Block:for Various Encoding Schemes.

Page 17: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Iterator Access: where decompression cannot be avoided. getNext()

Transiently decompresses the next value in the compression block

Returns that value along with the position in the uncompressed column.

asArray() Decompresses the entire compression block And returns a pointer to an array of data in the un

compressed column type.

Page 18: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Block Information Methods (1):Getting Data without Decompression For Run-Length Encoding

A compression block consists of a single RLE triple of the form (value, start_pos, run_length)

getSize(): Returns run_length;

getStartValue(): Returns value;

getEndPosition(): Returns (start_pos + run_length -1);

Page 19: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Block Information Methods (2):Getting Data without Decompression For bitmaps

A compression block is a consecutive subset of a bitmap for a single value.

getSize() : Returns the number of on bits in the block (i.e., a bit strin

g). getStartValue() :

Returns the value with which the bit string is associated. getEndPosition() :

Returns the position of the last on bit in the bit string.

Page 20: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Compression-Aware Optimization Natural Join

An input column is compressed by Run-Length Encoding,

The other input column is uncompressed Do the join directly

Reduce the number of operations by a factor of k, where k is the run-length of the RLE triple.

Count

Page 21: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009
Page 22: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009
Page 23: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

Summary: Integrating Compression and Execution Operate directly on compressed data

whenever possible Using compression blocks as an intermediate

representation of data. Degenerate to a lazy decompression scheme

when decompression cannot be avoided Iterating through values in a compression block.

Reduce query executor complexity By abstracting general properties of compression

techniques.

Page 24: C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009

References

1. Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran and Stan Zdonik. C-Store: A Column Oriented DBMS , VLDB, 2005. (http://db.csail.mit.edu/projects/cstore/vldb.pdf)

2. Daniel J. Abadi, Samuel R. Madden, and Miguel C. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems.In SIGMOD, June, 2006, Chicago, USA. http://db.csail.mit.edu/projects/cstore/abadisigmod06.pdf