ibm almaden research center © 2006 ibm corporation wringing a table dry: using csvzip to compress a...
TRANSCRIPT
IBM Almaden Research Center
© 2006 IBM Corporation
Wringing a Table Dry: Using CSVZIP to Compress a Relation to its Entropy
Vijayshankar Raman & Garret Swart
IBM Almaden Research Center
© 2006 IBM Corporation
Oxide is cheap, so why compress?
Make better use of memory
– Increase capacity of in memory database
– Increase effective cache size of on disk database
Make better use of bandwidth
– I/O and memory bandwidth are expensive to scale
– ALU operations are cheap and getting cheaper
Minimize storage and replication costs
IBM Almaden Research Center
© 2006 IBM Corporation
Why compress relations?
Relations are important for structured information
Text, video, audio, image compression is more advanced than relational
Statistical and structural properties of the relation can be exploited to improve compression
Relational data have special access patterns
– Don’t just “inflate.” Need to run selections, projections and aggregations
IBM Almaden Research Center
© 2006 IBM Corporation
Our results
Near optimal compression of relational data
– Exploits data skew, column correlations and lack of ordering
– Theory: Compress m i.i.d. tuples to within 4.3 m bits of entropy (but theory doesn’t count dictionaries)
– Practice: Between 8 and 40x compression
Scanning compressed relational data
– Directly perform projections, equality and range selections, and joins on entropy compressed data
– Cache efficient dictionary usage
– Query short circuiting
IBM Almaden Research Center
© 2006 IBM Corporation
ThisTalk
Raw Data
CompressedData
Analyze
Meta Data &Dictionaries
Compress
Query
Results
Update
NewRaw Data
CSVZIP Flow
Analyze to determine compression plan
Compress to reduce size
Execute many queries over compressed data
Periodically update data and dictionaries
IBM Almaden Research Center
© 2006 IBM Corporation
Sources of Redundancy in Relations Column Value space much smaller than
Domain– |C| << |domain(C)|– Type specific transformations, dictionaries
Skew in value frequency– H(C) << lg |C|– Entropy encoding (e.g. Huffman codes)
Column correlations within a tuple– H(C1, C2) << H(C1) + H(C2)– Column co-coding
Incidental tuple ordering– H({T1, T2, …, Tm}) ~ H(T1,T2, … , Tm) – m lg m– Sort and delta code
Tuple correlations– If correlated tuples share common columns, sort
first on those columns
{“Apple”, “Pear”, “Mango”} in
CHAR(10)
90% of fruits are “Apple”
Mangos are mainly sold in August
Mango buyers also buy paper towels
IBM Almaden Research Center
© 2006 IBM Corporation
Male/John
Compression Process: Step 1 Input tuple
Column 1 Column 2
Co-codetransform
Type specifictransform
Column1 & 2
Column3.A
ColumnCode
TupleCode
ColumnCode
Column 3
Column3.B
ColumnCode
HuffmanEncode
Dict HuffmanEncode
DictHuffmanEncode
Dict
Male/John/Sat
Sat 2006
Male, John, 08/10/06, Mango
101101011 001 01011101
10110101100101011101
p = 1/512 p = 1/8 p = 1/512
w35/Mango
w35
Male John 08/10/06 Mango
Michael 4.2%
David 3.8%
James 3.6%
Robert 3.5%
John 3.5%
William 2.5%
Mark 2.4%
Richard 2.3%
Thomas 1.9%
Steven 1.5%
Mon Tue Wed Thu Fri Sat Sun
Male 3% 4% 10% 6% 23% 42% 12%
Female 4% 5% 9% 15% 17% 28% 22%
IBM Almaden Research Center
© 2006 IBM Corporation
Compression Process: Step 2
First tuple code
Tuplecode
—
SortedTuplecodes1
PreviousTuplecode
Delta
HuffmanEncode
Delta Code
Append
Dict
CompressionBlock
10110101110000110010110101110001011111
1011010111000011101
10110101110001011101
—
—
10110101110001011101
0000000000000000001
000
000
00000000000000000001
010
010
0000000000000000101
1110
1110
—
Look Ma, no delimiters!101101011100010111010000101110
IBM Almaden Research Center
© 2006 IBM Corporation
Compression Results
P1 – P6: Various projections of TPC-H tables
P7: SAP SEOCOMPODF
P8: TPC-E Customer
IBM Almaden Research Center
© 2006 IBM Corporation
Huffman Code Scan operations
SELECT SUM(price) FROM Sale WHERE week(saleDate) = 23 AND fruit = “Mango” AND year(saleDate) between 1997 AND 2005
Scan this: 1011010110010101110101001
– Skip Over first column: Need length
– Range Compare on 2nd column: year in 1997 to 2005
– Equality Compare 3rd column: Week = 23, fruit = Mango
– Decode 4th column for aggregation
Segregated Coding: Faster operations, same compression
– Assign Huffman Codes in order of length• |code(v)| < |code(w)| code(v) < code(w)
– Sort codes within a length• |code(v)| = |code(w)| (v < w code(v) < code(w))
Year Code
2003 000
2006 001
1997 010000
1999 010001
2000 010010
2004 010011
1998 0101000
2001 0101001
2002 0101010
2005 0101011
IBM Almaden Research Center
© 2006 IBM Corporation
Segregated Coding: Computing Code Length
One code length Constant function
– #define codeLen(w) 6
Second largest code length << lg L1 cache size Use lookup table
– #define codeLen(w) \ codeTable[x>>26]
Otherwise compare input with max code of each length
– #define codeLen(w) \ (w <= 0b00111111…)?3 \:(w <= 0b01001111…)?6 \:(w <= 0b01010111…)?7
… )))
Year Code
2003 000
2006 001
1997 010000
1999 010001
2000 010010
2004 010011
1998 0101000
2001 0101001
2002 0101010
2005 0101011
IBM Almaden Research Center
© 2006 IBM Corporation
Segregated Coding: Range Query1002003009818022032287111190232256278298
switch (codeLen(w)) {
case 3: return w>>28 != 0;
302
case 4: return w >= 0b0111000000000000 && w <= 0b1000111111111111;
case 5: return w >= 0b1011000000000000&& w <= 0b1101111111111111;
}
333
Value code
0000010100110011110001001101001010110110101111100011001110101101111100SELECT * WHERE col BETWEEN 112 and 302
IBM Almaden Research Center
© 2006 IBM Corporation
Advantages of Segregated Coding
Find code length quickly
– No access to dictionary
Fast Range query
– No access to dictionary for constant ranges
Cache Locality
– Because values are sorted by code length, commonly used values are clustered near the beginning of the array
– The beginning of the array is most likely to be in cache, improving the cache hit ratio
IBM Almaden Research Center
© 2006 IBM Corporation
Query Short Circuiting
Reuse predicates and values that depend on unchanged columns
Sorting causes many unchanged columns
101101011100001100
1011010111000011101
0000000000000000101
+
Previous Tuple:
Delta Value:
Next Tuple:
Common Bits: 1011010111000011
Unchanged Columns:
Gender/FName
Reusedpredicates:
Sex = MaleName = JohnYear ≥ 2005
Reduces instructions but adds a branch!
Year
IBM Almaden Research Center
© 2006 IBM Corporation
Selected Prior Work
Entropy Coding– Shannon (1948), Huffman (1952)
Arithmetic coding – Abramson (1963) Pasco, Rissanen (1976)
Row or Page Coding– Compress each row or page
independently. Decompress on page load or row touch. Compression code is localized. [Oracle, DB2, IMS]
Column-wise coding– Each column value gets a fixed
length code from a per column dictionary. [Sybase IQ, CStore, MonetDB]
– Pack multiple short values into 16 bit quantities and decode them as a unit to save CPU [Abadi/Madden/Ferreira]
Delta coding– Sort and difference or remove
common prefix from adjacent codes [Inverted Indices, B-trees, CStore]
Text coding– “gzip” style coding using n-grams,
Huffman codes, and sliding dictionaries [Ziv, Lempel, Welch, Katz]
Order preserving codes– Allows range queries at a cost in
compression [Hu/Tucker, Antoshenkov/Murray/Lomet, Zandi/Iyer/Langdon]
Lossy coding– Model based lossy compression:
SPARTAN, Vector quantization
IBM Almaden Research Center
© 2006 IBM Corporation
Work in Progress Analysis to find best:
– Dictionaries that fit in L2 cache size
– Set of columns to co-code
– Column ordering for sort
Generate code for efficient queries on x86-64, Power5 and Cell
– Don’t interpret meta-data at run time
– Utilize architecture features
Update
– Incremental update of dictionaries. Background merge of new rows.
Release of CSVZIP utilities
IBM Almaden Research Center
© 2006 IBM Corporation
Observations
Entropy decoding uses less I/O, but more ALU ops than conventional decoding
– Our technique removes the cache as a problem
– Have to squeeze every ALU op: Trends in favor
Variable length codes makes vectorization and out-of-order execution hard
– Exploit compression block parallelism instead
These techniques can be exploited in a column store
IBM Almaden Research Center
© 2006 IBM Corporation
Back up
IBM Almaden Research Center
© 2006 IBM Corporation
Entropy Encoding on a Column Store
Don’t build tuple code: Treat tuple as vector of column codes and sort lexicographically
Columns early in the sort: Run length encoded deltas
Columns in the middle of the sort: Entropy encoded deltas
Columns late in the sort: Concatenated column codes
Independently break columns into compression blocks
Make dictionaries bigger because only using one at a time
IBM Almaden Research Center
© 2006 IBM Corporation
Entropy: A measure of information content
Entropy of a random variable R
– The expected number of bits needed to represent the outcome of R
– H(R) = ∑r domain(R) Pr(R = r) lg (1/ Pr(R = r))
Conditional entropy of R given S
– The expected number of bits needed to represent the outcome of R given we already know the outcome of S.
– H(R | S) = ∑s domain(S) ∑r domain(R) Pr(R = r & S = s)
– lg (1/ Pr(R = r & S = s)) – H(S)
If R is a random relation of size n, then R is a multi-set of random variables {T1, …, Tn} where each random tuple Ti is a cross product of random attributes C1i … Cki
IBM Almaden Research Center
© 2006 IBM Corporation
The Entropy of a Relation
We define a random relation R of size m over D as a random variable whose outcomes are multi-sets of size m where each element is chosen identically and independently from an arbitrary tuple distribution D. The results are dependent on H(D) and thus on the optimal encoding of tuples chosen from D.
– If we do a good job of co-coding and Huffman coding, then the tuple codes are entropy coded: They are random bit strings whose length depends on the distribution of the column values but whose entropy is equal to their length
Lemma 2: The Entropy of random relation R of size m over a distribution D is at least m H(D) – lg m!
Theorem 3: The Algorithm presented compresses a random relation R of size m to within H(R) + 4.3 m bits, if m > 100
IBM Almaden Research Center
© 2006 IBM Corporation
Proof of Lemma 2
Let R be a random vector of m tuples i.i.d. over distribution D whose outcomes are sequences of m tuples, t1, …, tm.
Obviously H(R) is m H(D).
Consider an augmentation of R that adds an index to each tuple so that ti has the value i appended. Define R1 as a set consisting of exactly those values. H(R1) = m H(D) as there is a bijection between R1 and R
But the random multi-set R is a projection of the set R1 and there are exactly m! equal probability sets R1 that each project to each outcome of R so H(R1) ≤ H(R) + lg m! and thus H(R) ≥ m H(D) – lg m!
IBM Almaden Research Center
© 2006 IBM Corporation
Proof sketch of Theorem 3
Lemma 1 says: If R is random multi-set of m values over the uniform distribution 1..m and m > 100, then H(delta(sort(R))) < 2.67 m.
But we have values from an arbitrary distribution, so work by cases
– For values longer than lg m bits, truncate, getting a uniform distribution in the range.
– For values shorter than lg m bits, append random bits, also getting a uniform distribution.