information retrieval - systems group · information retrieval eth zürich, fall 2012 thomas...
TRANSCRIPT
Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann
LECTURE 4 INDEX COMPRESSION 10.10.2012
Information Retrieval, ETHZ 2012 1
Overview
1. Dictionary Compression 2. Zipf’s Law
3. Posting List Compression
4. Gamma Codes
5. Golomb Code
6. Index Compression in Practice
Information Retrieval, ETHZ 2012 2
Vocabulary Growth: Heap’s Law
§ Can we assume there is an upper bound? § Not really: the vocabulary will keep growing with
collection size. § Heaps’ law: M = kTβ § M is the size of the vocabulary, T is the number of
tokens in the collection. § Typical values for the parameters k and β are: 30 ≤ k ≤
100 and β ≈ 0.5. § Empirical law: Heaps’ law is linear, i.e., the simplest
possible relationship between collection size and vocabulary size in log-log space.
Information Retrieval, ETHZ 2012 4
Dictionary Hash Table
Information Retrieval, ETHZ 2012 5
terms
class
…
hashes
ETHZ
mountain
weather
0 1 2
r
n
r+1
h
. . .
collision lists
mountain 549283471
ETHZ 398437231
class 234443989
weather 770209991
…
…
… …
Fixed (known) function
Storage need for token strings
Example: Space Estimate
Example block size k = 4 Where we used 4 × 3 bytes for term pointers without
blocking
...
… we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term.
We save 12 − (3 + 4) = 5 bytes per block. Total savings: 400,000/4 ∗ 5 = 0.5 MB
This reduces the size of the Reuters dictionary from 7.6 MB to 7.1 MB.
Information Retrieval, ETHZ 2012 8
Zipf’s Law
We have characterized the growth of the vocabulary in collections with Heap’s law.
We also want know how many frequent vs. infrequent terms we should expect in a collection.
In natural language, there are a few very frequent terms and very many very rare terms.
Zipf’s law: The i-th most frequent term has frequency proportional to 1/i, i.e., cfi ∝ 1/i cfi is collection frequency: the #occurrences of ti in coll.
Equivalent: cfi =c*ik or log(cfi)= log(c)+k*log(i) (for k=−1)
Example of a power law Information Retrieval, ETHZ 2012 12
Example: Zipf’s Law
Fit is not perfect for Reuters CV1.
What is important is the key insight:
Few frequent terms, many rare terms.
Information Retrieval, ETHZ 2012 13
Posting List Compression
The postings file is much larger than the dictionary, factor of at least 10.
Key desideratum: store each posting compactly. A posting for our purposes is a doc-id.
For Reuters (800,000 documents), we would use 32 bits per doc-id when using 4-byte integers.
Alternatively, we can use log2800,000 ≈ 20 bits per doc-id. Our goal: use a lot less than 20 bits per doc-id.
Information Retrieval, ETHZ 2012 15
Gap Encoding of Doc-IDs
Each postings list is ordered in increasing order of doc-id. Example postings list: computer: 283154, 283159,
283202, ...
It suffices to store gaps: 283159-283154=5, 283202-283159=43
Example postings list: computer: . . . 5, 43, . . . Gaps for frequent terms are small. Thus: We can encode small gaps with fewer than 20 bits.
Information Retrieval, ETHZ 2012 16
Variable Length Encoding
Aim: For arachnocentric and other rare terms, we will use about 20 bits per gap (= posting).
For the and other very frequent terms, we will use about 1 bit per gap (= posting).
In order to implement this, we need to devise some form of variable length encoding.
Use few bits for small gaps, many bits for large gaps.
Information Retrieval, ETHZ 2012 18
Variable Byte Code Used by many commercial/research systems Good low-tech blend of variable-length coding and
sensitivity to alignment matches (bit-level codes, see later).
Dedicate 1 bit (high bit) to be a continuation bit c.
If the gap G fits within 7 bits, binary-encode it in the 7 available bits and set c = 1.
Else: encode higher-order 7 bits (padding) and then use one or more additional bytes to encode the lower order bits using the same algorithm.
At the end set the continuation bit of the last byte to 1 (c = 1) and of the other bytes to 0 (c = 0).
Information Retrieval, ETHZ 2012 19
Gamma Codes
§ Even better compression with bit-level code § Gamma code is the best known of these. § Represent a gap G as a pair of length and offset. § Offset is the gap in binary, with the leading bit chopped
off. For example 13 → 1101 → 101 § Length is the length of offset. § For 13 (offset 101), this is 3. § Encode length in unary code: 1110. § Gamma code of 13 is the concatenation of length and
offset: 1110101.
Information Retrieval, ETHZ 2012 22
Unary Code
§ Represent n as n 1s with a final 0. § Unary code for 3 is 1110.
§ Unary code for 40 is 11111111111111111111111111111111111111110 .
Information Retrieval, ETHZ 2012 23
Length of Gamma Code
§ The length of offset is ⌊log2 G⌋ bits. § The length of length is ⌊log2 G⌋ + 1 bits,
So the length of the entire code is 2 × ⌊log2 G⌋ + 1 bits. Gamma codes are always of odd length.
§ Gamma codes are within a factor of 2 of the optimal encoding length log2 G.
§ Assuming equal-probability gaps – but the distribution is actually highly skewed.
Information Retrieval, ETHZ 2012 25
Gamma Codes: Alignment
§ Machines have word boundaries – 8, 16, 32 bits § Compressing and manipulating at individual bit-
granularity can slow down query processing
§ Variable byte alignment is potentially more efficient
§ Regardless of efficiency, variable byte is conceptually simpler at little additional space cost
Information Retrieval, ETHZ 2012 26
Gamma Code: Encode
Information Retrieval, ETHZ 2012 27 Taken from en.wikipedia.org/wiki/Elias_gamma_coding
Gamma Code: Decode
Information Retrieval, ETHZ 2012 28 Taken from en.wikipedia.org/wiki/Elias_gamma_coding
Shannon Limit
§ Is it possible to derive codes that are optimal (under certain assumptions)?
§ What is the optimal average code length for a code that encodes each integer (gap length) independently?
§ Lower bound on average code length: Shannon entropy
§ Asymptotically optimal codes (finite alphabets): arithmetic coding, Huffman codes
Information Retrieval, ETHZ 2012 30
Bernoulli Model
§ Assumption: term occurrences are Bernoulli events § Notation: § n: number of documents § m: number of terms in vocabulary § N: total number of (unique) occurrences
§ probability of term tj occurring in document di : p=N/nm
§ each term-document occurrence is an independent event
§ Probability of a gap of length x is given by the geometric distribution
Information Retrieval, ETHZ 2012 31
Local Bernoulli Model
§ If length of posting lists is known, then a Bernoulli model on each individual inverted list can be used
§ Frequent words are coded with smaller b, infrequent words with larger b
§ Term frequency need to be encoded (use gamma-code)
§ Local Bernoulli outperforms global Bernoulli model in practice (method of choice!)
Information Retrieval, ETHZ 2012 34
Block-Based Index Format
Block-based, variable-length format to reduce space and CPU
Reduced index size by ~30%, plus much faster to decode
Information Retrieval, ETHZ 2012 37
CPU Optimized Compression
Block index format: very good compression, but CPU-intensive to decode
Better format: single flat position space § Data structures on side keep track of doc boundaries
§ Posting lists are just lists of delta-encoded positions
§ Need to be compact (can’t afford 32 bit value per occurrence)
§ … but need to be very fast to decode
Information Retrieval, ETHZ 2012 38
Improved Byte-Aligned Variable-Length Encodings
Varint encoding: § 7 bits per byte with continuation bit
§ Con: Decoding requires lots of branches/shifts/masks
Idea: Encode byte length as low 2 bits § Better: fewer branches, shifts, and masks
§ Con: Limited to 30-bit values, still some shifting to decode
Information Retrieval, ETHZ 2012 39