information retrieval - systems group · information retrieval eth zürich, fall 2012 thomas...

Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann

LECTURE 4 INDEX COMPRESSION 10.10.2012

Information Retrieval, ETHZ 2012 1

Overview

1.  Dictionary Compression 2.  Zipf’s Law

3.  Posting List Compression

4.  Gamma Codes

5.  Golomb Code

6.  Index Compression in Practice


DICTIONARY COMPRESSION

3 Information Retrieval, ETHZ 2012

Vocabulary Growth: Heap’s Law

§  Can we assume there is an upper bound? §  Not really: the vocabulary will keep growing with

collection size. §  Heaps’ law: M = kTβ §  M is the size of the vocabulary, T is the number of

tokens in the collection. §  Typical values for the parameters k and β are: 30 ≤ k ≤

100 and β ≈ 0.5. §  Empirical law: Heaps’ law is linear, i.e., the simplest

possible relationship between collection size and vocabulary size in log-log space.


Dictionary Hash Table


terms

class

…

hashes

ETHZ

mountain

weather

0 1 2

r

n

r+1

h

. . .

collision lists

mountain 549283471

ETHZ 398437231

class 234443989

weather 770209991

…

…

… …

Fixed (known) function

Storage need for token strings

Dictionary as a String


Dictionary as a String with Blocking


Example: Space Estimate

Example block size k = 4 Where we used 4 × 3 bytes for term pointers without

blocking

...

… we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term.

We save 12 − (3 + 4) = 5 bytes per block. Total savings: 400,000/4 ∗ 5 = 0.5 MB

This reduces the size of the Reuters dictionary from 7.6 MB to 7.1 MB.


Front Coding


Example: Dictionary Compression for Reuters CV1


ZIPF’S LAW


Zipf’s Law

We have characterized the growth of the vocabulary in collections with Heap’s law.

We also want know how many frequent vs. infrequent terms we should expect in a collection.

In natural language, there are a few very frequent terms and very many very rare terms.

Zipf’s law: The i-th most frequent term has frequency proportional to 1/i, i.e., cfi ∝ 1/i cfi is collection frequency: the #occurrences of ti in coll.

Equivalent: cfi =c*ik or log(cfi)= log(c)+k*log(i) (for k=−1)

Example of a power law Information Retrieval, ETHZ 2012 12

Example: Zipf’s Law

Fit is not perfect for Reuters CV1.

What is important is the key insight:

Few frequent terms, many rare terms.


POSTING LIST COMPRESSION


Posting List Compression

The postings file is much larger than the dictionary, factor of at least 10.

Key desideratum: store each posting compactly. A posting for our purposes is a doc-id.

For Reuters (800,000 documents), we would use 32 bits per doc-id when using 4-byte integers.

Alternatively, we can use log2800,000 ≈ 20 bits per doc-id. Our goal: use a lot less than 20 bits per doc-id.


Gap Encoding of Doc-IDs

Each postings list is ordered in increasing order of doc-id. Example postings list: computer: 283154, 283159,

283202, ...

It suffices to store gaps: 283159-283154=5, 283202-283159=43

Example postings list: computer: . . . 5, 43, . . . Gaps for frequent terms are small. Thus: We can encode small gaps with fewer than 20 bits.


VARIABLE LENGTH ENCODING


Variable Length Encoding

Aim: For arachnocentric and other rare terms, we will use about 20 bits per gap (= posting).

For the and other very frequent terms, we will use about 1 bit per gap (= posting).

In order to implement this, we need to devise some form of variable length encoding.

Use few bits for small gaps, many bits for large gaps.


Variable Byte Code Used by many commercial/research systems Good low-tech blend of variable-length coding and

sensitivity to alignment matches (bit-level codes, see later).

Dedicate 1 bit (high bit) to be a continuation bit c.

If the gap G fits within 7 bits, binary-encode it in the 7 available bits and set c = 1.

Else: encode higher-order 7 bits (padding) and then use one or more additional bytes to encode the lower order bits using the same algorithm.

At the end set the continuation bit of the last byte to 1 (c = 1) and of the other bytes to 0 (c = 0).


Variable Byte Code: Example


GAMMA CODES


Gamma Codes

§  Even better compression with bit-level code §  Gamma code is the best known of these. §  Represent a gap G as a pair of length and offset. §  Offset is the gap in binary, with the leading bit chopped

off. For example 13 → 1101 → 101 §  Length is the length of offset. §  For 13 (offset 101), this is 3. §  Encode length in unary code: 1110. §  Gamma code of 13 is the concatenation of length and

offset: 1110101.


Unary Code

§  Represent n as n 1s with a final 0. §  Unary code for 3 is 1110.

§  Unary code for 40 is 11111111111111111111111111111111111111110 .


Gamma Code: Examples


Length of Gamma Code

§  The length of offset is ⌊log2 G⌋ bits. §  The length of length is ⌊log2 G⌋ + 1 bits,

So the length of the entire code is 2 × ⌊log2 G⌋ + 1 bits. Gamma codes are always of odd length.

§  Gamma codes are within a factor of 2 of the optimal encoding length log2 G.

§  Assuming equal-probability gaps – but the distribution is actually highly skewed.


Gamma Codes: Alignment

§  Machines have word boundaries – 8, 16, 32 bits §  Compressing and manipulating at individual bit-

granularity can slow down query processing

§  Variable byte alignment is potentially more efficient

§  Regardless of efficiency, variable byte is conceptually simpler at little additional space cost


Gamma Code: Encode

Information Retrieval, ETHZ 2012 27 Taken from en.wikipedia.org/wiki/Elias_gamma_coding

Gamma Code: Decode

Information Retrieval, ETHZ 2012 28 Taken from en.wikipedia.org/wiki/Elias_gamma_coding

GOLOMB CODE


Shannon Limit

§  Is it possible to derive codes that are optimal (under certain assumptions)?

§  What is the optimal average code length for a code that encodes each integer (gap length) independently?

§  Lower bound on average code length: Shannon entropy

§  Asymptotically optimal codes (finite alphabets): arithmetic coding, Huffman codes


Bernoulli Model

§  Assumption: term occurrences are Bernoulli events §  Notation: §  n: number of documents §  m: number of terms in vocabulary §  N: total number of (unique) occurrences

§  probability of term tj occurring in document di : p=N/nm

§  each term-document occurrence is an independent event

§  Probability of a gap of length x is given by the geometric distribution


Golomb Code


Local Bernoulli Model

§  If length of posting lists is known, then a Bernoulli model on each individual inverted list can be used

§  Frequent words are coded with smaller b, infrequent words with larger b

§  Term frequency need to be encoded (use gamma-code)

§  Local Bernoulli outperforms global Bernoulli model in practice (method of choice!)


Compression of Reuters: Summary


INDEX COMPRESSION IN PRACTICE


Block-Based Index Format

Block-based, variable-length format to reduce space and CPU

Reduced index size by ~30%, plus much faster to decode


CPU Optimized Compression

Block index format: very good compression, but CPU-intensive to decode

Better format: single flat position space §  Data structures on side keep track of doc boundaries

§  Posting lists are just lists of delta-encoded positions

§  Need to be compact (can’t afford 32 bit value per occurrence)

§  … but need to be very fast to decode


Improved Byte-Aligned Variable-Length Encodings

Varint encoding: §  7 bits per byte with continuation bit

§  Con: Decoding requires lots of branches/shifts/masks

Idea: Encode byte length as low 2 bits §  Better: fewer branches, shifts, and masks

§  Con: Limited to 30-bit values, still some shifting to decode


Group Varint Encoding Idea: encode groups of 4 values in 5-17 bytes Pull out 4 2-bit binary lengths into single byte prefix

Decode: Load prefix byte and lookup value in 256-entry table


information retrieval - systems group · information retrieval eth zürich, fall 2012 thomas...

Documents