inverted file compression in managing gigabytes 과목 : 정보검색론 강의 : 부산대학교...

Post on 05-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Inverted File Compression

In Managing Gigabytes

과목 : 정보검색론강의 : 부산대학교 권혁철

Inverted File Compression• Inverted file entry

– <t; ft; [d1, d2, …, dft]>

• t : term, ft : # of documents

• dk : document no. where dk < dk+1

– < elephant; 8; [3, 5, 20, 21, 23, 76, 77, 78] >

=> < elephant; 8; [3, 2, 15, 1, 2, 53, 1, 1] >• gap = dk+1 - dk

• Two compression classes– Global Methods V.S Local Methods

Summary of coding methods

Golomb 코드Gap x Unary Code γ Code δ 코드

b=3 b=6

1 0 1 1 00 0 00

2 10 010 010 0 010 0 01

3 110 011 010 1 011 0 100

4 1110 00100 011 00 100 0 101

5 11110 00101 011 01 1010 0 110

6 111110 00110 011 10 1011 0 111

7 1111110 00111 011 11 1100 10 00

8 11111110 000100 00100 000 11010 10 01

9 111111110 000101 00100 001 11011 10 100

10 1111111110 000110 00100 010 11100 10 101

Unary code

• Simple method– fixed representation of the positive integer

– log N (bits)

• Unary code– gap 이 x 일 때 , x-1 bit 의 1 과 1bit 의 0 으로

표현– lx = (x - 1) + 1, Pr[x] = 2-x

– eg) x = 9 일 때 , => 11111111 0

code

code– 1 + log x bit 의 unary code 와

log x bit 의 binary code(x - 2log x) 로 표현

– lx = 1 + log x + log x, Pr[x] = 1/2x2

– eg) x = 9 일 때 , log x = 3, x - 2log x=1 => 1110 001

– V = <1, 2, 4, 8, 16,…> or V = <1, 2, 2, 4, 4, 4, 8,…>

or ….

code

code code 와 표현 방법이 유사 .

– 1 + log x bit 의 unary code 대신에 code 를 사용하고 ,

log x bit 의 binary code(x - 2log x) 로 표현

– lx = 1 + 2log(1 + log x) + log x, Pr[x] = 1/2x(log x)2

– eg) x = 9 일 때 , => 11000 001

Global Bernoulli model

• Pr[x] = (1-p)x-1p, p : gap x 가 나타날 확률

• Golomb code– q + 1 bit 의 unary code 와 + log b or log b bit 의 binary cod

e

– q = (x - 1) / b, r = x - q b - 1

– bA = log(2 - p) / - log(1 - p) 0.69(N n / f)

– eg) b=3, r=0(0), 1(10), 2(11)

b=6, r=0(00), 1(01), 2(100), 3(101), 4(110), 5(111)

x=9 이면 , q = 2, r = 2 따라서 , 110 11

Global “observed frequency” model

• Based on observed frequency of appear gap size• Use arithmetic or Huffman code• In theory

– better compression method

• In practice– slightly better than and code

Local Bernoulli model

• The frequency of term t, ft , is known – Bernoulli model on each individual inverted file entry

can be used

• Very common words are encoded with b=1.– Tantamount bitvector

– thus, inverted file can never worse than bitvector.

• Necessary to store the parameter ft

– b can be used during decoding

Skewed Bernoulli model

• Bernoulli model 의 vector VG = <b, b, b, …>

• VT = <b, 2b, 4b, 2ib, …>

• slightly worse than the Golomb code

(a)

(b)

(c)

Word position in Bible : (a)bridegroom; (b)Jezebel; (c) twelfth

Local hyperbolic model

• Pr[x] = / x, x = 1, 2, …, m = 1 / (loge(m+1)+0.5772)

– m is largest gap

• Better performance• more complex to implement• requires the use of arithmetic coding

te

m

x f

N

m

mmxx

5772.0)1(log]Pr[

1

Local “observed frequency” model

• The ultimate in local modeling• batched frequency• request more memory space• best compression method

Performance of Index Compression Methods

Method Bits per pointerBible GNUbib Comact TREC

Global methodsUnary 264 920 490 1719Binary 15.00 16.00 18.00 20.00Bernoulli 9.67 11.65 10.58 12.61 6.55 5.69 4.48 6.43 6.26 5.08 4.36 6.19Observed frequency 5.92 4.83 4.21 5.83

Local methodsBernoulli 6.13 6.17 5.40 5.73Hyperbolic 5.77 5.17 4.65 5.74Skewed Bernoulli 5.68 4.71 4.24 5.28Batched frequency 5.61 4.65 4.03 5.27

Compression of bitmaps

• Bitmaps : Hierarchical bitvetor compression 기법으로 압축

(a) original bitvector (b) hierarchical structure(c) flattened tree as a string of bits

top related