web algorithmics

24
Web Algorithmics Dictionary-based compressors

Upload: dante

Post on 12-Jan-2016

39 views

Category:

Documents


1 download

DESCRIPTION

Web Algorithmics. Dictionary-based compressors. LZ77. Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves. a. a. c. a. a. c. a. b. c. a. a. a. a. a. a. a. c. . Dictionary (all substrings starting here). a. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web Algorithmics

Web Algorithmics

Dictionary-based compressors

Page 2: Web Algorithmics

LZ77

Algorithm’s step: Output <dist, len, next-char> Advance by len + 1

A buffer “window” has fixed length and moves

a a c a a c a b c a a a a a aDictionary

(all substrings starting here)

<6,3,a>

<3,4,c>a a c a a c a b c a a a a a a c

a c

a c

Page 3: Web Algorithmics

LZ77 Decoding

Decoder keeps same dictionary window as encoder. Finds substring and inserts a copy of it

What if l > d? (overlap with text to be compressed) E.g. seen = abcd, next codeword is (2,9,e) Simply copy starting at the cursor

for (i = 0; i < len; i++) out[cursor+i] = out[cursor-d+i]

Output is correct: abcdcdcdcdcdce

Page 4: Web Algorithmics

Lempel-Ziv Algorithms

Keep a “dictionary” of recently-seen strings.

The differences are: How the dictionary is stored How it is extended How it is indexed How elements are removed

LZ-algos are asymptotically optimal, i.e. their compression ratio goes to H(S) for n !!

No explicitfrequency estimation

Page 5: Web Algorithmics

You find this at: www.gzip.org/zlib/

Page 6: Web Algorithmics

LZ77 Optimizations used by gzip

LZSS: Output one of the following formats(0, position, length) or (1,char)

Typically uses the second format if length < 3.

Special greedy: possibly use shorter match so that next match is better

Hash Table for speed-up searches on triplets

Triples are coded with Huffman’s code

Page 7: Web Algorithmics

Web Algorithmics

Some special compressorsSpatial vs Temporal Locality

Page 8: Web Algorithmics

code for integer encoding

x > 0 and Length = log2 x +1

e.g., 9 represented as <000,1001>.

code for x takes 2 log2 x +1 bits (ie. factor of 2 from optimal)

0000...........0 x in binary Length-1

Optimal for Pr(x) = 1/2x2, and i.i.d integers

Page 9: Web Algorithmics

It is a prefix-free encoding…

Given the following sequence of coded integers, reconstruct the original sequence:

0001000001100110000011101100111

8 6 3 59 7

Page 10: Web Algorithmics

Streaming compression

Still you need to determine and sort all terms….Can we do everything in one pass ?

Move-to-Front (MTF): As a freq-sorting approximator As a caching strategy As a compressor

Run-Length-Encoding (RLE): FAX compression

Page 11: Web Algorithmics

Move to Front Coding

Transforms a char sequence into an integer sequence, that can then be var-length coded

Start with the list of symbols L=[a,b,c,d,…] For each input symbol s

1) output the position of s in L 2) move s to the front of L

Properties: Exploit temporal locality, and it is dynamic X = 1n 2n 3n… nn Huff = O(n2 log n), MTF = O(n log n) + n2

There is a memory

Page 12: Web Algorithmics

MTF: how good is it ?

Encode the integers via -coding:|(i)| ≤ 2 * log i + 1

Put in the front and consider the cost of encoding:

1 2

)()log(1

x

n

i

xxx

iippO

1

]1log*2[)log(x x

x n

NnOBy Jensen’s:

]1)(*2[*)log( 0 XHNO

No much worse than Huffman...but it may be far better

)1()(*2][ 0 OXHmtfLa

Page 13: Web Algorithmics

Run Length Encoding (RLE)

If spatial locality is very high, then

abbbaacccca => (a,1),(b,3),(a,2),(c,4),(a,1)

In case of binary strings just numbers and one bit

Properties:

Exploit spatial locality, and it is a dynamic code

X = 1n 2n 3n… nn

Huff(X) = O(n2 log n) > Rle(X) = O( n (1+log n) )

There is a memory

Page 14: Web Algorithmics

Web Algorithmics

Burrows-Wheeler Transform

Page 15: Web Algorithmics

The big (unconscious) step...

Page 16: Web Algorithmics

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

The Burrows-Wheeler Transform (1994)

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

F L

T

Page 17: Web Algorithmics

A famous example

Muchlonger...

Page 18: Web Algorithmics

Compressing L seems promising...

Key observation: L is locally

homogeneousL is highly compressible

Algorithm Bzip :

Move-to-Front coding of

L

Run-Length coding

Statistical coder

Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !

Page 19: Web Algorithmics

BWT matrix

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

How to compute the BWT ?

ipssm#pissii

L

12

1185211097463

SA

L[3] = T[ 7 ]

We said that: L[i] precedes F[i] in T

Given SA and T, we have L[i] = T[SA[i]-1]

Page 20: Web Algorithmics

How to construct SA from T ?

#i#ippi#issippi#ississippi#mississippipi#ppi#sippi#sissippi#ssippi#ssissippi#

12

1185211097463

SA

Elegant but inefficient

Obvious inefficiencies:• (n2 log n) time in the worst-case• (n log n) cache misses or I/O faults

Input: T = mississippi#

Page 21: Web Algorithmics

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

# mississipp ii #mississip pi ppi#missis s

F L

Take two equal L’s chars

How do we map L’s onto F’s chars ?

... Need to distinguish equal chars in F...

Rotate rightward their rows

Same relative order !!

unknown

A useful tool: L F mapping

Page 22: Web Algorithmics

T = .... #

i #mississip p

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

The BWT is invertible

# mississipp i

i ppi#missis s

F Lunknown

1. LF-array maps L’s to F’s chars

2. L[ i ] precedes F[ i ] in T

Two key properties:

Reconstruct T backward:

ippi

InvertBWT(L)

Compute LF[0,n-1];r = 0; i = n;while (i>0) { T[i] = L[r]; r = LF[r]; i--; }

Page 23: Web Algorithmics

RLE0 = 03141041403141410210

An encoding example

T = mississippimississippimississippi

L = ipppssssssmmmii#pppiiissssssiiiiii

Mtf = 020030000030030200300300000100000

Mtf = [i,m,p,s]

# at 16

Bzip2-output = Arithmetic/Huffman on ||+1 symbols...

... plus (16), plus the original Mtf-list (i,m,p,s)

Mtf = 030040000040040300400400000200000Alphabe

t||+1

Bin(6)=110, Wheeler’s code

Page 24: Web Algorithmics

You find this in your Linux distribution