lempel-ziv methods
DESCRIPTION
Lempel-Ziv methods. Dictionary models - I. Dictionary-based compression methods use the principle of replacing substrings in a message with a codeword that identifies each substring in a dictionary , or codebook The dictionary contains a list of substrings and their associated codewords - PowerPoint PPT PresentationTRANSCRIPT
Lempel-Ziv methods
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 2
Dictionary models - I
Dictionary-based compression methods use the principle of replacing substrings in a message with a codeword that identifies each substring in a dictionary, or codebook
The dictionary contains a list of substrings and their associated codewords
Unlike symbolwise methods, dictionary methods often use fixed codewords rather than explicit probability distribution
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 3
Dictionary models - II
For example, we can insert into the dictionary the full set of 8-bit ASCII characters How many?
and the 256 most common pairs of characters
If we use fixed length codeword, how many bits does we need to index dictionary entries?
SOL. 9 bits What about the performances in bits/character
in the best and in the worst case? SOL. best:4.5b/char worst:9b/char!!
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 4
Dictionary models - III
Another possibility is to use longer words in the dictionary, perhaps common words like the or and or common components of words like tion. This strings are the phrases of the dictionary
A dictionary with a predefined set of phrases does not achieve good compression
Performances are better if we tune the dictionary on input source, i.e. if we loose input indipendence
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 5
Dictionary models - IV
For istance common phrases for an italian sport newspaper are very rare in a business management book
To avoid the problem of dictionary being unsuitable for the text at hand we can build a new dictionary for each message to be compressed....
.... but there is a significant overhead for transmitting and storing it
Deciding the size of the dictionary in order to maximize compression is a very difficult problem
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 6
The Lempel-Ziv methods
The only efficient solution to the problem is to use an adaptive dictionary scheme
Pratically all adaptive dictionary compression methods are based on one of the two methods developed by two israely researchers, Abraham Lempel and Jacob Ziv in 1977 e 1978, and called LZ77 and LZ78
"A Universal Algorithm for Sequential Data Compression" in the IEEE Transactions on Information Theory, May 1977
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 7
The key idea - I
The key insight of the method is that it is possible to automatically build a dictionary of previously seen strings in the text being compressed
The prior text makes a very good dictionary, since it has usually the same style and language of the upcoming text
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 8
The key idea - II
The dictionary does not have to be transmitted with the compressed text, since the decompressor can build it the same way the compressor does
The many variants of Lempel-Ziv methods differ in how pointers are represented and in the limitations on what the pointers are able to refer to
The presence of so many variants is also caused by same patents, and by the disputes over patenting
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 9
The LZ77 family
+ Quite easy to implement+ Fast decoding with little use of memory
The output of the encoding consists of a series of triples
the first component indicates how far back to look in the previously decoded text
the second component is the length of the phrase the third is next character for the input
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 10
An example - encoding
alphabet {a,b}
<0,0,a>
a aa ab b b
<0,0,b> <2,1,a> <3,2,b>
aabb
<5,3,b>
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 11
An example - decoding
<0,0,x><0,0,y><2,1,z><2,1,x><5,3,z> <6,3,z><5,2,z>
SOL. x y xz xx yxzz xxyz zxz
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 12
A recursive example
<0,0,a><0,0,c><2,1,a><4,2,b><1,10,a>
Despite the recursive references, each character is available when needed
a c aa acb ?? bbbbbbbbbba
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 13
Further details on LZ77
LZ77 algorithm places limitations on how far back a pointer can refer (i.e. on the length of the first component of the triple) and on the maximum size of the string referred to (i.e. on the length of the second component)
For example, in English text there is no gain in using a sliding windows of more than a few thousand characters
We can use a windows of 8.192 characters, i.e. 13 bits
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 14
Further details on LZ77
At the same time, the length of the match is rarely over 16 characters, so the extra cost to allow longer match usually is not justified
Exercise: encode the sequence010020$0110$$0111
with a sliding window of 7 symbols and a maximal match length of 3. Calculate the compression ratioSOL. <0,0,0><0,0,1><2,1,0><0,0,2><2,1,$><7,2,1><5,2,$>,<6,3,1>
0000000 0000001 0100100 0000010 0100111 1111001 1011011 1101101
C=(17*2)/7*8=0.607 << 1!!!
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 15
LZ77 - encoding
Encode the text S[1..N] using LZ77, with a sliding window of W characters
p=1WHILE p<N {
search for the longest match for S[p...] in S[p-W ... p-1].Suppose the match occurs at position p-m, with length l
output the triple < p-m, l, S[p+l ] > p=p + l + 1 }
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 16
LZ77 - decoding
Decode the text S[1..N] using LZ77, with a sliding window of W characters
p=1FOREACH triple < f, l, c> {
S[p ... p + l - 1] = S[ p - f ... p – f + l - 1]Suppose the match occurs at position p-m, with length l
S[p+l ] = c p=p + l + 1 }
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 17
LZ77 - improvements
The LZ77 has been gradually refined first component of the triple: it is useful to use
variable length, assigning shorter codewords to recent matches (that are more common)
second component of the triple: variable length codes that uses less bits to represent smaller numbers
third component of the triple: in some variants it is added only when needed (when?), with a 1-bit flag to indicate the presence or the absence of this third component
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 18
gzip algorithm - I
gzip is one of the more effective variants of LZ77
It is distributed by the Gnu Free Software Foundation
home page of gzip project: www.gzip.org
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 19
gzip algorithm - II
gzip uses a hash tables: the next 3 characters to be coded are hashed, and the return value is used as index to lookup a table entry
This entry is the head of a list that contains the places of occurrence of the 3 characters in the window
The list is searched for the longest match If there is no match the string is coded as raw
characters
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 20
gzip algorithm - III
If the match exists, we have a length and a distance, otherwise we have a zero length and a raw character
the sliding window has dimension W=32KB, lengths are limited to 258 bytes
List lengths are limited to avoid time consuming researches
tradeoff accuracy/time user’s choice
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 21
gzip algorithm - IV
Lengths, distances and raw characters are coded with two Huffman trees, one for distances and the other for lengths and raw characters
Huffman codes are generated processing blocks of up to 64KB (with canonical Huffman)
so gzip it is not really one-pass. From a pratical point of view it is one-pass because blocks are small, so they are read only one time and kept in main memory
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 22
gzip - example
abacbcaab
<length, distance/character>
<0,a><0,b><1,2><0,c><1,3><1,2><1,4><2,7>
23
gzip – best compression
abacbcaab<length, distance/character>
<0,a><0,b><1,2><0,c><1,3><1,2><1,4><2,7>
two solutions ab + caab a + bcaab
The first is greedy, it uses longest possible match. But sometimes the second is better. If best compression is selected, gzip takes a longer time but chooses the best of the two, eventually coding raw characters even if matches are possible, if it gives better compression in the long run
...... abcaab
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 24
LZ78 family
- it has restrictions on which substring can be referenced (but this avoids some inefficiency)
- decoding is slower than LZ77 and require more memory
+ does not have a window to limit how far back substring can be referenced
+ one of its variant, LZW, is widely used in many popular compression systems
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 25
Referentiable strings
The text prior to the current coding position is parsed in substrings, and only parsed phrases can be referenced
Previous phrases are numbered in sequence, and the output is a list of pairs
<previous phrase, next character> This unseen combination is stored as a new
phrase
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 26
An example
a aa a aabb b
Phrases Output0 <null>1 a2 b3 aa
4 ba5 baa
<0,a> <0,b> <1,a> <2,a> <4,a>
Only this phrases can be referenced This avoids the inefficiency of having more than one coded
representation for the same string, as usual in LZ77
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 27
How to store the phrases
It is crucial for algorithm efficiency, to store the phrases in a clever way
This can be obtained using a trie
Phrases0 <null>
1 a2 b3 aa
4 ba
5 baa
0
1
3
2
5
4
6
a
a
a
a b
b
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 28
How to store the phrases The character of each
phrase specify a path from the root to a leaf
The character to be encoded are used to traverse the trie until the path is blocked
The last node contains the phrase number to output
A new node is added with next input character, to form a new phrase
0
1
3
2
5
4
6
a
a
a
a b
b
baab
7
b
<5,b>
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 29
A problem
The trie data structure continues to grow during coding, and eventually growth must be stopped to avoid an eccessive use of memory
There are various strategies the trie can be reinitialized from scratch the trie can be used as is, without further updates the trie can be partially rebuild using last part of the
text (this avoids the penalties of starting form scratch)
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 30
LZ78 vs. LZ77
LZ78 encoding can be faster LZ78 decoding is slower because the
decoder must also store the parsed phrases
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 31
LZ78 - exercise
Code the sequence with LZ78 e show the trie that store the phrases
0100101110101001011 SOL. <0,0><0,1><1,0><2,0><2,1><4,1><1,1><3,1><7,1>
PHRASES
0<null>
1 02 13 004 105 116 1017 01
8 0019 011
0
1
3
2
6
4
0
0
01 1
1
7 5
8 91 1 1
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 32
LZW variant - I
One of the most popular variants of Lempel-Ziv coding (Welch 1984)
It forms the basis for Unix utility compress and many other popular compressors
The main difference between LZW and LZ78 is that LZW encodes only phrase numbers without any ending characters
This scheme works fine because we initialize the dictionary with a phrase for each character of the source alphabet (e.g. the 256 characters of the 8-bit ASCII)
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 33
LZW variant - II
A new phrase is constructed from a coded one by appending the first character of the next phrase
Suppose we use 7-bit ASCII: dictionary is initialized with 128 phrases (0-127)
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 34
LZW - encoding
a b a ab ab ba aba abaainput:
output: 97 98 97 128 128 129 131 134
new phrases added
128ab
129ba
130aa
131aba
132abb
133baa
134abaa
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 35
LZW - decoding
97 98 97 128 128 129 131 134
a b a ab ab ba aba aba?
input:
output:
new phrases added
128 129 130 131 132 133 134
a?ab b? a? ab? ab? ba? aba?aba abb baa abaaba aa?
it is not ready!!
abaa
abaa?
The delay in phrase creation is not a problem unless the encoder uses the phrase immediately after its creation. In this case, if decoder inserts the phrases only when they are completed, cannot decode, because it doesn’t have phrase 134
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 36
LZW - exercise
Code with LZW the sequence0102$00$10111$02$
using 8-bit ASCII codes
Hint. 048, 149, 250, $36
SOL. 48 49 48 50 36 48 48 36 257 49 265 260 259
PHRASES
256 01257 10258 02259 2$260 $0
261 00262 0$263 $1264 101265 11266 11$
267$02
2682$?
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 37
Lempel-Ziv methods: summary
LZ77<pointer,length,character>
LZ78<phrase,character>
gzip<length, distance/character>
LZW<phrase>
Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-2006 38
Lempel-Ziv methods: exercise
Code the message using all studied methodsabbb010cac0bb0abb10111b1a
using a sliding window of 7 bits and a max match length of 7. No limit is given with respect to the dictionary dimension
You can use the following 8-bit ASCII codesa97 b98 c99 048 149