ee-575 information theory - sem 092ee575.pbworks.com/f/lempel+ziv++proj+report.pdf · ee-575...

1

EE-575 INFORMATION THEORY - SEM 092

Project Report on Lempel Ziv compression technique.

Department of Electrical Engineering

Prepared By:

Mohammed Akber Ali

Student ID # g200806120.

------------------------------------------------------------------------------------------------------------------------------------------

King Fahd University Of Petroleum & Minerals

Dhahran, Saudi Arabia.

2

Context

1. Introduction………………………………………………………………… 3

2. Dictionary coding……………………………………………………………4

3. Lempel Ziv coding…………………………………………………………. 5

4. The coding process…………………………………………………………..6

5. The decoding process………………………………………………………..7

6. Flowchart for coding process……………………………………………….9

7. Flowchart for decoding process…………………………………………….10

8. Problem eg.1.5.1 solved theoretically………………………………………11

9. Problem eg.1.5.2 solved theoretically………………………………………12

10. Problem exc.1.5.1 solved theoretically……………………………………..13

11. Problem exc.1.5.2 solved theoretically……………………………………..14

12. Advantages, Disadvantages & Applications……………………………….15

13. Results………………………………………………………………………..16

14. Conclusion……………………………………………………………………20

15. References……………………………………………………………………21

3

INTRODUCTION:

Data Compression seeks to reduce the number of bits used to store or transmit information. It

encompasses a wide variety of software and hardware compression techniques. Data

compression consists of taking a stream of symbols and transforming them into codes. For

effective compression, the resultant stream of codes will be smaller than the original symbol. For

e.g., Huffman coding is a type of coding where the actual output of encoder is determined by a

set of probabilities.

Here the problem is that it uses an integral number of bits & also, one must have the prior

information of probabilities. Well-known lossless compression techniques include:

• Run-length coding: Replace strings of repeated symbols with a count and only one symbol.

Example: aaaaabbbbbbccccc -> 5a6b5c

• Statistical techniques:

– Huffman coding: Replace fixed-length codes (such as ASCII) by variable-length codes,

assigning shorter codewords to the more frequently occurring symbols and thus decreasing the

overall length of the data. When using variable-length codewords, it is desirable to create a

(uniquely decipherable) prefix-code, avoiding the need for a separator to determine codeword

boundaries. Huffman coding creates such a code.

– Arithmetic coding: Code message as a whole using a floating point number in an interval from

zero to one.

– PPM (prediction by partial matching): Analyze the data and predict the probability of a

character in a given context. Usually, arithmetic coding is used for encoding the data. PPM

techniques yield the best results of statistical compression techniques.

The Lempel Ziv algorithms belong to yet another category of lossless compression techniques

known as dictionary coders. The problem of statistical model is solved by using adaptive

dictionary which is discussed below.

4

DICTIONARY CODING

Dictionary codes are compression codes that dynamically construct their own coding and

decoding tables “on the fly” by looking at the data stream itself. As they have these capabilities it

is not necessary for us to have to know the symbol probabilities beforehand. The codes take

advantage of the fact that, quite often certain strings can be assigned code words that represent

the entire string of symbols.

Dictionary coding techniques rely upon the observation that there are correlations between parts

of data (recurring patterns). The basic idea is to replace those repetitions by (shorter) references

to a "dictionary" containing the original.

(i) Static Dictionary

The simplest forms of dictionary coding use a static dictionary. Such a dictionary may contain

frequently occurring phrases of arbitrary length, digrams (two-letter combinations) or n-grams.

This kind of dictionary can easily be built upon an existing coding such as ASCII by using

previously unused codewords or extending the length of the codewords to accommodate the

dictionary entries. A static dictionary achieves little compression for most data sources. The

dictionary can be completely unsuitable for compressing particular data, thus resulting in an

increased message size (caused by the longer codewords needed for the dictionary).

(ii) Semi-Adaptive Dictionary

The aforementioned problems can be avoided by using a semi-adaptive encoder. This class of

encoders creates a dictionary custom-tailored for the message to be compressed. Unfortunately,

this makes it necessary to transmit/store the dictionary together with the data. Also, this method

usually requires two passes over the data, one to build the dictionary and another one to

compress the data. A question arising with the use of this technique is how to create an optimal

dictionary for a given message. It has been shown that this problem is NP-complete (vertex cover

problem). Fortunately, there exist heuristic algorithms for finding near-optimal dictionaries.

(iii) Adaptive Dictionary

The Lempel Ziv algorithms belong to this third category of dictionary coders. The dictionary is

being built in a single pass, while at the same time also encoding the data. As we will see, it is

not necessary to explicitly transmit/store the dictionary because the decoder can build up the

dictionary in the same way as the encoder while decompressing the data.

5

LEMPEL-ZIV CODING:

History: In 1983 Sperry filed a patent for an algorithm developed by Terry Welch, an employee

at the Sperry Research Center. This algorithm is Welch's variation on a data compression

technique first proposed by Jakob Ziv and Abraham Lempel in 1978. Welch's technique is both

simpler and faster. He published an article in the June 1984 issue of IEEE Computer Magazine

describing the technique. The technique became very popular and was widely adopted.

LZ compression is a form of substitution compression. In this form of compression, a

specific, unique string of characters is replaced with a reference to that phrase, which is

maintained in a dictionary. The resulting data compresses because the reference to the repeated

phrase is much smaller.

While LZ compression is very fast, it is best suited for files that contain repetitive data. Text

files and monochrome graphic images are ideal for LZW compression. Compressed files that do

not contain repetitive data will actually grow in size because of the LZW data dictionary.

LZ compression today is in the public domain, and freely available for use by anyone. The

U.S. patent expired in 2003, and the European, Canadian and Japanese patents expired in 2004.

A Linked List LZ algorithm:

As per the book of Richard B. Wells we try using the algorithm given in text, which is a mild

modification of the actual LZW algorithm. The algorithm begins by defining the structure of the

dictionary. Each entry in the dictionary is given an address m. Each entry consists of an ordered

pair <n,ai>, where n is a pointer to another location in the dictionary and „ai‟ is a symbol drawn

from the source alphabet. This order pairs in the dictionary is said to make up a linked list. The

pointer variables „n‟ also serve as the transmitted code words.

As the total number of dictionary entries exceeds the number of symbols, M, in the source

alphabet, where each transmitted code word actually contains more bits than it would take to

represent the alphabet A. Therefore most of the code words actually represent strings of source

symbols and in a long message it is more economical to encode these strings than it is to encode

the individual symbols.

6

The Coding Process:

A dictionary is initialized to contain the single-character strings corresponding to all the possible

input characters (and nothing else except the clear and stop codes if they're being used). The

algorithm works by scanning through the input string for successively longer substrings until it

finds one that is not in the dictionary. When such a string is found, the index for the string less

the last character (i.e., the longest substring that is in the dictionary) is retrieved from the

dictionary and sent to output, and the new string (including the last character) is added to the

dictionary with the next available code. The last input character is then used as the next starting

point to scan for substrings.

In this way, successively longer strings are registered in the dictionary and made available for

subsequent encoding as single output values. The algorithm works best on data with repeated

patterns, so the initial parts of a message will see little compression. As the message grows,

however, the compression ratio tends asymptotically to the maximum.

The LZ algorithm uses above principle with a vengeance and with the added twist that the strings

can be variable length. The algorithm is initialized by constructing the first M+1 entries in the

dictionary as following:

Address Dictionary Entry

0 0, Null

1 0, a0

… … …

m 0, am-1

… … …

M 0, aM-1

The 0-address entry in the dictionary is a null symbol, helpful to let the decoder know where

strings end. The pointers n in these first M+1 entries are zero. They “point” to the null entry at

the address 0. The initialization also initializes pointer variable n=0 and address pointer m=M+1.

The address pointer m points to the next “blank” location in the dictionary. After the

initialization, the encoder iteratively executes the following steps:

7

1. Fetch next source symbol a;

2. If the ordered pair <n,a> is already in the dictionary then

n= dictionary address of entry <n,a>;

else

transmit n

create new dictionary entry <n,a> at the dictionary address m

m=m+1

n=dictionary address of entry <0,a>;

3. Return to step 1.

If <n,a> is already in the dictionary in step 2 , the encoder is processing a string of symbols that

has occurred at least once previously. Setting the next value of n to this address constructs‟ a

linked list allows the string of symbols to be traced.

If <n,a> is not already in the dictionary in step 2, the encoder is encountering a new string

that was not processed previously. It transmits the code symbol n, which lets the receiver know

the dictionary address of the last source symbol in the previous string. Whenever the encoder

transmits a code symbol, it also creates a new dictionary entry. The encoder‟s dictionary building

and code symbol transmission process can be developed using Matlab program.

The Decoding Process:

The decoding algorithm works by reading a value from the encoded input and outputting the

corresponding string from the initialized dictionary. At the same time it obtains the next value

from the input, and adds to the dictionary the concatenation of the string just output and the first

character of the string obtained by decoding the next input value. The decoder then proceeds to

the next input value (which was already read in as the "next value" in the previous pass) and

repeats the process until there is no more input, at which point the final input value is decoded

without any more additions to the dictionary.

In this way the decoder builds up a dictionary which is identical to that used by the encoder, and

uses it to decode subsequent input values. Thus the full dictionary does not need be sent with the

encoded data; just the initial dictionary containing the single-character strings is sufficient (and is

typically defined beforehand within the encoder and decoder rather than being explicitly sent

with the encoded data.)

8

The decoder at the receiver must also be able to construct an identical dictionary based on the

symbol codes received. The decoder performs following decoding iterations:

1. Reception of any code word means that a new dictionary entry must be constructed.;

2. Pointer n for this new dictionary entry is the same as the received code word n;

3. Source symbol „a‟ for this entry is no yet known, since it is the root symbol of the next

string (which has not yet been transmitted by the encoder).

If the address of this next dictionary entry is m, we see that the decoder can only construct a

partial entry <n,?> since it must await the next received code word to find the root symbol „a‟ for

this entry. It can however, fill in the missing symbol in its previous dictionary entry at address

m-1. It can also decode the source symbol string associated with received code word n.

This decoding process also can be realized with the help of matlab code.

9

Flow chart for LEMPEL ZIV Encoder:

Input= Sequence to be coded;

;

S=size of input sequence; Initializing Dictionary, Address Pointer (Pm), pointer variable (Pn=0) & other variables.

Initialize while loop to consider each symbol of the input sequence one by one.

Set flag ak =1;

If Present i/p

symbol=dictionary entry &

Pn= address ponter of entry.

Update Pn=dictionary address, set flag ak=1, break the for loop.

Using for loop, check for root entry for the new dictionary entry and update Pn= addr. pointer,

increment while loop variable to receive next symbol.

If ak==1, record a new dictionary entry; And transmit Pn (pointer variable) , record it in array

Initialize for i=0: length of dictionary, loop to match the present symbol with

Dictionary elements.

Else Next

Symbol

Output Display Dictionary & transmitted Sequence

10

Flow chart for LEMPEL ZIV Decoder:

Input= Received Sequence to be decoded;

S=size of input sequence; Initializing Dictionary, Address Pointer (Pm), pointer variable (Pn=0) & other variables.

Initializing & incrementing for loop to consider each symbol of the input

sequence one by one.

Set flag ak =1;

If Present rcvd

symbol=dictionary entry &

Pn= 0; i.e. is it a root entry.

Update the symbol pointer and record the Dictionary element as decoded symbol & new entry.

Update Partial dictionary entry for previous symbol & create a new partial dictionary entry for

current symbol. Then fetch next symbol.

Record the pointer variable and treat it as an address pointer each time(Using While loop),

until root element is reached, Also keep a track of all elements confronted in this process and

update decoded symbols list in reverse order. Also record the root element to update previous

partial dictionary entry.

Initialize for i=0: length of dictionary, loop to match the present Received symbol

with Dictionary elements.

Else

Next

Symbol

If no next symbol then Display the Decoded Sequence

11

In example 1.5.1 a binary information source emits the sequence of symbols 110 001 011 001

011 100 011 11 etc. The Encoding sequential procedure is shown in the following table along

with encoder‟s dictionary being constructed.

Given that A = {0,1}

We Initialize the dictionary as shown(in block letters) with address 0 to 2.The initial values for n

& m are n=0 &m =3. The Encoders operation for the source that emits 0,1 are as follows:

Dictionary

Address

Dictionary

Entry

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

0,null

0,0

0,1

2,1

2,0

1,0

5,1

4,1

3,0

6,0

1,1

3,1

4,0

6,1

No entry

yet

Source

Symbol

Present

n

Present

m

Transmit Next

n

Dictionary

Entry

1 0 3 - 2 -

1 2 3 2 2 2,1

0 2 4 2 1 2,0

0 1 5 1 1 1,0

0 1 6 - 5 -

1 5 6 5 2 5,6

0 2 7 - 4 -

1 4 7 4 3 4,1

1 2 8 - 3 -

0 3 8 3 1 3,0

0 1 9 - 5 -

1 5 9 - 6 -

0 6 9 6 1 6,0

1 1 10 1 2 1,1

1 2 11 - 3 -

1 3 11 3 2 3,1

0 2 12 - 4 -

0 4 12 4 1 4,0

0 1 13 - 5 -

1 5 13 - 6 -

1 6 13 6 2 6,1

1 2 14 - 3 -

1 3 14 - 11 -

12

The decoding process in example 1.5.2 can be explicitly seen with the help of table below:

The Decoder begins by constructing the same first three entries as the encoder. It can do this

because the source alphabet is known a priori by the decoder. The decoder is initialized by value

for the next dictionary entry is 4.

Received

Bit

Dictionary

address

Dictionary

Entry

Tracing back Symbol

Coded

0 0,null

1 0,0

2 0,1 1

2 3 2,1 <0,1> 1

2 4 2,0 <0,1> 0

1 5 1,0 <0,0> 0,0

5 6 5,1 <1,0>--<0,0> 1,0

4 7 4,1 <2,0>--<0,1> ….. 1,1

3 8 3,0 <2,1>--<0,1> 0,0,1

6 9 6,0 <5,1>--<1,0>--<0,0> 0

1 10 1,1 <0,0> 1,1

3 11 3,1 <2,1>--<0,1> 1,0

4 12 4,0 <2,0>--<0,1> 0,0,1

6 13 6,1 <5,1>--<1,0>--<0,0>

14

Therefore the sequence decoded is 110 001 011 001 011 100 011 11 and the dictionary

constructed from the received signals is above.

13

In exercise problem 1.5.1, A discrete memory less source with A={a,b,c} emits the following

string “bccacbcccccccccccaccca”. The Encoding sequential procedure is shown in the following

table along with encoder‟s dictionary being constructed.

Given that A = {a, b, c}

We Initialize the dictionary as shown with address 0 to 3.The initial values

for n & m are n=0 &m =4. The Encoders operation for the source that emits

a, b, c are as follows:

Source

Symbol

Present

n

Present

m

Transmit Next

n

Dictionary

Entry

b 0 4 - 2 -

c 2 4 2 3 <2,c>

c 3 5 3 3 <3,c>

a 3 6 3 1 <3,a>

c 1 7 1 3 <1,c>

b 3 8 3 2 <3,b>

c 2 9 - 4 -

c 4 9 4 3 <4,c>

c 3 10 - 5 -

c 5 10 5 3 <5,c>

c 3 11 - 5 -

c 5 11 - 10 -

c 10 11 10 3 <10,c>

c 3 12 - 5 -

c 5 12 - 10 -

c 10 12 - 11 -

c 11 12 11 3 <11,c>

a 3 13 - 6 -

c 6 13 6 3 <6,c>

c 3 14 - 5 -

c 5 14 - 10 -

a 10 14 10 1 <10,a>

1 15

Dictionary

Address

Dictionary

Entry

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0,null

0,a

0,b

0,c

2,c

3,c

3,a

1,c

3,b

4,c

5,c

10,c

11,c

6,c

10,a

No entry

yet

14

The decoding process in problem 1.5.2 can be explicitly seen with the help of table below:

The Decoder begins by constructing the same first three entries as the encoder. It can do this

because the source alphabet is known a priori by the decoder. The decoder is initialized by value

for the next dictionary entry is 4.

Received

Bit

Dictionary

address

Dictionary

Entry

Tracing back Symbol

Coded

0 0,null

1 0,a

2 0,b

3 0,c b

2 4 2,c <0,b> c

3 5 3,c <0,c> c

3 6 3,a <0,c> a

1 7 1,c <0,a> ….. c

3 8 3,b <0,c> b, c

4 9 4,c <2,c>--<0,b> c, c

5 10 5,c <3,c>--<0,c> c, c, c

10 11 10,c <5,c>--<3,c>--<0,c> c, c, c, c

11 12 11,c <10,c>--<5,c>--<3,c>--<0,c> c, a

6 13 6,c <3,a>--<0,c> c, c, c

10 14 10,a <5,c>--<3,c>--<0,c> a

15 <0,a>

Therefore the sequence decoded is bccacbcccccccccccaccca and the dictionary constructed from

the received signals is above.

15

Advantages of LZ compression technique:

An LZ algorithm uses adaptive approach with universal coding scheme, without any need

to transmit/store dictionary with a single-pass transmission (dictionary creation “on-the-

fly” i.e. decompression recreates the codeword dictionary so it does not need to be

passed).

LZ compression works best for files containing lots of repetitive data. This is often the

case with text and monochrome images. Files that are compressed but that do not contain

any repetitive information at all can even grow bigger!

LZ compression is simple, fast and good compression.

Disadvantages of LZ compression technique:

The LZ compression technique substitutes the detected repeated patterns with references to a

dictionary. Unfortunately the larger the dictionary, the greater the number of bits that are

necessary for the references. The optimal size of the dictionary also varies for different types of

data; the more variable the data, the smaller the optimal size of the dictionary, hence does not

endow with an optimum compression ratio. Also LZ is a fairly old compression technique; all

recent computer systems have the horsepower to use more efficient algorithms.

Applications of LZ compression technique:

When it was introduced, LZ compression provided the best compression ratio among all well-

known methods available at that time. It became the first widely used universal data compression

method on computers. A large English text file can typically be compressed via LZ to about half

its original size.

LZ was used in the program compress, which became a more or less standard utility in Unix

systems circa 1986. It has since disappeared from many distributions, for both legal and technical

reasons, but as of 2008 at least FreeBSD includes both compress and uncompress as a part of the

distribution. Several other popular compression utilities also used LZ, or closely related methods.

LZW became very widely used when it became part of the GIF image format in 1987. It may

also (optionally) be used in TIFF and PDF files. (Although LZ is available in Adobe Acrobat

16

software, Acrobat by default uses the DEFLATE algorithm for most text and color-table-based

image data in PDF files.)

RESULTS:

LZ encoder Outputs from GUI:

18

LZ decoder Outputs from GUI:

20

Conclusion:

It is somewhat difficult to characterize the results of any data compression technique. The level

of compression achieved varies quite a bit depending on several factors. LZ compression excels

when confronted with data streams that have any type of repeated strings. Because of this, it does

extremely well when compressing English text. Compression levels of 50% or better should be

expected. In results the code is tested for examples 1.5.1, 1.5.2 and exercise problems 1.5.1

&1.5.2.

The code attached along with this report was written and tested on MATLAB, and was

successfully compiled and executed. The code consists of Coding and decoding routines for

binary sources (0, 1) as well as other discrete memory sources (i.e. a, b, c). The Code can be

extended to discrete source that transmits more than three symbols, by assigning proper

ASCII values to each symbol and appending the dictionary in right manner. The code gives a

(Graphical user interface)GUI output that is user helpful to give any input and obtain

respective output.

21

References:

[1] Applied Coding and Information Theory for Engineers text book by Richard B. Wells.

[2] http://en.wikipedia.org/wiki/LZW

[3] http://marknelson.us/1989/10/01/lzw-data-compression/

[4] http://www.answers.com/topic/data-compression

[5] http://www.prepressure.com/library/compression_algorithms/lzw

[6] The Lempel Ziv Algorithm, Christina Zeeh ,Seminar ”Famous Algorithms” January 16, 2003

http://en.wikipedia.org/wiki/LZW

http://marknelson.us/1989/10/01/lzw-data-compression/

http://www.answers.com/topic/data-compression

http://www.prepressure.com/library/compression_algorithms/lzw

ee-575 information theory - sem 092ee575.pbworks.com/f/lempel+ziv++proj+report.pdf · ee-575...

Documents