saarland university ©thomas haenselmann department 6 · xor 10011 001110 = remainder...

70
©Thomas Haenselmann Department 6.2 – Saarland University ©Thomas Haenselmann Department 6.2 – Saarland University Lecture on Computer Networks Historical Development time  Copyright (c) 2008 Dr. Thomas Haenselmann (Saarland University, Germany). Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Upload: trankhue

Post on 15-Feb-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Lecture on Computer Networks HistoricalDevelopment

time

  Copyright (c) 2008 Dr. Thomas Haenselmann (Saarland University, Germany). Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Motivation for source coding in data networks

This short introduction addresses the problem of how to code payload optimally for the network.

Theoretically, for an error-free communication channel, no special precautions are necessary, we could simply forward the bits. More realistic, we know that the channel introduces errors. So it can make sense to send more bits than only the data to transmit to account for the erroneous channel.

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Detecting vs. Correcting

Basically two variants are distinguished:

Error detecting codesError correcting codes (Forward Error Correction)

Both variants need a certain amount of redundancy. Which variant makes sense in which situation?

Error detection: Detection might be sufficient if the sink can ask the source for repeated transmission. Therefore, a feedback channel and a defined protocol are necessary. In some cases, erroneous data can simply be deleted without retransmission (e.g., IP-telephony).

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Detecting vs. Correcting

Error correcting codes: A larger amount of redundancy is necessary. This makes sense if

no feedback channel is available or a retransmission delays the packet considerably (e. g. telephony, video conferences).

What is better: Forward error correction or retransmission?

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: The Hamming-Distance

Error correcting codes: A larger amount of redundancy is necessary. This makes sense if

10001001 0011 Operand 1 10110001 0101 Operand 2 00111000 0110 XOR

The Hamming-Distance of two codewords corresponds exactly to the number of bits the codewords differ in. This is calculated using the XOR-operation. In other words: If two codewords have the Hamming-Distance d, then it is necessary to „toggle“ d bits to transform one codeword into the other.

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: The Hamming-Distance

The Hamming-Distance of an entire code is defined as the smallest possible distance two arbitrary (but different) codewords of a code can have.

To detect errors we need a code with valid and invalid words. To be able to detect d bit errors in a codeword the code must have a distance of d+1. Why does less distance not suffice?

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Error correcting codes

Error correcting codes need a distance of 2d+1 if d errors need to be detected and corrected. Why does less distance not suffice?

With a distance of 2d+1 we need to toggle 2d+1 bits to get from one valid codeword to another. If we toggle d bits only then the effort to get back to the original codeword is also exactly toggling d bits. But to get to a different (not the original) codeword when there are d bit errors it is necessary to toggle d+1 bits. If only d bit errors may occur then the shortest (and only) way to get to the next codeword is toggling d bits.

Example: 2 bits are encoded by 10

Orig. code: 00 01 10 11E.corr.Code: 0000000000 0000011111 1111100000 1111111111

How many bit errors can be detected and corrected here?

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Redundancy estimation of error correcting codes

How many bits do we need at least for the correction of a bit error?

We want to have 2^m valid codewords. r correction bits are needed. The error correcting code will have n = (m + r) bits in total. By toggling one of the n bits (redundant bits may be toggled, too) we get an (illegal) codeword with a distance of 1.That means, for each of the 2^m valid words we can create n invalid words. Hence follows

Source coding in data networks

n12m≤2n

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Redundancy estimation of error correcting codes

How many bits do we need at least for the correction of a bit error?

(n + 1) is composed of n invalid (each created by one bit error) and one valid codeword. The number of words of the error correcting code stands on the right side of the inequality.

n can be written as m data bits plus r correction bits.

With a given m we can estimate the number of correction bits a code needs in any case.

Source coding in data networks

mr12m≤2mr

⇒ mr1≤2r

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: The error correcting Hamming code

The Hamming code achieves the minimum of the previous estimation.

Algorithm: Number the bits from the LSB to the MSB. All 2^n bits are check bits, the rest are data bits. The data bits are filled up from left to right with the actual data, the check bits only depend on the data bits. Which data bit influences which check bit?

Source coding in data networks

1 2 3 4 5 6 7 8 9 10 11

Check bits

Data bits

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: The error correcting Hamming code

To see this, convert the (order) number of a bit into its binary representation:

11 = 8 + 2 + 1

Data bit 11 hence influences check bits 8, 2 and 1. Check bit 1 is of course also influenced by the data bits 3, 5, 7 and 9. By definition all check bits combined with their data bits have to exhibit an even (or odd, depending what was agreed upon) parity.

Source coding in data networks

1 2 3 4 5 6 7 8 9 10 11

Check bits

Data bits

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: The error correcting Hamming code

Example: Data: 1101101

1   2    3   4    5   6    7   8   9  10   11   digit?   ?    1   ?    1   0    1   ?   1   0    1   data w/out check bits1   2   1+2  4  4+1 4+2 4+2+1 8  8+1 8+2 8+2+1 Conversion

1   1   1   0    1   0    1   0   1   0    1   Data and check bits

Correction procedureFirst, a counter is set to zero. Then, the check bits are checked one by one. If one of them produces the wrong parity, the number of the corresponding check bit is added to the counter. In the end, the counter points to the toggled bit.

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: The error correcting Hamming code

Another interpretation of the error recovery scheme:

Why does the counter value reference the defective bit?

Bit 1 defective? yes: Bits 3,5,7,9 and 11 may be defectiveBit 2 defective? no: Bits 5 and 9 are leftBit 4 defective? yes: Only bit 5 is left

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Cyclic Redundancy Check (CRC)

CRC is based on the idea of polynomial division. Remember:

 (x5+x3+x+1):(x+1) = x4–x3+2x2–2x+3–[2/(x+1)]–(x5+x4)­­­­­­­­­­­ 0 – x4+x3

  –(–x4–x3) ­­­­­­­­­­­­   0 + 2x3+x     –(2x3+2x2)  ­­­­­­­­­­­­­­       0 – 2x2+x        –(–2x2–2x)      ­­­­­­­­­­­­­­­­            0  3x+1             –(3x+3)           ­­­­­­­­­­               0 – 2 = Remainder or modulus

What's the difference between polynomial division and normal division?

Source coding in data networks

Check: [x4–x3+2x2–2x+3–2/(x+1)]*(x+1) = ...

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Cyclic Redundancy Check

A bit string is interpreted as a polynomial by numbering the bits consecutively and, if a bit is set, by adding the corresponding term to the polynomial. In other words: Use the bits as coefficients.

Example:

76543210  position11010101  data bits

1x7 + 1x6 + 0x5 + x4 + 0x3 + 1x2 + 0x1 + 1x0 is the polynomial corresponding to the given data bits

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Cyclic Redundancy Check

The principle of CRC:

Sender and recipient agree upon a „divisor polynomial“, also called generator polynomial.Then, g zeros are added to the message, g being the degree of the generator polynomial. In the next step, the sender divides the message (extended by g zeros), similar to the polynomial division. In most cases there will be a remainder, the result of the division is of no interest. The remainder is then subtracted from the message (being extended by the zeros). The resulting bit string is now transferred to the recipient. If the message was transmitted correctly, no remainder should emerge on the recipient's side.

Why?

Because the sender intentionally subtracted the remainder before sending the message. The g zeros which emerge after the division are interpreted by the recipient as an indication of an error free transmission.

Note: The sender can safely subtract the remainder without harming the message. Why?

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Cyclic Redundancy Check

The only difference to normal polynomial division:

Calculations are binary and after the calculation of each digit a modulo 2 operation is performed! In other words: Always ignore the carry-over. This simplifies the addition and subtraction significantly.

10101101 01001110 + 01011100 – 11101010 -------- -------- 11110001 10100100

Discovery: Both operations, plus and minus, are equivalent to the XOR operation.

 

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: Cyclic Redundancy Check (CRC)

Example:

Message: 1101011011Generator polynomial: 10011 (x4 + x + 1) = 4th degree (5th order)Message extended by 4 zeros: 11010110110000

Division: 11010110110000:10011= (the result does not matter) XOR 10011 --------- 010011 XOR –10011 ---------- 0000010110 XOR –10011 ----------- 0010100 XOR 10011 001110 = remainder

11010110110000minus (XOR) 1110 11010110111110 = transmitted message, which should

generate no remainder when divided

Source coding in data networks

Special note: The division continues, if the MSB (most significant bit) of the divisor and of the bit string currently being divided is set. Sometimes the bit string has to be extended by new bits (from the message) until the generator polynomial “fits under it”.

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: (CRC) – Recognized errors

Which errors are recognized?

For the following analysis separate the error from the message:

Transmitted message (resp. the corresponding polynomial) including an error: M(x)Original message (error free): T(x)

Separate the transmitted message in: M(x) = T(x) + E(x)

with E(x) being the isolated error. Every bit which is set in E stands for a toggled bit in M. A sequence from the first 1 bit to the last 1 bit is called a burst-error. A burst-error can occur anywhere in E.

Question: Does the following division by the generator polynomial G(x) produce a remainder? If not, we cannot detect the error.

[T(x) + E(x)]:G(x) = „remainder­less“?

T(x):G(x) is divisible without any remainder, because we constructed the message exactly for this property. The analysis is therefore reduced to the question whether E(x):G(x) erroneously results in no remainder thus passing undetected.

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: (CRC) – Recognized errors

Which errors are recognized?

1-bit error:

The burst consists of only one error. If the generator polynomial has more than one coefficient, E(x) with a leading 1 followed by zeros cannot be divided without a remainder. So we are on the save side with regard to 1 bit errors. Our generator polynomial is at least as good as a parity bit.

Example:

Source coding in data networks

1000(...)0:101=1– 101 ---- 001 ... continued as above...

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: (CRC) – Recognized errors

Which errors are recognized?

2-bit error:

A 2-bit error must look like this: xi + (...) + xj, therefore xj can be factored out, which results in xj(x(i-j) + 1).

It has already been shown that a generator polynomial with more than one term cannot divide the factor xj. When is a term (xk + 1) divided? (with k = i - j)

For a given generator polynomial this has to be tested for 2-bit bursts with different lengths. Here, the error (inevitably) has the form 10(...)01.

What follows is an example program to test whether the generator polynomial

 x15 + x14 + 1

is useful for detecting 2-bit errors.

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: (CRC) – Recognized errors

Which errors are recognized?

main(){  Int   MAX_LENGTH = 60000;  char* generator = "1100000000000001";  char* bit_string = new char[MAX_LENGTH+1];

  for(int length = 2; length < MAX_LENGTH; length++) {

    if((length % 100) == 0) cout << length << endl;

    for(int j = 1; j < length­1; j++) // clear bitstring      bit_string[j] = '0';

    bit_string[0] = '1'; bit_string[length­1] = '1'; bit_string[length] = 0;

    // test if divisible by generator polynomial    if(Divisible(bit_string, length, generator, strlen(generator)) == true) {      cout << "Division successful with length " << length << endl;      break;    } // if

  } // for

  if(bit_string) delete[] bit_string;} // main

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: (CRC) – Recognized errors

Which errors are recognized?

Error polynomials with an odd number of terms:

Speculation: If the generator polynomial contains the term (x1 + x0), an error string with an odd number of bits cannot be divided.

Proof by contradiction: Assuming E(x) being divisible by (x1 + x0), the factor can also be extracted:

E(x) = (x1 + x0) Q(x)

So far we only divided polynomials. Now, for the first time we use them as functions and evaluate it for x = 1.

(x1 + x0) equals (1 + 1) and Q(x) equals 1, because Q(x) still contains an odd number of terms (additions are still done modulo 2). Hence follows (1 + 1) Q(x) = 0 x 1 = 0

But in the beginning we assumed that E(x) contains an odd number of terms. Thus, the result should have been 1, not 0. As follows, the factor (x1+x0) cannot be extracted. As a conse-quence, E(x) is not divisible by (x1 + x0), if it contains an odd number of terms (or error bits). Result: The generator polynomial should contain the term (x1 + x0) to catch all errors with an odd number of bits.

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Error control: (CRC) – Recognized errors

Which errors are recognized?

Recognition of burst errors of length r:

The burst error in E(x) could look like this: 0001 anything 100000

To move the last bit to the very right, a factor can be factored out:

E(x)=xi(x(r-1) + ... + 1), with i being the number of zeros on the right side of the last 1.

If the degree of the generator polynomial itself is r (hence it has r + 1 terms), the error of the form (x(r-1)+ ... + 1) cannot be divided either, because the generator polynomial is larger than the error to be divided.

Example (decimal system): 99:1234567 is not divisible without a remainder (result 0, remainder 99). Detecting burst errors with length r is trivial in the way that the error itself simply occurs at the end of the division.

Even if the burst is just as large as the generator polynomial (which means r+1 bits), the division yields no remainder only if by chance the error coincides exactly with the generator polynomial. This is possible, but not very likely.

Example (decimal system): 1234567:1234567 = 1 (modulo 0)

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Principle of Data Fountains

A number of sources generate atheoretically endless stream ofpackets which are derived from afile. All packets are pairwise different.

Given that the file consists of ppackets, the sink needs to receiveonly any p+k packets to regenerate the file – no matter from which source, no matter which packets. Any choice of p+k packets will do (k < 5% of the file size).

(mystical, right?)

Source coding in data networks

...

...

...

source 1 (streaming file A)

source 2 (streaming file A)

source n (streaming file A)

sink (downloading file A)

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Principle of Data Fountains

Mathematical foundation:

Let F be a file split into chunks to form a n x m bit-matrix and let T be a invertible n x n bit-matrix (containing only 0 and 1).

TxF=M

The bit-matrix M is transmitted line-by-line over the net. At the receiver side, the original file is obtained by

T-1xM=F

If T is the unit matrix, then the sender simply sends a file chunk-by-chunk like e.g., know from FTP. Advantage: Nothing to compute. Disadvantage: Any missing chunk will corrupt the file.

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Principle of Data Fountains

Data fountains rather make use of a matrix T which is more densely populated with 1-bits. The gain is that we don't rely on each individual data-chunk. Any number of m chunks (or slightly more) will do.

Algorithm

Sender: The sender splits a file into n data chunks (parts). Then, a random bit-vector of size n is created. The chunks with a corresponding 1 are XORed into a new chunk which is broadcast over the network. This process can be repeated endlessly to produce more and more packets from the same file.

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Principle of Data Fountains

Receiver: It collects n packets with the corresponding bit-vectors. The receiver knows these bit vectors used by the sender because is uses exactly the same random number generator. After having obtained n packets with linear independent bit-vectors (namely the matrix T), it inverts T and back transforms the message M into the original file F. (The random number sources have to be synchronized.)

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Principle of Data Fountains

Example

Data to be transmitted: 1100 0001 1101Random bit-stream: 10101110001

Sender

Receiver

Source coding in data networks

1 0 10 1 11 0 0

−1

=0 0 11 1 11 0 1

The receiver starts decoding the incoming packets no sooner than he got an invertible matrix. If this is not (yet) the case, the receiver gathers more packets (and more random data) until the inverse matrix can be calculated.

The random bits are not sent but generated by two synchronized (e.g., by an index) random number generators.

1 0 10 1 11 0 0

1 1 0 00 0 0 11 1 0 1=

0 0 0 11 1 0 01 1 0 0

0 0 11 1 11 0 1

0 0 0 11 1 0 01 1 0 0 =1 1 0 0

0 0 0 11 1 0 1

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Principle of Data Fountains

Pro

A missing packets means a missing random bit vector. However, the next packet will likely fill the gap.In a traditional transmission a missing part of a file can be replace by no other packet rather than the missing one.Many sources can contribute without mutual cooperation.

Con

Inverting the large matrix might be time- and memory consuming.To generate a single new packet, the sender has to XOR 50% of the entire file.

Improvement: Don't generate random bits for the transform but a matrix which is mostly populated along the diagonal.

Source coding in data networks

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Motivation for compression in data networks

The bandwidth provided by the physical layer is usually not available for an application because of protocol overhead introduced by each layer of the network stack. By optimizing protocols we can try to exploit the resources more optimally.

However, the potential of not transmitting redundant data bears a much larger potential for increased efficiency, often several orders of magnitude larger than the wasted network bandwidth.

Question: Every once in a while we can read in the press, that a compression algorithm has been invented which can (successfully) compress its own output again (repeatedly). In particular it is usually claimed that random data can be compressed. Why is this unlikely?

There is a long history of inventors and companies how claim to have achieved the above described. For details see

http://www.faqs.org/faqs/compression-faq/part1/section-8.html

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Handling huge data volumes

Text

Still image

Audio

Video

Introduction to compression

1 page with 80 characters/line and 64 lines/page and 1 byte/char results in 80 * 64 * 1 * 8 = 41 kbit/page

24 bits/pixel, 512 x 512 pixel/image results in 512 x 512 x 24 = ca. 6 Mbit/image

CD quality, sampling rate 44,1 KHz, 16 bits per sample results in 44,1 x 16 = 706 kbit/s. Stereo: 1,412 Mbit/s

Full-size frame 1024 x 768 pixel/frame, 24 bits/pixel, 30 frames/s results in 1024 x 768 x 24 x 30 = 566 Mbit/s.More realistic: 360 x 240 pixel/frame, 360 x 240 x 24 x 30 = 60 Mbit/s

=> Storage and transmission of multimedia streams require compression!

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Example 1: ABC -> 1; EE -> 2

Introduction to compression

Example 2:

Note that in this example both algorithms lead to the same compression rate.

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Run Length Coding

PrincipleReplace all repetitions of the same symbol in the text (”runs“) by a repetition counter and the symbol.

ExampleText:AAAABBBAABBBBBCCCCCCCCDABCBAABBBBCCD

Encoding: 4A3B2A5B8C1D1A1B1C1B2A4B2C1D

As we can see, we can only expect a good compression rate when long runs occur frequently. Examples are long runs of blanks in text documents or “leading” white pixels in gray-scale images.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

When dealing with binary files we are sure that a run of “1“s is always followed by a run of “0“s and vice versa. It is thus sufficient to store the repetition counters only!

Example

Introduction to compression

000000000000000000000000000011111111111111000000000 28 14 9000000000000000000000000001111111111111111110000000  26 18 7000000000000000000000001111111111111111111111110000  23 24 4000000000000000000000011111111111111111111111111000  22 26 3000000000000000000001111111111111111111111111111110  20 30 1000000000000000000011111110000000000000000001111111  19  7 18 7000000000000000000011111000000000000000000000011111  19  5 22 5000000000000000000011100000000000000000000000000111  19  3 26 3000000000000000000011100000000000000000000000000111  19  3 26 3000000000000000000011100000000000000000000000000111  19  3 26 3000000000000000000011100000000000000000000000000111  19  3 26 3000000000000000000001111000000000000000000000001110  20  4 23 3 1000000000000000000000011100000000000000000000111000  22  3 20 3 3011111111111111111111111111111111111111111111111111   1 50011111111111111111111111111111111111111111111111111   1 50011111111111111111111111111111111111111111111111111   1 50011111111111111111111111111111111111111111111111111   1 50011111111111111111111111111111111111111111111111111   1 50011000000000000000000000000000000000000000000000011   1  2 46 2

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Run Length Coding, Legal Issues - Beware of the patent trap!

Runlength encoding of the type (length, character)US Patent No: 4,586,027Title: Method and system for data compression and restorationFiled: 07-Aug-1984Granted: 29-Apr-1986Inventor: Tsukimaya et al.Assignee: Hitachi

Runlength encoding (length [<= 16], character)Number: 4,872,009Title: Method and apparatus for data compression and restorationFiled: 07-Dec-1987Granted: 03-Oct-1989Inventor: Tsukimaya et al.Assignee: Hitachi

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Variable Length Coding

Classical character codes use the same number of bits for each character. When the frequency of occurrence is different for different characters, we can use fewer bits for frequent characters and more bits for rare characters.

ExampleCode 1: A B C D E ...

1 2 3 4 5 (binary)

Encoding of ABRACADABRA with constant bit length (= 5 Bits):0000100010100100000100011000010010000001000101001000001

Code 2: A B R C D0 1 01 10 11

Encoding: 0 1 01 0 10 0 11 0 1 01 0

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Delimiters

Code 2 can only be decoded unambiguously when delimiters are stored with the codewords. This can increase the size of the encoded string considerably.

IdeaNo code word should be the prefix of another codeword! We will then no longer need delimiters.

Code 3:

Encoded string: 1100011110101110110001111

Introduction to compression

A 11B 00R 011C 010D 10

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Representation as a TRIE (or prefix tree)

An obvious method to represent such a code as a TRIE. In fact, any TRIE with M leaf nodes can be used to represent a code for a string containing M different characters.

The figure on the next page shows two codes which can be used for ABRACADABRA. The code for each character is represented by the path from the root of the TRIE to that character where “0“ goes to the left, “1“ goes to the right, as is the convention for TRIEs.

The TRIE on the left corresponds to the encoding of ABRACADABRA on the previous page, the TRIE on the right generates the following encoding:

01101001111011100110100

which is two bits shorter.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Two Tries for our Example

The TRIE representation guarantees indeed that no codeword is the prefix of another codeword. Thus the encoded bit string can be uniquely decoded.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Huffman Code

Now the question arises how we can find the best variable-length code for given character frequencies (or probabilities). The algorithm that solves this problem was found by David Huffman in 1952. Algorithm Generate-Huffman-CodeDetermine the frequencies of the characters and mark the leaf nodes of a binary tree (to be built) with them.

Out of the tree nodes not yet marked as DONE, take the two with the smallest frequencies and compute their sum.Create a parent node for them and mark it with the sum. Mark the branch to the left son with 0, the one to the right son with 1.Mark the two son nodes as DONE. When there is only one node not yet marked as DONE, stop (the tree is complete). Otherwise, continue with step 2.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Huffman Code, Example

Probabilities of the characters:p(A) = 0.3; p(B) = 0.3; p(C) = 0.1; p(D) = 0.15; p(E) = 0.15

Introduction to compression

30%

30%

11

100%

11

25%

A

B

C

D

E0

40%

0

0

0

60%

10%

15%

15%

11

10

011

010

00

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Huffman Code, why is it optimal?

Characters with higher probabilities are closer to the root of the tree and thus have shorter codeword lengths; thus it is a good code. It is even the best possible code!

Reason:The length of an encoded string equals the weighted outer path length of the Huffman tree.To compute the “weighted outer path length“ we first compute the product of the weight (frequency counter) of a leaf node with its distance from the root. We then compute the sum of all these values over the leaf nodes. This is obviously the same as summing up the products of each character‘s codeword length with its frequency of occurrence.No other tree with the same frequencies attached to the leaf nodes has a smaller weighted path length than the Huffman tree.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Sketch of the Proof

With a similar construction process, another tree could be built but without always combining the two nodes with the minimal frequencies. We can show by induction that no other such strategy will lead to a smaller weighted outer path length than the one that combines the minimal values in each step.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Decoding Huffman Codes (1)

An obvious possibility is to use the TRIE:Read the input stream sequentially and traverse the TRIE until a leaf node is reached.

When a leaf node is reached, output the character attached to it.

To decode the next bit, start again at the root of the TRIE.

Introduction to compression

ObservationThe input bit rate is constant, the output character rate is variable.

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Decoding Huffman Codes (2)

As an alternative we can use a decoding table.

Creation of the decoding table:If the longest codeword has L bits, the table has 2L entries.

Let ci be the codeword for character si. Let ci have li bits. We then create 2L-li entries in the table. In each of these entries the first li bits are equal to ci, and the remaining bits take on all possible L-li binary combinations.

At all these addresses of the table we enter si as the character recognized, and we remember li as the length of the codeword.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Decoding with the Table

Algorithm Table-Based Huffman DecoderRead L bits from the input stream into a buffer.

Use the buffer as the address into the table and output the recognized character si.

Remove the first li bits from the buffer and pull in the next li bits from the input bit stream.

Continue with step 2.

ObservationTable-based Huffman decoding is fast.

The output character rate is constant, the input bit rate is variable.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Huffman Code, Comments

A very good code for many practical purposes.

Can only be used when the frequencies (or probabilities) of the characters are known in advance.Variation: Determine the character frequencies separately for each new document and store/transmit the code tree/table with the data. Note that a loss in “optimality“ comes from the fact that each character must be encoded with a fixed number of bits, and thus the codeword lengths do not match the frequencies exactly (consider a code for three characters A, B and C, each occurring with a frequency of 33 %).

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Critical Review of 0-Order Codes (1)

ABCD ABCD ABCD ABCD …Is quite redundant but cannot be compressed by Huffman.

Introduction to compression

currentcharacter

nextcharacter

A

B

C

D

A

B

C

D

Huffman is sometimes referred to as a 0-order entropy model which means that each character is generated independent of its predecessor. No character makes a particular successor more likely.

Huffman is sometimes referred to as a 0-order entropy model which means that each character is generated independent of its predecessor. No character makes a particular successor more likely.

Huffman is sometimes referred to as a 0-order entropy model which means that each character is generated independent of its predecessor. No character makes a particular successor more likely.

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Critical Review of 0-Order Codes (2)

ABCD ABCD ABCD ABCD …

Introduction to compression

currentcharacter

nextcharacter

A

B

C

D

A

B

C

D

A string of the above type could be better described by a 1-order entropy model: the current character gives a hint on what character is expected next.

The example above is trivial since each character uniquely determines the next one.

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Critical Review of 0-Order Codes (3)

Most sources produce characters with varying relative occurrences which can be encoded with Huffman, but which also exhibit inter-character correlations.

Introduction to compression

currentcharacter

nextcharacter

A

B

C

D

A

B

C

D

25%

25%

25%

25%

assumed relative occurrence(used in the next slide)

In the English language a ‘c’ is often followed by an ‘h’, but not very often by a ‘z’.The example on the right consists of only four characters. Once an ‘A’ has occurred, only another ‘A’ or a ‘B’ will follow. As an excep-tion, a ‘D’ is always followed by another ‘D’.Let us see how many bits we must spend if this particular correlation is known:

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Critical Review of 0-Order Codes (4)

We assume that the occurrence of every character is equal (25%). Once it has occurred, the following two characters are equally likely (e.g., 50% for A->A and 50% for A->B)*

X=current character Y|X = Y occurred and X was its predecessorP(X)=probability for char. X P(Y|X) = Prob. for Y if X was predecessor

Introduction to compression

X= A           B           C           DP(X)=    0.25      0.25        0.25        0.25Y|X=                 A     B     B     C     C     D     DP(Y|X)=              0.5   0.5   0.5   0.5   0.5   0.5   1­log(P(Y|X))=        1     1     1     1     1     1     0­P(Y|X)log(P(Y|X))=  0.5   0.5   0.5   0.5   0.5   0.5   0Sum [P()P()log()]=   1           1           1           0

H(Y|X)= 0.75 Bits for coding a character  (first character must be given)   

*Note: The example is not realistic. It was chosen for easy calculation.

A

B

C

D

A

B

C

D

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Critical Review of 0-Order Codes (5)

Number of bits needed in 1-order entropy models in general:

Introduction to compression

( )∑ ∑∀ ∀−=

x yxyPxyPxPXYH ))((log)1)(()()( 2

This determines the number of bits we need in order to code a letter Y (y is a particular instance) if an X occurred already

In this sum, all possible occurrences of X are considered. x can be considered to be the first character which determines the next character y. Each x occurs with a probability of P(x).

This is the probability of getting a character y next if we have currently seen an x. Note that in real­world example this can be zero very often because many combinations of characters simply never occur. This is why we can save on bits!

This gives us the number of bits we need in order to code a y if we have seen an x before. 

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Lempel-Ziv Code

Lempel-Ziv codes are an example of the large group of dictionary-based codes.

Dictionary: A table of character strings which is used in the encoding process.

ExampleThe word “lecture“ is found on page x4, line y4 of the dictionary. It can thus be encoded as (x4,y4).

A sentence such as „this is a lecture“ could perhaps be encoded as a sequence of tuples (x1,y1) (x2,y2) (x3,y3) (x4,y4).

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Dictionary-Based Coding Techniques

Static techniquesThe dictionary exists before a string is encoded. It is not changed, neither in the encoding nor in the decoding process.

Dynamic techniquesThe dictionary is created “on the fly“ during the encoding process, at the sending (and sometimes also at the receiving) side.

Lempel and Ziv have proposed an especially brilliant dynamic, dictionary-based technique (1977). Variations of this techniques are used very widely today for lossless compression. An example is LZW (Lempel/Ziv/Welch) which is invoked with the Unix compress command.

The well-known TIFF format (Tag Image File Format) is also based on Lempel-Ziv coding.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Ziv-Lempel Coding, the Principle

The current piece of the message can be encoded as a reference to an earlier (identical) piece of the message. This reference will usually be shorter than the piece itself. As the message is processed, the dictionary is created dynamically.

InitializeStringTable();WriteCode(ClearCode);w = the empty string;for each character in string {  K = GetNextCharacter();  if w + K is in the string table {    w=w+K   /* string concatenation*/  } else {    WriteCode(CodeFromString(w));    AddTableEntry(w+K);    w=K   } }WriteCode(CodeFromString(w));

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

LZW, Example 1, Encoding

Alphabet: {A, B, C} Message: A B A B C B A B A BEncoded Message: 1 2 4 3 5 8

Introduction to compression

 

ω +K ω

AAB

A

B

BA A

AB AB

ABC C

CB

BA BA

BAB

B

BBA BA

BAB BAB

3

2

1

entryindex = code

DICTIONARY

4 AB

A

B

C

5 BA

6 ABC

7 CB

8 BAB

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

LZW Algorithm: Decoding (1)

Note that the decoding algorithm also creates the dictionary dynamically, the dictionary is not transmitted!

While((Code=GetNextCode() != EofCode) {   if (Code == ClearCode) {      InitializeTable();     Code = GetNextCode();     if (Code==EofCode) break;     WriteString(StringFromCode(Code));     OldCode = Code;  } /* end of ClearCode case */  else {      if (IsInTable(Code)) {        WriteString( StringFromCode(Code) );        AddStringToTable(StringFromCode(OldCode)+           FirstChar(StringFromCode(Code)));         OldCode = Code;     } 

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

LZW Algorithm: Decoding (2)

     else {/* code is not in table */       OutString = StringFromCode(OldCode) +        FirstChar(StringFromCode(OldCode)));       WriteString(OutString);       AddStringToTable(OutString);       OldCode = Code;     }  }}

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

LZW, Example 2, Decoding

Alphabet: {A, B, C,D}

Introduction to compression

Transmitted Code:

D4

C3

B2

A1

entryindex

DICTIONARY

 

code oldcode

12

1

2

1 2 1 3 5 1

1 1

3 3

5 5

1 1 5 AB

6 BA

7 AC8 CA

A B

9 ABA

A C AB A

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

LZW, Properties

The dictionary is created dynamically during the encoding and decoding process. It is neither stored nor transmitted!The dictionary adapts dynamically to the properties of the character string.With length N of the original message, the encoding process is of complexity O(N). With length M of the encoded message, the decoding process is of complexity O(M). These are thus very efficient processes. Since several characters of the input alphabet are combined into one character of the code, M <= N.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Typical Compression Rates

Typical examples of file sizes in % of the original size

Introduction to compression

30 %50 %text

55 %80 %machine code

45 %65 %C source code

Encoded with Lempel­Ziv

Encoded with Huffman

Type of file

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Arithmetic Coding

From an information theory point of view, the Huffman code is not quite optimal since a codeword must always consist of an integer number of bits even if this does not correspond exactly to the frequency of occurrence of the character. Arithmetic coding solves this problem.

IdeaAn entire message is represented by a floating point number out of the interval [0,1). For this purpose the interval [0,1) is repeatedly subdivided according to the frequency of the next symbol. Each new sub-interval represents one symbol. When the process is completed the shortest floating point number contained in the target interval is chosen as the representative for the message.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Arithmetic Coding, the Algorithm

Begin in front of the first character of the input stream, with the current interval set to [0,1).

Read the next character from the input stream. Subdivide the current interval according to the frequencies of all characters of the alphabet. Select the subinterval corresponding to the current character as the next current interval.

If you reach the end of the input stream or the end symbol, go to step 3. Otherwise go to step 1.

From the current (final) interval, select the floating point number that you can represent in the computer with the smallest number of bits. This number is the encoding of the string.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Arithmetic Coding, the Decoding Algorithm

Algorithm Arithmetic Decoding

Subdivide the interval [0,1) according to the character frequencies, as described in the encoding algorithm, up to the maximum size of a message.

The encoded floating point number uniquely identifies one particular subinterval.

This subinterval uniquely identifies one particular message. Output the message.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Arithmetic Coding, Example

Alphabet = {A,B,C}Frequencies (probabilities): p(A) = 0.2; p(B) = 0.3; p(C) = 0.5

Messages: ACB AAB (maximum size of a messe is 3).

Encoding of the first block ACB:

Introduction to compression

0 0,2 0,5 1A B C

0 0,04 0,1 0,2A B C

0,1 0,12 0,15 0,2A B C

Final interval: [0.12; 0.15) choose e.g. 0.125

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Arithmetic Coding, Implementation

So far we dealt with real numbers of (theoretically) infinit precision. How do we actually encode a message with many characters (like several megabytes)?If character ‘A’ occurs with a probability of 20%, the number of digits for coding consecutive ‘A’s grows very fast:

First A < 0.2, second A < 0.04, third A < 0.008, 0.0016, 0.00032, 0.000064 and so on. Let us assume, that our processor has 8-bit wide registers.

Note: We skip to store or transmit “0.” since it is redundant.Advantage: We have a binary fixed point arithmetic representation of fractions and we can compute them in the processor’s register.

Introduction to compression

(decimal 0)0.00000000

0,2 0,5

(decimal 255)0.11111111

A B C

(0.2 x 255=51)0.00110011

(0.5 x 255=127)0.01111111

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Arithmetic Coding, Implementation

Disadvantage: Even when using 32-bit or 64-bit CPU registers we can only code a couple of characters.

Solution: Once our interval gets smaller and smaller, we will obtain a growing number of leading bits which have settled (they will never change). So we will transmit them to the receiver and shift them our of our register to gain new bits.

Example: Interval of first A = [ 0.00000000 , 0.00110011 ]

No matter which characters follow, the most significant two zeros from the lower and the upper bound will never change. So we store or transmit them and shift the rest two digits to the left. As a consequence we gain two new least significant digits:A=[ 0.000000?? , 0.110011?? ]

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Arithmetic Coding, Implementation

A=[ 0.000000?? , 0.110011?? ]

How do we initialize the new digits prior to the ongoing encoding?

Obviously, we want to keep the interval as large as possible. This is achieved by filling the lower bound with 0-bits and the upper bound with 1-bits.

Anew=[ 0.00000000 , 0.11001111 ]

Note that always adding 0-bits to the lower bound and 1-bits to the upper bound introduces a mistake, because the size of the interval does not exactly correspond to the probability of character ‘A’. However, we will not get into trouble if the encoding and the decoding side makes the same mistake.

Introduction to compression

©Tho

mas

Hae

nsel

man

nD

epar

tmen

t 6.

2 –

Saar

land

Uni

vers

ity©T

hom

as H

aens

elm

ann

Dep

artm

ent

6.2

– Sa

arla

nd U

nive

rsity

Arithmetic Coding, Properties

The encoding depends on the probabilities (relative occurrences) of the characters. The higher the frequency, the larger the subinterval; the smaller the number of bits needed to represent it.The code length reaches the theoretical optimum: The number of bits used for each character need not be an integer. It can approach the real probability better than using the Huffman code.We always need a terminal symbol to stop the encoding process.

Problem:One bit error destroys the entire message.

Introduction to compression