data compression - text compression - run length encoding

Data Compression Manish T I

• If a data item d occurs n consecutive times in theinput stream, replace the n occurrences with thesingle pair nd.

• The n consecutive occurrences of a data item arecalled a run length of n, and this approach todata compression is called run-length encodingor RLE.

Run Length Encoding

We have to adopt the convention that only three ormore repetitions of the same character will bereplaced with a repetition factor.

The main problems with this method are the following:

• In plain English text there are not many repetitions.

• There are many “doubles” but a “triple” is rare.

• The most repetitive character is the space.

• Dashes or asterisks may sometimes also repeat.

• In mathematical texts, digits may repeat.

• Example Paragraph

The abbott from Abruzzi accedes to the demands of allabbesses from Narragansett and Abbevilles from Abyssinia.He will accommodate them, abbreviate his sabbatical, andbe an accomplished accessory.

• The character “@” may be part of the text in the inputstream, in which case a different escape character mustbe chosen.

• Sometimes the input stream may contain every possiblecharacter in the alphabet.

• Example

An example is an object file, the result of compiling aprogram. Such a file contains machine instructions and canbe considered a string of bytes that may have any values.

• Since the repetition count is written on the outputstream as a byte, it is limited to counts of up to 255.

• This limitation can be softened somewhat when werealize that the existence of a repetition count meansthat there is a repetition (at least three identicalconsecutive characters).

• We may adopt the convention that a repeat count of 0means three repeat characters, which implies that arepeat count of 255 means a run of 258 identicalcharacters.

• The MNP class 5 method was used for datacompression in old modems.

• It has been developed by Microcom, Inc., amaker of modems (MNP stands for MicrocomNetwork Protocol), and it uses a combinationof run-length and adaptive frequencyencoding.

Performance

We assume that the string contains M repetitions of

average length L each. Each of the M repetitions is

replaced by 3 characters (escape, count, and data)

Size of the compressed string is N − M × L +M ×3

= N −M(L − 3)

Compression factor = N / N −M(L − 3)

Digram Encoding

• A variant of run length encoding for text is digramencoding.

• This method is suitable for cases where the datato be compressed consists only of certaincharacters, e.g., just letters, digits, andpunctuation.

• Good results can be obtained if the data can beanalyzed beforehand.

• “E”, “T”, “TH”, and “A”, occur often.

Pattern SubstitutionFor compressing computer programs, where certain

words, such as for, repeat, and print, occur often.

Each such word is replaced with a control character

or, if there are many such words, with an escape

character followed by a code character.

Assuming that code “a” is assigned to the word

print, the text “m:print,b,a;” will be compressed to

“m:@a,b,a;”.

Relative Encoding [Differencing]

• Successive temperatures normally do not differby much, so the sensor needs to send only the

first temperature, followed by differences.

The sequence of temperatures 70, 71, 72.5, 73.1, . .

can be compressed to 70, 1, 1.5, 0.6, . . ..

The differences are small and can be expressed in

fewer bits.

The sequence 110, 115, 121, 119, 200, 202, . . .can be compressed to 110, 5, 6,−2, 200, 2, . . . .

Now need to distinguish between a difference andan actual value.

The compressor creating an extra bit (a flag) for eachnumber sent, accumulating those bits, and sendingthem to the de compressor from time to time, as partof the transmission.

Assuming that each difference is sent as a byte, thecompressor should follow (or precede) a group of 8 byteswith a byte consisting of their 8 flags.

Another practical way to send differences mixed with actual

values is to send pairs of bytes. Each pair is either an actual

16-bit measurement (positive or negative) or two 8-bit

signed differences.

Thus actual measurements can be between 0 and ±32K and

differences can be between 0 and ±255.

For each pair, the compressor creates a flag: 0 if the pair is

an actual value, 1 if it is a pair of differences.

After 16 pairs are sent, the compressor sends the 16 flags.

• The sequence of measurements 110, 115, 121, 119, 200, 202, . . . is sent as (110), (5, 6), (−2,−1), (200), (2, . . .), where each pair of parentheses indicates a pair of bytes.

• The −1 has value 11111111 (binary) , which is ignored by the de-compressor (it indicates that there is only one difference in this pair).

Reference:-

Data Compression: The Complete Reference, David Salomon, Springer Science & Business Media, 2004

For any queries contact: Web: www.iprg.co.inE-mail: [email protected]: @ImageProcessingResearchGroup

data compression - text compression - run length encoding

Education