compression

15-853: Algorithms in the Real World (2004)

296.3Page 1Algorithms in the Real WorldData Compression: Lectures 1 and 21296.3Page 2Compression in the Real WorldGeneric File CompressionFiles: gzip (LZ77), bzip (Burrows-Wheeler), BOA (PPM)Archivers: ARC (LZW), PKZip (LZW+)File systems: NTFSCommunicationFax: ITU-T Group 3 (run-length + Huffman)Modems: V.42bis protocol (LZW),MNP5 (run-length+Huffman)Virtual Connections296.3Page 3Compression in the Real WorldMultimediaImages: gif (LZW), jbig (context),jpeg-ls (residual), jpeg (transform+RL+arithmetic)TV: HDTV (mpeg-4)Sound: mp3An exampleOther structuresIndexes: google, lycosMeshes (for graphics): edgebreakerGraphsDatabases:296.3Page 4Compression OutlineIntroduction: Lossless vs. lossyModel and coderBenchmarksInformation Theory: Entropy, etc.Probability Coding: Huffman + Arithmetic CodingApplications of Probability Coding: PPM + othersLempel-Ziv Algorithms: LZ77, gzip, compress, ...Other Lossless Algorithms: Burrows-WheelerLossy algorithms for images: JPEG, MPEG, ...Compressing graphs and meshes: BBK296.3Page 5Encoding/DecodingWill use message in generic sense to mean the data to be compressedEncoderDecoderInputMessageOutputMessageCompressedMessageThe encoder and decoder need to understand common compressed format. 296.3Page 6Lossless vs. LossyLossless: Input message = Output messageLossy: Input message Output message

Lossy does not necessarily mean loss of quality. In fact the output could be better than the input.Drop random noise in images (dust on lens)Drop background in musicFix spelling errors in text. Put into better form.Writing is the art of lossy text compression.296.3Page 7How much can we compress?For lossless compression, assuming all input messages are valid, if even one string is compressed, some other must expand. 296.3Page 9Quality of CompressionRuntime vs. Compression vs. GeneralitySeveral standard corpuses to compare algorithmse.g. Calgary Corpus2 books, 5 papers, 1 bibliography, 1 collection of news articles, 3 programs, 1 terminal session, 2 object files, 1 geophysical data, 1 bitmap bw imageThe Archive Comparison Test maintains a comparison of just about all algorithms publicly available296.3Page 10Comparison of Algorithms

296.3Page 11Compression OutlineIntroduction: Lossy vs. Lossless, Benchmarks, Information Theory: EntropyConditional EntropyEntropy of the English LanguageProbability Coding: Huffman + Arithmetic CodingApplications of Probability Coding: PPM + othersLempel-Ziv Algorithms: LZ77, gzip, compress, ...Other Lossless Algorithms: Burrows-WheelerLossy algorithms for images: JPEG, MPEG, ...Compressing graphs and meshes: BBK296.3Page 12Information TheoryAn interface between modeling and codingEntropyA measure of information contentConditional EntropyInformation content based on a contextEntropy of the English LanguageHow much information does each character in typical English text contain?296.3Page 13Entropy (Shannon 1948)For a set of messages S with probability p(s), s S, the self information of s is:

Measured in bits if the log is base 2.The lower the probability, the higher the informationEntropy is the weighted average of self information.

296.3Page 14Entropy Example

296.3Page 15Conditional Entropy The conditional probability p(s|c) is the probability of s in a context c. The conditional self information is

The conditional information can be either more or less than the unconditional information.The conditional entropy is the weighted average of the conditional self information

296.3Page 16Example of a Markov Chainwbp(b|w)p(w|b)p(w|w)p(b|b).9.1.2.8296.3Page 17Entropy of the English LanguageHow can we measure the information per character?ASCII code = 7Entropy = 4.5 (based on character probabilities)Huffman codes (average) = 4.7Unix Compress = 3.5Gzip = 2.6Bzip = 1.9Entropy = 1.3 (for text compression test)Must be less than 1.3 for English language.296.3Page 18Shannons experimentAsked humans to predict the next character given the whole previous text. He used these as conditional probabilities to estimate the entropy of the English Language.The number of guesses required for right answer:

From the experiment he predicted H(English) = .6-1.3

296.3Page 19Compression OutlineIntroduction: Lossy vs. Lossless, Benchmarks, Information Theory: Entropy, etc.Probability Coding:Prefix codes and relationship to EntropyHuffman codesArithmetic codesApplications of Probability Coding: PPM + othersLempel-Ziv Algorithms: LZ77, gzip, compress, ...Other Lossless Algorithms: Burrows-WheelerLossy algorithms for images: JPEG, MPEG, ...Compressing graphs and meshes: BBK296.3Page 20Assumptions and DefinitionsCommunication (or a file) is broken up into pieces called messages.Each message comes from a message set S = {s1,,sn}with a probability distribution p(s).Probabilities must sum to 1. Set can be infinite.Code C(s): A mapping from a message set to codewords, each of which is a string of bitsMessage sequence: a sequence of messagesNote: Adjacent messages might be of a different types and come from different probability distributions296.3Page 21Discrete or BlendedWe will consider two types of coding:Discrete: each message is a fixed set of bits Huffman coding, Shannon-Fano coding

Blended: bits can be shared among messagesArithmetic coding

01001110110001message: 1 2 3 4010010111010message: 1,2,3, and 421Warning: Different books and authors use different terminology.Here is what I will be using.296.3Page 22Uniquely Decodable CodesA variable length code assigns a bit string (codeword) of variable length to every message valuee.g. a = 1, b = 01, c = 101, d = 011What if you get the sequence of bits1011 ?Is it aba, ca, or, ad?A uniquely decodable code is a variable length code in which bit strings can always be uniquely decomposed into its codewords. 296.3Page 23Prefix CodesA prefix code is a variable length code in which no codeword is a prefix of another word.e.g., a = 0, b = 110, c = 111, d = 10All prefix codes are uniquely decodable296.3Page 24Prefix Codes: as a treeCan be viewed as a binary tree with message values at the leaves and 0s or 1s on the edges:

a = 0, b = 110, c = 111, d = 10

bcad010110296.3Page 25Some Prefix Codes for Integers

Many other fixed prefix codes: Golomb, phased-binary, subexponential, ...25Later it is going to be very important to come up with coding schemes for the integers, whichGive shorter codes to smaller integers. Imagine, for example, wanting to specify small errors from expectation.296.3Page 26Average LengthFor a code C with associated probabilities p(c) the average length is defined as

We say that a prefix code C is optimal if for all prefix codes C, la(C) la(C)

l(c) = length of the codeword c (a positive integer)

26What does average length mean?If send a very large number of messages from the distribution, it is the average length (in bits) of those messages.296.3Page 27Relationship to EntropyTheorem (lower bound): For any probability distribution p(S) with associated uniquely decodable code C,

Theorem (upper bound): For any probability distribution p(S) with associated optimal prefix code C,

27The upper bound can actually made tighter if the largest probability is small.We are going to Prove the upper bound.296.3Page 28Kraft McMillan InequalityTheorem (Kraft-McMillan): For any uniquely decodable binary code C,

Also, for any set of lengths L such that

there is a prefix code C such that

296.3Page 29Proof of the Upper Bound (Part 1)Assign each message a length:We then have

So by the Kraft-McMillan inequality there is a prefix code with lengths l(s).

296.3Page 30Proof of the Upper Bound (Part 2)

Now we can calculate the average length given l(s)And we are done.296.3Page 31Another property of optimal codesTheorem: If C is an optimal prefix code for the probabilities {p1, , pn} then pi > pj implies l(ci) l(cj)Proof: (by contradiction)Assume l(ci) > l(cj). Consider switching codes ci and cj. If la is the average length of the original code, the length of the new code is

This is a contradiction since la is not optimal

31This is important in the proof of optimality of Huffman codes.296.3Page 32Huffman CodesInvented by Huffman as a class assignment in 1950.Used in many, if not most, compression algorithmsgzip, bzip, jpeg (as option), fax compression,Properties:Generates optimal prefix codesCheap to generate codesCheap to encode and decode la = H if probabilities are powers of 2

32Really good example of how great theory meets great practice.Also example of how it takes many years to make it into practice, and how basic method is improved over years.Should give you hope that you can invent something great as a student.There are other codes: Shannon-Fano (infact, Fano was the instructor of the course)Shannon-Fano codes guarantee that l < roundup(log(1/p)). Huffman codes do not.296.3Page 33Huffman CodesHuffman Algorithm:Start with a forest of trees each consisting of a single vertex corresponding to a message s and with weight p(s)Repeat until one tree left:Select two trees with minimum weight roots p1 and p2Join into single tree by adding root with weight p1 + p2296.3Page 34Examplep(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5a(.1)b(.2)d(.5)c(.2)a(.1)b(.2)(.3)a(.1)b(.2)(.3)c(.2)a(.1)b(.2)(.3)c(.2)(.5)(.5)d(.5)(1.0)a=000, b=001, c=01, d=1000111Step 1Step 2Step 3296.3Page 35Encoding and DecodingEncoding: Start at leaf of Huffman tree and follow path to the root. Reverse order of bits and send.Decoding: Start at root of Huffman tree and take branch for each bit received. When at leaf can output message and return to root.a(.1)b(.2)(.3)c(.2)(.5)d(.5)(1.0)000111There are even faster methods that can process 8 or 32 bits at a time35Any ideas how the faster methods work?It was these faster ways that eventually made Huffman codes really practical.296.3Page 36Huffman codes are optimalTheorem: The Huffman algorithm generates an optimal prefix code.Proof outline:Induction on the number of messages n. Base case n = 2 or n=1?Consider a message set S with n+1 messagesLeast probable messages m1 and m2 of S are neighbors in the Huffman tree; Replace the two least probable messages with one message with probability p(m1) + p(m2) making new message set SCan transform any optimal prefix code for S so that m1 and m2 are neighbors. Cost of optimal code for S is cost of optimal code for S + p(m1) + p(m2)Huffman code for S is optimal by inductionCost of Huffman code for S is cost of optimal code for S + p(m1) + p(m2).

36The proof is in the handout.This is a neet proof and you should definitely go through it.296.3Page 37Problem with Huffman CodingConsider a message with probability .999. The self information of this message is

If we were to send 1000 such messages we might hope to use 1000*.0014 = 1.44 bits.Using Huffman codes we require at least one bit per message, so we would require 1000 bits.

37It might seem like we cannot do better than this.Assuming Huffman codes, how could we improve this?Assuming there is only one possible other message (with prob. .001), what would theExpected length be for sending 1000 messages picked from this distribution? (about 10bits + 1.44 bits = 11.44 bits)296.3Page 38Arithmetic Coding: IntroductionAllows blending of bits in a message sequence.Only requires 3 bits for the exampleCan bound total bits required based on sum of self information:

Used in DMM, JPEG/MPEG (as option), zip (PPMd)More expensive than Huffman coding, but integer implementation is not too bad.

296.3Page 39Arithmetic Coding: message intervalsAssign each probability distribution to an interval range from 0 (inclusive) to 1 (exclusive).e.g.a = .2c = .3b = .50.00.20.71.0f(a) = .0, f(b) = .2, f(c) = .7

The interval for a particular message will be calledthe message interval (e.g for b the interval is [.2,.7))39We are going to make a distinction between message, sequence, and code intervals.It is important to keep them straight.Also I will give different meanings to p(I) and pi and it is important to keep these straight296.3Page 40Arithmetic Coding: sequence intervalsCode a message sequence by composing intervals.For example: bac

The final interval is [.27,.3)We call this the sequence intervala = .2c = .3b = .50.00.20.71.0a = .2c = .3b = .50.20.30.550.7a = .2c = .3b = .50.20.210.270.340If the notation gets confusing, the intuition is still clear, as shown by this example.296.3Page 41Arithmetic Coding: sequence intervalsTo code a sequence of messages with probabilities pi (i = 1..n) use the following:

Each message narrows the interval by a factor of pi.

Final interval size:

bottom of intervalsize of interval41How does the interval size relate to the probability of that message sequence?296.3Page 42WarningThree types of interval:message interval : interval for a single messagesequence interval : composition of message intervalscode interval : interval for a specific code used to represent a sequence interval (discussed later)296.3Page 43Uniquely defining an intervalImportant property:The sequence intervals for distinct message sequences of length n will never overlapTherefore: specifying any number in the final interval uniquely determines the sequence.Decoding is similar to encoding, but on each step need to determine what the message value is and then reduce interval43Why will they never overlap?How should we pick a number in the interval, and how should we represent the number?296.3Page 44Arithmetic Coding: Decoding ExampleDecoding the number .49, knowing the message is of length 3:

The message is bbc.a = .2c = .3b = .50.00.20.71.0a = .2c = .3b = .50.20.30.550.7a = .2c = .3b = .50.30.350.4750.550.490.490.4944Basically we are running the coding example but instead of paying attention to the symbol names,We pay attention to the number.296.3Page 45Representing FractionsBinary fractional representation:

So how about just using the smallest binary fractional representation in the sequence interval. e.g. [0,.33) = .0 [.33,.66) = .1 [.66,1) = .11But what if you receive a 1?Should we wait for another 1?

45It is not a prefix code.296.3Page 46Representing an IntervalCan view binary fractional numbers as intervals by considering all completions. e.g.

We will call this the code interval.

46We now have mentioned all three kinds of intervals: message, sequence and code.Can you give an intuition for the lemma?296.3Page 47Code Intervals: exampleTry code [0,.33) = .0 [.33,.66) = .1 [.66,1) = .11

01.0.11.1Note that if code intervals overlap then one code is a prefix of the other.Lemma: If a set of code intervals do not overlap then the corresponding codes form a prefix code.

296.3Page 48Selecting the Code IntervalTo find a prefix code find a binary fractional number whose code interval is contained in the sequence interval.

Can use the fraction l + s/2 truncated to

bits .61.79.625.75Sequence IntervalCode Interval (.101)

48Recal that s is the size of the interval. Using -log(s/2) bits represents an interval of size >= s/2.296.3Page 49Selecting a code interval: example[0,.33) = .001 [.33,.66) = .100 [.66,1) = .110

.001.110e.g: for [.33,.66), l = .33, s = .33 l + s/2 = .5 = .1000truncated to bits is .100Is this the best we can do for [0,.33) ?

.33.6601.100

296.3Page 50RealArith Encoding and DecodingRealArithEncode:Determine l and s using original recurrencesCode using l + s/2 truncated to 1+-log s bitsRealArithDecode:Read bits as needed so code interval falls within a message interval, and then narrow sequence interval.Repeat until n messages have been decoded .50Note that this requires that you know the number of messages n in the sequence.This needs to be predetermined or sent as a header.296.3Page 51Bound on LengthTheorem: For n messages with self information {s1,,sn} RealArithEncode will generate at most bits.Proof:

51The 1 + log s is from previous slide based on truncating to -log(s/2) bits.Note that s is overloaded (size of sequence interval, and self information of message). I apologize.296.3Page 52Integer Arithmetic CodingProblem with RealArithCode is that operations on arbitrary precision real numbers is expensive.

Key Ideas of integer version:Keep integers in range [0..R) where R=2kUse rounding to generate integer sequence intervalWhenever sequence interval falls into top, bottom or middle half, expand the interval by factor of 2 This integer Algorithm is an approximation or the real algorithm.

52S gets very small and require many bits of precision --- as many as the size of the final code.If input probabilities are binary floats, then can use extended floats.If input probabilities are rationals, then can use arbitrary precision rationals.296.3Page 53Integer Arithmetic Coding The probability distribution as integersProbabilities as counts:e.g. c(a) = 11, c(b) = 7, c(c) = 30T is the sum of counts e.g. 48 (11+7+30)Partial sums f as before:e.g. f(a) = 0, f(b) = 11, f(c) = 18Require that R > 4T so that probabilities do not get rounded to zeroa = 11c = 30b = 701118T=48R=256296.3Page 54Integer Arithmetic (contracting) l0 = 0, u0 = R-1

0R=256ul 54The ui and li do not overlap296.3Page 55Integer Arithmetic (scaling)If l R/2 then (in top half)Output 1 followed by m 0sm = 0Scale message interval by expanding by 2If u < R/2 then (in bottom half)Output 0 followed by m 1sm = 0Scale message interval by expanding by 2If l R/4 and u < 3R/4 then (in middle half)Increment mScale message interval by expanding by 2 55Hard part is if you keep narrowing down on the middle.ProgramAlgorithmTimeBPCScore

RK LZ + PPM111+1151.79430

BOAPPM Var.94+971.91407

PPMDPPM11+202.07265

IMPBW10+32.14254

BZIPBW20+62.19273

GZIPLZ77 Var.19+52.59318

LZ77LZ77?3.94?

# of guesses12345> 5

Probability.79.08.03.02.02.05

nBinaryUnaryGamma

1..00100|

2..0101010|0

3..01111010|1

4..1001110110|00

5..10111110110|01

6..110111110110|10

compression

Documents

lossless algorithms

lossless compression

comparison of algorithms

real worlddata compression

self information of

art of lossy text compression

probability ps

gzip lz77