data compression lecture1. references khalid sayood, introduction to data compression, morgan...
DESCRIPTION
What is data compression? In data compression, one wishes to give a compact representation of data generated by a data source. Depending upon the source of the data, the data could be of various types, such as text data, image data, speech data, audio data, video data, etc. Data compression is performed in order to more easily store the data or to more easily transmit the data.TRANSCRIPT
Data Compression
lecture1
References
• Khalid Sayood, Introduction to Data Compression, Morgan Kaufman Publ. 2nd ed. 2000,• Mark Nelson, Jean Loop Gailly, Hungry Minds Inc. 2nd ed. 1995,• Michael Barnsley, Fractals Everywhere, Academic Press, 2nd ed. 1994.• Text Compression, T.C.Bell, J.G. Cleary , I.H. Witten, Advanced Reference Series,
Englewood Cliffs, NJ: Prentice Hall
Acknowledgments: In preparation of these notes I have also used course notes of lectures on data
compression given byJohn Kieffer, University of Minnesota,Amar Mukherjee, University of Central FloridaPeter Bro Miltersen, University of Aarhus
What is data compression?• In data compression, one wishes to give a compact
representation of data generated by a data source.• Depending upon the source of the data, the data could
be of various types, such as• text data, • image data,• speech data,• audio data, • video data,• etc. • Data compression is performed in order to more easily
store the data or to more easily transmit the data.
Block diagram illustrating a general data compression system:
the data generated by the data source - the source data the compact representation of the source data - the compressed data. The data compression system consists of encoder and decoder. The encoder converts the source data into the compressed data, and the decoder attempts to reconstruct the source data from the compressed data. The reconstructed data generated by the decoder either coincides with the source data or is perceptually indistinguishable from it.
source source data senource encoder compressed data senource decoder reconstructed data
Approximate Bit Rates for Uncompressed Sources (I)
Telephony 8000 samples/sec X 12 bit/sample(Bandwidth=3.4kHz) = 96 kb/sec---------------------------------------------------------------------------Wideband Speech 16,000 s/s X 14 b/s = 224 kb/sec (Tele.audio bw~7 kHz)---------------------------------------------------------------------------Compact Disc Audio 44,100 s/s X 16 b/s X 2 channels(Bandwidth~ 20kHz) = 1.41 Mb/sec---------------------------------------------------------------------------Images 512x512 pixelsX24 bits/pixel=6.3 Mb/image----------------------------------------------------------------------------
Approximate Bit Rates for Uncompressed Sources (II)
Video 640x480 color pixel X 24 bits/pixel X 30 images/sec=221Mb/sCIF (Videoconferencing: 360x288): 75Mb/secCCIR(TV: 720x576): 300 Mb/secHDTV (1260x720 pixels X 24 b/p X 80 images/sec): 1.3 Gb/sec
An early example of data compression is Morse code developed by Samuel Morse in the mid-19th century. Letters sent by telegraph are encoded with dots and dashes. Morse noticed that certain letters occurred more often than others. In order to reduce the average time required to send a message, he assigned shorter sequences to letters that occur more frequently, such as e () and a ( ) and larger sequences to letters that occur less frequently such as q ( ) and j ( ). This idea of using shorter codes for more frequently occurring characters is used in Huffman coding.
Morse Code (1835)
abcdefg
hijklmn
opqrstu
vwxyz
a n d f o r t h e
1 time unit 3 time units 7 time units
Morse Code, 2
1 60 5
2 73 84 9
Per character (on average)…8.5 time units for English text11 time units if characters are equally likely17 time units for numbers
Where Morse code uses frequency of occurrence of single characters, a widely used form of Braille code which was also developed in the mid-19th century, uses the frequency of occurrence of words to provide compression. In Braille coding, 23 arrays of dots are used to represent text. Different letters can be represented depending on whether the dots are raised or flat. In Grade 1 Braille each array of six dots represents a single character. However, given six dots with two positions for each dot, we can obtain 62 different combinations. If we use 26 of these for the different letters, we have 38 combinations left. In Grade 2 Braille, some of these combinations are used to represent words that occur frequently, such as “and” and “for”. One of these combinations is used as a special symbol indicating that the symbol that follows is a word and not a character, thus allowing a large number of words to be represented by two arrays of dots. These modifications resulted in average reduction (or compression) about 20%.
Grade 1 Braille (1820’s)
a b c d fe hg i j k l
n
m
po q r ts vu xw zy
and for the mother and father(24 letters; 24 symbols in encoding)
Grade 2 Braille Example
and for the mother and father(24 letters; 8 symbols in encoding)
67% compression (20% is more typical)
and for the mother and father
Measure of performance
We need to evaluate how good a potential data compression system is. This can be done by a compression ratio . By a compression ratio of r to 1, we mean
size of source data in bitssize of compressed data in bits
r
Thus a compression ratio of 2 to 1 means that the compressed data is half the size of the source data. The higher the compression ratio, the better the compression system is.
EXAMPLE 1. This example illustrates how data compression can assist one in storing data or in transmitting data. Suppose the data source generates an arbitrary 512512 digital image consisting of 256 colors. Each color is represented by an intensity from the set {0, 1, 2, …,255}. Mathematically, this image is a 512 512 matrix, each of whose elements comes from the set {0, 1, 2, ….,255}. (These elements are called pixel elements.) The intensity with which each pixel element is designated can be represented using 8 bits. Thus, the size of the source data in bits is 8512512 = 221, which is about 2.1 megabytes. A one gigabyte hard disk could thus store only about 476 of these images, without compression. Suppose, however, that one can compress each such image at a compression ratio of 8 to 1. Then, one can store about 3800 images on this hard disk! Suppose now that one wants to transmit an uncompressed image over a telephone channel that can transmit 30, 000 bits per second. Computing 221/30000, one sees that it would take about 70 seconds to do this. With the 8 to 1 compression ratio, the compressed image could be transmitted in under 9 seconds!
Statistical structure is being used to provide compression in Morse code and Braille code. There are many kinds of structures existing in data of different type that can be exploited for compression.
FINISHED FILES ARE THE RE- SULT OF YEARS OF SCIENTIF-
IC STUDY COMBINED WITH THE EXPERIENCE OF YEARS
This is a dog that belongs to my friend This dog belongs my friend
number of words removed 4measure of redundancy=the total number of words 9
Modeling and Encoding
encoder decoderSource Result
Model Model
channel
The development of data compression algorithms for a variety of data can be divided in two phases. The first phase is usually referred to as modeling. In this phase we try to extract information about any redundancy that exists in the data and describe the redundancy in the form of a model. The second phase is called coding. A description of the model and a description of how the data differ from the model are encoded, generally, using a binary alphabet. The difference between the data and the model is often referred to as the residual.
Example 1
9 11 11 11 14 13 15 17 16 17 20 21
We could represent each number using 5 bits (could use 4 bits). Need 12*5 = 60 bits (or 12*4 = 48 bits) for entire message.
y = 1,0175x + 7,9697
0
5
10
15
20
0 2 4 6 8 10 12
Example 1, cont.
Model: xn = n + 8
Source: 9 11 11 11 14 13 15 17 16 17 20 21
Model: 9 10 11 12 13 14 15 16 17 18 19 20
Residual: 0 1 0 -1 1 -1 0 1 -1 -1 1 1
Only need to transmit the model parameters and residuals.Residuals can be encoded using 2 bits each 12*2 = 24 bits.Savings as long as model parameters encoded in less than 60-24=36 bits (or 48-24 = 24 bits)
Example 2
27 28 29 28 26 27 29 28 30 32 34 36 38
We could represent each number using 6 bits (could use 4 bits). Need 13*6 = 78 bits (or 13*4 = 52 bits) for entire message.
0
10
20
30
40
0 5 10 15
Example 2, cont.
Transmit first value, then successive differences.
Source: 27 28 29 28 26 27 29 28 30 32 34 36 38
Transmit: 27 1 1 -1 -2 1 2 -1 2 2 2 2 2
6 bits for first number, then 3 bits for each difference value.6 + 12*3 = 6+36 = 42 bits (as compared to 78 bits) .Encoder and decoder must also know the model being used!
Example 3a_barayaran_array_ran_far_faar_faaar_away(Note: _ is a single space character)
Sequence uses 8 symbols (a, _, b, r, y, n, f, w)3 bits per symbol are needed, 41*3 = 123 bits total for sequence.
Symbol Frequency Encodinga 16 1r 8 000_ 7 001f 3 0100y 3 0101n 2 0111b 1 01100w 1 01101
Now require 106 bits,or 2.58 bits per symbol.
Compression ratio:123/106, or 1.16:1
Standards With the increasing use of compression there has also been an increasing need for standards. Standards allow products developed by different venders to communicate. Thus we can compress something with products from one vender and reconstruct it using the product of different vender. The different international standard organizations have responded to this need and a number of standards for various compression applications have been approved.
StandardsStandards Bodies: ITU, US ANSI Committee, TIA, ETSI, TIC. IEEE, CCITTFax: Group 3 and Group 4, JBIGStill Images: JPEG(0.25-2 bit/pixel), JPEG-2000.Video: MPEG1(1.5Mb/s), MPEG2(6-10 Mb/s),MPEG-2000High quality audio for MPEG:84/128/192 kbs/s per channelThere are at least 12 audio standards including a few new standardsused for cellular phones ( GSM, I-95, UTMS).(Practically, no standard for text data . The .pdf format is image based, html is used in most web based text interchanges. Anemerging new format which is getting popular is XML. Variouscompression algorithms are used: Huffman, arithmetic, compress,Gzip, An emerging new algorithm is Bzip2)
Acronyms
ITU ( formerly CCITT): International Telecommunication UnionTIA: Telecommunication Industries AssociationETSI: European Telecommunication Standards InstitiuteNIST: National Institute of StandardsTIC: Japanese Telecommunications Technology CommitteeISO: International Standards OrganizationIEEE: Institute of Electrical and Electronic EngineeringACM: Association of Computing Machinnery
JBIG: Joint Bi-level Image GroupJPEG: Joint Photographic Expert GroupMPEG: Moving Pictures Expert Group
There are two varieties of data compression systems: lossless compression systems lossy compression systems
Compression and Reconstruction
Source ReconstructedCompressed
Compression Reconstruction
Lossless: Reconstructed is identical to Source.Lossy: Reconstructed is different from Source.
Usually more compression with lossy, but give up exactness.