a new model of numerical computer data and its application for construction of minimum-redundancy...

9
IEEE TRANSACTIONS ON INFORMATION THEORY, VOI. 3[9, NO. 2. MARCH 1993 389 A New Model of Numerical Computer Data and Its Application for Construction of Minimum-Redundancy Codes Boris Y. Kavalerchik Absti-uct-Numerical computer data in many applications con- tain numerous leading and trailing zeros, and this data feature has been neglected in common source models and related data compression methods. A new approach to source modeling, con- sisting in the description of this data feature by probabilities 11,) of data values with i leading and j trailing zeros, is proposed. The entropy of the new source model is proved to be less than the entropy of the common ones. Based on this model a few codes of different computational complexity are proposed. Experiments on typical computer files show that the coding algorithms based on the new source and data model considerably outperform existing data compression algorithms. Index Terms- Numerical data compression, source modeling, data model, minimum-redundancy codes. I. INTRODUCTION HE CURRENT DATA representations go back to the time T of the appearance of the first computers. The choice of them was influenced by many factors, among which economy of expression played only a minor role. Therefore, computer data are considerably redundant, and they may be compressed. Many data processing applications require storage of large volumes of data. At the same time, the proliferation of computer communication networks is resulting in massive transfer of data over communication links. When the amount of data to be transmitted or stored is reduced, the effect is that of increasing the capacity of the communication channel or storage medium. Data compression is traditionally considered from the point of view of data transmission, although it is of value in data storage as well. This paper is framed in the terminology of data transmission, even though compression and decompression of data files are in the main the same tasks as required in sending and receiving data over a communications channel. Any data compression scheme may be viewed conveniently as consisting of two steps, viz., source modeling and coding. Excellent discussion of this two-step approach is due to J. Rissanen and G.G. Langdon, Jr. [l]. Many effective coding techniques are currently in use. Huffman variable-length coding [2] has the attractive feature of Manuscript received April 22, 1991; revised June IS, 1992. This work was presented in part at the IEEE International Symposium on Information Theory. Budapest, Hungary, June 24-28, lY9l. The author was with Byelorussian Research Institute for Management Information Systems, 8612 Karinets Strcet, Minsk. Belarus 220108. He is now at 33-07 Hamilton Road, Fair Lawn, NJ 07410. IEEE Log Number 9203873. having near-optimum compression for sequences of messages emitted by a discrete stationary memoryless source. The use of source extensions can make the average Huffman codeword length as close to the lower bound, given by the source entropy, as desired. The arithmetic coding technique [ 11, [3] achieves the same result without the need to block the source alphabet. Huffman and arithmetic codes as well as Shannon-Fano, Lempel-Ziv [4], [5] and others are widely used in computer systems, especially for personal computers and computer networks. Detailed surveys of data compression methods used in computer systems are given in [6]-[lo]. Given the availability of effective coding methods, the crucial problem in computer data compression today is that of building adequate source models. Source modeling involves building a statistical source such that the message sequence to be compressed is its possible outcome. Source modeling consists of determination of certain structure and necessary statistical parameters of the source. Hence, the entropy, the average amount of information represented by a message (or a source letter in a message) and consequently the lower bound on the average message length after compression, is a function of the model used to produce that message. Source models traditionally studied in communication the- ory are commonly used for computer data compression. Data compression problems arising in digital processing differ in one important respect from those studied in communication theory: the common source models have been created for other application areas and therefore they do not coincide completely with the features of computer data. In fact, an adequate model may be constructed only by taking into account typical redundancy in numerical computer data. The main contribution of this paper is to construct a source model for numerical data, aimed at business and commercial applications of computer systems. 11. TYPES OF REDUNDANCY Business and commercial data files are typically composed of formatted data records. Each data record is subdivided into data items (fields). The most popular formats of numerical data items are fixed-length, fixed-point binary, and binary-coded decimal representations. Let f( I!,. m) denote a formal of an it-digit number with ut digits in the fractional part. The decimal point is assumed, i.e., it does not actually exist in the real data representation. 0018-9448/93$03.00 0 1093 IEEE

Upload: by

Post on 20-Sep-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A new model of numerical computer data and its application for construction of minimum-redundancy codes

IEEE TRANSACTIONS ON INFORMATION THEORY, VOI. 3[9, NO. 2. MARCH 1993 389

A New Model of Numerical Computer Data and Its Application for Construction

of Minimum-Redundancy Codes Boris Y. Kavalerchik

Absti-uct-Numerical computer data in many applications con- tain numerous leading and trailing zeros, and this data feature has been neglected in common source models and related data compression methods. A new approach to source modeling, con- sisting in the description of this data feature by probabilities 1 1 , ) of data values with i leading and j trailing zeros, is proposed. The entropy of the new source model is proved to be less than the entropy of the common ones. Based on this model a few codes of different computational complexity are proposed. Experiments on typical computer files show that the coding algorithms based on the new source and data model considerably outperform existing data compression algorithms.

Index Terms- Numerical data compression, source modeling, data model, minimum-redundancy codes.

I. INTRODUCTION

HE CURRENT DATA representations go back to the time T of the appearance of the first computers. The choice of them was influenced by many factors, among which economy of expression played only a minor role. Therefore, computer data are considerably redundant, and they may be compressed. Many data processing applications require storage of large volumes of data. At the same time, the proliferation of computer communication networks is resulting in massive transfer of data over communication links. When the amount of data to be transmitted or stored is reduced, the effect is that of increasing the capacity of the communication channel or storage medium.

Data compression is traditionally considered from the point of view of data transmission, although it is of value in data storage as well. This paper is framed in the terminology of data transmission, even though compression and decompression of data files are in the main the same tasks as required in sending and receiving data over a communications channel.

Any data compression scheme may be viewed conveniently as consisting of two steps, viz., source modeling and coding. Excellent discussion of this two-step approach is due to J. Rissanen and G.G. Langdon, Jr. [l].

Many effective coding techniques are currently in use. Huffman variable-length coding [2] has the attractive feature of

Manuscript received April 22, 1991; revised June IS, 1992. This work was presented in part at the IEEE International Symposium o n Information Theory. Budapest, Hungary, June 24-28, lY9l.

The author was with Byelorussian Research Insti tute for Management Information Systems, 8612 Karinets Strcet, Minsk. Belarus 220108. He is now at 33-07 Hamilton Road, Fair Lawn, NJ 07410.

IEEE Log Number 9203873.

having near-optimum compression for sequences of messages emitted by a discrete stationary memoryless source. The use of source extensions can make the average Huffman codeword length as close to the lower bound, given by the source entropy, as desired. The arithmetic coding technique [ 11, [3] achieves the same result without the need to block the source alphabet. Huffman and arithmetic codes as well as Shannon-Fano, Lempel-Ziv [4], [ 5 ] and others are widely used in computer systems, especially for personal computers and computer networks. Detailed surveys of data compression methods used in computer systems are given in [6]-[lo].

Given the availability of effective coding methods, the crucial problem in computer data compression today is that of building adequate source models. Source modeling involves building a statistical source such that the message sequence to be compressed is its possible outcome. Source modeling consists of determination of certain structure and necessary statistical parameters of the source. Hence, the entropy, the average amount of information represented by a message (or a source letter in a message) and consequently the lower bound on the average message length after compression, is a function of the model used to produce that message.

Source models traditionally studied in communication the- ory are commonly used for computer data compression. Data compression problems arising in digital processing differ in one important respect from those studied in communication theory: the common source models have been created for other application areas and therefore they do not coincide completely with the features of computer data. In fact, an adequate model may be constructed only by taking into account typical redundancy in numerical computer data.

The main contribution of this paper is to construct a source model for numerical data, aimed at business and commercial applications of computer systems.

11. TYPES OF REDUNDANCY

Business and commercial data files are typically composed of formatted data records. Each data record is subdivided into data items (fields). The most popular formats of numerical data items are fixed-length, fixed-point binary, and binary-coded decimal representations. Let f ( I!,. m ) denote a formal of an it-digit number with u t digits in the fractional part. The decimal point is assumed, i.e., it does not actually exist in the real data representation.

0018-9448/93$03.00 0 1093 IEEE

Page 2: A new model of numerical computer data and its application for construction of minimum-redundancy codes

390 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 2, MARCH 1993

As a rule, numerical computer data redundancy exists in one of four forms.

The range of the data values is much smaller than can be represented with their storage format. This kind of redundancy is typical for business and commercial databases. For example, many numerical data items of customer order files, inventory files, etc. contain zeros or small integers preponderantly. One or few data values occur with exceptionally high frequency. Mostly this value is zero. Significant correlation exists between adjacent data val- ues. This redundancy form is typical for sorted key values, telemetry applications, etc. Integer values and values with a few significant digits in the fractional part occur considerably more frequently than values with all the significant digits in the fractional part. This kind Qf redundancy is typical for business and commercial applications. It occurs in data rounded off to a given relative precision.

The four forms of redundancy overlap to some extent. For example, if the most frequent data value equals zero, then the second redundancy form is reduced to the first one. Storing the differences between successive data values reduces the third form to the first one, too.

Therefore, the most conspicuous feature of numerical com- puter data is that they contain numerous leading and trailing zeros, the latter being usually in the fractional part. High frequency of the digit 0, as noted by many researchers, is a consequence of this feature. For example, in character frequency distribution for a typical commercial file given by J. Martin [11], the frequency of digit 0 is 16.3 times higher than the average frequency of any other digit. To construct an efficient code, however, one must clearly understand that, as compared with other digits, zero not only occurs more often in number representations but its high frequency is associated with leading and trailing zeros. Let us see whether the aforenamed feature of numerical data is utilized in source models and coding methods commonly used in computer systems.

111. SOURCE AND DATA MODELS

Numerical data in computer systems can be considered as arrays of n-digits numbers dndn-l . . . d z d l , d , being a base-t digit, i = 1 , 2 , . . . , n . If the numbers describe a sequence of independent events, the most adequate source model is that of a discrete stationary memoryless source. The output of this common source is a sequence of letters (or messages), each being a selection from some fixed fi- nite alphabet {a , } , i = 1 , 2 , . . . , N , which we refer to as a “primary” alphabet. The letters are taken to be random, statistically independent selections from the primary alphabet, the selections being made according to some fixed probability assignment { p , } , p , = p(a , ) , i = 1 , 2 , . . . , N . Without loss of generality, the code (secondary) alphabet is assumed to be {0,1} throughout this paper. The source entropy is defined as H = - cz=l p , logp,. All logarithms throughout the paper are to the base 2. Given N , the entropy has its maximum

N

value when p l = p 2 = . . . = p~ = 1 / N . A fundamental result [ 121 of information theory states that no decodable code exists for which the average codeword length 1 = E,”=, p, l , satisfies 1 < H , and there exists at least one for which 1 < H+1, where I , denotes the length of the codeword representing message a,. In our case the primary alphabet includes N = tn letters. But in any practical implementation complexity considerations place a very low limit on N . Therefore coding is usually applied to the digit sequences in the number designation, i.e., separate digits or some short blocks of digits are considered as source letters. In this case, it is useful to introduce a data model in order to statistically describe the manner in which the number is composed of digits.

The simplest data model is the one corresponding to a discrete stationary memoryless source: a number consists of n independent digits (source letters), each taking value i with assigned probability p , = p ( i ) , i = 0 , 1 , . . . , t - 1. In what follows this source and the corresponding data model will be referred to as source A and model A , and the subindex A will indicate that the indexed variables relates to the source (model) A. For example, HA denotes entropy calculated on the basis of the probabilities { p , } .

The aforenamed data feature being taken into account, the probability of digit 0 is much greater than that of the others, p , >> p,, which are approximately equal. If only the probabilities (frequencies of occurrence) p ( i ) are known (measured), no code exists with average codeword length per primary letter (digit) less than the entropy

t - 1

i=0

The use of source extensions (letter blocking), arithmetic coding or any other code can only make the average code- length closer to the lower bound given by (1).

The main defect of the model A is that the source out- comes do not coincide with the data feature. For example, all n-digit binary integers with a single digit 1 are equiprobable as outcomes of this source: p ( 100. . . O ) = ~ ( 0 0 . . .010. . . O ) = ~ ( 0 0 . . ‘01). But according to the redundancy forms (1) and (2), the probability of zero values for high-order digits must be higher than for low-order ones. There are two causes for this defect. To outline these we consider the number sequence n1,n2, . . , n, . . . , each number being represented as a sequence of digits n, = d , n d z , n - l . . . d , l . First, while the number sequence is stationary, the digit sequence is not. Only sequences d l , , d z 3 , . . . , dl, , . . . are of this kind. Second, the successive digits are not independent.

Models based on Markov chains are usually exploited to take advantage of interdependence between source letters (digits). In the case of a first-order Markov source, for example, the description of the source consists of a matrix of conditional probabilities p , , = p ( i lj), i = 0,1,...,t - 1, and j = 0, I , . . . , t - 1. The source entropy is then given by H = - p , p,, logp, , . The entropy tends to be over- estimated when the digit interdependence is not considered. Models that exploit the relationship between source letters may achieve better compression than predicted by the entropy

t - 1 t - 1

Page 3: A new model of numerical computer data and its application for construction of minimum-redundancy codes

KAVALERCHIK: A NEW MODEL OF NUMERICAL COMPUTER DATA 391

calculation (1) based only on digit probabilities. The use of Markov models is dealt with in papers [13]-[16]. These models have the same defect as the previous one. The reason for this defect is nonstationarity. For example, the conditional probability p ( 0 I O ) essentially depends on what kind of a digit is the first (conditioning) zero: high-order insignificant, significant or the last in integer part of the number.

The Lempel-Ziv compression algorithm consists of a rule for parsing strings of letters from a finite primary alphabet into substrings whose length does not exceed a prescribed integer L,s, and a coding scheme that maps these substrings sequentially into uniquely decipherable codewords of fixed length L,, [4]. This coding represents a departure from the classical view of a code as a mapping from a fixed set of source

may be uniquely represented by this alphabet, and conversely, any sequence of interim alphabet letters has a binary digits representation. Runlength encoding corresponds to a source C, which has as its outcomes sequences of letters of the alphabet { b , } , i.e., zero runs of length I followed by digit 1.

If the source C is stationary and memoryless, the entropy per letter of the alphabet { b , } equals H c = - E,”=, q2 log 4%. where qz denotes the probability of the occurrence of letter b,. The entropy per binary alphabet digit equals H c f l c , where lc = E:, q 2 ( / + 1) = l+Cz, iq, is the average length (in bits) of alphabet { h , } letter. Every letter of the alphabet { b , } includes the single digit 1. Therefore the probability of digit 1 in the binary sequence equals q l/lc and the probability of digit 0 equals p = 1 - q.

messages (block-to-variable code) to a fixed set of codewords (variable-to-block code). The compression ratio achieved by the Lempel-Ziv code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes [4].

Theorem 1: For an arbitrary binary source suppose we know the probability p . < p < 1. and probabilities { q z } , 0 6 q? < 1. such that Croql = 1. Then Hcflc 6 HA, and the equality holds only if the source is a source A.

~~

There are many other, often empirical compression methods, which are being used in computer systems. Let us consider the most popular of them.

Bit mapping is suitable for the second form of redundancy. A mapping bit is appended to the front of each data item, 1 indicating that data is present and 0 indicating the absence of the data item. The zero value items are dropped during compression and reinserted during decompression. Bit map- ping corresponds to the following data model (model B). The source I? emits an ra-digit zero or non-zero number with prob- abilities p and q = 1 - p correspondingly. For binary source, assuming that all nonzero number values are equiprobable, the probability of digit one equals ( q / 2 ) ( 2 ” / ( 2 ” - l ) ) , and if n. is large enough it may be estimated as q / 2 . The average code length equals IB = p + ( I + r1) ( 1 - p ) = 1 + iaq. The entropy for binary source A with probability of one of the digits equaling q / 2 may be estimated as

Statement I : For any q , 0 < (1 < 1 there exists / I , * such that 1~ < n,H.A for any rt, > / I * .

To see this, consider the function f ( q ) = / ~ / n - H.4 -

1/n = q - H,A. This function for 0 < (I < 1 is negative, since f ( 0 ) = f(1) = 0 and f ” ( q ) = ( I - q ) / ( q ( 2 - q ) ) > 0. 0 < q < 1. Since f ( q ) does not depend on 7 1 , the statement follows from the continuity of the function.

Consequently, if 7) is large enough, bit mapping achieves better compression than any method based on model A. The main defect of bit mapping incorporates a very primitive source and data model. Bit mapping is one of the null suppression methods.

Another very popular method of this kind is runlength encoding. A special character is inserted in the digit string to indicate the presence of a run of zeros, and the run is replaced by the count indicating its length.

Consider an infinite “interim” alphabet ( O 1 }, i = 0 . 1.2. . . . .

Proof: Given the probabilities { q z } , -1 oc

p = ( l - q ) = l - l / l c = l - l+Ciqz . ( 2 = 0 ) Let p and, consequently, H ~ A and 1, be given. Vie determine the maximum value the ratio H,/lc can take under the constraints

x

cqz = 1,

x

1 + 7q, = I , = l / q = 1/(1 - p ) = const. (3 ) , = n

Using the Lagrange multipliers method, define

Equations ( a L ) / ( d q l ) = 0, I = 0 . 1 . . . . , and, consequently ( d L ) / ( d q z + l ) - ( i ) L ) / ( i ) q 2 ) = 0 imply q2+1/qt = C = const or qI = qoC’. From (2), it follows that 40 = 1 - C and from (3) , taking into account that

k

and x .r

23.l = ~ 0 < z < l , , = n (1 - .r)”

we get C = p , and ql = y ’ ( 1 - p ) ; i.e., q, represents the probability of a source A outcome, consisting of z zeros followed by digit 1. And finally it is easily verified that this extremum is maximum. 0

. , where bo = 1, 0 1 = 01, (12 = 001, and so on. It is evident that after an additional bit 1 is appended, any binary digit sequence

Theorem 1 ensures that potentially runlength encoding is better than methods based on model A , e.g., Huffman method.

Page 4: A new model of numerical computer data and its application for construction of minimum-redundancy codes

-

392

It exploits the high probability of zero-runs but it does not take into account the fact that these runs consist mainly of leading and trailing zeros. According to model C, a run of 1 zeros, 1 < n, may be located in every part of the number equiprobably .

All the methods considered, regardless of the model used, are suitable for compression of digit sequences, because of the relatively high probabilities of digit 0 or zero-runs. But all source and data models do not completely take into account the numerical computer data feature.

IV. A NEW DATA MODEL

to source and data modeling, which consists of the description of the aforenamed feature by probabilities p i j , i = 0,1, . . . , n , j = 0,1 , . . . , n - i of data values with i leading and j trailing zeros. The triangular matrix (1 p i j (1 determines a source, which emits numbers (sequences of n digits) in the form of

The present study proposes a new approach

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 2, MARCH 1993

with probabilities p i j , where x and y are base-t digits, x E {0,1, . . . , t - l}, y E { 1 , 2 , . - + , t - 1). The left and the right y’s coincide if i + j = n - 1, and they disappear if i + j = n. As (n + 1) pairs ( z , j ) , i + j = n, correspond to an all- zero item, so that the probability of this item would not be broken up into (n + 1) separate probabilities for the numbers of format f ( n , m), we assume pn-m,m Z 0, and pn-i , i = 0, i = O , l , . . . , m- 1, m+ 1,. . . , n . In what follows, this source and the corresponding data model will be referred to as source D and model D. We note that the probabilities (frequencies of occurrence) p i j can be easily estimated.

Every pair ( i , j ) determines a set Mij of (n - i - j ) - digit numbers, each number having neither leading nor trailing zeros. The set Mij consists of the Nij elements

, i f n - i - j > l , if n - i - j = 1, if n - i - j = 0.

For simplicity, without loss of generality, all numbers are assumed to be binary, i.e., y = 1, and x E (0 , l ) . Let the matrix IIpijll be known, and let us assume that in the numbers of the set Mij, digit x equals zero with probability q i j . The leading 00 . . . 01 and the trailing 100 . . . O parts of the number designation are uniquely determined by the pair ( 2 , j). Therefore, the entropy of these parts of the number equals -p i j log p i j . Then we can estimate the upper bound on the entropy of the n-digit numbers as

n n-i

n-2 n-2-2

Theorem 2: Consider a binary source which emits mes- sages consisting of n-digit sequences (numbers), n > 1. For this source let the probability p of digit 0, 0 < p < 1, and the matr ix((pi j ( ! ,O<pij < 1 , i = 0 , l , . . . , n , j = 0 , 1 , . . . , n - i , c ~ = o ~ ~ ~ ~ p i j = 1 be known (measured), pi j being the probability of number with i leading and j trailing zeros. Then H D Q nHA, and the equality holds only if the source is a source A.

Proof: Let probability of digit x in the numbers of set Mij being 0 be denoted q i j , the matrix ( 1 q;jll being unknown. The frequency of occurrence of digit 0 in the source outcomes equals

I n n-i

We shall determine the maximum value the entropy H o can take under the constraints

n n-i

(7)

and n n-a n-2 n- i -2

= const.

(8)

Using the Lagrange multiplier method, we define

n n-i

- E P i j ( n - i - j - 2)

, n n-i

n-2 n-i-2

i=o j=o /

From equations (aL)/(aqij) = 0, we have p;j(n - i - j - 2 ) ( - l o g q i j + l o g ( l - q i j ) - x ~ ) = 0 , i = O , l , . . . , n - 2, j = 0 , 1 , . . . , n - i - 2. Hence, qi j = q = const,

(9) A 2 = W ( 1 - Q ) / Q ) ,

i.e., for given H A the entropy H D is maximum when all q i j

are equal. Denote any p i j , i + j = IC as pk. From equations

Page 5: A new model of numerical computer data and its application for construction of minimum-redundancy codes

KAVALERCHIK: A NEW MODEL OF NUMERICAL COMPUTER DATA 393

( i X ) / ( 8 p Z , ) = 0, we have The most likely assumption for real-life numerical data is that all possible values of 5 are equiprobable, i.e., all qtJ in (6) are equal to l / 2 (assumption E ) . Let us denote this variant of source D by index E. For binary n-digit numbers the entropy (6) equals

- 1OgT)k - 1 + (77 - k - 2 ) H ( q ) - A 1 - X y k - X y ( n - k - 2 ) q = 0, (10)

k = 0 , 1 , . . . ,n -2 ,where H ( q ) = - q l o g q - ( l -q) log(l- q ) . and 71 n--1

H E = - );7 y P 7 , l o g P t , - l o g p k - 1 - - X2k = 0, k = 1 ) - 1.n. (11) ?=o J = O

Taking into account (9), we have from (lo), (11)

pk ==Po& k = o . 1 : " . / , - 2 . or q 7'

(12) ri n--2 4- l 1-!l

P71-1 = Po ~. P 7 L = Po ___

(1 - d 2 ' HE = - PZJ lOgP-2, It is easily verified by considering the second derivatives that r=O ,=o -

this extremum is a maximum. Each p k , k = 0. 1.. . . . ?i - 1, represents a subset of ( k + I ) different p q I . Substituting (12) into (7), we have

i=O j = O

and using (4), we obtain

Po = (1 - d 2 . (13)

Substituting (12), (13) into (8) and using (4) we obtain q = p . We calculate the probability of digit d k being equal to 1.

+ (1 - P ) I=O j = o

= (1 -1)).

p ( d 1 = 1) = p(d, , = 1) = 1 - p.

k = 2 . 3 . ' " . ? ? - 1.

In (14), the first double sum may be treated as the entropy of the leading 00. . . 01 and trailing 100. . . O parts of the numbers, and the second double sum corresponds to the entropy of (71 - 1 - J - 2)-digit binary numbers x.r . . . x. The entropy in (16) means that in every set Aft, all the N2, number values are equiprobable. If the assumption E is valid, then by Theorem 2, for any given p , 0 < p < 1. rriaxHE 5 maxHo = nH>s, and the equality holds only if all the number values are equiprobable. On the other hand, for any given matrix IIpzJII, iiiax H D = H E .

v . CODING OF BINARY CODED DECIMAL NUMBERS

In computer systems, binary-coded decimals (packed deci- mals) are also often used, each decimal digit being represented for four bits.

Let us consider wdigit binary coded t-ary numbers. Each digit is represented by m, bits, t < r = Z m , m > 1. For

source is a source A , and in this case H D = I / H 4 . 0

Theorem 2 states that irrespective of the nature of the real source, the matrix 1 1 plJII is more informative than the probability p , and only if the source is a discrete stationary memoryless one these characteristics are equivalent.

Let us note that model 13 corresponding to bit mapping is a simple particular case of the new model, namely, when only the probability p a = C?+J=npLJ is measured. We give the following statement without proof.

Statement 2: For n > 1 and any values { p t , } , 0 < p7, < 1. 71 n - 2 c c Pz, = 1.

2 = 0 ,=O

1~ > H B 3 H D .

and the equality holds only if p7,/NIJ = ( I - ~ , ) / ( 2 ' ~ - 1) = const for all I . ) , 1 +.I < n .

00. . . 0 y.rr . . .x. (17) - k

where k = 0 , 1 , . ' . . n, y = 1 . 2 . . . . . t - 1, x = 0 , 1 , . . . , t - 1, and, without loss of generality, t = r - 1, and let

denote the probability of number values with k leading zeros. Various data compression methods using the digit representation redundancy for t < T are possible.

A number representation with delimiters is one of these methods (representation F) . The number has the form of

2y.r.r . . ..r. (18)

where z is a delimiter, z = 1 = r - 1. The average number length for representation F is equal to

71 71

Page 6: A new model of numerical computer data and its application for construction of minimum-redundancy codes

394 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 2, MARCH 1993

Generalizing (15) for t-ary numbers (t-letter source alpha- It is easily verified that this extremum is a minimum and, hence, bet), the model E entropy equals

n n-I

k=O k=O n

k=O n-1

Proof: Let us denote n n.

It is easily verified that H L > H E . Let us consider the conditional extremum

1~ - H E > F ( P ) 3 F ( P * ) > 0. 0

Runlength encoding used separately or in combination with any other coding method is in fact the standard data compres- sion technique for computer systems. An array of U n-digit numbers is considered as a sequence of un digits, and every run of s, r + 2 > s > 2 zeros is replaced by a delimiter z , z = r - 1, followed by a binary m-bit code of number (s - 3 ) (representation G) .

If the assumption E is valid, the entropy H E determines the greatest lower bound on the average code length, and the evaluation given by (14) cannot be improved.

As very efficient techniques J. Martin [ l l ] recommends the combinations of two compression methods: at first runlength- encoding or a representation with delimiters is to be used, and then Huffman coding is applied to representation F or G, the source alphabet including t symbols { 0 , 1 , . . . , t - 1 ) and an additional symbol z . Denote such two-step coding by index H or I depending on whether the first step representation is F or G.

Every number in the representation F consists of one symbol z and in average ( 1 - p , ) and E;:,' (n - k - 1)pk symbols y and x correspondingly.

The average number length equals 1~ = C z = o p k ( l + (n - k)) symbols of the ( t + 1)-letter alphabet ( 0 , 1 , . . . , t- 1 , ~ ) . The frequency of occurrence of z is p ( z ) = 1 / 1 ~ . Assuming that x and y take all possible values equiprobably, the frequency of occurrence of digit 0 equals

m i n F ( P ) , &k - 1 = 0, k=O

where F ( P ) = ZF - HL,P = ( p o , p l , . . . , p , ) . The extremum point P* is defined by

( T / t ) k , k = 0 , 1 , * . . , n ( r / t ) - 1

and

= m + (m - l o g t ) n - Icp; The entropy H H equals H H = - p ( z ) l o g ( z ) -p(O) l o g p ( 0 ) -

At first sight the coding technique H seems to be optimal. As the entropy evaluation (14) cannot be improved it means

k=O that HH = H E . But in reality HH > H E , because the symbols in the number representation F are not independent. Model H is suitable for coding any sequence of independent symbols { 0, 1 , . . . , t - 1 , z } . Model E is suitable for coding only a certain subset of such sequences. For example, z cannot be followed by 0, between two symbols z there cannot be more than n other symbols, etc. The analogous conclusion ( H I > H E ) can be drawn about runlength-encoding followed

( t - l ) P ( l ) l o g d l ) . ( k 1 0 ,) T / t - 1 + log + log + kp;

(r/t)"+l - 1

= m + (m - 1 o g t ) n + log - T - t

t - log( (r/t)n+l - 1 )

> m + (m - 1 o g t ) n + log - - log(r / t )n+l r - t

t = m + (m - l O g t ) n + lOg(T - t ) - lOgt - (n + l>m + (n + l ) l o g t = - t ) = 0. by Huffman coding.

Page 7: A new model of numerical computer data and its application for construction of minimum-redundancy codes

KAVALERCHIK. A NEW MODEL OF NUMERICAL COMPUTER DATA 395

VI. CODE STRUCTURE

Let the pairs ( 6 . ; ) be numbered. The entropy of numbers of the pairs ( i . : j ) , 'i = 0. 1.. . . .n. ,j = 0. 1. . . ~ n, - 'i equals - ci,jp,J logplJ . which coincides with the first term in the right side of (15). The set M7] consists of N,j elements, and the length of a fixed-length binary code for coding the numbers of the elements equals [log N,,] where r.1 is the least integer not smaller than (1.

Now, the expression (15) prompts the code structure corre- sponding to the data model E. For coding the numerical data a concatenated code can be used, CE = CI + C2. where CI indicates the pair ( i . , j ) , and a fixed-length code C2 identifies the particular number in the set M7,.

To construct an efficient code CE, i.e., to choose codes C1 and Cp, both the compression ratio and coding/decoding complexity (number of computer operations per byte of orig- inal data) have to be taken into account. The compression ratio is mainly determined by the choice of code C1. The codingidecoding complexity is determined by the chosen code C1 and algorithms for the original number value mapping into its ordinal number in the corresponding set M I ] including the inverse mapping.

Processing and storage costs are dropping much faster than communication costs. For this reason a tradeoff be- tween communication efficiency and processing complexity make minimum-redundancy codes preferable for applications in communication area. for example, Huffman code or its adaptive modification [17]-[19] can be used as code C1.

In some other applications data compressionidecompression benefits only if few operations per original number are re- quired. In this case, the following modification of code E can be used (code 1). Let K pairs (i,. ,j,) be given

0 = 'io 6 i l 6 i 2 6 . . 6 ?l i -1 .

0 1 .io 6 .jl 6 . i2 6 . . . 5 . j f i -1 .

%, + .j, < 'iJ>+1 + :jp+l 6 71,.

p = 0.1 ~ . . . . K - 1. h' = 2 k .

Each pair ( i p . , j p ) determines set S,

where

SI< = 0.

Without loss of generality, only binary numbers may be considered. Fixed-length k-bits binary code of number p is taken as code C1. lnstead of the ordinal number of elements in set S, the corresponding W,-digit part of the original number is used:

The following example will serve to illustrate the code c ' 7 . An array of binary numbers of format f ( l O . 3 ) is to be coded. Let k = 2, and ( ~ o . . J o ) = (0,0), ( ~ 1 . . ~ 1 ) = ( 5 . 2 ) . ( ~ 2 ~ ~ 2 ) = ( 6 . 3 ) .

Fig. 1 . Matrix 1 1 pi , 1 1 subdivided into four sets So. SI. $. and s3.

TABLE I EXAMPL~S or CODE C7

number 1) C'I c2 CI

1 1 11 ~~00000000l~ 3 ooooon 1 000 2 10 10

01011 0000001 100 1

ooooo 1 oono 1 01 IO0 01 100

10101010 10 0 00 101010 101 0 00 1010101010 0000000001 0 00 0000000001 000000000001

01 011

TABLE 11 PKOBAHILITIE~ p ,

20 0.4780 13 0.0268 6 0.0005 1') o. 1 080 12 0.0173 5 0.0002

18 0.0990 1 1 0.01 12 4 0.0001

17 0.0848 IO 0.0071 3 0.0001 1 6 0.07 10 9 0.0039 2 <0.0001 15 0.0514 8 0.0020 1 <0.0001 14 0.0365 7 0.001 2 0 <0.0001

( ' i : j . , j : l ) = ( 7 . 3 ) as it is shown in Fig. 1. Notice that the sets s3 and S2 consist of a single number, and corresponding codes C1 uniquely determine it. The codes of some numbers are given in Table I. The average codelength equals

K-1

1.7 = k: + Wp(Q(,&l..jp) - Q(ip+1,.i,+l)). p = o

where

K - I \

Page 8: A new model of numerical computer data and its application for construction of minimum-redundancy codes

396 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 39, NO. 2, MARCH 1993

TABLE 111 COMPRESSION PERFORMANCE

Array 1 Array 2

Entropy H or average code length 1, bits/number Compression ratio R, bits/bit H or 1 R Source Model or Coding Algorithm

Original computer representation 32 1.0 40 1 .o Fixed-length binary numbers 20 1.6 30 1.3 Stationary memoryless source (model .4) 6.15 5.2 9.61 4.16 Bit mapping (model B) 11.44 2.8 First-order Markov source 5.64 5.7 9.38 4.26

Second-order Markov source 5.39 5.9 9.20 4.35

Runlength encoding (model C) 4.38 7.3 8.23 4.9 Two-step coding H 4.59 7.0

Two-step coding I - - 9.78 4.1

Model E 3.99 8.0 6.74 5.9

- -

- -

Coding E (Cl is Huffman code) 4.01 8.0 7.37 5.4

Coding .J 5.03' 6.4 8.742 4.6

Archival programs (the best result) 5.30 6.0 10.60 3.8

'Results for k = 2, io = 0. 21 = 12. i 2 = 17. I 3 = 20.

*Resultsfor k = 2. ( z o . j o ) = (0 .0 ) , (i1.jl) = ( 3 . 3 ) , ( i 2 . j 2 ) = ( 4 . 3 ) . ( i 3 . ; 3 ) = ( 5 . 3 ) .

Number Number of Trailing Zeros, j of -

Leading Zeros, i 0 1 2 3 4 5 6 7 8 9

0 1 2

3 4

5 6 7 8

9

0.0000 0.0000

0.0000 0.0003 0.001 1

0.0030 0.0045

0.0039 0.0018 0.0000

0.0Ooo 0.0000

0.0001 0.0005 0.001 7

0.0032 0.0038 0.0029

0.0000

0.0000 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0007 0.0002 0.0002 0.0001 0.0001 0.0000 0.0005 0.0082 0.0031 0.0015 0.0012 0.0000

0.0016 0.0515 0.0183 0.0117 0.0000

0.0032 0.1922 0.0902 0.0000 0.0031 0.5822 0.0000 0.0033 0.0000

0.0000

Given k, the internal minimum can be found by means of dynamic programming [20]. The external minimum can be easily found setting IC to a few values IC = 0.1 ,2 , . . . .

VII. EXPERIMENTAL RESULTS Some results of experimental studies of the performance

of the new as well as common data compression algorithms on typical computer files are presented in this section. All compressed numbers were data items of inventory files.

Array I: The number array consists of more than 166 thousand nonnegative integers, represented as fixed-length binary numbers, in the binary format f(20,O). The numbers in the original computer representation are aligned and the number length equals 4 bytes or 32 bits. Being represented as a binary string, a data item length reduces to 20 bits. The probability of 0 in this representation is p = 0. 945. The

probabilities pi of numbers with i leading zeros are given in Table 11.

The results of comparison of compression performance attainable (italic type) using various source and data models, or attained by various compression algorithms, are shown in Table 111. In this table, the entropy H (italic type) and the average code length 1 are given in bits per number. The last line in Table 111 represents the best result attained by available archival programs for personal computers (PKZIP, ARC, PKPAK, LHICE).

Array 2: This array consists of more than one million posi- tive binary coded decimals, number format f (9 ,3) . The matrix I)pijII is given in Table IV, and the results of comparison are represented in Table 111.

Coding H and I are, in essence, some implementations of runlength encoding. As it may be expected, the entropy of

Page 9: A new model of numerical computer data and its application for construction of minimum-redundancy codes

KAVALERCHIK: A NEW MODEL OF NUMERICAL COMPUTER DATA 397

these codings is slightly larger than that of model C , which is a lower bound for all algorithms based on runlength encoding.

The entropy of model E is 9 percent (array 1) and 18 percent (array 2) less than the least entropy of the common models studied. Notice that the average code length for coding E is less than the entropies of all the common methods considered, i.e., the new coding in fact achieves a better compression ratio than the upper theoretical bound on the common coding techniques. Moreover, the average codelength attained by coding E is 24 percent (array 1) and 30 percent (array 2 ) less than the best result of available archival pro- grams, which for these as well as for many other arrays was achieved by modifications of the Lempel-Ziv [4] technique. The computational complexity of coding E is not higher than that of Lempel-Ziv.

Even the simplest coding J , being incomparably easier computationally than the available programs, outperforms their best result by 9 and 18 percents for arrays 1 and 2, respectively.

REFERENCES

May 1977. [SI T. A. Welch, “A technique for high-performance data compression,”

Comput., vol. 17. no. 6, pp. 8-19, June 1984. [6] D. A. Lelever and D. S. Hirschberg, “Data compression,” ACM Comput-

ing Surverys, vol. 19, no. 3, pp. 261-296, Sept. 1987. 171 D. G. Severance, “A practitioner’s guide to data base compression,”

Inform. Syst., vol. 8, no. I , pp, 51-62, 1983. 181 J . A. Storer, Datu CompressJon: Methods and Theory. Rockville, MD:

Comput. Sci. Press. 1988. [Y] T. C. Bell, J. G. Cleary, and I. H. Wittcn, Text Compression. Englewood

Cliffs, NJ: Prentice Hall, 1990. [ I O ] R. N. Williams. Adaptive Data Compression. Boston, MA: Kluwer,

1990. [ 111 J . Martin. Computer Data-base Organtzatlon. Englewood Cliffs, NJ:

Prentice-Hall, 1977, ch. 32, pp. 572-587. [ 121 R. G. Gallager, Information Theory andReliable Communication. New

York: Wiley, 1968. I 131 M. Guazzo, “A general minimum-redundancy source-coding algorithm,”

IEEE Trans. Inform. Theory, vol. IT-26, pp. 15-25, Jan. 1980. [ 141 G. G. Langdon, Jr., and J. Rissanen, “A double-adaptive file compression

algorithm,” IEEE Tram. Commun., vol. COM-31, pp. 1253- 1255, Nov. 1983.

[IS] J. A. Llewellyn, “Data compression for a source with Markov charac- teristics,” Comput. J . , vol. 30, no. 2. pp. 149-156, 1987.

[16] T.V. Ramabadran and D.L. Cohn, “An adaptive algorithm for the compression of computer data,” IEEE Trans. Commun., vol. 37, no. 4, OD. 317-324. Am. 1989. 1 .

[ I ] J, ~i~~~~~~ and G , G , ~ ~ ~ ~ d ~ ~ , J ~ , , -universal modeling and coding,” IEEE Trans. Inform. Theory, vol. IT-27, pp. 12-23, Jan. 1981.

(21 D. A. ~ ~ f f ~ ~ ~ , “A method for the construction mjnimum-redundancy codes,” Proc. IRE, vol. 40, pp. 1098-1101, 1952.

[3] 1, H. Witten, R. M, ~ ~ ~ 1 , and J , G. cleary, ‘‘Arithmetic coding for data compression,” Commun. ACM, vol. 30, no. 6, pp. 520-540, June 1987.

[4] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans, Inform. Theory, vol. IT-23, pp. 337-343.

[17] R. G. Gallager, “Variations on a theme by Huffman,” IEEE Trans.

[ 181 D. E. Knuth, “Dynamic Huffman coding,” J . Algorithms, vol. 6, no. 2, pp. 163-180, June 1985.

[ 191 J. S. Vitter, “Daign and analysis of dynamic Huffman codes,” J . ACM, vol. 34, no. 4, pp. 825-845, Oct. 1987.

[20] R. Bellman and s. E. Dreyfus, Applied Dynamic Programming. Prince- ton, NJ: Princeton Univ. Press, 1962.

Inform. Theory, vol. IT-24, pp. 668-674, Nov. 1978.