method for data file compression

8
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-6, NO. 4, JULY 1980 REFERENCES [1] P. H. Notredame, "Toepassingen van ALMAC, een algemene macro-processor," Rijksuniversiteit Gent, Laboratorium voor elek- tronika en meettechnick, Internal Rep., Ghent, Belgium. [2] T. A. Seim, "Assembling microprocessor software with minicom- puters," Battelle, Pacific Northwest lab., Richland, WA. [3] I. M. Watson, "Comparison of commercially available software tools for microprocessor programming," Proc. IEEE, June 1976. [4] D. W. Barron, Assemblers and Loaders. Amsterdam, The Nether- lands: Elsevier, 1972. [5] M. Campbeil-Kelly, An Introduction to Macros. Amsterdam, The Netherlands: Elsevier, 1971. [6] Digital Equipment Corp., IAS/RSX-11 Macro-Il Ref. ManuaL [7] Digital Equipment Corp., IAS/RSX-1l, I/O Operations Reference Manual, ch. 7 (The table driven parser (TPARS)). [8] Szymansky, "Assembling code for machines with span-dependent instructions," Comnun. ACM, vol. 21, Apr. 1978. Karel R. Tavernier, photograph and biography not available at the time of publication. Paul H. Notredame, photograph and biography not available at the time of publication. Overhead Storage Considerations and a Multilinear Method for Data File Compression TZAY Y. YOUNG, MEMBER, IEEE, AND PHILIP S. LIU, MEMBER, IEEE Abstract-The paper is concerned with the reduction of overhead storage, ie., the stored compression/decompression (C/D) table, in field4evel data file compression. A large C/D table can occupy a lage fraction of maim memory space during compression and decompression, and may cause excessve page swapping in virtual memory systems. A two-stage approach is studied, including the reuired additional C/D table decompression time. It appears that the approach has limitations and is not completely satisfactory. A multilinear compression method is proposed which is capabl of reducing the overhead storage by a sgnificant factor. Multilinear compession groups data items into several duster and then com- prses each duster by a binary-field linear trandormation. Algo- rithms for clustering and transformation are developed, and data com- pression examples are presented. Index Tenr-Ouster analysis, data file compression, data transfonna- tion, multilinear approach, overhead storage, performance analyis, piecewise linear transformation, storage reduction techniques. I. DATA FILE COMPRESSION AND OVERHEAD STORAGE C ONSIDER a data file consisting of fixed-length records. _The contents of a record are divided into several fields, with each field representing an attribute or a key. To reduce the size of a file, data compression may be used at record level or field level. Record-level compression treats each record as a data vector during compression and decompression. In many files different types of data items are stored in the data fields, and some fields may be compressed more effectively than others. It may be more successful to compress the data fields separately, using different compression techniques. This is called field-level compression, and data items in a data field, or possibly items formed by merging two or three fields to- gether, are regarded as data vectors to be compressed, apart from other data fields of the records. It is noted that data items within the same data field are not stored contiguously in the secondary memory. A very simple and commonly used method for field-level compression is the fixed-length miniimum-bit (FLMB) encod- ing scheme. Consider the set of data items in a data field. With Nr records in the file, there are Nr data items in a field, one for each record. But some of the data items may be identical, and there may be onlyN distinct data items,N<Nr. Let n= [log2Nl (1) where [ 1 denotes the smallest integer greater than or equal to its argument. Clearly each data item can be represented uniquely by an n-bit binary number. The compression ratio is l/n, where I is the bit length of the original data item. FLMB compression can be very effective when N is substantially smaller than Nr. For example, in a file of student records Manuscript received March 9, 1979; revised December 6, 1979. This kept by an academic department, the field of course numbers work was supported in part by the National Science Foundation under (e.g., EEN 201) is of major concern, and it can be compressed Grant MCS 77-01483. Prelminary resultswere presented at COMPSAC effectively by the FLMB scheme since a course is taken by 79. The authors ae with the Department of Electrical Engineeri, U man y students. versity of Miami, Coral Gables, FL 33124. The FLMB scheme requires the construction and storage of 0098-5589/80/0700-0340$00.75 i 1980 IEEE 340

Upload: truongtruc

Post on 04-Feb-2017

230 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Method for Data File Compression

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-6, NO. 4, JULY 1980

REFERENCES

[1] P. H. Notredame, "Toepassingen van ALMAC, een algemenemacro-processor," Rijksuniversiteit Gent, Laboratorium voor elek-tronika en meettechnick, Internal Rep., Ghent, Belgium.

[2] T. A. Seim, "Assembling microprocessor software with minicom-puters," Battelle, Pacific Northwest lab., Richland, WA.

[3] I. M. Watson, "Comparison of commercially available softwaretools for microprocessor programming," Proc. IEEE, June 1976.

[4] D. W. Barron, Assemblers and Loaders. Amsterdam, The Nether-lands: Elsevier, 1972.

[5] M. Campbeil-Kelly, An Introduction to Macros. Amsterdam, TheNetherlands: Elsevier, 1971.

[6] Digital Equipment Corp., IAS/RSX-11 Macro-Il Ref. ManuaL

[7] Digital Equipment Corp., IAS/RSX-1l, I/O Operations ReferenceManual, ch. 7 (The table driven parser (TPARS)).

[8] Szymansky, "Assembling code for machines with span-dependentinstructions," Comnun. ACM, vol. 21, Apr. 1978.

Karel R. Tavernier, photograph and biography not available at the timeof publication.

Paul H. Notredame, photograph and biography not available at thetime of publication.

Overhead Storage Considerations and a MultilinearMethod for Data File Compression

TZAY Y. YOUNG, MEMBER, IEEE, AND PHILIP S. LIU, MEMBER, IEEE

Abstract-The paper is concerned with the reduction of overheadstorage, ie., the stored compression/decompression (C/D) table, infield4evel data file compression. A large C/D table can occupy a lagefraction of maim memory space during compression and decompression,and may cause excessve page swapping in virtual memory systems. Atwo-stage approach is studied, including the reuired additional C/Dtable decompression time. It appears that the approach has limitationsand is not completely satisfactory.A multilinear compression method is proposed which is capabl of

reducing the overhead storage by a sgnificant factor. Multilinearcompession groups data items into several duster and then com-prses each duster by a binary-field linear trandormation. Algo-rithms for clustering and transformation are developed, and data com-pression examples are presented.

Index Tenr-Ouster analysis, data file compression, data transfonna-tion, multilinear approach, overhead storage, performance analyis,piecewise linear transformation, storage reduction techniques.

I. DATA FILE COMPRESSION ANDOVERHEAD STORAGE

C ONSIDER a data file consisting of fixed-length records._The contents of a record are divided into several fields,

with each field representing an attribute or a key. To reducethe size of a file, data compression may be used at record levelor field level. Record-level compression treats each record as adata vector during compression and decompression. In many

files different types of data items are stored in the data fields,and some fields may be compressed more effectively thanothers. It may be more successful to compress the data fieldsseparately, using different compression techniques. This iscalled field-level compression, and data items in a data field,or possibly items formed by merging two or three fields to-gether, are regarded as data vectors to be compressed, apartfrom other data fields of the records. It is noted that dataitems within the same data field are not stored contiguouslyin the secondary memory.A very simple and commonly used method for field-level

compression is the fixed-length miniimum-bit (FLMB) encod-ing scheme. Consider the set of data items in a data field.With Nr records in the file, there are Nr data items in a field,one for each record. But some of the data items may beidentical, and there may be onlyN distinct data items,N<Nr.Let

n= [log2Nl (1)where [ 1 denotes the smallest integer greater than or equalto its argument. Clearly each data item can be representeduniquely by an n-bit binary number. The compression ratio isl/n, where I is the bit length of the original data item. FLMBcompression can be very effective when N is substantiallysmaller than Nr. For example, in a file of student records

Manuscript received March 9, 1979; revised December 6, 1979. This kept by an academic department, the field of course numberswork was supported in part by the National Science Foundation under (e.g., EEN 201) is of major concern, and it can be compressedGrant MCS 77-01483. Prelminary resultswere presented at COMPSAC effectively by the FLMB scheme since a course is taken by79.The authors ae with the Department of Electrical Engineeri,U man y students.

versity of Miami, Coral Gables, FL 33124. The FLMB scheme requires the construction and storage of

0098-5589/80/0700-0340$00.75 i 1980 IEEE

340

Page 2: Method for Data File Compression

YOUNG AND LIU: METHOD FOR DATA FILE COMPRESSION

a compression/decompression (C/D) table. The table size willbe small and insignificant if the number of distinct data itemsis small. In many instances, however, the size of the C/D tablecan be substantial ifN X I is large, and in some cases, the tablesize, together with the total compressed data size, may ap-proach the total original data size. During compression anddecompression, a large C/D table will occupy a large fractionof the main memory space and may cause time-consumingpage swappings in virtual memory systems. This paper ad-dresses the overhead (C/D table) problem in field-level com-pression, and proposes methods for reducing overhead storage.A new compression method called multilinear compression is

proposed in the following section. It is capable of reducingthe overhead storage significantly (say by a factor of 5 ormore) with a somewhat smaller compression ratio. It is aflexible approach that may be used to achieve a balancedperformance between overhead storage and compression ratio.Multilinear compression divides the data items into K

clusters, and then compresses each cluster into m bits by anm X 1 modulo-2 linear transformation. In addition to the mbits, a cluster identification number of [0og2Kj bits is at-tached to each compressed data item. The overhead storageis the K binary matrices which occupy substantially smallermemory spaces than the C/D table of the FLMB scheme.Multilinear compression is applicable when the distinct dataitems are clustered and/or similar.An apparent alternative is a two-stage approach that com-

presses a field by the FLMB scheme at the first stage and re-duces its C/D table size at the second stage. This approachhas limitations depending on the method chosen for C/Dtable compression, and it is discussed in Section III. Theadditional decompression time incurred at the second stageis analyzed in terms of numbers of computer instructions.Experimental results on multilinear compression and two-stage compression are presented in Section IV.In recent years there has been substantial interest in data

file compression for storage reduction purposes [11 - [6], anddata transformation techniques based on linear transforma-tions in real number field have been utilized in multikeysearching [7]. Commonly used compression techniques in-clude the difference method [1], Huffman code [8], andrun length code [9]. With these approaches, the compresseddata items are of variable lengths.The difference method usually requires the data items to

be arranged or accessed in lexical order; otherwise, pointersmust be attached to link the difference information to theprevious data items [2]. Run length coding is effective onlywhen the data items consist of long runs of zeros and/or ones.The Huffman code is a statistical approach which is optimal interms of compression ratio. The statistics of the data itemsmust be known, and the code also requires a C/D table. Elias[10] developed a new code which, as in the difference method,requires that the data items be sorted into lexical order. Weshall discuss in some detail the application of Elias code andthe difference method in two-stage compression.

It is noted that the FLMB scheme is, among existing meth-ods, the most appropriate method for data field compression.The scheme does not require that the data items be stored

contiguously or in any order. It uses no pointers and hencedoes not incur extra disk access cost associated with tracingpointers. It also has the desirable feature that the compresseddata items are of the same length. The FLMB scheme is opti-mal in compression ratio when the data items are uniformlydistributed. The proposed multilinear compression has manyof the same desirable properties as the FLMB scheme, and ityields a more balanced performance in overhead storage andcompression ratio.

II. MULTILINEAR DATA COMPRESSIONConsider a set ofN distinct binary data items with each item

being of length I bits. Suppose due to favorable data charac-teristics, the set lies completely in an m-dimensional subspaceof the i-dimensional linear vector space defined on binaryvectors and binary-field (modulo-2) operations. Then thedata items may be compressed from I bits to m bits by alinear transformation, and the required overhead storage isthe ml bits of an m X i binary matrix.In the FLMB scheme the data items are compressed into

n = [0og2NJ bits. A C/D table is retained with each entry ofthe table being the original uncompressed data items, and itssize is Nl = 2"i, which grows exponentially with n. Hence, bycomparison, the savings in overhead storage with linear com-pression could be substantial even though m > n. Unfortu-nately, linear compression is unrealistic since, in most cases,m is very close to 1.A more realistic approach is multilinear compression which

divides the data set into K clusters and uses an m X I trans-formation for each cluster. A [log2KJ-bit cluster identifica-tion (ID) number is attached to the m-bit transformed dataitem, and the total length is usually greater than n. Multi-linear compression is applicable when the distinct data itemsare clustered and/or similar and when it is tolerable to have acompression ratio smaller than l/n under certain situations.The value of m may range from 0 to the compressed data

length of the linear method. A small m usually requires a largeK and vice versa. It is obvious that the multilinear method re-duces to linear compression when K= 1. At the other extreme,with m = 0 andK = N, each data item forms a separate cluster,and the cluster ID numbers require flog2Nl bits and becomethe compressed binary data items in the FLMB scheme. Hence,both linear compression and the FLMB scheme may be re-garded as special cases of multilinear compression.A disadvantage of the multilinear method is that the values

of m and K must be selected for each data compression prob-lem. The selection should be based on the desired tradeoffbetween overhead storage and compression ratio. Also, thereare situations where the multilinear method will not work sat-isfactorily. If the data items are, on the whole, scattered al-most randomly in the i-dimensional space, it is unlikely thata reasonable number of clusters can be formed with each onebeing compressible into m bits. The data items have to besomewhat clustered for the method to be successful. Anothersituation is that all 2' data items are present; we note that inthis case, the data items cannot be compressed by any fixed-length schemes including linear, multilinear, and FLMB com-pression methods.

341

Page 3: Method for Data File Compression

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-6, NO. 4, JULY 1980

A. Linear CompressionLet x = (xl, x2,.*.*, xl) be a binary data item in row vector

form, and let the set of 1-bit data items be denoted by S. Theset is linearly compressible, if there exists an m X I binary-valued matrix G of rank m, m <1, and an I-vector c such that

x=zG+c, foreveryxES. (2)This is a decompression equation with the m-vector z repre-senting the compressed data item, and all operations aremodulo-2 operations.A linear compression is minimal if m is the smallest integer

satisfying (2).Linear compression forms the basis of multilinear compres-

sion. A detafled theory for linear compression is developed inthe Appendix, and the results are summarized here. It isshown in the Appendix that the algorithm developed yields aminimal linear compression.A set of binary data items are linearly compressible if, and

only if, there exists a nonsingular matrix A = [F, F'], suchthat for every x in the set

(x+c)F=z (3)and

(x + c) F'= 0. (4)

The matrices F and G are compression and decompressionmatrices, respectively, and F' and (4) may be used to checkwhether a data item x is recoverable from the decompressionequation (2). The vector c represents a linear translation. Wenote from the Appendix thatA = [F, F'] = B-1 andB consistsof G and an ( - m) X I matrix G' chosen arbitrarily subject tothe constraint of nonsingularity of B. Thus, F and F' can becalculated when needed for compression, and it is only neces-

sary to store G and c, with a resulting overhead storage,

SL =(m+ 1)1. (5)The compression ratio is i/m.The following algorithm calculates G and c from a given

set of data items, and it yields a minimal linear compression.Algorithm 1:1) Given a set of i-bit data items 8, select arbitrarily an

xES andletc =x.2) Construct a set = {x + c: x E S}.

3) Select m linearly independent vectors from Sc to formthe row vectors of G, where m is the maximal number of suchvectors that can be selected from Sc. (The data set is notlinearly compressible ifm = 1.)4) Calculate F and F' from G as discussed in Theorem 1 of

the Appendix.

B. Multilinear CompressionIn a multilinear compression scheme, the distinct data items

are grouped into K clusters, C1, i = 1, 2, *... , K, each with a

cluster center cj, so that the I-bit data items in C, are linearlycompressible to m bits. In other words, there exist m X I ma-trices G1, and

x=zG1+c,, xEC1. (6)

To compress a data item, we must first determine the clusterto which it belongs. As in linear compression, two matrices F,and Fj can be calculated from G, when needed. Given a dataitem x, the equation

(x+cj)F7'=O, ifxEC1

is used to identify the cluster, and with C, thus identified,

z = (x + c1) F, x EC1

(7)

(8)represents the compressed data item. It is conceivable thata data item will belong to more than one cluster; in this case,we arbitrarily choose one of these clusters for compression/de-compression purposes. A cluster ID number must be attachedto the compressed data item, hence the compressed item has atotal length ofm + flog2K] bits.Decompression depends on the ID number and (6). The

overhead storage requirement consists of G1 and cj, i = 1,2,* * *, K, and hence

SML = Kl(m + 1). (9)

Clustering techniques have been studied and applied in otherareas [11], [12]. For our purposes, we group data items usinga sequential approach which generates clusters sequentiallyand examines the data items one at a time. The criterion isthat each cluster must be linearly compressible by an l X mmatrix Fi, and decompression matrices are obtained from thefollowing algorithm using linear independence concept. Theoriginal data items are arranged in some order, say in lexicalorder. Since two similar data items will have a small Hammingdistance between them, another possibility is to arrange thedata items according to the order in which a minimal spanningtree is constructed.Algorithm 2:1) Construct a data array by arranging the distinct data

items in some order.2) Let cl be the first data item in the array, and then delete

the item from the array. Setj = 1 and i = 0.3) With c; and i row vectors of G1 determined, O< im,

take the next data item x from the array and do one of thefollowing.

a) If i<m, let (x+ c;) be the (i+ l)st row vector of G,.Increment i by one and go to step 4).

b) If i = m, let c1+ 1 =x and delete x from the data array.Incrementj by one, set i =0 and go to step 5).4) Delete a data item x from the data array if (x + c;) is

linearly dependent on the i row vectors of G1. Repeat thisfor everyx in the array.5) Exit if there is no data item left in the array; otherwise

go to step 3).In the preceding algorithm, we assume that m >0 and its

value is known. The number of clusters K depends onm andthe data items being compressed. The algorithm may beapplied repeatedly on the same data set, using different valuesof m. We then select an m and its associated matrices G1 thatyield the most desirable compromise between compressionratio and overhead storage. Of course, with a criterion definedon these two factors, the procedure can be easily automated.

342

Page 4: Method for Data File Compression

YOUNG AND LIU: METHOD FOR DATA FILE COMPRESSION

C. Analysis ofDecompression TimeThe multilinear method requires binary matrix multiplica-

tions during compression/decompression. Compression timeis not a major factor and will be neglected, assuming fairlystatic files. Decompression time is more important since de-compression may be needed each time a data item is accessed.

It is difficult to compare decompression time based on high-level language decompression algorithms because data compres-sion techniques are developed from fundamentally differentprinciples. Hence, for purposes of comparison, decompressiontime will be expressed in terms of numbers of computer ma-chine instructions required to execute in the decompressionprocess for each method [131. The instruction set consists ofLOAD, ADD, AND, XOR, JUMP, JUMPP (jump if data in registerRi is positive), ROTL (rotate register Ri k bits to the left), andDROTL (rotate registers Ri and R1+ 1 together k bits to theleft).In the multilinear scheme, the cluster ID number attached

to the compressed data items can be used as an address or in-dex to locate the decompression matrix directly and quicklyin an array or table. The original data item is recovered bymodulo-2 addition (XOR) of the cluster center vector and theappropriate row vectors of the decompression matrix; in otherwords, the ith row vector will be included in the addition if,and only if, the ith bit of the compressed data item is 1. As-suming on the average half of the elements of the compresseddata are l's, the average number of XOR operations is m/2.The number ofmemory words accessed is

M=[(m+ 1) I/wI (10)

where w is the memory wordlength.A decompression algorithm has been developed, assuming

the decompression matrix is already in main memory. Theaverage numbers of instructions required are

ADD:JUMPP:AND:XOR:ROTL:DROTL:JUMP:LOAD:

2m + 4M times3m times1 timem/2 timesm timesm +M timesm times5 +M times

III. TWO-STAGE DATA COMPRESSIONThe first stage of our two-stage compression scheme is

simply FLMB compression of data items in the field of in-terest. At the second stage, the C/D table is compressed byElias code or by the difference method.

A. Overhead Storage Reduction by the Elias MethodAssume that the original uncompressed data items in the

C/D table are arranged in lexical order. Let the ith data itemxi be divided into two parts

Xi = (fi, hi ) (11)

where fi is the first n bits of xi and fi' is the remaining I - n

bits. Elias method [10] compresses only the first n bits.

Let Ilfill be the integer of which fi is the n-bit binary repre-sentation. Clearly, I1 ill 6 2n - 1, and it is assumed for con-venience that N= 2n - 1. Let ui be the encoding of the dif-ference IIfill - lfif-I 11 into a string of O's terminated by a 1,

11 , 11IU1 =O 1flI;Ui = ltlllf 1 1 i = 2, 3 ** N;

UN+1 =0 N (12)

The string f, = (fl, f2,..*, fN) is then encoded by the con-catenation of the strings in (12),

UC =(UI,U2, * * *,UN+ 1) (13)It is easy to see from (12) and (13) that uc, consists ofN

O's and N l's, and the nN-bit fc is compressed into 2N bits.We note that since the f/t vectors are not in lexical order, theycannot be compressed by a similar technique. Thus, withthe Elias code, the Ni bits of the C/D table are reduced to anoverhead storage of

SE =N(i - n +2) bits. (14)In the case I is significantly greater than n, the Elias methodis not very effective for reducing overhead storage. Elias wasinterested in the general results on storage and retrieval offiles. Lower bounds were derived for storage and for a num-ber of bits accessed during retrieval, maximized over all con-ceivable files having a length ofNi bits. Equation (14) is closeto the storage bound, and the method may be used for anyM-bit files. For special types of data, however, other com-pression methods may yield a total storage less than thebound.The retrieval of information in the Ellas method becomes,

in our case, decompression. With the FLMB scheme, the com-pressed data items are simply the binary representation of in-tegers i, 0S iSN - 1. It is seen from (11)-(13) that decom-pression is mainly the recovery of fi from i, and lftill is thenumber of O's accessed in UC before encountering the ith 1.The average number of bits (O's and l's) accessed for recover-ing afti is, therefore,

b = (2N+ 1)/2 =N+ 1/2. (15)

Assuming the reduced C/D table is available in main memory,the required average numbers of instructions for C/D tabledecompression are

ADD: 2b + [b/wl timesJUMPP: 2b + (N- 1)/2 timesJUMP: b timesROTL: b - [b/wi timesLOAD: [b/wi times

where w is the memory wordlength. The total number of in-structions is approximately proportional to b, the averagenumber of bits accessed.The value of b can be reduced significantly and made ap-

proaching the bit-access bound by using directories. Direc-tories occupy additional storage space. Since our primary in-terest is reducing the overhead storage, we do not discussdetails of this case here.

343

Page 5: Method for Data File Compression

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-6, NO. 4, JULY 1980

B. Overhead Storage Reduction by Difference MethodIf the N distinct data items, Xi, i = 0, l,-- ,N - 1, in the

C/D table of the FLMB scheme are similar, application of dif-ference method at the second stage can yield a smaller over-head storage than Elias method. Let the N data items bedivided into K clusters. This can be achieved by a minimalspanning tree technique [2], or by arranging theN data itemsin lexical order and selecting sequentially every N/K dataitems as a cluster.Suppose that xi and xi 1 belong to the same cluster. Let us

define a difference vector

vi=xi-i +xi (16)

where the addition is modulo-2 addition, or equivalently, alogical XOR operation. Since the two data items are similar,v, consists of very few l's and a large number of O's. Thus,instead of vi we may simply store the locations of l's in vi,with each location represented by [log211 bit. In addition,[log2 l bit is required to denote the number of l's in vi. Thefirst data item in a cluster is stored in its original form. Let abe the average number of l's in vi. The overhead storage fortwo-stage compression with the FLMB scheme and differencemethod is

SD = (N- K)(a + 1) [10g2l + KI. (17)The reduction factor of overhead storage is limited by

l/((t + 1) [log 1 ).

To recover xi by decompression, all previous differences ofdata items belonging to the same cluster as xi must be re-covered first. Assume that each oftheK clusters consist ofN/Kdata items. The number of difference vectors that must be re-covered in order to recover xi is, on the average, (N/K - 1)/2.We also assume for convenience that the length of the dataitems I is not greater than the memory wordlength w. Let Mbe the average number of memory words accessed for de-compression purposes. Witha. being the average number ofl's in a difference vector, we have

M = [((a+ 1)(N/K - 1) [log2 l /(2w))]. (18)

The average number of each instruction executed in the de-compression of a data item is listed below.

ADD:

JUMPP:JUMP:DROTL:LOAD:AND:

XOR:

(a + 2)(N/K - 1) +M times(a + 3/2)(N/K - 1) timesM times(a + 1)(N/K - 1)/2 times

M timesa(N/K - 1)/2 timesa(N/K - 1)/2 times.

It is clear from (18) and the above list that decompressiontime can be reduced by increasing K. Increasing the numberof clusters, however, causes slight increase in overhead storageas shown in (17).

IV. DATA COMPRESSION EXPERIMENTS

In data compression experiments the resulting compressionratio depends not only on the compression method chosen,but also on the original representation of the data items. In

the following examples alphanumerical characters are repre-sented originally by the very compact 6-bit ASCII code. Thefirst example shows the results of applying multilinear com-pression. In Example 2 we compare the performances of two-stage compression techniques and multilinear compression,including the comparison of decompression time.Example 1: Consider a set of 1000 alphanumerical course

numbers from 38 academic departments. There are six lettersin each course number, e.g., EEN 201, with each letter repre-sented originally by 6 bits, and hence 1= 36 bits. The resultsof multilinear compression are summarized in Table I. Wenote that in the student record files course numbers occupya major portion of the storage space.

It should be noted that when m = 0, each of the 1000 dis-tinct data items becomes a cluster, and we have an FLMBscheme with a compressed length of 10 bits and a C/D tablesize of 36 kbits. At the other extreme, m = 26 correspondsto minimal linear compression. The compressed data lengthsin the table include the length of the cluster ID number. Forexample, with m = 15, an 18-bit compressed data item con-sists of a 3-bit number representing one of the 8 clusters, anda 15-bit number obtained from linear transformation. Thetradeoff in performance between overhead storage and com-pression ratio is evident from the table.Example 2: This example uses the same data set as Example

1, i.e., 1000 course numbers. Performances of data compres-sion techniques are shown in Table II, including compressionratio, overhead storage, and decompression time. The reduc-tion factors of overhead storage are shown in brackets.The compression time depends on the execution times of

the instructions of a particular computer used. For a UNIVAC1108 computer, the execution time for each instruction is0.75 Ms, except for JUMPP and DROTL which require 1.125and 0.875 ps, respectively [14].All decompression times are calculated assuming that a com-

pressed data item is already accessed and the reduced C/Dtable (or decompression matrices) is available in main memory.It is noted that page swapping time is not included in the de-compression time, and small overhead storage will have lesschance of page swapping.The UNIVAC 1108 machine uses memory words of 36 bits.

Thus, for our example, w = I= 36 bits. For the differencemethod, a = 2.886, which is determined experimentally. Thedifference method results in less overhead storage than theElias method. The decompression time is small when thedata set is divided into 10 or 100 clusters. It is interesting tonote that the multilinear method may also be used at thesecond stage.The decompression time of the Elias method is, roughly

speaking, proportional to the number of bits accessed, andb = 1000 according to (15). The value of b can be reducedsignificantly if a directory is used. With a directory for ourexample, b can be reduced to 81 bits; hence we estimateroughly that the decompression time can be reduced toabout 0.47 ms. The directory size, however, causes an in-crease of overhead storage to approximately 30.5 kbits.Based on the figures in Table II, if overhead storage is of

primary concern and we wish to reduce it by a significantfactor, the multilinear method may be used which requires

344

Page 6: Method for Data File Compression

YOUNG AND LIU: METHOD FOR DATA FILE COMPRESSION

TABLE IMULTILINEAR COMPRESSION OF CouRSE NUMBERS, I = 36 BITS

Number Compressedof Clusters Data Length Compression Overhead

m K m + ']bog2K Ratio Storage

0 1000 10 3.600 36.00 K bits

3 228 11 3.273 32.83 K bits

6 101 13 2.769 25.45 K bits

9 46 15 2.400 16.56 K bits

12 19 17 2.118 8.89 K bits

15 8 18 2.000 4.61 K bits

18 5 21 1.714 3.42 K bits

21 4 23 1.565 3.17 K bits

24 3 26 1.385 2.70 K bits

26 1 26 1.385 0.97 K bits

TABLE IICOMPARISON OF PERFORMANCES OF DATA COMPRESSION TECHNIQUES

First Second Compression Overhead Storage Decompression

Stage Stage Ratio (Reduction Factor) Time*

FLMB none 3.60 36.00K bits negligible(1.00)

FLMB Elias 3.60 28.00K bits 5.84 m sec.(1.29)

FLMB Difference,K=l 3.60 23.33K 10.72 m sec(1.54)

FLMB Difference,K=10 3.60 23.44 K bits 1.06 m sec(1.54)

FLMB Difference,K=100 3.60 24.58K bits 0.10 m sec.(1.46)

Multi-linear none 2.79 25.45K bits, 0.08 m sec.m=6, K=101 (1.41)

Multi-linear none 2.12 8.89K bits 0.16 m sec.m=12 K=19 (4.05)

Multi-linear none 2.00 4.61K bits 0.19 m sec.m=15,K=8 (7.81)

Linear (or none 1.38 0.97K bits 0.33 m sec.m=26,K=l) (37.11)

FLMB Multi-linear 3.60 19.61K bits 0.19 m sec.m=15,K=8 (1.84)

*Decompression times are calculated assuming the instruction set ofa UNIVAC 1108 computer.

little overhead storage space with a smaller compression ratio.The decompression time for this method is short, and itsvalues in the table are in general agreement with our experi-ments on decompression carried out by a UNIVAC 1108computer.

V. CONCLUSIONSWe have investigated methods for overhead storage reduc-

tion in field-level data file compression. For N distinct dataitems of length I bits, the reduction of C/D table size of the

FLMB scheme by the Elias method may not be effective aslong as I is substantially greater than [log2Nl. In our example,the difference method is also not very effective. With thismethod, the reduction factor of overhead storage is limitedby l/((cx + 1)[log211), where a is the average number of bitsthat differ in the successive data items. Hence, it can be ef-fective only when 1 is very large and the data items are verysimilar.A multilinear compression method is proposed which can

reduce the overhead storage by a significant factor. It is

345

Page 7: Method for Data File Compression

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-6, NO. 4, JULY 1980

applicable when the data items are clustered and/or similar.The method allows performance tradeoff between overheadstorage requirement and compression ratio. Algorithms aredeveloped, and decompression time is analyzed in somedetail.

APPENDIXA set E of i-bit data items is linearly compressible, if there

exists an m X I binary-valued matrix G of rank m, m < 1, andan I-vector c such that

x=zG+c, foreveryxeS. (2)

In this Appendix, all operations are binary-field (modulo-2)operations.Theorem 1 in the following has been discussed briefly in the

author's recent paper [15] which deals primarily with approxi-mating discrete probability distributions. The theorem playsan important role in linear and multilinear compression.Theorem 1: A set of binary data items S is linearly com-

pressible if, and only if, there exists a nonsingular matrixA = [F, F'], such that for every x in the set

(x + c) F=z (3)

and

(x + c)F' 0. (4)

Proof: Suppose S is linearly compressible. A nonsingularmatrix B may be constructed from G

B=[ (19)

where G' is chosen somewhat arbitrarily subject to the con-straint of nonsingularity of B. Let A = B-, and with A =[F, F']GF=I, GF'=O (20)

where I is an identity matrix. Then,

(x + c)A [(x + c) F, (x + c)F']= [zGF,zGF'J

=[z,O] (21)where we have used (2) and (20).Conversely, assume that there is a nonsingular matrix A =

[F, F'] satisfying (3) and (4). Let B =A 1 and divide B intoG and G' as in (19). Using (3), (4) and the fact that ABFG +F'G'=1, we have

zG = (x + c)FG

=(x+c)FG+(x+c)F'G'

= (x + c)

which completes the proof.The matrices F and G are compression and decompression

matrices, respectively, and F' and (4) may be used to checkwhether a data item x is recoverable from the decompressionequation (2). It is noted that given G, the matrices F and F'are not unique due to the arbitrariness in selecting G'. Never-theless, (3) and (4) will be satisfied, and the nonuniquenesswill have no adverse effect on linear compression.A linear compression is minimal if m is the smallest integer

satisfying (2). In the following, we show that Algorithm 1 inSection II is a minimal linear compression using group theory.We note that the 21 x-vectors constitute an Abelian group undermodulo-2 addition, and denote the group by X. A subgroupx is a subset of X that satisfies the conditions of an Abelian

group (i.e., closure, associativity, commutativity, an identityelement, and an additive inverse for every element in S,X).Given a subgroup Sx of a group X, the coset associated withan element c E X is the set of all elements c + s, where s rangesover SX. The coset may be written as c + SX, and c is thecoset leader.Lemma: A set of linearly compressible data items S is a sub-

set of a coset.Proof: Consider a set of n-vectors defined by Sx = {zG: z

ranges over all distinctive m-vectors}. We wish to show thatSx defined in this way is indeed a subgroup of X. Then itfollows from (2) that SC c + Sx.We first note that zG is an I-vector, and the associative and

commutative properties follow directly from the definitionof modulo-2 vector addition. The vector 0 E Sx is the additiveidentity element, and the additive inverse of s E Sx is s itselfsince s + s = 0. To show the closure property, we note thatif s1 E Sx and s2 E SX, there exist z1 and Z2 such that s, =

zIG and s2 =z2G. Hence, (sI +s2)=(zI +z2)G and (sI +s2) EG Sx. Thus, Sx is a subgroup, and this completes theproof.

It becomes clear now that finding a linear compressionscheme is equivalent to seeking a coset. The key propertyof cosets is that two cosets, cl + Sx and c2 + SX, are eitherdisjoint or identical. Indeed any member of a coset may bechosen as the coset leader. This property is utilized in thedevelopment of Algorithm 1 and Algorithm 2, for linear andmultilinear compression, respectively.Theorem 2: For a set of data items 5, Algorithm 1 in Sec-

tion II yields a minimal linear compression.Proof: It is clear that Algorithm 1 yields a linear com-

pression. Since S must be a subset of a coset of interest ac-cording to the lemma, choosing an arbitrary xE S as thecoset leader will have no effect on the result.Suppose there exists a linear compression with dimension

m' <m. Then, there will be an m' X I matrix G and a sub-group S' consisting of all 2m' vectors in the m'-dimensionalsubspace spanned by the m' row vectors of G. Now it canbe seen from (2) and the definition of Sc in the algorithmthat Sc C S'. With m' <in,Sm cannot span an m-dimensionalsubspace, which is contradictory to the assumption in the algo.rithm that there are m linearly independent vectors in SC.

346

Page 8: Method for Data File Compression

YOUNG AND LIU: METHOD FOR DATA FILE COMPRESSION

REFERENCES

[1] D. Gottlieb, S. A. Hagerth, P. G. H. Lehot, and H. S. Rabinowitz,"A classification of compression methods and their usefulness fora large data processing center," in Proc. Nat. Comput. Conf.,1975, pp. 453-458.

[2] A. N. C. Kang, R. C. T. Lee, C. L. Chang, and S. K. Chang, "Stor-age reduction through minimal spanning trees and spanning for-rests," IEEE Trans. Comput., vol. C-26, pp. 425-434, May 1977.

[3] F. Rubin, "Experiments in text file compression," Commun. Ass.Comput. Mach., voL 19, pp. 617-623, 1976.

[4] S. S. Ruth and P. J. Kreutzer, "Data compression for large busi-ness files," Datamation, pp. 62-66, 1972.

[5] B. Hahn, "A new technique for compression and storage ofdata,"Commun. Ass. Comput. Mach., voL 17, pp. 434-436,1974.

[6] P. A. Alsberg, "Space and time savings through large data basecompression and dynamic structuring," Proc. IEEE, vol. 63, pp.1114-1122, Aug. 1975.

[7] R. C. T. Lee, Y. H. Chin, and S. C. Chang, "Application ofprincipal component analysis to multikey searching," IEEETrans. Software Eng., voL SE-2, pp. 185-193, Sept. 1976.

[8] D. A. Huffman, "A method for the construction of minimumredundancy code," Proc. IRE, vol. 40, pp. 1098-1101, 1952.

[9] T. S. Huang, "An upper bound on the entropy of run-lengthcoding," IEEE Trans. Inform. Theory, vol. IT-20, pp. 675-676,Sept. 1974.

[10] P. Elias, "Efficient storage and retrieval by content and addressof static file," J. Ass. Comput. Mach., voL 21, pp. 246-260,1974.

[11] K. Fukunaga, Introduction to Statistical Pattern Recognition.New York: Academic, 1972.

[12] T. Y. Young and T. W. Calvert, aassification, Estimation andPattern Recognition. New York: American Elsevier, 1974.

[13] P. S. Liu and F. J. Mowle, "Techniques of program executionwith a writable control memory," IEEE Trans. Comput., vol.C-27, pp. 816-827, Sept. 1978.

[14] Manual, UNIVAC 1108 processor and storage.[15] T. Y. Young and P. S. Liu, "Linear transformation of binary

random vectors and its application to approximating probabilitydistributions," IEEE Trans. Inform. Theory, vol. IT-24, pp. 152-156, Mar. 1978.

Tzay Y. Young (S'58-M'63) received the B.S.degree from National Taiwan University,Taipei, Taiwan, China, in 1955, the M.S. de-gree from the University of Vermont, Burling-ton, in 1959, and the Dr. Eng. degree fromJohns Hopkins University, Baltimnore, MD, in1962, al in electrical engineering.From 1962 to 1963 he was a Research As-

sociate at Carlyle Barton Laboratory, JohnsHopkins University, and from 1963 to 1964he was a member of the Technical Staff of Bell

Telephone Laboratories, Murray Hill, NJ. He was on the faculty ofCarnegie-Mellon University, Pittsburgh, PA, from 1964 to 1974, andwas on leave at NASA Goddard Space Flight Center from 1972 to1973. Since 1974 he has been a Professor of Electrical Engineering atthe University of Miami Coral Gables, FL. He is coauthor (with T. W.Calvert) of Classification, Estimation, and Pattern Recognition (NewYork: American Elsevier).

Dr. Young was an Associate Editor of the IEEE TRANsACrIONS ONCoMPUTERs from 1974 to 1976, and is currently a member ofthe EditorialCommittee of IEEE TRANsAcTioNs ON PATrERN ANALYSIS AND MACHINEINTELLIGENCE.

Philip S. Liu (S'70-M'75) was born in WaiChow, China, on November 19, 1945. He re-ceived the B.S. degree in electrical engineeringfrom the University of Wisconsin, Madison,in 1970, and the M.S. and Ph.D. degrees in elec-trical engineering from Purdue University, WestLafayette, IN, in 1972 and 1975, respectively.He joined the faculty of the University of

Miami, Coral Gables, FL, in 1975, and he is cur-rently Associate Professor of Electrical Engi-neering. His current research interests include

database systems and computer architecture.Dr. Liu is a member of the Association for Computing Machinery and

Eta Kappa Nu.

347