data compression and gray-code sorting

5

Click here to load reader

Upload: dana-richards

Post on 28-Aug-2016

223 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Data compression and gray-code sorting

Information Processing Letters 22 (1986) 201-205 North-Holland

17 April 1986

DATA C O M P R E S S I O N AND GRAY-CODE S O R T I N G

Dana R I C H A R D S

Department of Computer Science, University of Virginia, Thornton Hall, Charlottesville, VA 22903, U.S.A.

Communicated by L. Kott Received 2 May 1985

Keywords: Comparison-based sorting, radix sorting, data compression, Gray code

1. Introduction

Data compression techniques have been studied for many years. Here, we are concerned with file compression when all the records of the file have the same field structure. In particular, we assume each record has m fields, fa, f2 . . . . . fm- The field fi has integer values in the range 0 to N i - 1, so that for a record X = (x 1, x 2 . . . . ,Xm) we have 0 ~< x i ~< N i - l , f o r l ~ < j ~ < m .

The compression scheme we analyze here, known as differencing, was discussed by Ernvall [3]. The idea is to arrange the records in some order and output the first record. For each succes- sive record we output only those fields which differ from the previous record. If most successive records have many identical fields, then the output file should be much shorter than the original file. Note that there is some overhead for specifying which field is being output and end-of-record de- limiters, so that the output file could be longer. (If we were allowed to reference any record, not just the previous one, then a nonlinear structure devel- ops and leads to the use of min imum spanning trees [2,5].)

The initial ordering of the records is important and Ernvall chooses a generalized Gray-code order discussed below. To motivate this, consider the case when all N i are equal to 2 and the file consists of the 2 m possible records, i.e., every m-bit sequence. If these were arranged in Gray- code order, by definition, each successive record

would differ in exactly one field giving m + 2 m -- ] output fields, as opposed to the m2 m of the naive method. If we had simply put the records in lexicographical order, then the number of output fields would be

m - 1 i = 2m+l m + 2 m E ~ - - m - - 2 ,

i=0

which asymptotically is just twice as bad. How- ever, in the worst case we could have an ordering such that each successive record differs in m - 1 fields [7] giving an unfavorable number of 2m(m - 1) + 1 output fields.

The problem Ernvall explored is how to sort the records initially so that they are in a Gray-code order. This does not mean that each successive record differs in just one field. Instead, it means the records that do appear in the file are in the same order as they appear in a Gray-code order- ing of all potential records, over the ranges speci- fied. We present an improvement to Ernvall's algorithm, extend it to the full mixed-radix case and present another algorithm. In the next section we give definitions and some results which are interesting by themselves.

2. Preliminaries

Let S 1 be an ordered sequence of n 1 elements, denoted

0020-0190/86/$3.50 © 1986, Elsevier Science Publishers B.V. (North-Holland) 201

Page 2: Data compression and gray-code sorting

Volume 22, Number 4 I N F O R M A T I O N PROCESSING LETTERS 17 April 1986

[$1(0), S,(1) . . . . . S ~ ( n ~ - 1)].

Let SI(i)S2(j) represent an ordered pair fo rmed by conca tena t ing the respective elements of two se- quences; and SI(i)S 2 is the sequence

[S,( i )S2(0) , S , ( i ) S 2 ( 1 ) , . . . , S , ( i ) S 2 ( n 2 - 1)] .

Let

S 1 X S 2 = [ S 1 ( 0 ) $ 2 , S 1 ( 1 ) S 2 , S 1 ( 2 ) S 2,

. . . , S l ( n , - 1 ) S 2 ]

be a sequence of n l n 2 ordered pairs, where all the e lements of one subsequence precede the next subsequence. Finally,

S1 ® S2 = [S1(0)S2, S1(1)S2 R, S1(2)S 2, S1(3)S2 R,

. . . . S l ( n 1 - - ] ) s ( R ) ] ,

where $2 (R) denotes ei ther $2 or $2 R, i.e., the reversal of $2, depend ing on whe the r na - 1 is even or odd, respectively. As an example , let

S a = [ 0 , 1 , 2 ] and S 2 = [ A , B ].

Then,

S] ® S 2 = [0A, 0B, 1B, 1A, 2A, 2B].

Fur ther , S~ ®($2® $1) is, arranged in two col- umns:

0A0 1A2 0A1 1A1 0A2 1A0 0B2 2A0 0B1 2A1 0B0 2A2 1B0 2B2 1B1 2B1 1B2 2B0

It is easy to derive rules for indexing in to these c o m p o u n d sequences. W e see

$1 × S2(i) = S] ( i l )$2( i2 ) ,

where i I = [ i / n 2 ] and i 2 = i m o d n 2.

Similarly,

$1 ® S2(i) = $1( i l )$2( i2) ,

where i 1 = [ i / n 2 ] and i 2 = i m o d n 2

o r i 2 = n 2 - 1 - (i m o d n 2 ) ,

when i x is even or odd, respectively. For example, with n] = 3, n 2 = 2, and S 1 and S 2 as above,

Sl ® ($2 ® S1)(S)

= S1(1)S 2 ® $1(2 X 3 - 1 - 2)

= S , ( 1 ) $ 2 ( 1 ) $ 1 ( 3 - 1 - 0 ) = 1B2.

The example of S 1 ® (S 2 ® $1) p roduced a se- quence of ordered triples that is a Gray-code sequence, i.e., each differs f rom the preceding in exactly one posit ion. We can use this same ap- p roach to generate a Gray-code sequence of all possible records with m fields, as described in the preceding section. Recall N a is the n u m b e r of values in field fa and let N b = NaNa+ 1 . . . N b, the n u m b e r of possible subrecords with the contigu- ous fields fa . . . . ,fb" Let O b be a Gray-code se- quence for those N b subrecords, where G a is just the ordered sequence [0, 1 . . . . , N ~ - 1]. We want O p . The following recursive def ini t ion is a simple generalization of the wel l -known binary reflected Gray code (e.g. [1,4]). Let

O b = G a ® G b , .

A simple induc t ion a rgument establishes that each of the N b possible records appears exactly once and it is, in fact, a Gray-code sequence. To gener- ate the lexicographically ordered sequence of the same records we use the s t ra ightforward defini t ion L~ = L~ × L b + a, where L~ = G~.

We wish to establish an algebraic p roper ty of these sequences. We begin with this result for arbitrary sequences $1, S2, S 3 of lengths nl , n2, n3, respectively.

L e m m a 2 .1 . (S 1 ® 82) ® S 3 -~- S 1 ® ( S 2 ® S3) .

Proof. Let i be the mixed-radix [6] n u m b e r (iai2i3) w i t h r a d i c e s na , n 2 , n 3 , i .e . , i = i l n 2 n 3 + i 2 n 3 + i3,

where 0 ~ ij < nj for j = 1, 2, 3. Using the index- ing scheme above we find

(S t ® $2) ® S3(i ) = S, ® S2(a ')S3(a3)

= S l ( a , )S2(a2 )S3(a3) ,

202

Page 3: Data compression and gray-code sorting

Volume 22, Number 4 I N F O R M A T I O N P R O C E S S I N G LETFERS 17 April 1986

where

a' = [ i / n3 ] = i ln 2 + i 2,

i m o d n 2 = i3 ,

a 3 = n 3 - - 1 - (i mod n 3 ) = n 2

SO

a' even,

- 1 - i 3 , a ' o d d ,

a, = [ a ' / n 2 ] = i 1,

a' rood n 2 = i 2, i~ even,

a 2 = n 2 - 1 - ( a ' m o d n 2 ) = n 2 - 1 - i 2, i 1odd .

Similarly,

S, ® (S 2 ® S3)(i ) = S , (b , )S2 ® S3(b')

= S, (bl )S2(b2)SB(b3) ,

where

b, = [ i / ( n l n 2 ) ] = i , ,

b ' = [ i m o d n ln 2 = i2n 2 + i3, b~ even,

n2n 3 - - 1 - (i mod n 2 n 3 ) , b I odd,

SO

i 2 ,

b2 = [ b ' / n 3 ] = n 2 -- 1 - i 2 ,

i 3 , b 3 = b' m o d n 3 =

n 3 - - 1 - - i 3 ,

b 3 = n 3 - 1 - (b ' mod n3)

= f n 3 - - I - - i 3 , b 1 even, b2 odd,

i3, bl odd, b 2 odd.

By a case analysis we find, in all cases, a I = bl ,

a 2 = b 2 , a n d a 3 = b 3 . []

Theorem 2.2. G~ = Ga b ® G~,+l, a ~< b ~ c.

Proof . A simple induct ion proof, on k = c - a + 1, can be m a d e using the observat ion

¢-~ b + l c Gb ® G~,+, = o b ® --,b+, ® Gb+2

G b + l ® c ----- G b + 2 ,

w h e r e b + l < c . []

This theorem allows us to approach the Gray code b y decompos ing it in other ways than the typical left-to-right fashion. In a similar but sim-

b I even,

bl odd ,

b 1 e v e n , b 2 even,

bl odd, b 2 even,

pler manner we find

(S 1 X S2) X S 3 = S 1 x ( S 2 x $3)

and

Ec = Lba × Ub+l, a~<b~<c .

In the next section we will need to know the pari ty of the rank of a record X in the Gray-code order. The following lemma generalizes a result in

[31.

Lemma 2.3. If X = Gb(i) = ( x a . . . . , X b ) , then

b

i-- ~ x j mod2. j = a

Proof. We prove the lemma by induct ion on k = c - a + 1. Recall

X- -Gab -1 ® Gb( i ) = Gab- l ( i l ) G b ( i 2 ) ,

where i 1 = [ i / N b ] , i 2 = i mod N b o r i 2 = N b - l

- ( i mod Nb) as i I is even or odd. N o t e that x b = Gb( i2) = i 2 and

(i mod Nb) ----- 01 -- 1)i2 + 0 2 - - Nb + 1)il mod 2.

Hence,

i ---- i l N b + ( i mod Nb) mod 2

-= i I - i 2 mod 2

b - 1

= y" X j + X b m o d 2 . [] j = l

It is interesting that this result does not extend to the lexicographical order, i.e., for the record Y = Lb(i) = (Ya,-.-,Yb). It is easy tO show that, in this case, i is equal to the mixed radix number

(Ya ' "Yb) with the radices N a, N a + 1 . . . . . N b. If desired, the multiplications required to calculate i can be bypassed b y using a series of par i ty checks, in the Homer ' s rule fashion to calculate the pari ty of i. Furhter, the par i ty calculation is trivial when all radices are even.

3. Sorting with generalized comparisons

Recall that our goal is to sort the file in Gray- code order, i.e., record X precedes record Y iff X

203

Page 4: Data compression and gray-code sorting

Volume 22, Number 4 I N F O R M A T I O N P R O C E S S I N G LETTERS 17 April 1986

precedes Y in G ~ . Ernvall [3] used the approach of simply employ ing some k n o w n good sort ing a lgor i thm and augment ing the no t ion of a com- parison. In part icular , he proposes us ing a boolean funct ion grayorder(X, Y) for every compar i son step, which returns whether X precedes Y. When X 4 Y, let d be the least integer such that xd+ 1 4: Yd+a and let total d = E d xj. Ernvall gave the fol- j = l lowing procedure , which we haye generalized to the mixed-radix case:

grayorder(X, Y) = if total d is even then

return(xd+x < Yd+l) e l s e

return(yd+ x < Xd+l)

This follows because both X and Y are in the same subsequence "--'1¢" d[; ~,'l?"-'d + 1W~" m(R), where the (R) in- dicates reversal iff i~ odd. Recall i x - total d m o d 2.

p a+l G m it follows that rank- Since Gdm+ 1 = "-'d + 1 ® d + 2, ing within c,.d+l..,d+l determines the rank ing in Gdm+l and the result follows.

Not ice that every compar i son requires O(m) time. Hence, if an opt imal O(n log n) com- par ison-based sort ing algori thm is used, as Ernvall r ecommends , we use O(mn log n) time. However, if, dur ing preprocessing, we calculate the rank of each record in Gray-code order, we could then simply sort according to rank. If each rank can be compared in O(1) t ime and can be c o m p u t e d in O(m) time, then the new a lgor i thm would run in O(nm + n log n) time.

To compu te the rank of subrecord X, i.e., i where X = G~(i), we note that

X = G b ( i l ) G • + l ( i 2 ) and i = i x N ~ + l + i ' 2,

where i' 2 = i 2 or i'2 = N~,+l - 1 - i 2 as i x is even or odd. So, if we, perhaps recursively, calculate ix and i 2, we can re turn i. The s implest O(m) imple- men ta t ion of this scheme would involve a left-to- right scan, in the typical H o m e r ' s rule fashion:

i < - - X 1

for i <--- 2 to m do

i 2 ~ if i e v e n then xj e l s e Nj - 1 - xj i ~ iNj + i 2

endfor

However, our general decompos i t i on approach

could easily be implemented by a recursive O(m) t ime approach, with dep th O(log m) when the division is by half at each step. Further , if file compress ion became a bot t leneck in a large file system, then it may be economical to design spe- c ia l -purpose hardware to compu te ranks. The ob- servations above natural ly lead to an O(log m) t ime parallel implementa t ion .

If some system is already designed to sort in lexicographical order, then we can achieve the same end by filtering the file before and after sorting. In particular, record X = G~(i) is trans- fo rmed into Y = L~(i) before the sort ing and the reverse mapp ing is done after sorting. Clearly, the result is in Gray-code order. Using the techniques above, all the t ransformat ions can be done in O(nm) time. (Such t ransformat ions with the bi- nary reflected Gray code are well known (see, e.g., [8]).)

4. Rad ix sort ing

Due to the mixed-radix representa t ion of the records, the radix-sort a lgor i thm (e.g., [6]) natu- rally suggests itself. If we wished to sort the re- cords in lexicographical order, i.e., according to L~, then we would use the following algori thm:

form a list L f rom the set of records for j ,-- m d o w n t o 1 do

form empty lists L 1 . . . . , LNj for each record X f rom list L do

append X to Lx~ e n d f o r form a new L by conca tena t ing L 1 . . . . , LNj

e n d f o r

Briefly, its correctness is due to the fact that, after i i terations, L is sorted by its final i posit ions, i.e., wi th respect to L m_ i + 1"

TO produce the records in Gray-code order we note that

Gt-- ® GL1

= [Ga (0)G~+ x,(Ga (1)Gb+ 1 )R . . . . ].

Hence, if we have sorted the records with respect

204

Page 5: Data compression and gray-code sorting

Volume 22, Number 4 INFORMATION PROCESSING LETTERS 17 April 1986

to Gab+l and have divided them into N a lists, we find that the odd lists need to be reversed. There are several ways to do this; perhaps the simplest is to replace the 'append' statement above with the following:

if xj even then append X to Lx~

else prepend X to Lx,

endfor

The running time of these radix-sort algorithms are both O(nm + ~im__lNi). When the second term is small, this approach should be quite fast.

References

[1] J.R. Bitner, G. Ehrlich and E.M. Reingold, Efficient gener- ation of the binary reflected Gray code and its appli- cations, Comm. ACM 19 (9) (1976) 517-521.

[2] J. Ernvall and O. Nevalainen, Compact storage schemes for formatted files by spanning trees, BIT 19 (1979) 463-475.

[3] J. Ernvall, On the construction of spanning paths by Gray- code in compression of files, Technique et Science Informa- tiques 3 (6) (1984) 411-414.

[4] E.N. Gilbert, Gray codes and paths on the n-cube, Bell System Tech. J. 37 (9) (1958) 815-826.

[5] A.N.C. Kang, R.C.T. Lee, C. Chang and S. Chang, Storage reduction through minimal spanning trees and spanning forests, IEEE Trans. Comput. C-26 (5) (1977) 425-434.

[6] E.M. Reingold, J. Nievergelt and N. Deo, Combinatorial Algorithms: Theory and Practice (Prentice-Hall, En- glewood Cliffs, N J, 1977).

[7] J. Robinson and M. Cohn, Counting sequences, IEEE Trans. Comput. C-30 (1) (1981) 17-23.

[8] H. Salzer, Gray code and the + sign sequence, Comm. ACM 16 (3) (1973) 180.

205