hierarchical memory with block transfer

13
Hierarchical Memory with Block Transfer Alok Aggarwal Ashok K. Chandra Marc Snir mMThomas J.. P. Q. Box 218 J:leightstNew York t 10598. Abstract In this paper we introduce a model of Hierarchical Memory with Block Transfer (BT for short). It is like a random access ma- chine t except that access to location x takes time f(x)t and 'a block of consecutive locations can be copied from memory to memoryt taking one unit of time per element after the initial access time. We first study the model with f(x) = XiX for 0 < a < 1. A tight bound of S(n log log n) is shown for many simple problems: read- ing each input t dot product, shuffle exchange t and merging two sorted lists. The same bound holds for transposing a vn x vn matrix; we use this to compute an FFT graph in optimal S(n log n) time. An optimal S(n log n) sorting algorithm is also shown. Some additional issues considered are: maintaining data structures such as dictionaries t DAG simulation t and connections with PRAMs. Next we study the model f(x) = x. Using techniques similar to those developed for the previous model t we show tight bounds of S(n log n) for the simple problems mentioned above, and provide a new technique that yields optimal lower bounds of O(n log2n) for sorting, computing an FFT graph, and for matrix transposition. We also obtain optimal bounds for the model f(x) = XiX with a> 1. Finallyt we study the model f(x) = log x and obtain optimal bounds of S(n log *n) for simple problems mentioned above and of S(n log n) for sorting, computing an FFT graph, and for some permutations. 1. INTRODUCTION 1.1 Background Large computers usually have a complex memory hierarchy consisting of a small amount of fast memory (registers) followed by increasingly larger amounts of slower memory, which may in- clude one or two levels of cache, main memory, extended store, drums, disks, and mass store. Efficient execution of algorithms in such an environment requires some care in making sure the data are available in fast memory most of the time when they are needed. Compilers, machine architectures, and operating systems attempt to help by doing register allocation, cache management, or demand paging, but ultimately the algorithm designer can influ- 0272-5428/87/0000/0204$01.00 © 1987 IEEE 204 ence the performance in major ways. For example, in multiplying two large matrices A x B by the standard (row times column) dot product algorithm, one normally gets many page faults and cache misses unless one first transposes one of the matrices so that the elements in the rows of A are contiguous, as are those in the col- umns of B. In general, it is important to utilize the locality of reference in a problem. This comes in two flavors: temporal and spatial. Temporal locality refers to the use of the same data several times once brought into fast memory. Spatial locality refers to using some data followed by use of neighboring ones. This is important because even in slow memory, the time to access the first word may be long, but then several words can be transferred into fast memory rapidly. Most systems use this to transfer blocks of data, e.g. lines of cache, or pages of memory. There has been a great deal of research devoted to studying pragmatic issues in memory hierarchies [De70, Ba80 t Sm86, Si83, MGS70, AC86, G74]t but relatively little aimed at a basic under- standing of algorithms or data structures. Examples include [K73, MC69, FP79, F72, HK81, W83 t AV87] where I/O complexity is considered in a 2-level memory hierarchy. Part of the reason for the paucity of theoretical work on general hierarchies may be that no clean models have existed for this purpose. In [AACS87] the Hierarchical Memory Model (HMM) was proposed. This is like a Random Access Machine (RAM) model [AHU74], except that access to location x takes time f(x). For the standard RAM model, f(x) = 1, but in memory hierarchiesf(x) = log x andf(x) = XiX are more natural, and were examined. In HMM, however, there was no concept of block transfer to utilize spatial locality of reference in algorithms. (The log-cost RAM [AHU74] is a related model where access time is logarithmic in the value stored, not in the lo- cation -- see also [S84].) As we shall see, the capability of block transfer results in a major change in the style and performance of algorithms. Indeed, fairly efficient algorithms are possible even when most of the memory is rather slow. t.2 The Model In this paper we propose a model, B1[, of Hierarchical Memory with B lock Transfer. It is like a RAM with memory locations 1, 2, 3, .... Access to location x takes time f(x). In addition a contiguous block can be copied in unit time per word after

Upload: others

Post on 02-May-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical Memory with Block Transfer

Hierarchical Memory with Block Transfer

Alok AggarwalAshok K. Chandra

Marc Snir

mMThomas J.. W~tson.Centert P. Q. Box 218Yorkto~n J:leightstNew Yorkt 10598.

AbstractIn this paper we introduce a model of Hierarchical Memory

with Block Transfer (BT for short). It is like a random access ma­

chinet except that access to location x takes time f(x)t and 'a block

of consecutive locations can be copied from memory to memoryt

taking one unit of time per element after the initial access time.

We first study the model with f(x) = XiX for 0 < a < 1. A tight

bound of S(n log log n) is shown for many simple problems: read­

ing each inputt dot product, shuffle exchanget and merging two

sorted lists. The same bound holds for transposing a vn x vnmatrix; we use this to compute an FFT graph in optimal

S(n log n) time. An optimal S(n log n) sorting algorithm is also

shown. Some additional issues considered are: maintaining data

structures such as dictionariest DAG simulationt and connectionswith PRAMs.

Next we study the model f(x) = x. Using techniques similar to

those developed for the previous modelt we show tight bounds of

S(n log n) for the simple problems mentioned above, and provide

a new technique that yields optimal lower bounds of O(n log2n) for

sorting, computing an FFT graph, and for matrix transposition.

We also obtain optimal bounds for the model f(x) = XiX with

a> 1.

Finallyt we study the model f(x) = log x and obtain optimal

bounds of S(n log *n) for simple problems mentioned above and

of S(n log n) for sorting, computing an FFT graph, and for somepermutations.

1. INTRODUCTION

1.1 Background

Large computers usually have a complex memory hierarchyconsisting of a small amount of fast memory (registers) followedby increasingly larger amounts of slower memory, which may in­

clude one or two levels of cache, main memory, extended store,

drums, disks, and mass store. Efficient execution of algorithms in

such an environment requires some care in making sure the data

are available in fast memory most of the time when they are

needed. Compilers, machine architectures, and operating systems

attempt to help by doing register allocation, cache management,

or demand paging, but ultimately the algorithm designer can influ-

0272-5428/87/0000/0204$01.00 © 1987 IEEE204

ence the performance in major ways. For example, in multiplying

two large matrices A x B by the standard (row times column) dot

product algorithm, one normally gets many page faults and cache

misses unless one first transposes one of the matrices so that the

elements in the rows of A are contiguous, as are those in the col­

umns of B.

In general, it is important to utilize the locality of reference in a

problem. This comes in two flavors: temporal and spatial.

Temporal locality refers to the use of the same data several times

once brought into fast memory. Spatial locality refers to using

some data followed by use of neighboring ones. This is important

because even in slow memory, the time to access the first word may

be long, but then several words can be transferred into fast memory

rapidly. Most systems use this to transfer blocks of data, e.g. lines

of cache, or pages of memory.

There has been a great deal of research devoted to studying

pragmatic issues in memory hierarchies [De70, Ba80t Sm86, Si83,

MGS70, AC86, G74]t but relatively little aimed at a basic under­

standing of algorithms or data structures. Examples include [K73,

MC69, FP79, F72, HK81, W83t AV87] where I/O complexity is

considered in a 2-level memory hierarchy. Part of the reason for

the paucity of theoretical work on general hierarchies may be that

no clean models have existed for this purpose. In [AACS87] the

Hierarchical Memory Model (HMM) was proposed. This is like a

Random Access Machine (RAM) model [AHU74], except that

access to location x takes time f(x). For the standard RAM model,

f(x) = 1, but in memory hierarchiesf(x) = log x andf(x) = XiX are

more natural, and were examined. In HMM, however, there was

no concept of block transfer to utilize spatial locality of reference

in algorithms. (The log-cost RAM [AHU74] is a related model

where access time is logarithmic in the value stored, not in the lo­

cation -- see also [S84].) As we shall see, the capability of block

transfer results in a major change in the style and performance of

algorithms. Indeed, fairly efficient algorithms are possible even

when most of the memory is rather slow.

t.2 The Model

In this paper we propose a model, B1[, of Hierarchical Memory

with B lock Transfer. It is like a RAM with memory locations1, 2, 3, .... Access to location x takes time f(x). In addition a

contiguous block can be copied in unit time per word after t~e

Page 2: Hierarchical Memory with Block Transfer

startup time. Specifically, a block copy operation

[x - I, x] .. [y - l,y] is allowed. It copies the contents of location

x - i into location y - i, for 0 SiS I (and is valid if the two in­

tervals [x - I, x] and [y - I, y] are disjoint). Its running time is

[(x) + I if x >y, and is fey) +1 otherwise. Standard RAM oper­

ations (including indirect addressing) are also allowed -- such op­

erations take unit time in addition to the memory reference times.

For example, adding the contents of locations XJl and storing the

result in z takes time 1 + [(x) + [(y) + [(z). The precise operations

allowed on data will be specified in the problems considered, andwill always be finite.

B Tf without block copy is the Hierarchical Memory Model

HMMf [AACS87], and with [(x) == 1 it is the unit-cost RAM. In

the following, we will use "block move" or "block transfer" as

synonyms for "block copy." The functions [that are of particular

interest include [(x) = r log x 1 and [(x) = rx a 1 (all logarithms

in this paper are base 2). We will write B1iog x, BTxa for these

specific cases and abbreviate these as B1iog and BTa, resp. (to

avoid confusion with the function [(x) == 1, we will not use the

abbreviation for a = 1). The case [(x) = r log x 1 is suggested for

semiconductor memory [MC80] in view of the number of levels in

a hierarchical layout and in the decode logic. Also, current com­

puters have access times for semiconductor memories which may

vary from about IOns to a few hundred nanoseconds for access up26 a

to, say, 2 words of memory. The case [(x) = x seems more

applicable when extending this model to drums, disks and mass

store where access times may be tens of milliseconds for 234 words,and seconds for 2

38 words.

The BT model is fairly clean and robust. For example, one

could change the time for a block copy operation

[x - I, x] .. [y - IJl] to [(x) + fey) + I or (f(x) + l) + (f(y) + I)

and the running time would change by at most a constant factor.

Also, there is no concept of predetermined block lengths or

boundaries. It may be noted that the program itself could be stored

at the top of the memory (low numbered locations) without in­

creasing the running time appreciably for programs of constant

size. We will therefore not concern ourselves with this issue in the

paper. Finally, it can also be shown that one can double the space

available to an algorithm, and run two algorithms in the same

memory, while multiplying the total running time by at most a

constant.

In general, if T and S denote the time and space taken by an al­

gorithm on a RAM, then this algorithm can be executed in

OCT x [(S) )-time and O(S)-space. Time on the BTf model is also

bounded from above by time on the HMMfmodel, and from below

by time on a RAM.

1.3 The Rest of the Paper

In section 2, we study the BTa model, for 0 < a < 1. It is shown

that even to read an input of length n takes time S(n log log n), and

this also suffices-for various simple problems such as computing the

dot product, perfonning a =shuffle-exchange llermutation, and

merging two lists. .As -such, n log log n seems to corr-espond to

205

"linear time" in this model; from this follow obvious

O(n log n log log n) upper bounds for computing an FFT graph and

for sorting. In section 3, we improve both these bounds to

O(n log n).

In section 4, we study some data structures in the BTa model,

and also consider the problems of simulating straightline RAM al­

gorithms, and PRAMs.

In section 5, we consider the BT model with other access time

functions. For the BTxa model with a ~ 1, the analog of "linear

time" becomes S(n log n) for a = 1 and 8(na

) for a > 1. We also

obtain optimal bounds for matrix transposition, computing an FFT

graph, and for sorting in the BTxa model with a ~ 1 and we provide

an interesting technique that provides lower bounds for these

problems when €X = 1. Finally, in section 5, we also study BTiog and

obtain optimal bounds for the above mentioned problems.

Section 6 is devoted to conclusions and possibilities for future

research.

2. BOUNDS FOR SIMPLE PROBLEMS IN BTa (0 < a < 1)

Consider the following problem of reading the input which we

call the Touch Problem: n inputs at, ... , an are stored in memory

locations 1, ... , n. An algorithm touches the input aj if, at some

time during the execution of this algorithm, ai is moved to location

1 in memory. The problem is to touch all inputs. Remark: Without

loss of generality, for various algorithms, we may assume that ac­

cess to any location for an operation other than block copy is pre­

ceded by copying that value into location 1 and hence "touching"

it. This increases the running time by at most a constant factor.

Theorem 2.1: Let 0 < c S 1 be a constant. Any algorithm that

touches cn inputs requires O(n log log n) time on the BTa model

where 0'< a < 1.

Proof: From the remark above, it suffices to prove this Theorem

when only block copy operations are permitted. We say that input

aj is k-touched at step t if there is some memory location j such that

aj is stored in j and j S k. Let bj(t) be the least k such thataj has

been k-touched in one of the first t steps of the computation; define

the potential at step t to be

n

ep(t) == 2 log(2)bj (t),j=1

where log(2); == max(O, log log 0. The initial potential is

ep(O) == ~ log(2)i. When the algorithm terminates at step Tat least

en input~=~ave been I-touched; hence the final potential is bounded

by

n

cp(n s 2 log(2)i.i-en

Page 3: Hierarchical Memory with Block Transfer

Hence

en</>(0) - </>(1) ~ L log(2) i = O(n log log n).

i== 1

We shall prove the Theorem by showing that the decrease in po­

tential due to a block copy is bounded by a constant times the block

copy time.

Consider an operation that copies a block from locations

p - /, ... ,P to locations q - /, ... , q. This move may decrease

the value of b; if q < p, and then by at most

I (2) . 1 (2)(j ) h . d· 1 . . S·og J - og - p + q were a; IS store In ocatlon J. Ince

/ < q, it follows that the decrease in potential due to this move is

at most

p IL (log(2)j - log(2)(j - p + I) S (/ + 1) log(2)p - L log(2)j,

j==p-l ;==0

We conclude the proof by showing that this· decrease is no more

than a constant times the block copy time~ i. e., it is 0(/ + P a).

Without loss of generality, we may assume that log(2)p > 0:

(/ + 1) log(2)p - ~ log(2)i;==0

(l (2)(p / log p)-l (2) (2) . I (2) (2)

~ (log p - log I) + ~ (log p - log i);==0 i== p (l/ log(2)p

spa + I( log a +1).

•From Theorem 2.1, a similar O(n log log n) bound follows if

an algorithm separates en inputs from their predecessors. (An input

aj is separated from its predecessor a;-l if at some point in the al­

gorithm, OJ is stored in locationj and a;_l is not.in locationj - 1.)

This is because any such algorithm can be modified by making it

use only locations that are numbered at least 2, and then preceding

any block copy operation [x - 1,x] .. [Y - IJ1] by [x,x] .. [l,l] and

[y + IJ1 + 1].. [1,1]. This causes the algorithm to take no more

running time (except for a constant factor) and ensures that any

input that is separated from its predecessor is also "touched."

Theorem 2.2: Any of the following computations can be performed

in time O(n log log n) in the BTa model for fixed 0 < a < 1.

(i) Touch problem.(li) Computing the dot product of two n-vectors.

(ill) Deterministic context-free language recognition.

(iv) Merging two sorted lists of size n each.

(v) Performing a shuffle permutation on n elements.

(vi) Odd-even exchange permutation on n elements.Furthermore, problems (i), (li), (iv), (v), and (vi) require

O(n log log n) time and for problem (lii), there are even some reg­

ular sets whose recognition takes g(n log log n) time.

206

Proof Oldli.: The lower bounds follow directly when Theorem

2.1 is applied to problems (i), (li), (iv), (v), and (vi). Also, recog­

nizing whether a string of length n is in the regular language 0*,

i. e., whether a {0,1} string contains only zeroes, requires touching

all inputs so that the lower bound corresponding to problem (iii)

follows. Below, we outline sketches of the upper bound proofs.

(i) The n inputs are stored in locations 1, 2, ... , n. Treat itI-a a

as a sequence of roughly n blocks of length b = rx 1 each.

Move a block into locations 1, 2, ... , b (unless it is already there)

and solve it recursively. Continue until all blocks have been proc­

essed. The running time fulfills the recursionI-a a

T(n) = n T(n) + O(n) which solves to T(n) =O(n log log n)

for any constant a < 1. Note that the algorithm processes the items

in the order they occur in memory. The same idea applies (li) and

(vi).

(iii,iv,v) It can be shown that a stack (or even a deque) can

be maintained in a BTa machine in amortized cost O( log log n) per

operation, for n (push, pop, top, empty) operations, and that a

constant number of such structures can be maintained simultane­

ously. Hence, n steps of a deterministic pushdown automata can

be simulated in time O(n log log n), which implies (iii). Two lists

can be merged using three stacks in O(n) stack operations; a shuffle

can be performed by executing a merge of the bottom half and the

top half of the list of items, with a suitable definition of comparison

order.

•3. MATRIX TRANSPOSITION, COMPUTING FFT GRAPHS,

AND SORTING FOR BTa , 0 < a < 1

Computational problems can often be represented as di­

rected acyclic graphs whose nodes correspond to the computed

values and whose arcs correspond to the dependency relations.

Here, the input nodes (i. e., the nodes with in-degree zero) are

provided to the algorithm with given values and a node can only

be computed if all its predecessor nodes have been computed. The

time taken to compute a dag is the time taken by the algorithm to

compute all its output nodes (i. e., the nodes with out-degree zero).

The FFT (Fast Fourier Transform) graph is one such directed

acyclic graph that is quite useful, and several problems can be

solved by using an algorithm for computing the FFT graph. An

FFT graph consists of log n stages where all items are accessed in

each stage. This might seem to imply that g(n log n log log n) time

is required to compute it. However, we obtain a tight bound of

f>(n log n). The lower bound follows directly since this graph con­

tains f>(n log n) nodes and the upper bound is obtained by provid-

ing fast algorithms for transposing a V; x V; matrix in time

O(n log log n). Because of Theorem 2.2, the algorithm for trans­

posing a matrix is optimal within a constant factor.

For 0 < a < 1/2, a V; x V; matrix can be transposed in op­

timal O(n log log n) time.by recursively transposing submatrices

of size na X na by first moving each submatrix into faster memory.

Page 4: Hierarchical Memory with Block Transfer

We show that O(n log log n) time can be achieved even for

1/2 S a < 1. In fact, we define a class of permutations which we

call rational permutations and show that any rational permutation

can be achieved in O(n log log n) time. A permutation is rational

if it is defined by a permutation on the bits of the binary address

of the items. Let [am-I' am-2' ... , ao] be the number represented

in binary notation by am-I, am-2' ... , tlo' i. e. , let

m-I

[am-I' am-2' ... , £10] = L ai2i.

;=0

Let a be a permutation on {m - 1, ... , OJ. The rational permu­

tation To on n = 2m

elements associated with a permutes the ele­

ment [am-I' ... , ao] to [aa(m-l)' ... , aa(O)]' It is easy to see thata shuffle-exchange permutation, matrix transposition, and parti­

tioning a matrix into smaller submatrices are some examples of ra­

tional permutations for appropriate values of n.

Theorem 3.1: Any rational permutation on a set of elements that are

stored in the first O(n) locations in the memory of a B Tex machine

can be achieved in O(n log log n) time for fixed 0 < a < 1.

Proof Outline: Let m = log n~ We define a k-input-block to consist

of 2k

inputs, initially stored in locations r2k +1, ... ,(r + 1)2

kfor

some r; these are the items with the same m - k most significant

bits. Similarly, a k-output-block consists of 2k

items that occupy

locations r2k +1, ... ,(r + 1)2

kwhen the permutation has been

computed. The proof of Theorem 3.2 relies on the following ob-

servations: '"1..

• A rational permutation To where a(i) = i for i < am (i. e., the

corresponding bit permutation involves permuting only the most

significant m - ram 1 bits) can be achieved in O(n) time. This is

because such permutation preserves ram l-input-blocks; it can be

achieved by moving each ram l-input-block (of size approximatelyniX). The cost of each move is O(n

ex), i. e., in constant time per

element.

• Any rational permutation 'Ta where a(i) = i unless i €

{k - 1, k - 2, ... , k - L 0.25(1 - a)mJ} for any k ~ 0.25m

can be achieved in O(n) time. This follows because we can move

each am-input-block from the first O(n) memory locations to thefirst O(n ex) memory locations at cost O(n ex) per block move; we can

move each a2m-input-block from the first O(n ex) memory locations

2 2to the first O(n ex ) memory locations at cost n ex per block move,

and, so on. Each move takes constant time per element. In

2/ 10g(l/a) iterations k-input-blocks are moved to the first O(2k

)

memory locations. We can then permute the data within these

blocks by applying the previous observation.

• Any rational permutation 'Ta where aU) = i unless i ~ m/4 can

be achieved inO(n) time; this is accomplished as follows. Partition

the most significant r0.75m 1 bits into intervals of size

LO.125(1 -a)mJ each and denote these intervals by

207

ap' ap-l' ... , at where p = 6/(1 - a). It is not hard to see thatany permutation on the set {m - 1, m - 2, ... , r0.25m l} can

be expressed as the product of O(p2) permutations, where each of

these permutations affect at most two consecutive intervals

(ai' ai-I)' Since each pair of intervals has at most 0.25(1 - a)m

bits, the corresponding rational permutation on n elements can be

achieved in O(n) time by using the previous two observations.

Consequently, the rational permutation in which the corresponding

bit permutation requires the permutation of the most significant0.75 log n bits can be achieved in O(P2n) time, i. e., in O(n) time

since p is a constant that only depends upon·a.

Below, we use the divide-and-conquer technique and the previ­

ous observations to achieve a rational permutation on n elements.

We assume that the input is in locations 2n + 1, 2n + 2, ... , 3n

and the output will be in 3n + 1, 3n + 2, ... , 4n.

Let 0 be a permutation on m - 1, ... , 1. It is not hard to see

that a can be expressed as the product of three permutations

01(12(13 ' where(1t and a3 are permutations on the most significantr0.75m 1 bits and a2 is a permutation on the least significant

LO.5m J bits. Now, 'To is also the product of 'T01 'T02T03' Further­more, 'Tat' 'T03 can be computed in O(n) time using the above ob­servations and T a is computed recursively by first moving

. k 22Lm/2J . . (.appropnate = contIguous Inputs 1. e.,

Lm/2 J-input-blocks) into locations 2k + I, ... , 3k and then ap­

plying 'T02 to them.

The running time of this algorithm fulfills the recursion

T(n) = In x T(In) + O(n) where T(I) = constant. Thus, it

can be verified that T(n) = O(n log log n).

•If n is any power of 4 then a In x In matrix can be trans­

posed using a rational permutation. In fact, if p and q are both

powers of two then a p x q matrix can be transposed using a suit­

able rational permutation; below,we consider the general case of

transposing any p x q matrix.

Corollary 3.2: Ap x q matrix can be transposed on a BTex machine

with 0 < a < 1, in time O(pq log logpq).

Proo!: Let r = Zr logpl, s = 2 r logql, and let A be the given

p x q matrix. Then, perform the following steps:

(a) Expand A to an r x s matrix B such that B(iJ) =A(iJ) for

1 ~ i S p, 1 S j ~ q, and B(iJ) is arbitrary otherwise.(b) Transpose B to obtain B T.

(c) Extract A T from B T.

Clearly, step (b) is a rational permutation and can be com­

puted in O(rs log log rs) time using Theorem 3.1. Since the compu­

tation of step (c) is similar to that of step (a), we only show below

how to compute step (a).

Initially, P x q ~atrix A is in locations

2rq + 1, ... , 2rq + pq. For any qI>. 1, to recursively expand a

Page 5: Hierarchical Memory with Block Transfer

p x ql matrix Al that is stored in locations

2rql +1, ... , 2rql + pql in row major order, let

q2 = max(l, l (rql)a/r J) and partition AI. into rql/q2 1 s~bmatri­ces, each (except poss~bly the last submatrix) of size p x q2; move

every submatrix into locations 2rq2 +1, ... , 2rq2 + pq2;

recursively expand that submatrix, and copy the result into appro­

priate locations in 3rql +1, ... , 4rql to obtain the r x ql expan­

sion of Al (the case ql = 1 and the last submatrix of A1 are specialcases that are easily handled). The rpnning time is.Q(rs log logrs)

since there are at most r log log q/ log(l/a)l recursive levels eachof which takes a cumulative running time of O(rs).

•11Ieomn 3.3: An n-point FFT grap.~ can be computed inO(n log n) time on a BTa machine for fixed 0 < a < 1.

Proof Outline: An n-point FFT graph can be computed by

recursively computing 2Vi FFT graphs on Vi. points each and

two transpositions of Vi matrices of size In xVi. Now, we can

use the transposition algorithm given in Theorem 3.2 that takes

O(n log log n) time in the worst-case, so, if T(n) denotes the time

to compute this FFT graph, we have

T(n) S 2Vi x T(Vi) + ,en log log nand T(I) = constant. This

yields T(n) = O(n log n).

•We can sort a set S of n words in O(n log n log log n) time and

O(n) memory using simple merge sort and the fact that m~rgingcan

be done in O(n log log n) time in a BTa machine (Cf. Theorem 2.2

(iv». In the following, we demonstrate an optimal O(n log n) time

algorithm, Approx-Median-Sort(S), that takes O(n log log n) space.

Approx-Median-Sort (S) assumes that the elements of set

S = {sl' ... , sn} reside in locations 2n + 1, .. , 3n and it con­

sists of the following five steps:

Step 1: Partition S into n I-a subsets, SI' ... , Sn I-a, of size na

each. For 1 S j S n I-a, execute the following substeps:

Substep 1.1: Bring the elements into locations indexed

2n a +1, ... , 3n a and sort Sj recursively by calling

Approx-Median-Sort (Sj)'

Substep 1.2: Form a set A containing the (i x log n)-th element

of Sj for 1 SiS na/log n.

Step 2 : After the completion of substep (1.2), A has n/ log n ele­

ments. Furthermore, a simple analysis shows that the p-th smallest

element in A is greater than at least p x log n - 1 elements in Sand

this element is also greater than at mostp x log n + (n I-a -1) X log n -1 elements in S. Now, sort A by

using merge sort.

S*" J: Form a set, B = {bl'~' ..• ', b,.l of r =na

/ log n

approximate-partitioning elements, such that for 1 S / $ r J h/ equalsthe (1 x n t-a)_th smallest element of A. At this point, note that

because of the remark made in -step (2), -there are at -leastt )( n1-4 x log n - 1 elements,of S that .are less than b, and also at

208

most (I + 1) x n I-a X log n -1 elements of S that are less thanb,.

Step 4: Now, for all}, 1 S j S n I-a, partition the elements of Sj

into r +'1 subsets, Sj,O' Sj,I' ... , Sj" such 'that for every

x e; Sj,b hiS x < h/+ 1 (treating ho as - 00 and h'+1 as +00).

I-anCompute C/ = .U Sj,/ and let I C/ I = nl' Then, from the remarks

;~1 I-ain step (3), note tliat 1 S n/ S 2 x n x log n.

Step 5 : For 0 S I Sna/log n, bring C/ to the faster memory and

sort C/ by calling Approx-Median-Sort(C/) recursively.end ,.

Theorem .3.4: Algorithm Approx-Median-Sort(S) sorts n words inO(n log n)time on a BTa machine for 0 < a < 1.

Proof Outline: The correctness of Approx-M~dian-Sort(S)is easy

to establish and hence omitted. To analyze the running time, let

T(n) denote the time. taken by Approx-Median-Sort(S) to sort n

elements. Then, subatep .(1.1) can be executed in total timeI-a a

n T(n) + O(n) for all Sj' Substep (1.2) can be performed bytransposing a (n a/ log n) x nI-a log n matrix in time

O(n loglogn). Merge sort for m = n/ log n items instep (2) takes

time Oem log m log log m) = O(n log log n). The set B is com­

puted instep (3) on set A,' in time O«n log log n)/ log n). Each set

Sj can be brought into fast memory and merged with the set B in.' l~a a a

total tIme O(n x (n log log n » = O(n log log n). At that

point the sets Sj,/ have been computed; they are stored in "row

major order;" i. e., the sets Sj,O' ... , Sj" are stored in contiguousmemory locations, for each 1 S j S r. We need to bring together

the sets SI,b ... , Sn I-a,1 for each 0 SiS r, i. e., store the matrixof sets Sj,/ in "column major order." From Theorem 3.5, this per­mutation· can· be computed in O(n( log log n)4) time and

O(n log log n) sp~ce. Finally, step (5) can ~e executed in

O(n log log n) + l: T(n/) time where r n/ =nandn/ S 2n I-a log n./;flns implies that the running '~e of this algo­rithm obeys the following recurrence relation:

naflog n

I-a a 4 ~T(n) S n T(n) + O(n( log log n) ) + L.J

/=0

nO:flog n 1where T(2), ==eoflStant, ~ n/ == n, and n/ S 2n -a log n. It canthen be verified that T(n) /;oO(n log n).

•Suppose the (iJ)-th entry of a p x q matrix, M, consists of a

set R(iJ) of data elements such that IR(iJ) I ~ 1,p q

.2 .2 IR{iJ) I = n, the entries of the matrix are stored in row­1=1;=1major order, and the d;ita elements of R (iJ) stored in contiguous

memory. locati0DS. Then,the problem of Generalized Matrix

Tt:ilnsposition requires the entries:Qf R(iJ)-to be stored in column-

Page 6: Hierarchical Memory with Block Transfer

major order such that the data elements of each individual R{iJ)

are still located in contiguous memory locations.

The set C/ for all values of I between 1 andr, in step (4) of

Approx-Median-Sort(S) can be computed using a generalized ma­trix transposition for a matrix of size (n a/log n) "x n I-a.

Theorem 3.5: The generalized matrix transposition problem can be

solved in O(n( log log n)4) time and O(n log log n) space on a BTamachine, for any fixed a where 0 < a < 1.

Proof Outline: The proof of Theorem 3.5 is an extension of a simple

O(n( log log n)2) time algorithm for matrix transposition -- this will

be provided in the final version of the paper., Roughly, one extra

factor of log log n arises in the algorithm for generalized matrix

transposition because we need to determine the length of the blocks

to be moved and the other factor arises because of the more com­

plicated storage allocation needed.

•4.MORE RESULTS FOR TH~ BTa MODEL (0 < a < 1)

4.1 Maintaining DictionariesAs mentioned in the proof of, Theorem 2.2, stacks and

deques can implemented in the BTa model in amortized time

O( log log n) per (insertion, deletion) operation. It is not hard to

show that searching can be performed in optimal time O(n a) using

a perfectly balanced binary search tree that is stored in memory in

breadth-first order. We now consider data structures that support

all three operations.

A dictionary is a data structure that supports the following

three operations: search (key), insert (key, value)~, and delete (key);

we call the last two operations update operations. For simplicity,

we assume that an insert with the key of an entry that is already

present in the dictionary replaces that entry; a delete for a nonex­

isting key has no effect. We present an efficient dictionary struc­

ture for the BTxtl model, where 0 < a < 1. This structure supportssearching in time O(n a), which is optimal; updates are done in

amortized time O( log n log log n).

We may assume that a dictionary is built by a sequerice of up­

date operations, starting from an empty structure. The dictionary

can be considered to consist of'a sequence of entries, where each

entry represents one update request. An update operation adds' an

entry to the head of this sequence; a search operation returns the

most recent entry with a matching key value. The search fails if

there is no such entry, or if the most recent such entry is a delete.

The sequence of entries can be reordered and compressed -­

updates on different keys commute with one another, and older

updates to a key can be deleted if there is a more recent update tothe same key.

Theorem 4.1: A dictionary can be maintained in the B Ta model

(0 < a < 1) so that the worst-case time to search an element in thedictionary is O(n a) after n operations, and it takes

209

O( log n log log n) amortized time to insert or delete an element inthe dictionary.

Proof: Our construction is similar to the static-to-dynamic trans­

formation method for data structures of Bentley and Saxe [BS80].

We store the sequen~e as a list T1, ... , Tk of trees, withTt con­

taining updates that are more recent than those in T2 and so on. ,1';is a complete binary search tree, containing at most 2

i-

1entries.

There are no duplicate entries for the same key in a tree; however,

there may be duplicate entries in distinct trees. After n operations,

k = O( log n).

Tree 1'; is stored between locations c2i

and (c + 1)2i-1 for

an appropriate c ~ 2, with interspersed "scratch" space from

(c + 1)2ito c2

i+1-1. We will think of 1'; as having 2

i-

1 nodes (by

padding, if necessary). 1'; is an i-level complete binary tree where

the key for any node is larger than the keysJor those in left subtree

and smaller than those its right subtree. We partition 1'; into sub­

trees, each of which is an ai-level complete binary tree 1';,u rooted

at node U in 1'; (for this exposition, we ignore the case where ai is

not an integer). Each 1';,u is stored in contiguous locations in pre­

order (i.e., each 1';,u is sorted by keys), and 1';,u is stored before

1';,vif u is nearer to the root of T; than v, or if they are at the samelevel and u is to the left of v. Thus, for a fixed i, the nodes in all

subtrees 1';,u at the same level are contiguous and sorted -- call this

set of nodes, a slice. The three operations are performed as fol­

lows:

Search - Search successively each 1';, until an entry with amatching key value is found.

Update - Let To be a one node tree, representing the update

request. Starting with i = 0, recursively perform the following:

Merge 1'; with 1';+1' and store the result in 1';+1 (1'; remains empty);

if 1';+1 overflows, then recursively merge i.t with 1';+2'Tree T; can be searched in time O(2

al) as follows. First make

space in fast memory by copying the contents of locations

1, 2, ... , 2a

; -1 in time O(2ai

) into scratch t;torage starting at

locations (c + 1) x 2;' Then, the first 200 -1 locations in 1'; (i. e.,

1'; root) are copied into locations 1, ... , 2ai

in time O(2ai

) and

. 2 . 2

searched in O(i X 2,a ) time using ai probes each taking 0(2

,a )time. This determines which tree 1';,u is to be searched next (unlessthe key has already been found), and the process is repeated at

most r1/a 1 times, ~ompleting the search of 1';. Finally, we restore

the original contents of locations 1, ... , 2ai

-1. Thus, the total

time for searching all To, ... , 1iog n is O(n a).

Now, consider a sequence, of n updates being executed on a

dictionary that contains n valid entries. We claim that two trees

of size O(m) can be m~rged in time O(m log log m); the cost per

entry of a merge is O( log log m). Every time 1'; is merged with

1';+1' each entry in 1'; is moved to 1';+1' and this may cause oneentry to disappear (if it has the same key). It follows that an entry

is merged at most k = O( log n) times. The total cost of all themerges per entry is O( log n log log n).

Page 7: Hierarchical Memory with Block Transfer

We conclude the analysis by proving ourolaim on merge time.

To merge 1i and 1i+ l' first sort 1f, 1f+1; next, merge the sortedlists; and finally, reconstitute 1i+ I by essentially reversing the op­

erations in the first step. We claim that each step takes

O(m log log m) time for ni == 2;. Since this bound has been shown

for merging, to prove this claim, it suffices to show how to sort 1iin this time bound. But this follows easily since the f 1/a 1 slices

that partition 1j are already stored as sorted lists, and hence can

be merged pairwise in time Oem log log m).

•It is interesting to note that the similarity of the, trees 1'; given

above and the B-trees - both can be thought of as having a large

fanout and keeping siblings contiguous. However, our structure is

usually more efficient for updates in BTa but its efficiency drops

when the number of valid dictionary entries becomes much lessthan the number of operations n.

4.2 COIIIpIItiag DAGs ilIIIl Simulatioll ofStlYliglltli_ Aigoritluns

Let G == < V,E> be a directed acyclic graph. We denote by

Vc the set of computation nodes of G, i. e., the set of nodes with

nonzero indegree. The nodes with indegree zero are the input

nodes; those with outdegree zero are the output nodes. If V is a

subset of nodes, let In(V) denote the set of nodes in V - V withan edge leading into V:

In(V) == {v E V - VI (v x V) n E;6 </>},

and let Out(V) denote the output nodes of V:

Out(V) == {v E V I v E Vout v V x (V - V) n E ;6 </>}.

Define VI' Vi, to be a cut of V if

• VI U V2 == V, VI n V2 == </>;

• (V2 x VI) nE == </>.

If Vt' V2 is a cut of V then In(VI)SIn(V) and Out(V2)SOut(V).

Let G be a DAG, and let p' be a set of computation nodes in G.

V is fen )-separable if I V I == 1 , or V has a cut VI' V2 such that

1. I VI I S 2/3 I V I and I V21 S 2/3 I y' I ;

2. IIn(VI ) U Out(V.) I S f( I VI ) and

IIn(V2) U Out(V2) I S f( I VI).

3. VI and V2 are recursively f(n)-separable.

G is f(n)-separable if ~ is f(n)-separable. Separation, as defined

here, is closely related to separators as used for VLSI layout

[L82]; we have the added restriction that all edges cut by a sepa­

rator are oriented in the same direction. Let G be a dag that is

f(n)-separable. We define a partition tree Tfor G as follows.

• The root of T is associated with Ve, the set of computationnodes of G.

210

• Let u be ,a node of T that is associated with a setVu of nodes

of G. If I Vu I == 1 then u is a leaf. Otherwise let VI ' ! V2 be a

cut of, Vu that fulfills conditions (1 )-(3) above. Then the left

child of u is associated with VI and the right child of. u is as­sociated with V2.

The partition tree T has logarithmic depth; the computation nodes

of G occur as leaves of T in an order that is a topological sort of

G.We use the partition tree T of G to evaluate the dag. The eval­

uation proceeds by traversing the tree recursively. When a leaf is

traversed the associated node is evaluated; when an internal' node

is traversed, data are permuted in memory. For a tree node u, let

In'(u) (resp.,Out(u» denote the set of values associated with the

nodes in In(Vu) (resp. Out(Vu». To simplify the presentation we

assume that for each set u in the partition, I In(u) I == I Out(u) I(this can be achieved by "padding" these sets). The algorithm

fulfills the following invariants:

• When the evaluation of u begins all values in In(u) are avail­

able in the first "In(u) I locations in memory.

• When the evaluation of u terminates, all values in Out(u) are

available in the first IOut(u) I locations in memory.

The algorithm is described below.

eval (u);

{if u is a leaf

then evaluate u, and replace In(u) by Out(u)

else {

let u/ and ur be the left and right children of u;

using In(u)compute(In(u/), In(ur) n In(u»; --(*)

eval (u/);

using (Out(u/), In(ur) n In(u»

compute (In(ur), Out(u/) n Out(u»; --(*)

eval(ur)

}

} .It is easy to see that the algorithm preserves the two invariants

mentioned' above. Each of the two lines labeled (*) consists of

unmerging a sequence into two sequences (the input sequence is a

merge of the two output sequences with duplicate keys eliminated),

possibly followed by a block exchange. Such unmerging can bedone in time O(n log log n). The total running time of this algo­

rithms fulfills the following recursion:

ten) == tent) + t(~) + O(f(n) log log(f(n») where t(l) == 1,

nl +~. == n, 1/3n S nl, ~ S 2/3n.

Note that this algorithm does not determine the partition tree

T from the graph G nor the sets In(u), Out(u) -- only the times for

the data movement and computation in G are counted. We call such

an algorithm a nonuniform algorithm. Below, we provide two ap-

Page 8: Hierarchical Memory with Block Transfer

plications of the above algorithm that computes general directedacyclic graphs.

Corollary 4.2: If a straight-line algorithm takes T(n) time on aRAM then it can be simulated on a BTa machine (0 < ex < 1) inO(Tlog Tlog log n time by a nonuniform algorithm.

Proof: A straight line RAM algorithm of length T is represented bya bounded degree dag with Tnodes. Every n-node bounded degreedag is O(n)-separable. This implies that a dag with Tnodes can be

evaluated in the BTa model (0 < ex < 1) in time ten, where ten

fulfills the recursion: t(n = t(T1) + t(T2) + O( T log log n where

T1 + T2 = T, T/3 S T1,T2 S 2T/3. This recursion yieldsten = O(Tlog Tlog log n.

•The best known lower bound for DAG simulation seems to be

O(Tlog n which can be derived from Corollary 5.9.

Corollary 4.3: Two n x n matrices can be multiplied by a nonuni­form algorithm in O(n

3) time on a BTa machine (0 < ex < 1) using

( +, x ) only.

Proof: Hong and Kung [HK81] have shown that the dag repres­enting the classical matrix multiplication algorithm (that uses+ and x only) is 0(n

2/

3) -separable. Consequently, we can use

the above algorithm to show that two matrices can be multiplied inthe BT model in time t(n

3), where ten) fulfills the recursion

a 2/3ten) = tent) + t(~) + O(n log log n) where nl + ~ = n,

1/3n S nt, ~ S 2/3n. Since this recursion yields ten) = O(n), itfollows that two matrices can be multiplied in O(n 3) time which isoptimal within a constant factor.

•The restriction of nonuniformity can be eliminated for matrix

multiplication. Use a simple divide-and-conquer lJlethod: an n x n

matrix can be multiplied by computing 8 products of n/2. x n/2

matrices. Furthermore, a matrix with m elements can be brokeninto 4 submatrices by "unmerging" a list into four sublists in timeOem log log m). It follows that the running time of this algorithmfulfills the recursion ten) = 8t(n/2) + 0(n

210g log n), so that

ten) = O(n3

). The algorithm derived in Corollary 4.3 is essentiallysimilar.

4.3. Simulation of PRAM Algorithms by a BTa Machilll!

Theorem 4.4: Let t, s, andp denote, respectively, the time, memoryspace, and the number of processors required by a Concurrent Read

Concurrent Write PRAM algorithm. Then, this algorithm can besimulated in time OCt x (s log log s + p x logp» on a BTa ma­chine (0 < ex < 1).

Proof: We use a simulation method due to Awerbuch, Israeli, andShiloach [AIS83]. The first O(P) locations in memory are used tostore processor records that represent the state of each simulatedprocessor; the next O(s) memory locations contain a list of memory

records representing the content of the PRAM memory. A PRAMstep is simulated as follows.

211

• The next instruction of each processor is simulated; if this in­struction accesses memory then an access record containing thememory address, the processor id, and the stored value for awrite, is created. The time for this phase is O(p log logp).

• The access records are' sorted according to their address intime O(p log p) and write conflicts are eliminated.

• The list of access records is "merged" with the list of memoryrecords in time O((p + s) log log(p + s»= O(s log log s + p log p). The merge is actually an updateoperation that modifies each of the two lists: if an access re­cord represents' a write, then the corresponding memory re­cord is updated; if it is a read then the value of correspondingmemory record is copied to the access record -- concurrentreads are also handled here.

• The access records are sorted by processor id in timeO(p log p) and write conflicts are eliminated.

• The list of access records is "merged" with the list of processorrecords; the state of each processor that executed a read in­struction is updated to contain the returned value.

The total time for the simulation of one PRAM step isO(s log log s + p logp).

•Corollary 4.5: The connected components of an undirected graphG with n vertices and m edges can be obtained in Oem log2n) timeon a BTa machine (0 < ex < 1).

Proof: Shiloach and Vishkin [SV82] have provided a parallel algo­rithm that takes O( log n) time and O(m) memory space usingn + m processors and that computes the connected components ofa graph with n nodes and m vertices.

•5. BOUNDS FOR OTHER MODELS

5.1 Boulids for BTx Model

Time bounds for various problems in the BTx model (i. e.,BTx a with ex = 1) are listed in the table given below. TheS(n log n) bounds for simple problems such as merging, shuffle­exchange permutation, and the touch problem are proven using thesame techniques as in Theorems 2.1 and 2.2; the O(n log2n) upperbounds for matrix transposition and computing FFT graphs can beobtained by using log n stages of the shuffle-exchange permutationand that for sorting is obtained by using a simple merge sort algo­rithm. The dictionary algorithm is similar to that given in Theorem4.1. However, unlike the upper bounds, the lower bounds forcomputing FFT graphs,.matrix transposition, and sorting require anew technique which is described below.

A computation is conservative if the only operation used isblock moves. We have the following lower bound for conservativepermutation algorithms.

Page 9: Hierarchical Memory with Block Transfer

1'ItIJorem 5.1: The average number of steps required to perform arandomly chosen permutation on n items using a conservativecomputation on a BTx machine is D(n log2n).

Proof Outline: It is useful to define models ofa· two-level memory

system (denoted by L(m» and a specially-blocked BTx machine, andprove Lemmas 5.2 and 5.3 for these models.

We define a two-level memory system, L(m), as one in whichthere are two memories - a primary memory that consists of mwords (e. g. a cache) and a secondary memory that is pot~ntially

infinite in size. The processor is connected to the primary memory.Any set of b S m/2 data items that are present in the contiguouslocations of the secondary memory can be transferred to any b lo­

cations in the primary memory in one transfer step. Conversely,contents of any b locations in primary memory can be copied into

contiguous locations in the secondary memory. In this model, it IS

assumed that the processor can perform any simple operation like

reading from (or writing into) primary memory or it can comparetwo words in the primary memory. Here, we are only in~erested inthe number of transfer steps that are required for solving a given

problem in this model and since Aggarwal and .Vitter [AV87] haveconsidered this model in some detail, we quote the following result

from their paper:

Lmuna 5.2 [A va1]: The average number of transfer steps neededto perform a random permutation on neleinents using a conserva­tive computation on an L(m) machine is D( min(n ,

(n/m) x log(n/m»).

Proof : The proof of Lemma 5.2 can be found in [AV87 ].

•A specially-blocked B Tx machine is one in which the memory

is partitioned into sets of contiguous locations such that the i-th set,Sj, contains the 2; memory locations [2;, 2

i+

1 -1] and any algo­

rithm for this machine transfers a block of contiguous data items

only from set Sj to set Sj_l or from Sj_l to Sj' for j.~ 1.

Lmuna 5.3: If a conservative algorithm runs on.a BTx machine intime T, then this algorithm can be modified into a conservative al­gorithm for a specially-blocked BTx machine that runs in time

O(n.

Proof: Partition the memory of a B Tx machine M' into sets ofcontiguous locations like the specially-blocked machine, i. e., thei-th set, s';, has size 2;. The simulating specially-blocked machine

M stores the content of s'; in the lower part of 8i+2; the upper partis used as a buffer. Consider a block move [x - I, x] .. [y - /J'] inM' where x >y (the case x <y is treated symmetrically). The costof this move is at least x. This move is simulated by M as follows:Let j = l log(x - l) J. Then, the block is contained in S'j U S'j+1

and hence in Sj+2 U Sj+3. If part of it is in Sj+3 then move it intothe buffer of Sj+2 to obtain a contiguous block B in M to be moved.If part of B has to go to S}+2 (i. e., to S'}), copy this piece directly.

212

Copy the rest of B into the buffer of Sj+ 1. If part of'this has to go

to Sj+l' copy it, and move the rest to the buffer of Sj' and, so on.The total time for all these moves is bounded by

j+3

:L2i+1 < i+5 == O(x).

i-O

•Proof O"tline . (of 1'ItIJorem 5.1 continued): Now, foroSiS log n - 2, consider a 2-level memory system, L(mi) such

i+lthat mi = 2 -1 and let 1j(mi) denote the expected number oftransfer steps that are required to achieve a permutation of n wordson L(m;). Using Lemmas 5.2 and 5.3, it follows that the expectedtime taken by any conservative algorithm on a BTx machine is atleast

logn-2

D( :L 2i

x Ti(m;».;==0

Consequently, the expected time taken to permute n words on aBTx machine is at least

logn-2

c x :L 2i

x min(n, (n/2i) log(n/2

i»;-0

which is n(n log2n).

•T1teoIWll 5.4: Any algorithm that sorts n words (using only com­

parisons and block moves), computes an FFT graph or transposesa~ x~ matrix takes, n(n log2n) time on a BTx machine.

Proof Outline: A sorting algorithm can be used to perform an arbi­trary permutation by suitably fixing the outcome of the compar­isons; three FFT graphs can be cascaded to obtain a permutationnetwork. Thus, the lower bound for these two problems followsfrom Theorem 5.1. The lower bound for transposing a~ x~

matrix is obtained by SUitably modifying Lemma 5.2 and the mod­ified Lemma can be obtained from [AV87].

•5.2 BOIIIIIIs for BTa Model With €X > 1

For €X > 1, the time bounds for various problems in the BTamodel are lis~~d in the table given below. The time required for

simple problems such as merging, shuffle-exchange permutation,and the touch problem is dominated by the time needed to accessthe farthest element (8(n

lX». The upper bounds for matrix trans-

position, computing FFT graphs, and sorting can be obtained bysimple divide-and-conquer algorithms. The upper bound for matrixmultiplication can be obtained by using the algorithm followingCorollary 4.3. The only nontrivial bound is the lower bound formatrix multiplication (using semiring operations only) and this isgiven below:

1'ItIJorem 5.5: For €X = 1.5,multiplying two n x n matrices usingonly ( x , + ) requires n(n

3 10g n) time on a BTa machine.

Page 10: Hierarchical Memory with Block Transfer

Proof Outline: Suppose two Vii x Vii matrices are stored in the

secondary memory of a two-level memory L(m) machine. Then,

it can be shown that any algorithm that uses only ( x , + ) takes

n(n31m3/2) transfer steps to multiply th~se matrices. Now, for

o SiS log n -1, consider a 2-level memory system, L(mi)' such

that mi = 2i+1 -1 and let T;(~i)'denote the minimum number of

transfer steps that are required to multiply two matrices on L(mi).

Using Lemmas 5.2 and 5.3 (which can be modified to apply to

BTxa, for a > 1, as well), it follows that the time taken by any al­

gorithm that multiplies two matrices in BTxl.s-model and that uses

logn-l 3/2( x , + ) only is at least ~( .~ 2 I x .T~(mi»' Now, the resultfollows by noting that T;(mij-~ n(n

3/2

3'/ ) for multiplying two

matrices on L(mi).

•5.3 Optimal BOIInds for B1iog Model

The time bounds for various problems in the B1iog model are

listed in the table given below. The upper bounds for simple prob­

lems such as merging, shuffle-exchange permutation, and the touch

problem are similar to those given in Theorem 2.2 except that the

data are now partitioned into blocks of size log n instead of n a. The

lower bound proof for these problems is also similar to the proof

of Theorem 2.1, and the modifications are explained below. The

upper bound of O(n log *n) for matrix transposition is obtained by

a divide-and-conquer algorithm and the lower bound follows be­

cause matrix transposition requires all inputs to be separated from

their predecessors. The upper bound of O(n log n) for computing

FFT graphs, computing arbitrary permutations, and for sorting

follow from Theorems 3.3 and 3.4. Tsteps of a straight-line RAM

algorithm can be simulated in time O(T log T log *n by adapting

the algorithm given in section 4.2. A dictionary can be maintained

with search time O( log3n/ log log n) and with amortized update

time O( log n log *n) using a structure similar to that in section 4.1.

Another dictionary structure is obtained using a B-tree

[AHU83] where buckets have size 8( log n). Each bucket is or­

ganized as a balanced search tree stored in contiguous memory lo­

cations. A dictionary operation requires D( log n/ log log n)

accesses to buckets. Each bucket can be moved to fast memory in

time D( log n). The operations performed on these buckets are

search, insert, delete, merge and split. Each of these operations can

be performed (for merge and split this is time amortized over nupdates) in time D( log

2m ) = D( ( log log ·n)2). Thus, each dic­

tionary operation can be executed in time D( log2n/ log log n).

The bucket size has to be updated when the number of entries

grows beyond n2or decreases beneath Vii; this updating can be

done in amortized time D( log2n/ log log n) per entry. Therefore,

the structure supports search in time D( log2n/ log log n) and in­

serts and deletes in amortized time D( log2n/ log log n). The

searching time can be proved to be optimal within a multiplicative

constant.

Theorem 5.6: Let 0 < c S 1 be a constant. Any algorithm that

touches cn inputs has time complexity n(n log *n) in the B1iogmodel.

Proof: nAs in Theorem 2.1, define the potential at step t as

ct>(t) = I log *bi(t) which implies that ct>(O) - ct>(nen i-I

~ I log *i = O(n log *n). Now, if an operation copies a blocki-I

from locations p - I, ... ,p to locations q - I, ... , q, then the

potential may decrease if q <p. The time for this move is

log p + I, and following Theorem 2.1, we have that

I

act> S (1 + 1) log *p - L log *i.i=O

Consequently, we conclude the proof by showing"that

I

(I + 1) log *p - L log *i S 2(1 + logp).i=O

IBut (l + 1) log *p - I log *i

i-O

log logp I

L (log *p - log *i) + L (log *p - log *0i-O i= log logP~ 1

Slog logp log *p + 2(/- log logp).

The first term is bounded by 2 log p, and the second term is

bounded by 2/.

•Note that there are many permutations, even permutations in­

Yolving all inputs, that can be computed in linear time. For exam­

ple, it is possible to exchange the first n/2 inputs with the last n/2inputs in linear time; this permutation moves all elements, and has

O(n2) inversions. On the other hand, there are permutations that

require O(n log n) time to compute. We prove their existence using

a nonconstructive counting argument; we have no explicit example

of such a "hard" permutation.

Theorem 5.7: In a B1iog machine, the number of distinct sequences

of block copy operations with a total time t is at most 9t.

Proof: Let C = CI , C2, ... , Ck be a sequence of block copy. op­

erations Ci = ([Xi - Ii' Xi] ... [y; -Ii' Yi])' 0 < Ii < xi'Yi andk

[x. -I· x·] n [yo - I,., -"".] = ct>; its running time is t(C) = .I t(Ci),, " I I . ,-I

wh~re t(Ci) = r loge max(xi,Yi» 1 + I;. We will give a one-to-one

map from such sequences into sequences r = rl' r2' ... , r3k of in-3k

tegers r· > 2 with a cost function cost(r) = I cost(r;),, - , i-I

cost(r,.) = r log ri 1, and having 2t(C) ~ cost(r). The resUlt thent-l.

follows from the claim that I {r Icost(r) == t} I == 3

213

Page 11: Hierarchical Memory with Block Transfer

RAM B1jog X BTxlX B1~lX BTlXX

= BTl (0 < IX < 1) (a = 1) (IX> 1)

Simple

Problems Sen) 9(nlog+n) 8(nloglogn) 8(nlogn) 8(na

)

(e.g. Touch)

Matrix 8(n3

) ifa. < 1.5

Multiplication3 S(n

3) 8(n

3) 8(n

3)

3S(n ) S(n logn)

lfa = 1.5

(+, +) 8(na

) ifIX > 1.5

Arb. Perm. 8(n) S(nlogn) 8(nlogn) S(r( logn)2) S(na

)

(3)

Mat. Transpose,

Rational 8(n) S(nlog+n) 8(nloglogn)2

8(na

)8(r(logn) )

Permutations

FFT Graph 9(nlogn) 8(nl~gn) S(nlogn)2

S(na

)8(r(logn) )

Sorting 8(nlogn) 8(nlogn) S(nlogn) 8(r(logn)2) 8(na

)

Searching S(logn)2

S(na) 8(na

)8«logn) Iloglogn) Sen)

Insert ~logn)2

O((lognl) 8(na- l )~(logn) Iloglogn) 0(lognloglogn)

& Delete (Amortized) (Amortized) (Amortized) (Amortized)

TABLE

The one-to-one map from C to r is induced by a one-to-one

function h that maps C; into ('3;-2' '3;-1' '3;)' We abbreviateh([x - 1,x]"'[Y - 1,Y]) by h(l,x,y), and define h as follows:

(a) h(l, x, y) = (I, x, y) if I ~ 2.(b) h(O, x, y) = (x, y, y) if x, y ~ 2.(c) h(l, x, y) = (x, y, x).

(d) h(O, 1, y) = (Y, y + 1, y).

(e) h(O, x, 1) = (x + 1, x, x + 1).

It can be seen that h is one-to-one since 1< XJ'; x ~ y; and ifI = 1 then x is neither y - 1 nor y nor y + 1. It can also be checked

that for all i, 2t(Ci) ~ cost(h(Ci»·Now, to establish the claim that there are 3

1-

1sequences r with

cost(r) = t, let At = I {r Icost(r) = t} I , and observe that a sequencewith cost t is obtained either from a sequence with cost t - 1 byappending a 2, or from a cost t - 2 sequence by appending a 3 ora 4. and so on. This yields a recurrence

214

Page 12: Hierarchical Memory with Block Transfer

Theorem 5.8: In a B1iog machine, there is a constant c such that thenumber of distinct sequences of operations (block copy or other­

wise) with a total time S t is at most 2ct

.

Proof: The proof follows as a direct extension of the above proof.

The constant c depends on the number of different kinds of oper­

ations allowed in the B1iog machine.

One can also consider several extensions of the BT model. For

example, one could study other access functions such as step func­

tions that may better reflect the physical situation of the memory

hierarchy in real machines. Another possibility is that the transfer

time for blocks could· be changed from one per word to some

function g(x) per word for copying from (or to) location x. Yet

another aspect that is significant for real memory hierarchies is

parallelism in block transfer (several reads from different disks may

take place simultaneously; transfers at different levels may proceed

concurrently, and can be overlapped with processing). Tradi­

tionally, this represented one of the early uses of parallelism in

computers, and from a theoretical point of view, would seem nec­

essary if we are to have data structures with good worst-case per-

formance. And, finally, in view of the importance of the memory

organization in multiprocessor machines, some appropriate model

for this would be nice. In any case, it is desirable that model ex­

tensions remain clean and robust.

This paper can be thought of as a step in developing a theory

of computation that is aimed at data movement as against data

modification. It is too early to tell if such a

memory/ communication oriented (versus CPU oriented) theory

of computation will have any influence on pragmatic algorithms,

machine architecture, memory management, or language design.

Some generalities are beginning to emerge. It appears that

some problems (FFT, sorting, matrix multiplication) are very well

behaved in that their running time (on BT or HMM using different

access functions) usually equals the maximum of the RAM time

and the time to read (i. e., touch) the inputs in the hierarchical

memory (there seems to be a slight penalty when RAM time and

touch time are roughly equal). This is about the best that good lo­

cality of reference could provide. Other problems (e. g. DAG sim­

ulation) appear not to behave in this manner. We do not

understand what characterizes such behaviour.

directory algorithms for BTa be improved'! And, it seems impor­

tant to study good memory management algorithms.

t-l+ 2 Ao where Ao = 1. It solves to

6. Conclusions

In this paper we have introduced a model for hierarchical

memory with block transfer. It is relatively clean and robust, but

nevertheless appears to mimic the behavior of real machines. Good

algorithms in this model utilize both temporal and spatial locality

of reference. The theory for this model appears to be rich and

deep.

There are a large number of possibilities for future research;

some are specific technical problems, other are more general issues.

The Corollary applies even to non-uniform algorithms that

perform block transfers or a finite set of other arbitrary powerful

operations; and in particular, to conservative algorithms. It re­

quires, however, that the permutation to be achieved is not given

as an additional input. It may be noted that the lower bound results

of Theorems 5.8 and Corollary 5.9 also apply to any BTf machine

where f(x) = D( log x) (provided f(x) = 0 for at most one value

of x) and, in particular, they apply to BTa for any ex > o.

At = At- I + 2At- 2 +At = 3

t- for t ~ 1.

Corollary 5.9: The expected time for achieving a random permuta­

tion on n elements on a B1iog machine is D(n log n).

Proof: A simple counting argument yields a lower bound of

(1/c) x n log n x (1 - 0(1» where c is the constant in Theorem

5.8.

In the BTa model, there is a sorting algorithm that takes

O(n log n log log n) time and O(n) space and another algorithm

that takes O(n log n) time and O(n log log n) space. Is it possible

to obtain an O(n log n) algorithm that takes only O(n) space? And,

are there problems for which there are space-time tradeoffs?

Permuting data is at the heart of many BT algorithms and it

would be nice to understand better the complexity of permutations.

For example, even in the BIiog model, we showed that most per­mutations require at least 0 (n log n) time. Can one demonstrate

such a permutation even for conservative algorithms?

There are numerous other areas, such as data structures, andgraph problems, that need to be analyzed. As an example, is it

possible to maintain directories in B1iog with O( log2n/ log log n)

search time and O( log n log *n) amortized update time? Can the

Acknowledgements: The authors wish to thank Michael Fisher forseveral useful suggestions.

References[AC86] R. C. Agarwal and J. W. Cooley, "Fourier Transform and

Convolution Subroutines for the IBM 3090 Vector Facility," IBM

J. of Research and Development, March 1986,145-162.

[AV87] A. Aggarwal and J. Vitter "The I/O Complexity of Sorting

Related Problems," Tech. Report, Dept. of Computer Science,Brown University, August 1987. Some results of this report can be

found in Proc. of the 14th Int. ColI. on Automata, Languages and

Programming, Karlsruhe, West Germany, July 1987,467-478.

[AACS87] A. Aggarwal, B. Alpern, A. K. Chandra and M. Snir,

"A Model for Hierarchical Memory," Proc. of the 19th Annual

215

Page 13: Hierarchical Memory with Block Transfer

ACM Symposium on the Theory of Computing, New York, 1987,305-314.

[AHU74] A. V. Aho, J. E. Hopcroft and J. D. Ullman, The Design

and Analysis of Computer Algorithms, Addison-Wesley, 1975.

[AHU83] A. V. Aho, J. E. Hopcroft and J. D. Ullman,es Data Structures -and Algorithms, Addison-Wesley, ReadingMass., 198.3.

[AIS83] B. Awerbuch, A. Israeli and Y. Shiloah, "Efficient Simu­lation of PRAM by Ultracomputer", Technical Report 120, IBMScientific Center, Haifa, Israel, May 1983.

[Ba80] J. L. Baer, Computer Systems Architecture, Computer Sci­ence Press, Potomac MD, 1980.

[BS80] J. L. Bentley and J. B. Saxe, "Decomposa~le Searchingproblems. I. Static-to-Dynamic Transformations," J. of Algo·rithIns, Dec. 1980, 301-358.

[De70] P. J. Denning, "Virtual Memory," ACM Computing Sur­veys, Sept. 1970, 153-189.

[F72) R. W. Floyd, "Permuting Information in Idealized Two­Level Storage," In R. E. Miller and J. W. Thatcher (editors),Complexity of Computer Computations, 105-109, Plenum Press,New York, 1972.

[FP79] P. C. Fischer and R. L. Probert, "Storage ReorganizationTechniques for Matrix Computation in a Paging Environment,"Comm. ACM, Vol. 22, No.7, July 1979,405-415.

[G74] J. Gecsei, "Determining Hit Ratios in Multilevel Hierar­chies," IBM J. of Research and Development, July 1974,316-327.

[HK81] J. W.Hong and H. T. Kung, "I/O Complexity: The Red­Blue Pebble Game," Proc. of the 13th Ann. ACM Symposium onTheory of Computing, Oct. 1981, 326-333.

216

[K73] D. E. Knuth, The Art of Computer Programming; Vol. 3,

Sorting and Searching, § 5.5.9, Addison-Wesley, Reading Mass.,

1973.

[L82] F. T. Leighton, "A layout Strategy for VLSIWhich IsProvably Good," Proc. of the 14th Ann. ACM Symposium onTheory of Computing, Oct. 1982, 85-98.

[MGS70] R. L. Mattson, J. Gacsei, D. R. Slutz and I. L. Traiger,Evaluation Techniques for Storage Hierarchies," IBM Systems

Journal, 1970,78-117.

[MC69] A. C. McKellar and E. G. Coffmann Jr., "OrganizingMatrices and Matrix Operations for Paged Memory Systems,"Comm. ACM, Vol. 12, No.3, March 1969, 153-165.

[MC80] C. Mead and L. Conway, Introduction to VLSI Systems

and Related Systems, Addison-Wesley, Reading Mass., 1980,

pg.316.

[S84) A. Schonhage, "A Nonlinear Lower Bound for Random Ac­cess Machines Under Logarithmic Cost," Technical Report RJ

4527, IBM Almaden Research Laboratory, May 1984.

[Si83] G. M. Silberman, "Delayed-Staging Hierarchy Optimiza­tion," IEEE Trans. on Computers, TC-32, Nov. 1983, 1029-1037.

[SV82] Y. Shiloach and U, Vishkin, "An Of log n) ParallelConnectivity Algorithm," J. of Algorithms, 1982, 57-67.

[Sm86] A. J. Smith, "Bibliography and Readings on CPU CacheMemories and Related Topics," Compo Arch. News, Jan. 1986,22-42.

[T83] R. E. Tarjan, Data Structures and Network Algorithms, SIAM,Philadelphia Penn., 1983.

[W83] C. K. Wong, Algorithmic Studies in Mass Storage Systems,Computer Science Press, Rockville MD., 1983.