[ieee 1994 international conference on parallel processing-vol 1 (icpp'94) - north carolina...

1994 International Conference on Parallel Processing

PSIM: PERIODICALLY SHIFTED INTERLEAVED MEMORY SYSTEM *

Jae Young Lee and Hee Yong Youn Department of Computer Science Engineering

The University of Texas at Arlington Arlington, Texas 76019-0015

[email protected]

Abstract - Interleaved memory schemes have been used to increase the memory bandwidth. In this paper, we propose a new interleaved memory scheme called PSIM (Periodically Shifted Interleaved Memory), which allows high memory bandwidths at the sacrifice of very small memory space. It is a stride dependent scheme where the processor utilization is 100% for most strides using a power-of-two number of memory modules. We show, analytically and then by simulation, that the efficiency of PSIM is consistently higher than not only earlier stride dependent schemes but also stride independent schemes for both short and long vector accesses.

1 Introduction Memory usually can not match the rate at which a processor makes the data request, and it has been the bottleneck for general computer systems in achieving high performances. One of the efficient solutions to this problem is the interleaved memory scheme. In general, interleaved memory systems can be categorized into three different schemes - stride dependent, stride independent, and dynamic schemes. The stride dependent scheme [1-3,9] allows the full bandwidth for all strides except some strides called black strides. The main idea of the stride independent schemes [4-7,10] is the address mapping of a pseudo-random fashion for preventing the address sequence of a specific stride from being mapped to a same module. The dynamic storage scheme [8] employs several different mapping schemes, where each scheme is chosen at compile time depending on the access pattern of the program.

The stride dependent scheme is expected to outperform the stride independent scheme if the number of black strides is minimized. The performance can also be analytically modeled and evaluated. Among

the stride dependent schemes developed so far, the prime memory system[2] has the smallest number of black strides. However, since the number of modules, N, is a prime number, the waste of address space is inevitable. Moreover, the number of black strides are still large as the multiples of N.

In this paper, we propose a new interleaved memory scheme, called PSIM (Periodically Shifted Interleaved Memory) which allows high performance for any pattern of strides even though it is a stride dependent scheme and N is power of two. The performance is higher than prime memory system as well as efficient stride independent schemes for both short and long vector accesses. This property was achieved by properly shifting the mapped addresses and sacrificing very small memory space which is an order of at most a few percent for practical size memory systems. The complexity of the mapping function is comparable to the prime memory system.

2 Proposed PSIM Scheme The memory system model usually employed is shown in Figure 1. The following notational conventions are also used in further discussions. An address sequence with astride s is denoted as As = (ao,ai, ...,a,∙,...). A module sequence for A, is represented by Ms = (mo, mi, ...,m,■,...), where m, = ƒ(a,∙) under a memory mapping function ƒ. Thus, m; G {0,1, ...TV—1}, where N is the number of memory modules. Mi is called the primary module sequence. An example of mapping with PSIM is illustrated in Figure 2. The entry marked with * is an unused memory space, which we will discuss later. It is similar to the standard interleaving scheme except that elements in some periodically distributed rows are shifted right by one position. The period is called the degree of irregularity, and denoted by S. In Figure 2, 6 = 1.

The basic address mapping function of PSIM is defined as

I - 2 2 0

Proceedings of the International Conference on Parallel Processing (ICPP'94)0-8493-2493-9/0-6 $20.00 © 1994

I - 2 2 1

where p = 6N — 1. Also the module indices in every 6-th row are incremented by one for the shift right. It can be easily implemented using hardware as shown later. Example 2.1: For the address sequence A\ — (0,1,2,...), the primary module sequence Mi is (0, 1, 2, 3, 1, 2, 3, 0, 1, 2, 3, 1, 2, 3, 0, 1, ...) when 6 = 2 and TV = 4. Note that the same sequence is repeated with a period of p = SN — 1 = 7. The repeated sequence is called basic sequence.

Among many possible mapping functions obtainable by varying S, of our interest is a class of mapping functions where SN — 1 is a prime number. The condition provides PSIM with some interesting properties which eventually allow a high performance. The following theorems identify the basic properties of PSIM, which will be used to evaluate the performance. In what follows, we assume that s satisfies gcd{s, SN — 1) = 1 unless otherwise indicated. The proofs are not given here due to the space limitation. Interested readers can refer to [11]. T h e o r e m 2.1 : The period, p, of a module sequence M, (s > 1) is SN - 1. Example 2.2: For the same case of Example 2.1 where the period of the sequence is 7, M^ = (0, 2, 1, 3, 1, 3, 2, 0, 2, ...). The period is also 7.

When s is a multiple of SN — 1, Icm{SN — 1, s) = s and thus p = | = 1. Here, all requests are mapped to one module, and the s's are black strides. Theo rem 2.2: The set of module indices in the basic sequence of Ms (s > 1) is identical with that of the basic sequence of M\ . Theorem 2.3: In the basic sequence of a module sequence Ms (s > 1), each module appears exactly 6 times except module-0 which appears S — 1 times (refer to Figure 3).

3 Performance Evaluation In this section the performance of PSIM is evaluated and compared with other schemes for both long and short vectors.

3.1 Performance for Long Vectors Let 6 denote the size of input buffer. In PSIM, we set 6 to be one less than the maximum number of occurrences of identical module indices in the basic sequence. Therefore, according to Theorem 2.3, b = S—l. The memory cycle time, c, is normalized to the processor cycle time.

First, it is shown that there is no processor stall for all c (< N) and s except for the black strides, which are the multiples of p. Let us construct a linear array called MT where MT,[i\ (i = 0,1,...), is the module index requested at time i for As. Figure 3 shows MTi and MT2 with S = 2 and N = 4. Another

1994 International Conference on Parallel Processing linear array, X™, is also constructed by successively collecting the indices of MTS for memory module-m. Figure 4 shows X\ obtained from MT2 of Figure 3. Note that XJ™ keeps the request times to module-m when As is presented.

Now a necessary condition for no processor stall is identified. Let x* = -X'T'M> and a(xi) be the address of the memory request corresponding to Xj. Lemma 3.1: Let <*,■ denote the access time to a memory module for a(xi). Then it is computed by <*,• = max{xi, a,_i + c). Assume that the input buffer of a module is full when a new memory request arrives. If the request which has been accessing the module is completed at that moment, then the request at the head of the buffer will access the module. Accordingly, the new request can be put in the buffer without causing a processor stall. Therefore, the necessary condition for no processor stall is x,+j+i — (a,■ + c) > 0 for all i > 0. Lemma 3.2: The latest access time, aitmax, of a(x,∙) is x,∙ + 6c, if a(x,∙) is put in an input buffer or directed to a memory module without causing a processor stall. Lemma 3.3: If there was no processor stall for a(x,∙), then there is no processor stall for a(xj+4+i) too. Theorem 3.1: There is no processor stall for all s > 1 if c < TV. Due to Theorem 3.1, the processor utilization of PSIM for long vector accesses is 1 for all strides except for the black strides, when 6 = 6— 1, c < TV, and SN — 1 is prime. Note that there is no restriction on 6 except the condition of SN — 1 to be prime. However, as 6 gets larger, the performance gets lower for short vectors, and the size of input buffers needs to be increased to avoid the processor stall. Therefore, it is desirable to select the smallest S for which SN — 1 is prime.

The efficiency of PSIM is computed as follows. The efficiency for a black stride, ej, is computed by jp'n . Here, Ta is the total time which is actually required for all the requests in the address sequence to be released from the input bus, while Tm,∙n is the time assuming no blocking. Tm;„ is clearly the length of the address sequence. When an address sequence of length I is presented to PSIM, then ej, is given as C(, = e()l ts . If each stride is assumed to occur

equally likely as in [6], the overall efficiency, e, of PSIM which is averaged over all strides can be computed by e = *=****■•

Table I lists the efficiency of PSIM for various TV. The efficiency of a prime memory system, ep, is also shown for comparison. In Table II, PSIM is also compared with RANDOM-H scheme [6] which is one of the most efficient stride independent schemes. PSIM displays the best performance when 6 = S—l, and the efficiency can be computed using the equation above. However, when the buffer size is less than (5—1, the efficiency can not be computed analytically, and thus


3.2 Performance for Short Vectors To estimate the short vector performance, two sets of simulations are performed for a memory system with N = 32, c = 24, and 6 = 3 while the length of the vectors is varied. First, for each of the three different vector lengths v — 16, 32, and 64, simulation is performed for each stride from 1 to 127. Efficiency is computed by e, = j} '" , where T3 is the measured actual service time for the whole sequence and Tm,∙„ is the minimal time needed to service the sequence. It is easy to see that Tm i n = I + c — 1. Then, overall efficiency is computed by the harmonic mean of e\, e2,..., ei27∙ The simulation result is shown in Table III.

The second set of simulations is for a mixture of short vectors of different strides. An address sequence of length /, which consists of a certain number of short vectors with strides of 1 to 127, is presented to a memory system. Each vector in the sequence is a short sequence of addresses and the length of the vectors are

I - 2 2 2

Table II. Comparison of PSIM and RANDOM-H.

simulations are performed using the same approach employed in [6]. Here, it is assumed that the size of output buffers is sufficiently large. For each stride, an address sequence of length 10000 was presented to PSIM, and the total number of completed memory requests were counted during the time window (1000,10000). Then the overall efficiency was computed as

where R(s) is the total number of requests actually completed and Rmax(s) is the maximum possible number of completed requests within the time window for stride-s in the steady state. Here N = 32 and c = 24. Note from the tables that PSIM consistently outperforms both the schemes.

Table I. Efficiency of PSIM for various N.

identical. Here, the relative frequency of each stride is varied such that the stride of 1 is assumed to occur at the rate of ƒ, while the other remaining 126 strides occur equally likely at the rate of -y^f. We perform the simulation for three different values of ƒ - y ^ > \ > and YJJ - and two different vector lengths of v = 32 and 64. The results are summarized in Table IV.

Table III. Comparison of short vector performances.

Table IV. Comparison for mixed vectors.

As shown in the tables, PSIM is always better than RANDOM-H for various operational conditions.

3.3 Implementation Issues Memory space utilization of PSIM, which is the fraction of memory space utilized, can be computed by um = 1 — jfc. For example, when N = 32 and 8 = 4, um = 0.992. Note that the waste of memory space for practical size memory systems will be negligible.

In PSIM, a word address inside a memory module is computed by iv(a,∙) = <5(a,∙ div p)+((at∙ mod p)) div N. The hardware required to implement the entire mapping operation of PSIM is shown in Figure 5. Observe that the first step computation dominates the complexity of the entire operation since it involves prime number manipulation. In the second step, division and modulo of N is trivial because N is power of 2. The multiplication of 6 is also simple since 6 is usually a very small value, e.g. 2, 3, or 4 when N < 128. In the third step, q is always smaller than 6, and the output of the AND gate is set to one only when q = 6 — 1. This is for the shift right operation. In the figure, 6 is assumed to be 4. Thus, the complexity of the third step is trivial too. As a whole, the complexity of the overall mapping operation is comparable to those of prime memory and RANDOM-H scheme. As pointed out in [6], if the memory mapping operation is pipelined as expected in practice and/or the speed of processors is much higher than that of memories, the delay incurred by the memory mapping will be relatively insignificant.


4 Conclusion We have presented a new interleaved memory scheme based on the periodic shift of mapped addresses called PSIM which has the following characteristics; i) stride dependent, ii) uses a power-of-two number of memory modules, iii) has the smallest number of black strides among the stride dependent schemes reported, iv) allows high efficiency for both the long and short vectors with a small size input buffers, and, thus, v) exhibits higher performance than earlier schemes. More comprehensive simulation and comparison will be done to fully assess the effectiveness of the proposed scheme for various operational conditions.

References [1] P. Budnik and D.J. Kuck, "The organization and

use of parallel memories," IEEE Trans. Comput. vol. c-20, no. 12, pp. 1566-1569, Dec. 1971.

[2] D.H. Lawrie and C.R. Vora, "The prime memory system for array access," IEEE Trans. Comput. vol. c-31, no. 5, pp. 435-442, May 1982.

[3] Q.S. Gao, "The Chinese Remainder Theorem and the prime memory system," Proc. Int'l Sympo. on Computer Architecture, pp. 337-340, May 1993.

[4] A. Norton and E. Melton, "A class of Boolean linear transformations for conflict-free power-of-two stride access," Proc. Int'l. Conf, on Parallel Processing, pp. 247-254, Aug. 1987.

[5] B.R. Rau, M.S. Schlansker, and D.W.L. Yen, "The Cydra 5 stride-insensitive memory system," Proc. Int'l. Conf, on Parallel Processing, pp. 242-246, Aug. 1989.

[6] R. Raghavan and J.P. Hayes, "On randomly interleaved memories," Proc. Supercomputing '90, pp. 49-58, Nov. 1990.

[7] G.D. Sohi, "High-bandwidth interleaved memories for vector processors - A simulation study," IEEE Trans. Comput., vol. 42, no. 1, pp. 34-44, Jan.1993.

[8] D.T. Harper III and D.A. Linebarger, "A dynamic storage scheme for conflict-free vector access," Proc. Int'l. Symp, on Computer Architecture, pp. 72-77, May 1989.

[9] K. Kim and V.K.P. Kumar, "Perfect Latin squares and parallel array access," Proc. 16th Int'l. Symp, on Computer Architecture, pp. 372-379, May 1989.

[10] J.M. Frailong and J. Lenfant, "XOR-schemes: A flexible data organization in parallel memories," Proc. Int'l. Conf. on Parallel Processing, pp. 276-283, Aug. 1985.

[11] J.Y. Lee and H.Y. Youn, "PSIM: Periodically Shifted Interleaved Memory System," Technical Report, Computer Science Engineering Dept., Univ. of Texas at Arlington, No. CSETR-9410

I -223

Figure 5. Implementation of PSIM mapping function.

Figure 4. Arrays formed by request times to identical modules.

Figure 3 .Arrays of module indices in terms of request time.

Figure 2. Address mapping of PSIM.

Figure 1. Memory system model.

1994 International Conference on Parallel Processing


[ieee 1994 international conference on parallel processing-vol 1 (icpp'94) - north carolina...

Documents