m-users b-servers arbiter for multiple-busses multiprocessors

North-Holland Publishing Company Microprocessing and Microprogramming 10 (1982) 11-18

M-Users B-Servers Arbiter for Multiple- Busses Multiprocessors

Tom~s Lang and Mateo Valero Facultat d'lnform~tica, Universitat Polit~cnica de Barcelona, Jordi Girona Salgado 31, Barcelona 34, Spain

The multiple busses network is attractive for interconnecting processors to memory modules in a multiprocessor system. The use of this network requires an M-users B-servers arbiter. In this paper we show several designs for such an arbiter with a round- robin policy. The iterative design is simple but relatively slow and does not use all the possibilities of integration. A network with one level of Iookahead is faster and more integrated. To in- crease even more the speed of arbitration a design with two levels of Iookahead is proposed. This network can be used ef- fectively for up to 16 processors, memory modules and busses.

Keywords: Parallel processing, Multimicroprocessor systems, multiple-busses network, Synchronous machines, resources allocation, arbiter design, look-ahead.

1. Introduction

The existence of microprocessors and other VLS1 modules has given importance to the design of multiprocessor systems. In such systems there are several resources (memories, busses, I /O devices) that have to be shared by all the processors. Most of these shared resources do not allow simul- taneous use. This characteristic makes it necessary to have mechanisms (in many instances implemented in hardware), called arbiters, that execute an algorithm to decide the assignment of the shared resource.

Several arbiter designs have been proposed in the literature [1-8]. In all of them, there are n users for a single shared resource. They differ in the assignment algorithm (fixed priority, round- robin, random or adaptative), their synchronous or asynchoronous behavior, and the specific implementation (hardwired, microprogrammed, modular, etc.).

In some situations it is necessary to assign a pool of k resources to n users. An example of this oc-

curs in a multiprocessor system with a multi- module memory in which the connection between processors and memory is implemented by a set of busses. In this case the arbiter has to be hardwired so that the arbitration time does not degrade the speed of the system. In this paper we present the design of such an arbiter. The proposed system is fast and modular, being adequate for economic implementation with LSI modules.

2. Multibus Organization

An important aspect in the design of a multiprocessor is the organization of the shared memory and the communication mechanism between the processors and the memory. Several organizations (see [9-11]) have been proposed dif- fering in speed, cost, application range, modulari- ty and reliability.

We concentrate on the organization that divides the memory into several independent modules and connects the processors and the memory by an interconnection network. In a system with N processors and N memory-modules the maximum memory bandwidth is N accesses per memory cicle. This maximum is attained only if the processors generate adresses to different modules and the interconnection network can sustain this bandwidth. A network that has this capability is the crossbar, which is very costly for large N.

On the other hand, if in a cycle several processors generate addresses to the same module, only one of them can be served, resulting in a lower memory bandwidth. Several analytic and simula- tion studies have been made to determine the effective bandwidth under different assumptions [12-15].

The most studied case is that in which the module numbers of the addresses generated by the

12 7-. Lang, M. Valero / M-Users B-Servers Arbiter

" ' ' ~ ,g

' " °

Memory Modules

Busses

Processors

Fig. 1. Multiple Busses Organization.

processors are independent and uniformly distributed random variables. In this case the module contention is very significant, so that ap- proximately the same bandwidth can be obtained

by less expensive interconnection networks; one of these is the multibus [16].

In the multibus organization (Fig. 1) the N processors and M memory modules are connected by

B busses such that B _ min (N,M). Exact and ap- proximate methods of analysis are presented in [17].

For maximum bandwidth the assignment of the

busses has to satisfy the following constraints: (a) Only one bus should be assigned to the set of

processors that request access to the same module.

(b) Only one bus should be assigned to each pro-

cessor. (c) The assignment policy should be fair. In the next section we describe the design of an arbiter that satisfies these constraints.

(d) The assignment of a memory-assigned bus to a processor is done by a conventional n-user, 1-server arbiter. As we mentioned in the introduction, these arbiters have been extensively studied and will not be further discussed here.

A high-level description of an arbiter for the assignment of the busses to the memory modules follows.

The buses are labeled BUS(0) to B U S ( B - l) and the memory modules MEM(0) to M E M ( M - 1).

The assignment is a follows:

For k = 0 until min(B-1 , ~MolD j - 1)

assign BUS(K) , MEM (rK) SO that

r k

~Dj=k+l, J=E

where E is the state variable that indicates the first

module to be assigned in a cycle and Dj indicates that module j has at least one outstanding request.

b

( ~. is a notation to indicate that the sum should j = a

begin at j = a, get to M - 1 and continues f rom 0 until b.)

The new state E is:

M - l

f r B if ~ D j < B - l , E = j=0

_ ~ otherwise.

3. D e s c r i p t i o n o f an A r b i t e r

In the next section we describe several designs for this arbiter.

We now give a high-level description of an arbiter that satisfies the restrictions given in the last section. The specific characteristics of this arbiter are: (a) It can be used in a system where the processors

are synchronized. (b) Each cycle of the system is composed of an ar-

bitration subcycle and a memory access subcycle.

(c) In the arbitration subcycle the busses are assigned in a round-robin fashion to those memory modules that have at least one request. In the next subcycle the assignment continues where it stopped in the previous one.

R o

UPo m,

81

uP, rq,

RH-1

MPN-I m,

,Go b, .eo

,G1 b. ~e,

>GN-1 b, , e,_,

Fig. 2. Arbiter.

T. L ang, M. Valero / M- Users B- Servers Arbiter 13

4. Implementation of the Arbiter

Fig. 2 presents a block diagram of the arbiter• There are (m + 1) inputs and (b + 1) outputs corresponding to processor i (i = 0, 1 ..... N - 1). The input R i indicates that the processor i requires

• / cc,,,o

/ ~J \ / t I CG,~

~R1,M-1 ~ ~ [ " CGH_I , `

, , , _ : /N+/ \ :

''-I -rn+~lN,. iH-Iti / ~ AM'-1 t "---CS I'M-I N""-""" "I J " CG,_,~,_,

M-users

B-servers

GMo

GM~ _BM,

GMM-1 BM,_I

Fig. 3. Arbiter Decomposition.

memory access and the input MPi represents, in a binary code, the number of the corresponding memory module• The output G~ indicates that a bus has been granted to processor i and B i the number of the corresponding bus.

Fig. 3 presents a first decomposition of the arbiter• The first level is formed by a set of decoders so that R~, j indicates that processor i requires access to memory module j . The blocks AMj are conventional arbiters that select one of the processors that require access to module j . The signals Dj indicate that there is at least one request for memory- module j and these are the inputs required by the M-users B-servers arbiter which is the topic of this paper.

The outputs of this module indicate that the memory module j has been granted a bus (GMj), and gives the number of the bus (BMj).

4.1. Iterativedesign

In Fig. 4 we present a first implementation of the arbiter. It consists of M combinational modules MB(i) that perform the assignment of the B busses and a state-register to store the state of the arbiter after each assignment subcycle.

In addition to the arbiter inputs and outputs, already described in this section, there are the following internal variables: - e i , is a binary variable that indicates the

memory module where the arbitration begins in each subcycle. The vector e is the 1 out of M en- coding of variable E of Section 3. The arbitra-

l So eo ,% el State

S nN_ 2 R IIj G MM-1 R•MN. 1

• • • CM_:

_ _ _0 m

SM-2 eM~ SM_t eM++ I Storage

Fig. 4. Iterative Design.

14 7". Lang, IV/. Valero ~M-Users B-Servers Arb i te r

tion continues from this memory module in a round-robin fashion.

- Ci, indicates that bus B - 1 has not been assigned by M(i) or previous modules and, conse- quently, that the arbitration should continue.

- BMi, which serves the dual purpose of being an external output and an internal variable, in-

dicates the number of the last bus assigned by the previous modules. To indicate that no bus has been assigned by previous modules, we specify B M i = B - 1 and Ci = 1.

- Si, indicates, at the end of the arbitration subcycle, that module MB(O assigned bus B - 1 . If there are less than B memory with requests then all Si are zero.

The operation of the module MB(O is described by the following expressions:

[ 0 if ~.^[~i_lv[DiA(BMi_ 1 = B - 2)]], Ci 1 otherwise;

I10 if ~ACi_IADiA(BMi_I = B - 2), Si = otherwise;

BMi = [(B-1)ei + BMi_ 1 ~ + Di] mod B;

GMi=IIo i fDiA(eivCi-]) ' otherwise.

Due to the cyclic arbitration C_ l = CM-~ and BM_I =BMM_I . In the expressions we also assume that B > 1.

A possible implementation for the state register is represented in Fig. 5. It consist of two registers L1 and L2, additional logic to obtain SO (= 1 if ~ ] 1 S i = 0 ) and the reset signal. Latch L1 stores the state at the end of the arbitration and latch L2 at the beginning of the next arbitration subcycle.

Sl • , eo

SI~ , e l

L1 " 1_2

SN-1 • aN-1

Reset -'7"-" "-T-

This implementation of the arbiter is simple but might be slow because of the propagation delay of the internal variables through all modules. Also the level of integration obtained is small consider- ing present day technologies. In the next subsection a faster and more integrated version is presented.

4.2. Implementation with one level o f lookahead

To reduce the arbitration time, it is possible to im- plement in one module several MB(O's. This implementation should be aimed at reducing the propagation delay of the signals B M and C from one of these macro-modules to the next. The idea is of course not new; one example of its application are lookahead adders.

In the design of these macromodules it is necessary to take into account, at least, two restrictions. One is the cost of implementing the before- mentioned functions with a small number of logic levels, and the other is the limitation in the number of pins in the resulting chip. If we integrate r modules MB(O in one macromodule, the number of inputs and outputs is r(2 + b) + b + 4. For b = 4

( 1 6 busses) and r = 4 we obtain 32 inputs and outputs, which is a reasonable number. To simplify the design we assume that in a subcycle the arbitration begins at the first memory modules in the macromodule that obtained the last bus in the previous arbitration subcycle.

In Fig. 6 we present a block diagram of the macromodule. The expressions for the carry-out signals are:

carry.in

no

0 1 - - -

Dr.,

ei

.GMo ' b; .BMo

-GM1 b, ~ BM1

_ _ 6 M r 4 " b-~-~ nMr_,

~ C o ~ S o

carry-out

Fig. 5. Storage of the State. Fig. 6. Macromodule.

T. Lang, M. Valero ~M-Users B-Servers Arbiter 15

BMr_ 1 = (B - 1)el + B M r el + = Dj mod B;

(° l if l, Co = (BMt+ 1)+ ~Dj >_B ,

j=0

otherwise.

For the values of r = 4 and b = 4 these functions can be implemented with a PLA. The other functions required can be implemented as in Section 4.1 or they could also be realized with PLAS; in this case the expressions would be:

BAli = [(B - 1)e/+ BMle 1 + ~ Dj] mod B; j=O

GM,-=

¢1 ifDiA[(CtveI)A[el

A[((BMI--/:B-1) (BMI+ 1)

+ <B , j=0

0 otherwise;

S O

"1 if (CIVel)A[e l

A [ ( ( B M , ¢ B - 1)(BMI+ 1)

i - I

0 otherwise.

This solution is faster than that of Section 4.1. Nevertheless it could still be too slow due to the propagation delays across all macromodules. An

implementation with two levels of lookahead is discussed in the next subsection.

4.3. Implementation with two levels of lookahead

This implementation consist of two types of modules: the macromodules defined in Section 4.2 not including the generation of the propagation functions, and modules to generate these functions.

A block diagram of this new module is presented in Fig. 7, for the case in which it generates the propagation for four macromodules. The variables N(0 code the number of request to memory modules corresponding to the ith macromodule, therefore ~h = [log2 rq variables are required. All other variables have the same meaning as before, the index in parenthesis corresponding to the number of the macromodule.

The expressions for the BM(i) and C1(i) are:

BM(O) = [(B - 1) e(0) + BMIe(O) + N(0)] mod B,

BM(1) = [ ( B - 1) [e(0)ve(1)] + BMI[~(0)^~(1)] + e-(1) N(0) + N(1)] mod B,

BM(2) = [(B - 1) [e(0)ve(1)ve(2)] + BMI[e(0)AR1)Ae(2)] + [~(1)A~(2)]N(0) + e(2)N(1) + N(2)] rood B,

BM(3) = [ ( B - 1) [e(0)ve(1)ve(Z)ve(3)] + BMI[e(O ) Ae(1) Ae(2) Ae(3 ) ]

+ [e(1) Ae(2) A e-(3)]N(0) + [~(2)^~(3)]N(1) + ~(3) N(2) + N(3)] mod B,

Ci

BM~

'I?T c T "l,"

r F T T e(o) ell l e~2j e~3j

Ctl .Ctaj BM(3)

Fig. 7. Lookahead Generator.

i6

C(1) =

,._1

%

C(2) =

"o

C(3) =

7". Lang, M. Valero / M-Users B-Servers Arbiter

if [e(0)A [N(0) _> B]] v [e(0) A [~z

v [ [ ( B M I : C : B - 1) (BMI + 1) + N(0)] >B]]],

otherwise;

v [ [ ( B M , , B - I ) ( B M , + 1)

j=0

otherwise;

[ t0) [c, 1)(BM, + 1)

+ j=O

otherwise;

if

V [e(O) Ae(1) Ae¢2) Ae(3 ) A [~'I

+ 1)

j=O

otherwise.

The N servers, B-users arbiter for 16 memory modules and 16 busses is given in Fig. 8. In this case the module that generates the propagation signals has 36 input-outputs. The most complex function realized by this module is C(3). Fig. 9 il- lustrates a possible implementation. It seems that all the functions BM and C can be integrated in one chip with a reasonable propagation delay.

5. Conclusions

In this paper we present several implementations for an arbiter that assigns busses in a multibus, shared-memory multiprocessor. The iterative network of Section 4.1 is simple but is does not use ef- fectively the level of integration that is possible with todays technology. The design with one level of lookahead produces a faster network with good integration. A still faster solution is obtained in the implementation with two levels of lookahead. The resulting network is fast, modular and highly integrated.

In the design with two levels of lookahead we have assumed a fixed number of busses. For more flexibility it would be necessary to introduce this value as an additional variable and modify the implementation accordingly.

The pin limitations of the modules makes the last design adequate up to 16 processors, memory modules and busses.

The arbiter presented is cyclic in order to obtain a fair policy. Nevertheless, due to the fact that the arbiter assigns busses to memory modules and that the processor requests are uniformly distributed, a fixed-priority arbiter would also be fair, provided that the number of busses is sufficiently large so that no request remains without service for a long

period. The fixed-priority arbiter is somewhat simpler because it does not require storage of the state and it makes possible a simplification of the connections between busses and memory [18].

As mentioned in the introduction the multibus interconnection is an attractive alternative to the crossbar because it is less expensive and more fault-tolerant and with an adequate number of busses it gives aproximately the same effective bandwidth [16]. The arbiter required in the

T. Lang, M. Valero / M-Users B-Servers Arbiter 17

M6o)

C(3)__.~C: N(o)

I

S(Ol

~ ' ~ C B N to) M 1 I)

N[ll

|M(~) M(2)

N(2}

Look - ahead generator

e[~ e(2]

BMI21MI3) ]

C(3J BM~3)

Storage of the state

F T s(~) S(2)

L_ e(3) ]

7 - S[3/

Fig. 8. A 16-busses Arbiter With two Levels of Lookahead.

7 7 "i ~' "i '' Adder ~- -B

comparator

~B e(o} e(ll B

"i 1' 7'"'i' Id.

N(2) N(3) 11

Id. I - "

l BIll = [

,-1 L ~

ixJ i=3

Cr A ~(i)

N(3)

~ -.--B ~B e(3k~

,CI31

Fig. 9. Implementation of C(3) With Standard MSI Modules.

multibus case is more complex, but the design, presented shows that it can be cost-effective and

fast.

References

[1] W. Plummer, Asynchronous arbiters, IEEE Trans. Com- put. C-21 (1) (1972) 37-42.

[2] R.C. Pearce et al., Asynchronous arbiter module, IEEE Trans. Comput. (September 1975) 931-932.

18 T. Lang, M. Valero ~M-Users B-Servers Arbiter

[3] P. Corsini, N-User asynchronous arbiter, Electron. Lett. 11 (1) (1975) 1-2.

[4] K. S~e Hf~jberg, An asynchronous arbiter resolves resource allocation conflicts on a random priority basis, Comput. Des. (August 1977) 120-123.

[5] K. Sf)e H~)jberg, One-step programmable arbiters for multiprocessors, Comput. Des. (April 1978) 154-158.

[6] M. Courvoisier, One-step N-user programmable arbiter, Electron. Lett. 15 (4) (1979) 430-432.

[7] K. Sc)e H~)jberg, Queue handling arbiter solves shared resource conflicts, Comput. Des. (November 1979) 129-135.

[8] E. Petriu, N-channel asynchronous arbiter resolves resource allocation conflicts, Comput. Des. (August 1980) 126-132.

[9] W.A, Wulf and C.G. Bell, C. mmp-A multi-mini- processor, Proc. Fall Joint Comp. Conf. AFIPS. (1972) 765-777.

[10] R.J. Swan et al., CM, - A modular multimicroprocessor, AFIPS (1977) 637-644.

[11] G. Mazare, MCS, A symmetric multi-micro-processor systems, Euromicro Workshop, Venice, October 1976.

[12] D.P. Bhandarkar, Markov chain models for analyzing memory interference in multiprocessor computer systems, Porc. 1st Annual Symp. Comp. Arch. (1973) 1-6.

[13] F. Baskett and A.J. Smith, Interference in multiprocessor computer systems with interleaved memory, Comm. ACM 19 (6) (1976) 327-334.

[14] C.H. Hoogendoorn, A general model for memory interference in multiprocessors, IEEE Trans. Comput. C-20 (10) (1977) 998-1005.

[15] B. Ramakrishna Rau, Interleaved memory bandwidth in a model of a multiprocessor computer system, IEEE Trans. Comput. C-28 (9) (1979) 678-681.

[16] T. Lang et al., Bandwidth of crossbar and multibus connections for multiprocessors, IEEE Trans. Comput. To appear.

[17] M. Ajmone and M. Gerla, Markov models for multiple bus multiprocessor systems, Report. No. CSD 810304, Univ. of California, Berkeley, CA (February 1981).

[18] T. Lang and M. Valero, Reducci6n de conexiones en orga- nizaci6n multibus y arbitraje asociado, RR 81/07, Facultat d'lnform~tica de Barcelona, Barcelona.

Tom,Is Lang received an Engineering Degree from the Univer- sidad de Chile in 1964, a M.S. from UC Berkeley in 1966 and the Ph.D. from Stanford University in 1974. He was professor of Engineering at the Universitat de Chile from 1965 to 1973, belonged to the Faculty of UCLA from 1974 to 1976, then taught Computer Science at the Universitat Polit(}cnica de Bar- celona from 1978 to 1981 and currently is a Visiting Associate Professor of Computer Science at UCLA. His teaching and research interest are in Computer Architecture, with current emphasis in multiprocessor and architectural support for Opera- ting Systems functions.

Mateo Valero born in 1952 at Alfam~n (Spain), received a Telecommunication Engineering Degree from the Universitat Polit~cnica de Madrid in 1974 and a Ph.D. from the Universitat de Barcelona in 1980. He was professor of Engineering at the Universitat Polit~cnica de Barcelona (Escuela de Telecomu- nicacibn) from 1974 to 1980. Since 1980 he is professor of Com- puter Architecture at the Facultat d'lnform~ltica de Barcelona (UPB). His teaching and research interest are in Computer Architecture, with emphasis in the design and evaluation of interconnection networks for multiprocessor systems and local computer networks.

m-users b-servers arbiter for multiple-busses multiprocessors

Documents