the el'brus-3 and mars-m: recent advances in russian high-performance computing

The JournaI of Supercomputing, 6, 5-48 (1992) �9 1992 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

The El'brus-3 and MARS-M: Recent Advances in Russian High-Performance Computing

MiKHAIL N. DOROZHEVETS Institute of Informatics Systems, Novosibirsk, 630090, Russia

PETER WOLCOTT* University of Arizona, Tucson, AZ 85721

(Received September 1991; final version accepted April 1992.)

Abstract. The El'brus-3 and MARS-M represent two recent efforts to address the Soviet Union's high-performance computing needs through original, indigenous development. The El'brus-3 extends very long instruction word (VLIW) concepts to a multiprocessor environment and offers features that increase performance and efficiency and decrease code size for both scientific and general-purpose applications. It incorporates procedure static and globally dynamic instruction scheduling, multiple, simultaneous branch path execution, and iteration frames for executing loops with recurrences and conditional branches. The MARS-M integrates VLIW, data flow, decoupled heterogeneous processors, and hierarchical systems into a unified framework. It also offers a combination of static and dynamic VLIW scheduling. While the viability of these machines has been demonstrated, significant barriers to their production and use remain.

Keywords. El'brus-3, MARS-M, VLIW, high-performance computing, parallel processing.

1. Introduction

Soviet researchers have developed digital computers since the late 1940s. By the middle of the 1980s, however, the gap between Soviet systems and Western models in performance, quality, and numbers produced had become extreme, particularly in the field of supercomputing [Wolcott and Goodman 1988; Wolcott and Goodman 1990]. The CRAY-1, with a theoretical peak performance of 160 megaflops, had entered production a decade earlier. The most powerful Soviet machine currently in volume production, the ten-processor El'brus-2 with a theoretical peak performance of 94 megaflops on 64-bit operands, t entered series production in 1985. At the same time the Soviet economy increasingly felt the pressures of over two decades of stagnation and the intensified Cold War. In 1985 Mikhail Gorbachev initiated a program of economic restructuring in an attempt to revitalize the Soviet economy and maintain the Soviet Union's status as a world superpower. Issues of national security, economic reconstruction, and national prestige made the development of indigenous supercomputers a high national priority.

The El'brus-3 and MARS-M computers represent two leading efforts to address the high- performance computing needs of the Soviet Union. Although quite different from each

*This paper was written nearly entirely by means of e-mail between Tucson and Novosibirsk. It is one of the first examples of this type of collaboration between Russian and American colleagues.

6 M.N. DOROZHEVETS AND P. WOLCOTT

other in goals, design requirements, architecture, and levels of support, they both incorporate innovative very long instruction word (VLIW) approaches to achieve high performance. These machines continue the tradition of indigenous designs of the Soviet high-performance computing sector. At the same time the projects were strongly influenced by previous research, from the West as well as from the East. When the MARS-M project was initiated in 1981, the AP-120B attached-array processor from Floating Point Systems (FPS) was the only commercial VLIW processor in existence. Its horizontal architecture was a starting point for the MARS-M. Initiated four years later, the El'brus-3 design was influenced by both the FPS attached-array processors and Joseph Fisher's work on compilers for VL1W processors. Although Russian designers in some cases made design decisions similar to those of their American colleagues working on the Trace and Cydra 5 computers, their machines are unique. In the sections below we examine in detail the El'brus-3 and MARS-M architectures and the design decisions underlying them.

2. El'brus-3

In 1985 a state order for a machine with a theoretical peak performance of 10 gigaflops was given to the Academy of Science's Institute of Precision Mechanics and Computer Technology (ITMVT) in Moscow. This was a logical choice. ITMVT has been a leader in Soviet high-performance computing since the 1960s. The El'brus-1 and -2 multiprocessors were the only mass-produced 64-bit machines in the Soviet Union. Since by 1989 an estimated one million lines of code had been written for these machines, it was important that the future machine be compatible with them.

The El'brus-2 is a multiprocessor with up to ten processors. The processors have a stacked- based architecture and employ ten functional units, none of which are pipelined. They have a clock period of 47 ns and share up to 144 megabytes of main memory. The computing paradigm includes the following elements:

�9 multiprocessing; �9 hardware support for high-level language and operating system constructs; �9 multiple functional units; �9 complex instructions (CISC) and zero-address instructions for stack processing; �9 dynamic instruction scheduling.

The strong orientation towards run-time mechanisms for instruction scheduling and hardware implementation of many operations of high-level languages and the operating system resulted in an overly complex design. There were several consequences. First, designers were unable to fully simulate and verify design decisions. As a result, some design faults were only detected at the debugging stage. Some mechanisms were canceled or redesigned at great cost. Second, hardware complexity was a serious challenge to service engineers outside the ITMVT, increasing installation and repair times. Third, dynamic hardware mechanisms to support procedure calls, prefetching of instructions and data, issuing of instructions, and so forth caused many diagnostic and debugging problems.

THE EL'BRUS-3 AND MARS-M 7

While having a number of merits, dynamic instruction scheduling that enabled only one or two El'brus-2 instructions to be issued per clock period would not make it possible to achieve the two orders of magnitude theoretical peak performance increase required for the El'brus-3.

Thanks in large part to the efforts of the ITMVT, ECL 1500-gates/chip technology with minimum and average gate delays of 400 and 800 to 900 picoseconds became available to designers in the Soviet Union around 1984-85. This made it possible to design a processor with a clock period of 10 ns. The results of gate-level simulation performed after wire routing showed that a 12.5-ns clock period was needed to allow the necessary data transfer at each stage, forcing designers in 1990 to increase the clock period.

Thus, the gate array technology was only four to five times faster than that used on the El'brus-2. Since the system was to be used for both general-purpose and scientific applications, a vector-pipeline approach was rejected: At the same time designers had to preserve the portability of software written for the El'brus-2 in the El' -76 language. These constraints formed the context within which the El'brus-3 was developed. Designers were forced to use a VLIW approach for the El'brus-3 central processor and simultaneously increase the maximum number of processors from 10 to 16, all the while preserving the tightly coupled multiprocessor configuration of the El'brus-2.

2.1. El'brus-3 System Organization

The key features of the El'brus-3 architecture are

* multiprocessing; * hardware support for high-level language and operating system; �9 multiple pipelined functional units; �9 RISC approach with long instruction words; �9 static instruction scheduling.

The system, shown in Figure 1, consists of up to 16 central processors with their own local memories, 8 shared main memory sections, 8 I/O processors, 16 telecommunication processors, magnetic disks, and other I/O devices. Each central processor is a VLIW processor with a peak performance of 560 megaflops or 1360 MIPS, giving a peak performance of 8.96 gigaflops or 21.76 GIPS. 2

To transfer data between the central processor's buffers, local and main memories, and the I/O processors, each central processor features an I/O buffer and input and output crossbar switches. The I/O buffer stores requests and data that are passed from or to eight I/O processors along eight pairs of unidirectional input and output data paths. Up to eight central processor or I/O processor requests can be passed through the input crossbar switch to the local and main memories per cycle. A special management unit services I/O processor requests without any central processor involvement. The output crossbar switch transfers up to eight operands from the local and main memories to a central processor's buffers or to the I/O buffer. Eight additional pairs of load and store paths (one for each main memory section) connect the input and output crossbar switches and main memory.


MMS 1 I

2 �9 . 8

I/OP 1 I

I

MMS2 [ I MMS8

�9 . 16

11 ~pu 2 811 i ICIPU 16 8

1

I/OP'2 I I I/OP'8

I I

MMS - Main Memory Section CPU - Cenlral Processing Unit I/OP - Input/OutputProcessor TCP - Telecommunications

Processor MD - Magnetic Disk Controller I/OD - Other I70 Device

Figure 1. El'brus-3 logical configuration.

Each of the eight I/O processors has a 200-megabyte/s data transfer rate to and from the central processors for an aggregate bandwidth of 1600 megabyte/s. Each processor has two types of channels: (1) 32 slow channels, IBM-compatible, with 1.2-3.0 megabytes/s bandwidth for interfacing with industry-standard I/O devices, and (2) 32 fast fiber-optics channels with 15 megabytes/s bandwidth for interfacing with telecommunications processors and other devices.

Each central processor is housed in two water-cooled racks, each of which measures about 1500 x 650 x 1900 mm. A main memory section and an I/O processor each occupy one such rack. A minimal configuration of one central processor, one main memory section, and one I/O processor dissipates about 110 kilowatts of power.

2.2. El'brus-3 Central Processor

Using hardware tags to specify data types, the El'brus-3 central processor (Figure 2) can operate both on integer and floating point data of 32- and 64-bit formats. All of the functional units are pipelined.

There are nine functional units, but to reduce the complexity of the interconnect structure, the divider and one of the logical units (LUs) are each combined with a load/store unit to form two compound units. There are three reasons for this. First, when loading data into the array element buffer during loop processing, the load/store units are not generally used, except in the case of sparse arrays. Second, in most scalar computations

THE EUBRUS-3 AND MARS-M 9

* 7 128W history

result buffers

l * 2 adders I * 2 multipliers

* 1 divider

* 2 logic units

* 2 load/store units

* a crossbar switch

Functional Units ~

8 data paths

* a 1024 W stack buffer

F * a 512 W array element buffer

* an operand record memory

* a crossbar switch

Buffer Memory

* 2 data conversion

units

Indexation Unit

* 8 address generators

* a generation control buffer

* an address parameter buffer

Instruction Unit

a sequencer

a subprogram unit

32 display registers

a 256 W control information

stack

a 2048 W instruction buffer

8 table look-aside buffers

a global data cache

a request buffer

a process page table

an input/output buffer and

crossbar switches

* 8 ports

* 2M86 bit words

* 32 banks

paths

16

paths

~ _ ~ * 8 sections

to/from other * 16 ports per section

central precessors I * 32M 86-bit words per section

| �9 128 banks per section Main

Memory

Figure 2. E l ' b r u s - 3 C P U .

one LU is sufficient and a data load is performed by the load/store unit that is combined with another LU. Third, for non-numerical computations, a data transfer can be accomplished by the load/store unit, which is combined with the divider.

Thus there are seven functional units interconnected by a full 15 • 16 crossbar switch. The crossbar provides sixteen 72-bit (plus one parity bit) data paths, consisting of an 8-bit tag field and a 64-bit data field.

The functional unit operands can come from

�9 seven functional unit outputs; �9 seven synchronous history result buffers (HRBs); �9 a multiported stack buffer; �9 a multiported array element buffer; �9 literals from the Instruction unit.

All data paths are unidirectional with a bandwidth of 720 megabytes/s. The aggregate throughput of all internal data paths is 15 * 720 megabytes = 10.5 gigabytes/s because

10 M.N. DOROZHEVETS AND R WOLCOTT

up to 15 different operands can be placed on the crossbar input paths. Besides passing its operation results to the crossbar switch, each functional unit writes them into its history result buffer, which provides a temporary storage for the results. When stored in the buffers, these results can be read and used as input operands. As will be explained below, the presence of this additional register memory helps to reduce the contention between the units in accessing the stack and array element buffers and solve other scheduling problems.

The shared buffer memory is made up of two multiported register files: stack and array element buffers. The 1024-word stack buffer holds the history of procedure frames, eliminating external memory accesses for the variables within the frames. The maximum frame size is 64 words. There is no stack pointer. Data are addressed by specifying their displacements from display registers or by means of indirect reference words. (The display registers point out the locations of activation frames for procedures or blocks in a program stack.) The 512-word array element buffer is used for storing array elements being processed in loops.

The buffer memory can provide up to eight input operands for the functional units per cycle, of which up to six come from the stack buffer and up to eight from the array element buffer. Allowable combinations are 0 + 8, 1 + 7, . . . . 6 + 2. Up to four functional unit results can be stored in the stack buffer and up to two in the array element buffer in one cycle. In addition, up to two and six operands from the local or main memories can be loaded in one clock period into the stack and array element buffers, respectively. However, this maximal throughout is achieved only for a nonconflicting combination of access addresses for the buffers.

To provide fast access to the processor's buffer memory, two types of ECL SRAM multiway chips were specifically produced for the El'brus-3 project: a 128 * 16-bit chip with one port for reading and one port for writing and a 256 * 2-bit chip with four ports for reading and four ports for writing. They can perform read/write operations during one clock period.

The generation of virtual 32-bit addresses is performed by two scalar load/store units and eight vector address generators from the Indexation unit. Six of the latter are used for loading and two for storing in the array element buffer.

The array element buffer must be large enough to accommodate the continuous generation of addresses by six address generators. The generation of a read address is accom- panied by the reservation of one word of the array element buffer in which an operand from the local or main memory is to be placed. Therefore, the size of the array element buffer must be greater than the number of address generators times the maximum memory latency for loading data into the array element buffer. The maximum memory latency occurs when data are loaded from main memory (70 clock cycles). Therefore, the array element buffer must be larger than 6 * 70 = 420 words.

A cache/memory management unit (CMMU) maps virtual addresses to physical addresses and manages the transfer of data between the buffers, local and main memories, and I/O processors. Before translating virtual addresses, the CMMU checks for hits within the buffers and a global data cache. Up to eight addresses can be translated by table look-aside buffers (TLBs) in one cycle. Then, translated central processor requests (load addresses and both store-addresses and data) are placed into a request buffer. The I/O buffer contains I/O


processor requests. The latter are processed in cycles when the central processor's scalar requests are absent. Up to eight requests can be passed to local and main memories and I/O processors in one cycle. There are eight paths between the CMMU and the local memory: six read and two read/write. The write paths transfer both addresses and data. There are also eight 72-bit (plus one parity bit) output data paths from the local memory.

The 2-megaword local memory with single error correction/double error detection is implemented using fast ECL 16-Kbit SRAMs. It is interleaved on 32 banks with a cycle time of four clock periods. With eight ports it is possible to achieve a bandwidth of 5.76 gigabytes/s transferring 72-bit data. The local memory's latency in loading data to processor buffers is 17 clock periods.

Each main memory section has a 32-megaword storage capacity. It is implemented using tow 256-Kbit DRAMs with a cycle time of 400 ns. The 16-ported main memory section is interleaved on 128 banks with a cycle time of 35 clock periods. It has an average latency of 70 clock periods in loading data to the central processors.

2.3. El'brus-3 VLIW Approach

2.3.1. Procedure-oriented static scheduling model of computation. Like the Trace [Colwell et al. 1987] and the Cydra 5 [Ran et al. 1982; Rau et al. 1989], the El'brus-3 performs a global analysis of a source program. Like the Trace, it analyzes one procedure or module at a time, looking for the critical path of computation. The compiler builds data dependency, control dependency, and procedure call graphs and tries to discover hidden dependencies between addresses to be computed at run time.

In spite of its advantages in achieving greater optimization, however, a global scheduling approach could not be used by the El'brus designers. Because the El'brus compiler must use the same code for all calls to a given procedure, it uses a procedure-oriented static scheduling (PSS) model in which code motion is restricted to within procedure boundaries. Consequently, the El'brus-3 designers used methods of reducing control dependencies in conditional branches and procedure calls (described below) that differ from those used in the Trace and Cydra 5.

The fundamental reason for the PSS model lies in the El'brus design goals and development environment, which differ from those of the Trace and Cydra 5. In contrast to the American computers, the El'brus-3, like the El'brus-1 and -2, was developed almost ex- clusively for military applications. The two primary requirements were high performance and portability of software. Cost, either in rubles or volume of hardware, was not a strong constraint.

Under these circumstances designers decided to duplicate a set of functional units, to implement an expensive full crossbar switch and distributed register memory for intermediate results, to use multiported register files, and so on. The availability of a sufficient number of functional units makes it possible to execute multiple operations from alternative basic blocks in parallel. In contrast to the Trace, such multitracing with subsequent choice of the correct path does not make use of predictions or compensation code.

To preserve portability of software, the El'brus-3 had to support El' -76, the base language of all models of the El'brus computers. A procedure is a key object in the language. It


is both a building block of the program and a basic computation unit. Designers were obliged to preserve the ability to build a program from separately scheduled procedures without any code duplication. Code motion is therefore limited by procedure boundaries and the same procedure code is used in all calls to a procedure. This contrasts with the in-line procedure substitution used in the Trace and Cydra 5.

Like its predecessors, the El'brus-3 has hardware support for the procedure mechanism, such as mapping the top of the program stack onto a processor's registers, allocating register frames for procedure activations within the stack buffer, and providing display registers for multilevel lexical addressing. 3 But in the El'brus-2 there was a significant delay (30 to 40 clock periods) between the issuing of a call operation and the beginning of execution of the called procedure. This delay was necessary to guarantee the correct handling of in- terrupts and page faults since all issued operations of the calling procedure had to be executed before any instructions of the called procedure were issued.

The designers were faced with a serious challenge: How could they enable a statically scheduled procedure to begin its execution while simultaneously completing all operations issued earlier in any part of the program when there was only one copy of the procedure code? Their solution was to implement an overlapped execution mode in which it was possible for the compiler to detect simultaneous access by the calling and called procedures to functional units, buffers, and the same variables in memory. To resolve conflicts, it forms special time pointers for the procedure's parameters and adds them to any call operation and to the descriptors of the procedure. Specialized hardware uses the timing hints provided by the compiler to ensure correct overlapped execution of two procedures.

Global program analysis enables the compiler to identify when calling and called procedures will have conflicting access to system resources. Such analysis also reveals dependencies between procedures. On the basis of this information, the compiler can try to schedule instructions within procedures such that the overlapping execution of called and calling procedures is optimized. In this manner some interprocedure optimization can be done, in spite of the constraints of the PSS model.

The following are the four stages of procedure execution and the primary operations to be supported:

1. Callpreparation: forming a new context for a procedure to be called and saving a current procedure context.

2. Procedure call: setting the new context and synchronizing the passing and using of parameters during the overlapped execution of calling and called procedures.

3. Execution: performing the operations of 1 and 2 for other procedure calls, and restoring the old context of the calling procedure.

4. Return: setting the old context and synchronizing the passing and using of procedure output values.

This process division provides the ability to overlap procedure execution within a central processor using dynamic, instead of static, instruction scheduling. As a result, for a period of time after performing calls/returns the parallel execution of operations of two procedures takes place on one processor. The first instructions of the called procedure are executed simultaneously with the instructions of the calling procedure, if the former are


not dependent on results that have yet to be computed by the latter. The static approaches used in other machines are based on an in-line procedure substitution with long instructions containing operations from different procedures. In the El'brus-3, overlapped procedure execution can exist in two forms. First, the creation of a new context (i.e., forming address and control information and preloading code), initiated by context operations, is performed simultaneously with the execution of operations of the current procedure. Sec- ond, the execution of a new procedure can start without waiting for the completion of operations issued by the calling procedure. The following actions must be accomplished for Prepare~Call/Return operations:

�9 prepare jump; �9 form new address values for display registers, using an additional set of display registers; �9 save current display-register values in the Control stack (for Call operations); �9 load information about procedure parameters in dedicated procedure control registers; �9 write other address and control information in the Control stack.

The preparation for any jump (to a procedure or not) is similar and will be discussed later, as will parameter access synchronization.

2.3.2. Branching. Branching in the El'brus has both static and dynamic aspects. A large number of functional units are able to simultaneously perform operations from different basic blocks without waiting for the computation of a branch condition. Although it resembles that of the Trace, the El'brus approach has an important difference. The El'brus-3 compiler does not select the most likely path and does not use compensation code to restore program correctness if some predictions were wrong. Instead, mukiple possible paths are executed simultaneously. In general, to branch means to take the results produced by the true path.

Suppose it is necessary to schedule the execution of the following fragment of Fortran code:

100 T=A+B

IF (T .GT .K*L )GOTO 300 200 X=D+T

GOT0 400 300 X=T-C 400 K=X+K

Y=X*K

A, B, C, D, K, t., T, V are local variables; X is a global variable. The data flow graph of the code fragment is shown in Figure 3a. There are two possible

execution paths: BB0&BBI&BB3 and BB0&BB2&BB3. First, the compiler checks the possibility of simultaneous execution of BB0, BB1, and BB2. The condition path BB0 is a critical path for this code fragment since the condition will not be calculated by the time BB1 and BB2 have completed.

Another problem is that both BB1 and BB2 contain store operations (Stl and St2), but only one of them can be enabled to be performed in the process of parallel execution of


BB0

_.....I [ CMP:[A1],[M1] [ C1

BB1 ~ 1~ ~ ~ BB2 A2 - ~ S1 Stl [Store:[A2]->X [ [ [Store: [S1]->X[ St2 - -

B B 3 ~ RX

I V x1,K- K [ A3

[ *:t l,tA31->Y ] M2

data dependency control dependency

Figure 3a. Data flow graph for branch.

these alternative blocks. To solve this type of problem, the El'brus-3 instruction set includes both conditional and unconditional versions of operations (load and store) to access memory. Besides standard arguments, the conditional load and store operations require one additional operand, a predicate, to determine whether the operation is to be performed. (Besides these conditional operations for the load/store units, there are also similar operations for the Indexation unit to control the transfer of data between memory and the array element buffer.)

It is interesting to note that while they were developed independently, the El'brus-3 and Cydra 5 approaches to the execution of memory access operations in alternative execution paths--called a conditional mode 2 and a directed dataflow, respectively--are equivalent. It should be pointed out, however, that in contrast to the El'brus-3, all Cydra 5 operations, not just memory access, have predicates as input operands.

In the scheduled code fragment shown in Figure 3b the issue of conditional store operations Stl and St2 is delayed until the compare operation C1 completes. The result of the C1 operation is used as a predicate-operand of both the Stl and St2 operations, guarantee- ing that only one of them will be performed.


A1 ]+,RA,RB,Wr(DeI+)(T)] ~ M1

i

HB I I . ~ l I ] I' C1 ICMPH'SHAI'[Mlb ' I

" ] I , f - - I [ Store([C 1]=trae,RHA2,(X) l [ Join,RHA2,RHSI,[C 1 ] ] Store([C 1 ]=fal,;e),R HS 1 ,(X)]

- JXI I csT2

I f H: ---~ hi~to----s r~sol7 Uul','er stolin---------7 ] ] storing operations" results

JK Join,RA3.1 ,RA3.1 ,[C 1 ],Wr(DelJ)(K) RA - The read operation for the A variable being placed in the buffer stack I I

M2 I *,RHJX,[JK],Wr(DeI*)(Y) ] RHA2- The read operation for the A2 opermion's result being placed in the history buffer

[CI] - Take the C1 operation's result from the unit's output

Wr(DeIJ)(K) - The write operation for the join operation's result to be assigned as a value of the K l(xzal variable

data dependency

Figure 3b. Scheduled code fragment.

To avoid a delay in computing BB3, part of the BB3 code has to be inserted into the BB1 and BB3 paths; that is, the compiler expands both the alternative paths. Since the X and K variables, now being computed in both paths, are used in BB3, it is necessary to choose their true values. To make it possible, a special join operation is included in the LU's instruction set. This operation has three arguments: source 1, source 2, and a predicate. The predicate's value determines which of the sources will be the result of the join operation.

Two join operations are used to schedule this code fragment: one for • another for K. Note that the final code does not contain any branches at all; that is, both alternative blocks will be executed fully. In general, the expansion code size is determined by the time needed to complete the computation of the condition value. After this, execution of only one of the alternative paths continues. The cost of this solution is a small increase in code size.

It is important to point out the difference between compensation and expansion codes. The former is used to undo the logical inconsistencies arising in cases where the single


execution trace was predicted incorrectly. The latter executes all possible paths, making it possible to implement simultaneous multitracing with the subsequent (delayed) choice of one of the traces.

To handle exceptions for such look-ahead computation correctly, a special conditional mode 1 is provided. Each operation to be executed within expanded parts of alternative paths is equipped with a special bit indicating the conditional mode of its execution. Ex- ceptions are taken into account only if they are produced by operations from the true path. The use of the conditional mode 1 is quite similar to the method used in the Trace machines [Colwell et al. 1987].

Some new ideas can be found in dynamic jump scheduling. The main feature is a preparation of some jumps during the execution of the main path of computation. Subsequently, any of the prepared jumps may be chosen for execution. The chosen can be taken or not, depending on its condition value.

To support such branch scheduling, three pairs of branch target and trigger function registers with Prepare/Perfonn__Jump operations on them are provided. The same registers are used for Prepare Call/Return operations.

When any Prepare operation is performed, the required code page is fetched into an instruction buffer, and then the target instruction is placed in a specified branch target register. A specified value is set in the trigger function register, which is coupled with the branch target register. A trigger function operates on a condition code register. The function determines a branch condition, masking required conditional code register bits and executing an OR operation on them. Computation of the trigger function register value takes place every cycle. For Jump operations, executing an instruction from the branch target register occurs only if the trigger function value is true. Otherwise, the instruction following the Perform_Jump instruction is taken.

The specification of the number of the branch target register (and the trigger function register) by operations can be both explicit (by indicating its number) and implicit. In the latter case, two of three branch target registers are specified by a special 3-bit mask. The chosen registers are prioritized from left to right. In checking from the left to the right, the first register with the trigger function value of true will be used as a target. This branch method is widely used to implement high-level languages operations, such as a switch or a case.

A similar capability of multiway branching with a software-controlled priority scheme was implemented in the Trace. Both architectures can pack multiple jumps into a single instruction. This enables the simultaneous execution of multiple jumps, with the priority of possible paths of computation being based on the original ordering of their tests in a program. The difference between them lies in the implementation. The El'brus approach appears more advanced, however, because it provides not only the selection and calculation of potential branch addresses (as for the Trace), but also a look-ahead, software- controlled loading of the branch target code.

2.3.3. Loops. One of the goals of the El'brus-3 design was to achieve vector supercomputer performance while providing the capability of processing loops with recurrences and conditional branches. Designers of the Cydra 5 computer addressed a similar task. But the El'brus plan is more ambitious: to provide high performance by supporting nested loops with recurrences and conditional branches for regular data structure processing.


As in the Cydra 5, the El'brus approach is based on the ability to store and access multiple loop iteration contexts during loop processing. For each loop iteration a set of registers, called an iteration frame, is dynamically allocated within the circular array element buffer. All variables to be processed during a loop iteration are placed within its iteration frame. Such variables can be both array or vector elements and scalars. They are accessed by means of base loop registers, each of which holds an iteration pointer and the iteration size. Any array element buffer address is the sum of an instruction's displacement and the pointer.

The capability of fast random access to vector elements has two primary advantages. On the one hand, the use of such a register file makes it possible to attain vector processing performance comparable to that of a vector supercomputer. In comparison with the latter, the El'brus-3 has additional overhead costs from the adding of the displacement to the loop base register.

On the other hand, random access to vector elements is much more flexible than in the case of vector registers. First, the size of an iteration frame and the location of any loop variable within the frame are known at compile time. Second, multiple iteration frames can be accessed during one clock period. As a result, any loop variable within the frames placed in the array element buffer can be accessed. The iteration frame approach is superior to vector registers in processing loops with recurrences. There are three frame types: previous, current, and next. A simple case is shown in Figure 4.

This organization of the array element buffer makes it possible to perform operations within the current iteration using, when necessary, data from previous iterations. A look- ahead load of other vector elements from memory for the next iteration takes place simultaneously with the computation of the current iteration. The look-ahead data load is possible due to the regularity of data structures processed in loops. (Other nonregular loop data are placed in the Stack buffer.)

Loop

Base Register

A(i-1)

B(i-1)

A(i)

B(i)

A(i+l)

B(i+l)

r

Previous

frame

Current

frame

Next

frame

Figure 4. Iteration frames.


The look-ahead load of data for the next iteration makes it possible to speed up the execution of loops since no time is wasted waiting for data to be fetched from memory. 7 provide this capability, some hardware support was required.

More complex control is provided by means of the conditional mode 1, implemented in hardware. In this mode, exception handling for the look-ahead data loading process is suspended until a future iteration becomes the current one. The exceptions will be killed if the preload is wasteful (i.e., the current iteration is the last). The compiler provides the code necessary to restore the true values of descriptors and other registers changed by the wasteful data load.

In addition to providing mechanisms to support operations on multiple iterations of one loop, the El'brus designers implemented some hardware mechanisms to help handle nested loops. Specifically, they help support interloop jumps, allocation of separate register areas within the array element buffer for multiple loop contexts, and loop counting. Details will be given after considering the array element buffer organization.

Architectural support for loops is provided for the following actions:

�9 allocating registers in the array element buffer for each iteration; �9 storing an iteration frame history; �9 simultaneously accessing multiple iteration frames; �9 look-ahead loading of data for the next iteration; �9 handling nested loops.

In contrast, no architectural support for loops exists in the Trace, where loop unrolling in software is used. In the Cydra 5, the first three actions are supported, although the third is provided only for multiple iterations of the innermost loop. The Cydra 5 also supports the option of look-ahead load through a compiler switch that disables all memory exceptions.

We now consider the details of the iteration flame load mechanism. The access to a regular data structure is defined by its descriptor and index variable. The descriptor contains a base address and a number of structure elements in local or main memory. The index variable consists of initial and increment index values. Pairs of descriptors and index variables, called address packs, are address generation invariants during a loop execution.

Address generation is accomplished by eight address generators in the Indexation unit, with six for reading and two for writing. The address generators get their operands (address packs) and commands indirectly through two load/store units and place them in two buffers: an address parameter buffer and a generation control buffer.

Any 128-bit command from the generation control buffer controls all address generators, specifying for each of them the address pack and one of the following operations:

�9 compute an address; �9 compute an address and an index; �9 compute an index; �9 no-op.

A machine instruction, conditionally or unconditionally, initiates an address computation by using a short 4-bit field to specify the address of the command in the generation


control buffer. As the result of the execution of the fetched address command, each generator can pass one 32-bit virtual address to the cache/memory management unit. This two-level control scheme reduces the number of bits required for the main instruction, but increases the start-up time for a loop execution.

The array element buffer consists of four independent circular 128-word buffers, called quadrants, for storing four different iteration frame histories. After loading the initial value, each loop base register is linked with a specific quadrant. A main instruction has a field to control the advancement of all of the loop base registers.

The segmentation of the array element buffer into quadrants is used so that the working contexts of multiple nested loops can be stored with independent access for each of them. Loop-counting functions for these loops are performed on three dedicated loop counters in the Instruction unit. To provide fast jumps in and out of loops, three branch target registers can be scheduled for preparation of the jumps.

2.3.4. Contention between functional units in accessing shared registers. The stack buffer and the array element buffers are shared register files for the seven functional units. Their aggregate data bandwidth enables up to eight read and four write operations for the units' results to be performed per cycle. Nevertheless, this is not sufficient to feed all of the functional units and eliminate contention between them.

To solve contention problems, a special interconnect structure was implemented. It is a full crossbar switch, expanded by the dedicated history result buffers, located between the output of each functional unit and its switch input.

The buffers are synchronous circular queues. Each of them holds a track of unit output values for a determined number of cycles. A write location is provided by a Write pointer. It increments its value in each cycle regardless of whether any functional unit's operation finishes or not.

Besides being placed in the history result buffer, a result in normal execution mode always passes to the crossbar. It can be used as an input operand immediately and/or after some cycles. In the latter case, it must be read from the history result buffer by an operation that indicates the number of cycles after its computation. This time value is a negative 3-bit displacement from a Read pointer, which, in general, coincides with the Write pointer. (A special case will be discussed later.)

The history result buffers are provided to overcome some scheduling problems. Sup- pose there were no history result buffers, only the buffer memory shown in Figure 2. A compiler would have only two ways to provide operands for the functional units: from the units' outputs and from buffer memory. Data from a functional unit can be fetched in the same clock period as an operation from the Instruction unit, that is, without any delay from the point of view of the compiler. But to store a data element in the buffer memory and then send it to the functional units as an operand requires six to seven clock cycles. A schedule hole occurs during which this datum cannot be used. The history result buffers are used to close the schedule hole by allowing much faster access to these data without using the stack and array element buffers. Special operations can read data computed within the last seven cycles from the history result buffers.


The ability to store units' results for this time period is sufficient to handle the computation process in a normal mode. But the history result buffers had to be designed to provide a correct execution in the case of exceptions, too. When an exception occurs, the instruction issue is stopped. All previously issued units' operations are executed to completion. This means that none of the buffers' read operations will occur for at least the time needed to drain the longest functional unit pipeline. Thus, the size of any history result buffer has to be greater than the length of the longest pipeline. A storage capacity of 32 words would be sufficient for any history buffer, but a larger size of 128 words was used because engineers had only special 128 * 16-bit memory chips with separate read and write ports available with which to work.

Using synchronous memory with conflict-free access dramatically simplifies the task of scheduling access to the shared register files. At the same time, storing a result history for each unit enables a computation process to be restarted in some exceptional cases. This restart capability is supported by the implementation of other history queues for units' operands, write addresses, control information, and program counter values.

The idea of using synchronous memory is not new. Earlier, similar ideas were implemented in the MARS-M [Vishnevskiy 1982] and the Cydra 5 [Ran et al. 1989] computers. One of the main features of polycyclic architectures [Rau et al. 1982] is delay elements for simpli- fying scheduling tasks. But the implementations of the idea are quite different; the El'brus-3 uses a history approach, whereas others use a delay approach.

2.3.5. Instructions. Instructions consist of a set of operations and an instruction format. There are six groups of the El'brus operations:

1. control and subroutine; 2. read buffers/registers; 3. write buffers/registers; 4. loop control; 5. arithmetic, logical; 6. load/store.

Operations can be both half-single (32-bit data and 4-bit tag) and single precision (64-bit data and 8-bit tag). There are also operations to work with any word of double-precision (128-bit) data.

There are two types of arithmetic operations: integer and universal. The universal operations can use both integers and floating point data as their operands. According to the hardware-recognized data tags, the functional unit hardware makes a choice of operation type and output data format. Both integer and universal operations are executed by the same functional units.

Communication with the local and main memories is accomplished both explicitly, by load/store operations, and implicitly, by loop-control operations. The latter initiate an address generation by the address generators.

The El'brus-3 approach to instruction encoding is the same as for the Trace: a packed variable-length memory representation of unpacked fixed-length machine instructions. The packed El'brus-3 instruction can consist of one to four 72-bit words. Each of them is divided


into four 16-bit parcels. Tag fields and the first parcel of the packed instruction contain an operation mask. Bits of the mask indicate the presence of operations associated with them within the packed instruction. Locations of some operations within the instruction are fixed and others are variable. The order of mask bits with on value corresponds to the ordering of their corresponding operations in the packed instruction. The unpacked instruction is 504 bits wide. 4 It has a fixed location for each operation.

A program code divided into 128-word code pages is placed in physical memory (i.e., no run-time address translation is needed to access it). The code pages are loaded into a 16-page instruction buffer during a jump preparation. In fact, some form of a software- controlled cache fill by Prepare_Jump operations is used here. This approach is quite different from those used by the Trace and Cydra 5. For example, in the Trace a special cache refill engine performs run-time scheduling of instructions fetched from memory and passing them to caches.

2.3.6. Synchronization and exception handling. There are some cases in which the static schedule of an El'brus computation process can be altered at run time: overlapped execution for two independently scheduled procedures, communication with asynchronous memory, and exception handling.

The first case is specific to the El'brus; the others are traditional challenges for any VLIW architecture. Synchronization in the first case is necessary because of the possibility of access conflicts. Two procedures may wish to write results simultaneously into the same location of the history result buffer. Also, two procedures may interfere with the correct sequence of read/write operations in the Stack buffer (for example, reading parameters before writing them).

To solve the first problem, specific issue delays are inserted into instructions. Each instruction can contain a field that specifies in how many cycles the instruction following the current one is to be issued.

In addition, there are limitations on the possible positions within a procedure for operations that compute parameters. When the adder is used, for example, conflict between two procedures can arise due to the different execution times of its integer and universal operations (four and six cycles, respectively). To resolve this, the adder's universal operations have to be issued no later than two cycles before any Perfornk_Call/Return operation. The adder's integer operations cannot be issued in the two first cycles after entering or return- ing from any procedure. Similar solutions were used to resolve the conflicts of writing parameters in the stack buffer by the units with different execution times. The divider's operation, for example, has to be issued no later than 15 cycles before any procedure call/return. In other cases a nonlocal static analysis for parameter passing is used.

The El'brus-3 provides hardware support for resolving the access parameter conflicts for overlapped procedure execution. This is done by inserting a run-time delay when performing procedure calls/returns.

To compute the delay value, on the one hand, a compiler forms two time scales for each procedure to show the chronologies of first reading its input parameters from and writing its output parameters to the stack buffer. Time values within the scales are given in cycles after performing a call/return.


On the other hand, Prepare Call operations contain other time scales. These specify how many cycles after the procedure call and return its input parameters will be written to, and its output will be read from, the stack buffer. The problem is to fit information about an arbitrary number of procedure parameters into the fixed-length timefield for any scale within an instruction. The solution is to divide any time field into four parameter fields. Each of the first three contains time information about each of the first three procedure input or output parameters. If the number of parameters exceeds four, the fourth field contains the maximum time among the latter, that is, specifies the worst case.

When performing Prepare Call/Return operations, the Instruction unit computes the necessary time delays, subtracting the read times from the write times of the scales. The computed time values, if they are positive, delay the issue of instructions following Call/Return operations.

Synchronization is also needed to avoid accessing data that have not yet been loaded from the local and main memories into buffer memories (the stack buffer and array element buffer), or branch target registers that are not ready when performing Jump operations. To detect these situations, a synchronization bit is provided for every branch target register and for each buffer memory register. Any access to registers that are not ready stops an instruction issue, allowing operations already issued to complete.

In contrast to the Trace and Cydra 5, the El'brus-3 features the use of common, shared memory that is scheduled nonstatically. Clearly, the El'brus designers had no choice because static scheduling of memory is not possible for multiprocessors with shared memory.

To solve the tasks of synchronization and exception handling, functional unit pipelines with self-draining are used. All addresses needed to complete any operation are packed together with the operation code in the same instruction. Such a solution was used in the Trace, but the presence of synchronous history result buffers results in a problem specific to the El'brus-3. The problem is that a Read history result buffer operation specifies how many cycles ago the desired result was computed. Stopping an instruction issue, while allowing issued instructions to finish, destroys the static schedule. A solution is to provide two different time pointers for every history result buffer: a Read pointer (common for all units) and a Write pointer (private for every unit).

As has been mentioned, in normal execution mode both pointers coincide and increment their values in every cycle. Stopping an instruction issue freezes all Read pointers. If there are operations in progress for some units, their Write pointers continue to advance their values until the operations have completed. When an instruction issue is resumed, all Read pointers are unfrozen, but the Write pointers are incremented only if or when both pointers coincide. In the case when the Read and Write pointers are not equal, data from the history result buffer, rather than data from the unit's output, should be placed on the crossbar input. Every functional unit provides in hardware a transfer of data being specified by Read pointer in the case.

Currently, several El'brus-3 prototype units are being assembled.

3. MARS-M

The MARS project originated in work on parallel programming languages carried out at the Computing Center of the Siberian Department of the USSR Academy Sciences in


Novosibirsk during the 1970s. The researchers concluded that an asynchronous model of computation with shared memory would best accommodate their new parallel programming language with hierarchical structure. Each of the language levels features specific mechanisms for object management, process control, and communication. By 1978 basic concepts of a high-level system called Modular, Asynchronous, Expandable Systems (MARS) based on this model had been formulated [Marchuk and Kotov 1978].

At that time MARS was just one of many purely academic research projects. To progress beyond the paper stage, research projects in the Soviet Academy of Sciences had to obtain support from industry and policy makers. MARS was implemented thanks to the efforts of G.I. Marchuk, the president of the State Committee on Science and Technology, and V.S. Burtsev and G.G. Ryabov, directors of the Institute of Precision Mechanics and Computer Technology (ITMVT). The alliance of the Computing Center and ITMVT was vitally necessary for MARS designers, enabling them to get access to the El'brus technology and CAD systems. (No high-technology materials and components were accessible for a project that is not involved in a state order.) By 1981 the architecture of a prototype MARS system, the MARS-M computer, had been developed at the Computing Center [Vishnev- skiy 1982]. The architecture combined a variety of design concepts into one system: data flow, very long instructions words, decoupled heterogeneous processors, and a hierarchical structure.

The tight coupling with ITMVT significantly influenced the implementation of the project. The MARS-M computer had to be incorporated into the El'brus system as a dedicated processor. This requirement meant that the MARS-M lacked its own input/output processor and secondary storage. Customers therefore would have to buy both computers even though they only needed the MARS-M. Another constraint was the requirement that all MARS- M hardware fit into three water-cooled racks, like a regular El'brus-2 CPU. The tri-spoke configuration of the racks and the need to use the El'brus cooling and power supply systems designed for it made it impossible to add a fourth rack. For this reason designers were forced to implement the prototype as a 48-bit machine rather than a 64-bit, as planned.

A single unit of the MARS-M computer was manufactured and installed in the Novosibirsk Division of ITMVT in 1988. In 1990, after two more years of work, the project was stopped due to the lack of state funding.

3.1. MARS-M Architecture

In this section we discuss key features of the MARS approach to building an adaptive, high-performance computing system. To satisfy the often conflicting goals of adaptability and high performance, a hierarchical architecture with several levels and distributed control was developed. This architecture is characterized by a combination of hard skeleton and soft tuning elements at each level. The MARS-M is a shared-memory heterogeneous multiprocessor having a control processor, a central processing unit, a peripheral subsystem, a memory management unit, and multiported main memory. The control processor consists of eight (virtual) system processors, and the central processing unit consists of a control subsystem and four address and four execution processors. These elements constitute the skeleton. Figure 5 shows the logical structure of the MARS-M. The tuning elements


CONTROL PROCESSOR: S)'stem modules t , ~ system processor i ] ,~F~------] ~

system * instruction~ ~*1"~ ~ w H I pr . . . . . . ~ . I * ' I t l i i

Channels

MAIN MEMORY SUBSYSTEM

CENTRAL PROCESSING UNIT: Application operators MEMORY ACCESS [ Add i , , ~ addressp ~ -- Communication .... . . . . . . . . subsystem _ _ Channels

address * fragments *

"~ addressprocessor

CONTROL I Control subsystem I

sEXbseCyUttieCn executi~no n executio! pr . . . . ~ ~

f r a ~ _

EXECUTION

PERIPHERAL SUBSYSTEM: I/O operators

Figure 5. Logical structure of the MARS-M.

give the user the ability to specify resource management policies for virtual and physical memory allocation, system process scheduling, allocation of the central processing unit and so forth, and to choose a set of operations typical for an application domain.

3.1.1. HierarchicalparaUelprocessing. The MARS-M has three architectural levels: system, application, and functional. Each level has a corresponding model of computation, language objects processed at that level, and hardware elements to execute those objects. Table 1 shows the level features of the MARS-M.

Table 1. Levels and their features in the MARS-M architecture.

Level Model of Computation Language Objects Execution of the Objects

System Virtual heterogeneous Modules and I/O operators By the control processor and rnultiprocessor the peripheral subsystem

Application Decoupled architecture Application operators By the central processing unit

Functional VL1W architecture Address and execution By functional units of the fragments address and execution

processors


Because the MARS-M is oriented towards a multilevel language, we begin by discussing the native MARS-M language, called KOKOS (Constructor of Comfortable Operation Scope). KOKOS provides constructs for defining and manipulating objects at multiple levels and combining them into a unified program. The operations of the application level are not predetermined, but can be selected to suit the application field. A KOKOS program is built of three levels of compound computation objects (Figure 6):

�9 system modules; �9 application and input/output operators; �9 address and execution fragments.

System and application processes are created by calling modules and application operators, respectively. Different programming computation models are used to handle parallel processing within these language entities.

System level. Computation within modules is defined by so-called control expressions. These contain instructions that call modules, operators, other control expressions, and so forth. The execution of control expressions is called a system thread; the execution of input/output operators is called an external communication thread. The external communication threads perform data exchange between the MARS-M and El'brus memories. New system and application processes can only be created through the execution of system threads.

program I module

I application operator [ input/outpUtoperator !

, ' system

operation

address execution control fragment fragment operation

i I i I

Figure 6 Executable objects of the KOKOS language.


A new system thread can be generated both explicitly by a Parallel__call operation in- voking a control expression and implicitly with the help of a token control mechanism. Every control expression can be linked to a specific control port having a number of entries. There are special operations to send tokens to indicated port entries. After at least one token is received at each entry of a port, execution of the first instruction of the expression attached to that port will be initiated. This token mechanism is also used to synchronize execution of multiple threads at this level.

In general, this control mechanism is similar to macro (procedure-level) data flow mechanisms. Its specifics lie in the objects whose execution is to be dynamically scheduled: the MARS-M's modules, operators, control expressions, and system operations.

Application level. Like the central processing unit that supports it, the computation model for the application operators is based on a decoupled multiprocessor architecture. For brevity, we will use the term operator to refer to an application operator. Operators are built of three types of operations: control, address, and execution. The latter two are called address and execution fragments. They are functions for performing complex computation and memory access operations that are typical for an application field. Different applications can have different sets of elementary address and execution operations. The execution of an address fragment by an address processor, an execution fragment by an execution processor, and control instructions by the control subsystem are called address, execution, and control threads, respectively. Up to four address, four execution, and one control thread can run simultaneously at this level. These executing threads communicate by means of queues of different types.

There are two asynchronous mechanisms to schedule parallel processing within an operator: data flow and control token. The data flow mechanism is hidden from the programmer. It is built into hardware to synchronize communication between fragments through any queues. The control token mechanism is used by the programmer to impose additional synchronization control on decoupled multithread processing. Here, thread exchange is enforced by means of special control messages sent through branch queues.

Functional level. Execution of each fragment is statically scheduled by a so-called fragment compiler. In every cycle, one long address instruction can issue operations to all address functional units, and one execution instruction to all execution functional units.

3.1.2. The structure of the MARS-M. The distinctive design features of the MARS-M implementation are

�9 the heterogeneous multiprocessor organization with multiported shared memory, and control, central and input/output processors;

�9 the virtual multiprocessor organization of the control VLIW processor; �9 the multiprocessor decoupled architecture of the central processing unit; �9 static instruction scheduling with multiple sequencers to issue instructions from distributed

instruction memory to multiple functional units within the address and execution VLIW processors.

The MARS-M block diagram is shown in Figure 7. The MARS-M consists of a control processor for executing system functions, a central processing unit for numerical

THE EUBRUS-3 A N D MARS-M 27

E X E C U T I O N S U B S Y S T E M A D D R E S S S U B S Y S T E M

I f r a g m e m d i s p a t c h e r I [ t ragment dispatchct [ i

9 control modules execut ion-to-address t I control nl(xll.ilcs

nstruction paths l 1 hlSll-UCtk)n p~lths address- to-execution

* f loating point adder control and data i * 3 address genen~tors channels ~ * 3 address nl~/n~lgcincn[ * floating point mult ipl ier ~ ' - - - - - ~ . " units

* in teeer A L U * A L U * 2K-word local menlory * 2K-word local m e m o r y * 2 delay units C O N T R O L S U B S Y S T E M

* 2 read. channel * delay uuh nloJr~.uemellt~ units * AL * i-el]d and write ch[tlnlc[.

* write channel U [ n~anagenlellt units �9 * 4x16 pwame te r re2isters I * cl"o~gbar '~witch milnac, ement unit . . . ~ ~ i k I O o t ' c k l ~ V l t % l t

. * 4x 16 condition code I "a /IX I * crossbar switch reeisters I "' I I

cut~on,~ I ~ I I ~ ' a ~ ' " ~ ' p~ameter cham~els] ] L . ~ ] [ channcls 7 ~-_----------~ r - - - ~ - - - - - - n i

[ / ~ e s s c h a n n c l s I I I , ~ata ] ~ ~ a ~ , e s I I I {'~ta

/ "2 24-bit adders I" I I I I / * 2x64 registers I I I I I

I * load/stt~'e unit I I I I ~a~ ~,~el~ I I I I 7 ~~ switch I 4 I I

_ _ _L /.__ M A I N M E M O R Y I ~/

/ * 4 ports / / . 1 6 b~k~ / [ * 2M 57-bit words ]

TE S U B S Y S T E M * synchronizer ' ( crossbar

Figure 7. Structure of the MARS-M.

computation, a peripheral subsystem, a memory management unit (MMU), and main memory. The central processing unit is made up of three subsystems: control, execution, and address. Communication between these subsystems and memory takes place through multiple


eight-word queues (sets of registers with a FIFO discipline for reads and writes) that are combined in groups called channels. Each channel consists of from two to eight circular FIFO queues that share one input and one output channel paths. One read and one write operation can be performed on any two queues of a channel in one clock period. All queues are divided into two groups: the central processing unit contains dynamically scheduled queues, whereas static queues are used in the control processor and peripheral subsystem. Of the dynamically scheduled queues,

�9 48 are within six 48-bit data channels; �9 32 are within four 24-bit data channels; �9 24 are within three 25-bit address channels; �9 32 are within four 1-bit branch channels.

Of the static queues,

�9 8 are within four 48-bit data channels; �9 8 are within three 25-bit address channels.

The execution and address subsystems both consist of four processors sharing the set of functional units. The execution subsystem contains a control unit and 9 functional units with their control modules; the address subsystem contains a control unit and 11 functional units. In each cycle, operations can be issued to all of the units. The address and execution processors execute address and execution fragments. A fragment dispatcher schedules the processors at run time. The execution of a given fragment is scheduled statically by the compiler. The multiple control modules located in the execution and address subsystems issue fragment instructions from the distributed instruction memory to functional units. Up to 10 and 12 operations can be issued simultaneously by one instruction in the execution and address subsystems, respectively.

The control processor is implemented as a virtual multiprocessor consisting of eight virtual system processors. They share seven system functional units. A special hardware manager provides all necessary support for system process control. Up to three operations can be issued to the units in one clock period.

The peripheral subsystem communicates with the MARS-M main memory and the El'brus processor-memory crossbar switch, performing conversion between 48- and 64-bit data. Its bandwidth is 55.5 megabytes/s.

The memory management unit performs two tasks. First, it links address and data channels at run time, using a channel linkage table for each of six address channels. Four of the latter are load and two are store. Second, it maps virtual 25-bit addresses to physical 21-bit addresses and transfers data between queues and memory. Only four of six translated addresses can be passed to the main memory in one clock period. Physical space constraints forced designers to limit the number of channels and ports.

The 2-megaword main memory with single error correction/double error detection is implemented using low 16-Kbit DRAMS. It has 4 ports and 16 memory banks with a cycle time of 4 clock periods. The latency in loading data to processor queues is 15 clock periods. The memory bandwidth is 222 megabytes/s.


The SSI and MSI ECL elements with a minimum delay of 2.5 ns are used to implement processor logic. Initially, the clock period was 100 ns, but it was increased to 108 ns at the debugging stage. The MARS-M aggregate peak performance of 18.5 megaflops or 240.5 MIPS is the sum of the following components:

�9 Execution subsystem: 18.5 megaflops or 92.5 MIPS. �9 Address subsystem: 111 MIPS. �9 Control subsystem: 9.25 MIPS. �9 Control processor: 27.75 MIPS.

The main memory with its management unit and channels occupy one of the three water- cooled racks that measures 2230 • 1070 x 425 mm. The rest of the hardware is located in the two other racks, which measure 2020 x 1080 x 425 ram. The aggregate power dissipation is about 24 kilowatts.

In the following sections we discuss the most interesting features of the MARS-M computer: virtual multiprocessing and the hardware system process manager in the control processor; the multiprocessor decoupled architecture of the central processing unit; address/ execution processor scheduling; the distributed instruction issue mechanism of the address/ execution VLIW subsystem; and hardware support for executing array subscript expressions with nested conditional branches.

3.2. MARS-M System-Level Architecture

3.2.1. Virtual system-level architecture. In the design of the system-level architecture the main goal was to be able to exploit the concurrence of system activities. The following system activities can be executed in parallel:

�9 process and thread control; �9 context management; �9 physical memory allocation; �9 central processing unit scheduling; �9 communication with the El'brus operating system; �9 communication with the service microcomputer; �9 interrupt handling; �9 diagnostics.

These activities are handled by eight virtual processors. Each processor has its own table- driven instruction set. Every task has its own set of the processors' instruction set tables. A message-passing mechanism is used for communication between the system processors. A message contains the pointers to the system processor, a user's tasks, the processor's instruction set table, and data parameters.

A word within an instruction set table is an instruction memory address of the program implementing the corresponding system operation. Static tuning of resource allocation policies is done by placing the corresponding links (addresses) into the instruction set tables for each task independently.

30 M.N. DOROZHEVETS AND E WOLCOTT

Once initiated, a system operation is always executed to completion by the system processor. This makes it possible to avoid saving and restoring the processors' contexts during execution of a system operation, although it does result in some interrupt problems. The goal was to develop a dynamic system pipeline having the following features: dynamic structure without any centralized dispatcher and self-scheduling of the system processors.

After completing an operation, a system processor itself decides to which processor the message is to be sent. It then fetches a new message from its message mailbox in the control processor's local memory. When there are no messages in the mailbox, the processor suspends (no idle looping) until it receives a new message. In general, messages are placed into the mailboxes according to the FIFO discipline. However, a user's scheduling hints specifying urgent processes within tasks can be used to place urgent messages at the head of FIFO queues.

3.2.2. Dynamic virtual multiprocessing. The design of the control processor combines the following approaches: virtual multiprocessing and dedicated processors, multiple functional units, medium-long instruction words, and static scheduling.

The eight virtual system processors share seven functional units of the control processor. Only one of the virtual processors can run on the control processor hardware at a time, and only one of them can issue an instruction in a given clock period.

Two factors determined the choice of virtual rather than real system multiprocessing: the short, 1-clock period pipelines in the functional units and the memory load latency of 15 clock periods, which is long compared with the pipelines.

The short pipelines make it impossible to form a long instruction word at run time from multiple system processor instructions. An attempt to provide run-time sharing of functional units among the relatively large number of virtual processors would result in continuous conflicts because the units could compute results in every clock period without any pauses. Only during a memory load/store operation is the delay long enough to give other system processors a chance to work.

In this implementation of virtual system multiprocessing a system process loses control of system hardware after it issues a load/store operation or completes. In the former case its execution will be suspended and another process can be started within the control processor. After completing the load/store operation, the suspended process will become ready to execute again.

The control processor's instruction set includes a number of process generation/termination operations. They specify the first instruction's address and the virtual processor to execute the process. (In the subsequent discussion of the control processor we will use the terms process and system process to refer to processes executing system operations rather than such KOKOS entities as modules and operators.)

Interrupt processes are created by hardware after the receipt of messages or signals from the control and peripheral subsystems, service microcomputer, timer, and other sources. They are executed by the interrupt virtual processor and have the highest priority.

The idea of virtual multiprocessing is not new. For example, the CDC 6600 used virtual multiprocessing for its peripheral processors in 1964. The approaches of the MARS-M and the CDC 6600 differ in key respects, however. Only two of the CDC 6600 virtual processors executed operating system functions; the others were used for communication


with input/output devices. Also, the CDC 6600 peripheral processors were switched in equal time periods, in contrast to those of the MARS-M, which are switched dynamically.

Ideas similar to the MARS-M virtual multiprocessing scheme were implemented in the HEP computer. Both systems use dynamic sharing of common hardware between multiple processes, performing processor switching when memory access operations are issued.

3.2.3. Hardware implementation of the control processor. The seven functional units of the control processor (Figure 8) are interconnected by a 10 x 7 crossbar switch. All internal data paths are 32 bits wide.

The load/store unit, which performs conversion between 32- and 48-bit formats, provides communication with main memory. Read and write channels--two 25-bit address and two 48-bit data--connect the load/store unit and memory. When storing, the unit forms one 48-bit memory word from the specified fields of two 32-bit data elements. When loading, it selects the specified 16/32-bit fields out of 48-bit memory words. A third address channel carries addresses computed by an input/output address generator. This channel is used by the peripheral subsystem to transfer data between the MARS-M and El'brus memories by the peripheral subsystem.

Sixteen communication registers pass data between the processor, the CPU's subsystems, and the service microcomputer. To increase data bandwidth, general-purpose 32-bit registers

MEMORY adder 8K 32-bit 8x8 32-bit words re~istCl-S

STT ,t

address paths to the MMU Cmd data paths to/from memou

T . . , paths to,'from I other subsystems

16 communications registers

C l ' O S S b a r s w i t c h ]

* 32-bit data exchange regis*er

* instruction issue control * prograna counter * 13-bit address adder * 8x8-word address stack * 48-bit instruction

mgister * 8K-word instruction

memory

adder I STOt * 8x8 32-bit

re~isters �9

INSTRUCTION UNIT

PROCESS MANAGER

* proce~ generator I1<< > * 4 process queues

* process handier signals from the MMU

Figure 8. Structure of the control processor.


are separated into two independent files. Each file is combined with a 24-bit integer adder performing address and loop-counter calculations. A 32-bit ALU performs arithmetic, logic, and shift operations. An 8K-word local memory is capable of doing one read and one write operation per clock period.

Before any system process is started, its statically scheduled code is loaded into the instruction unit's 8K-word rewritable memory. As in the case for fragments in the address and execution processors, no predictions are made of process execution paths in the control processor. A 48-bit instruction word contains operations of up to three units, giving a peak performance of 27.75 MIPS. Simulation showed that a larger instruction size would be required only in a very few cases. Only the instruction unit's operations have a fixed placement in an instruction word.

An instruction processing pipeline consists of three 1-clock stages: fetching from the instruction memory; decoding and issuing operations to functional units and executing the instruction unit's operations, or both; and executing operations in functional units.

In the case of branching the instruction memory is capable of fetching two instructions per clock period, each from possible execution paths. In this implementation any condition will always be computed at the moment (the end of the clock period) when the branch operation has to be executed. Because branching is implemented as a selection between two instructions that are fetched in advance, the execution of a branch does not result in any break or delay in issuing instructions by the instruction unit. The choice of the target instruction can be made before the condition has been calculated, however. If the selected path is true, the instruction determined by the results of the condition is ready to be executed (not just issued) in the clock period following the calculation of the branch condition. If the path is wrong, no compensation code is required; the results of the executed instruction can be nullified by not storing them in registers or memory. For example, consider the branch in Figure 9a and its code representation in memory in Figure 9b. In executing the branch operation with the i + 1-address, two other instructions--with addresses i + 2 and X--are fetched in advance. The first corresponds to the beginning of BB2 block and the second, to the BB1 block. To speed the computation process, the compiler can place the first operation from one of the basic blocks and the branch operation together into one instruction.

3.2.4. System process management. The use of processes at the hardware level makes sense only if process control costs are far less than the gains that can be achieved. To perform the processor switching as fast as possible, it was necessary, first of all, to solve the problem of fast save/restore of the processors' contexts. Having a fixed number of system processors, designers decided to implement each 64-word register file as a set of eight independent register sections. For any access to these register files, hardware forms at run time a 6-bit pointer consisting of a 3-bit active__system._processor field and a 3-bit local___register field. Similar independent register sections were implemented in the address and condition stacks within the instruction unit. Thus, in performing a processor switch, a processor's local data do not have to be saved or restored.

The remaining problem to be solved was the fast implementation of the process control operations performed by the system process manager:


X-target address

I

(a)

i: ALU operation: A-B

i+l: Branch: if ALU>0 goto X-address; (Possibly the first operation of the BB 1/BB2.)

i+2: I BB2 code (Possibly without the first operation.) I

X: I BB 1 code (Possibly without the first operation.) I

(b)

Figure 9. A branch and its representation in memory.

�9 generate new processes in executing system threads; �9 generate interrupt processes in response to hardware signals; �9 terminate the execution of a process by performing the corresponding system operation; �9 suspend the execution of a process accessing main memory; �9 initiate or resume the execution of another process that is ready to execute.

To provide parallel operations on different types of processes, dedicated queues are implemented in the manager for interrupt processes, processes suspended by load or by store, and new software-created processes. A pipelined process handler performs four actions, each one clock period in duration:

1. create a process identifier consisting of the processor pointer and program counter value; 2. write the identifier into the corresponding process queue; 3. determine the highest priority process and fetch its identifier in advance into an available

process register; 4. start the process from the register.

Actions 1 and 4 are executed simultaneously with a load/store operation, resulting in no loss of cycles during process switching (if there is a process ready to be executed, of course).


3.2.5. Real performance of the control processor. Measurements of performance on real kernel programs showed that the control processor has a real system performance of 8 MIPS and an average of 1.9 operations per instruction [Dorozhevets 1988]. Typical times for some system operations follow:

�9 Placing a message into a processor's mailbox: 4.8 tzs. �9 Fetching a message and starting the process: 4.3-6.2/zs. �9 Creating a process object to perform a module/operator/control expression: 9.7-10.8 #s. �9 Allocating and linking one page of virtual and physical memory to a process's context:

31.1-35.2 tzs.

To measure the true effectiveness of the process switch mechanism, a sample KOKOS program consisting only of the instructions to make parallel calls of operators was executed. An operator call is the most complex KOKOS instruction. Four system processors sequen- tially take part in its execution, forming the specific execution pipeline. In executing the program consisting of multiple parallel operator calls, up to four calls can be processed at different stages in the pipeline. In fact, it is the simplest way to involve half of the virtual processors in the processing. In this case, time spent on communication and synchronization of the system processors was 7 % and 1%, respectively. The results showed that 83 % of memory access delays were usefully used. The virtual multiprocessor approach resulted in a speedup factor of 1.17 to 1.24, depending on variations in the access time due to memory bank conflicts.

3. 3. MARS-M Decoupled Architecture

According to [Smith et al. 1986], there are two main elements of a decoupled architecture. It has two separate instruction sets and concurrent processing of at least two instruction streams, one for accessing memory and one for performing function execution; communication between the memory access and the execution processes takes place by means of architectural queues.

The ZS-1 and PIPE computers are examples of recently developed computers with decoupled architecture [Farrens and Pleszkun 1991; Smith 1989]. The MARS-M differs from these and earlier efforts in its use of multiprocessing to perform both memory access and execution tasks, and dynamic hardware allocation of both address/execution processors and communication queues.

3.3.1. Instruction issue. In executing modules, the control processor selects operators to be executed by the central processing unit and passes their parameters and code descriptors to the control subsystem. This action is similar to a common procedure call with the loading of parameters (memory references) and a code descriptor of each operator into special registers. The control processor can perform up to four operator calls without waiting for the completion of previously called operators. The control subsystem uses the descriptors to fetch operators' codes into its instruction buffer. Execution of any operator can be overlapped with the loading of other operators' code. The buffer can contain the code of four operators.


From the control subsystem's point of view, the central processing unit consists of the following objects:

�9 four execution processors; �9 four address processors; �9 multiple operand queues within communication channels; �9 3 • 32 descriptor registers in the address subsystem; �9 a 24-bit ALU in the control subsystem; �9 4 • 16 32-bit operator parameter registers; �9 special control and resource stacks and registers.

The control subsystem splits an instruction stream from the instruction buffer so some of the instructions are passed to free address and execution processors, while others are executed by the control subsystem. In general, in executing operations, the processors use channel queues as their operands. The address processors also operate on descriptor registers. After receiving an instruction, an address or execution processor cannot receive another until the current instruction has finished executing.

Each instruction sent to the processors contains a fragment call operation. Address fragments compute addresses of elements within structured objects, such as arrays or vectors, which either have to be fetched from memory into queues or have to be stored into memory from queues. Execution fragments perform their operations on data being fetched from memory or computed and passed from address fragments by means of queues external to the subsystem and/or computed and passed by other fragments by means of the subsystem's internal queues.

The control subsystem is pipelined to issue one instruction per clock period. Instruc- tions are issued strictly in program order, but can begin and complete their execution out of order. The reason for this out-of-order execution is the asynchronous communication between processors and memory by means of channels. A data flow mechanism is used to initiate or suspend the execution of a fragment, depending on the availability of operands in its channel queues. This run-time scheduling is carried out by hardware within the address and execution subsystems without any control subsystem involvement. So instructions are issued to the execution processors with no regard for data dependencies.

In address instructions special attention has to be given to the case of reading operands from main memory or from descriptor registers. A fragment completion word is used to guarantee a correct sequence of read and write operations. Any address write-fragment that writes data into the memory or descriptor registers always sends a final 4-bit message that is stored in the control subsystem's fragment completion stack. Address read-fragments check the availability of corresponding words in the stack. Instruction issue is stopped when the reading of an off value completion word takes place. To reduce data dependencies in these cases, the compiler reorders operator instructions.

Completion words sent by address and execution fragments contain values of flags such as sign and overflow. The control subsystem uses them and its own flags to perform macrobranch operations. They are called macro because branches can be performed within address and execution fragments, too.


After issuing all the instructions of an operator, the control subsystem begins executing another operator, if any, not waiting for the completion of address/execution operations (fragments) issued earlier. With such overlapping, up to four different operators (i.e., address/execution fragments of operators) can run in parallel within the central processing unit. The availability of several sets of data and control registers within the control subsystem provides the correct completion of operations from different operators.

Dynamic chaining of both fragments and operators within the central processing unit therefore makes it possible to minimize the impact of the rather large control delay of eight clock periods between the instruction fetch from the instruction buffer and the start of their execution within the address and execution subsystems. Although this control path is pipelined, its value, in fact, indicates the minimum size of the data stream needed to achieve half of the peak performance. Taking into account the execution times of the floating point functional units (three clock cycles for the adder and four for the multiplier) and the channels (one clock cycle for read/write operations), this size can be estimated to be at least 12 to 13 elements.

3.3.2. Dynamic resource scheduling. The address and execution processors and the queues within the central processing unit are allocated at run time. Any address (execution) instruction can be executed using any of the free address (execution) processors and free queues within the specified channels. To provide run-time allocation of processors and queues, a number of tag pools are implemented in the control subsystem. For each channel its queue tag pool consists of a set of tags that have a one-to-one correspondence with free queues within the channel. Address and execution processor tag pools contain tags (the numbers) of free processors.

After an instruction is issued, its output queue tag is placed in the stack of queue tags. Address and execution instructions use tags from this stack as their input operands. The queue tags are returned to the corresponding tag pools after the instruction that used them as its input operands finishes. Thus the stack is the model of interaction between operations in the address and execution processors. It is the compiler's task to order instructions within the instruction stream to guarantee the correct ordering of input operands for the address/execution instructions.

Besides providing the correct ordering of instructions, it is also necessary at run time to indicate the correspondence between logical queue numbers and physical queue tags. The former are assigned to operations within fragments when fragments are compiled. Special queue linkage tables were implemented for each processor in the address and execution subsystems to provide run-time mapping from logical to physical queue numbers. When a fragment is executed, any access to channel queues is carried out through the queue linkage tables of the corresponding channel management units within the subsystems. The control subsystem loads the tables and issues an address or execution instruction simultaneously. There are similar tables in the memory management units for the channels linking memory and the subsystems.

3.3.3. Dynamic processor chaining. In executing any fragment, a processor can send its results to another processor rather than storing them in memory. This direct processor interaction takes place by means of address-to-address, address-to-execution, execution- to-execution, and execution-to-address data and branch channels.


Like vector registers, all of the queues can be used to chain processors at run time. In contrast to the synchronous chaining of vector registers, however, queue chaining is asynchronous in nature. The same data flow mechanism is used for communication between processors as between processors and memory.

The MARS-M architecture can be made to look as though it has both memory-to-memory and register-to-register features. For example, in array/vector processing an execution processor can operate on data streams fetched from memory by executing an address fragment on an address processor. An output data stream is stored in memory. Here the processing has all memory-to-memory features since communication queues work as intermediate FIFO buffers smoothing the transfer of data between the processors and memory.

The approach is weak when it is necessary to perform other operations (execution fragments) on the same data streams. To avoid repeated fetching of data streams from memory, processor chaining with a special quantification mechanism is used. The main goal is to make queues to look like eight-word vector registers. Like any registers, contents of vector registers can be read any number of times. In general, this is impossible for the communication FIFO queues. To provide a similar ability for the queues, data are vec- torized or, in MARS terminology, quantified; that is, processing is done on data blocks up to eight words in length. After filling a queue, the data flow control mechanism of the queue is turned off by special instructions executed in the control subsystem. As a result, a data block fetched from memory or computed by a processor can be used as an input operand as many times as required without accessing memory. When an instruction executed in this mode completes, the queue tag is not placed into the queue tag pool. A special control instruction must be used to place the tag into the pool after the multiple accesses to the queue have been completed.

3.4. The MARS-M VLIW Approach

Both address and execution processors have a horizontal architecture with multiple pipelined functiorral units interconnected by a crossbar switch. The approach taken to scheduling instructions in these processors differs from the global and procedure static approaches of the Trace and El'brus-3. To form long instruction words, both the latter computers rely solely upon compilers, which pack operations from multiple basic blocks into one instruction. The MARS-M uses both static and dynamic packaging.

3.4.L Static and dynamic code scheduling. As has been stated above, the application level is tuned by creating the set of address and execution operations called fragments. A library- oriented approach similar to that of attached array processors was used for the address and execution processors. Fragments are independent program blocks that can consist of multiple basic blocks and a control part containing branches, loop counters, literals, and other operations. A basic block is a sequence of code that can only be entered at the top and exited at the bottom. Conditional branches and conditional branch targets typically delineate basic blocks [Smith 1989]. The fragment compiler schedules the execution of each fragment separately. Four address and four execution processors execute up to eight fragments in parallel.


Hardware provides some support for the static scheduling of data streams within fragments. To reduce contention between functional units in accessing local memory, special data- delay units were incorporated into the address and execution subsystems. Each of the units consists of four 8-word circular FIFO queues (register files with FIFO discipline for reads/writes), one per processor. Data are stored in the queues in accordance with a compiler specification of how many cycles they are to remain there before being transferred to the crossbar switch.

In this scheduling strategy, operations from different basic blocks in the fragment are not packaged together. While this results in nondense fragment code (i.e., not very long instructions), no additional or compensation code is required. Very long instruction words are formed dynamically by hardware from such variable-length instructions during the concurrent execution of multiple fragments within each of the subsystems.

Besides the dynamic composition of very long instruction words, designers had another target: to achieve vector supercomputer performance on vector/array processing (one/two results per cycle) while providing some powerful processing capabilities lacking in vector supercomputers.

Loop unrolling could not be used to achieve this goal since a complex operation approach had been taken and only a limited amount of memory was available to store fragments' (operations') code. Instead of "code-expensive" static methods, a method combining both static and dynamic aspects was developed. The compiler generates a scenario with dynamic overlapping of several iterations of simple DO loops with no code expansion. More details of the MARS-M's approach to loop processing will be given later after discussing the instruction issue mechanism.

During the project developers realized that they would need to use conventional sequen- tial languages. They planned to develop Fortran and C compilers at a later stage and move from library-oriented to conventional high-level language programming. These compilers would have to build a data flow dependency graph and find independent fragments of a program that could be executed in parallel by address and execution processors com- municating with each other. Fortran and C compilers were not implemented before the project was ended, however.

3.4.2. Distributed instruction issue control. To explain the basic principles of instruction issue in the MARS-M, we will consider each address/execution processor as a union of control section and functional section. The control section is responsible for issuing instructions to the functional section while receiving status signals from the latter. In the MARS-M all four processors (address or execution) within each of the subsystems have their own control sections and one shared, functional section. Each independent control section includes instruction issue logic, queue linkage tables, and queues within the data- delay units. The rest of the hardware--such as functional units, instruction memory, data, and instruction paths--that is, the address or execution functional section, is shared by the four address or four execution processors.

To implement the MARS-M's VLIW model of computation, it was necessary to solve two major tasks:


1. to enable four processors within each subsystem to issue their statically scheduled, variable-length instruction streams simultaneously while providing dynamic sharing of common functional hardware;

2. for each processor, to provide the ability to overlap execution of several iterations of simple DO loops not containing internal branches with no code expansion.

To solve these tasks, a VLIW instruction issue mechanism with multiple sequencers (one per functional unit) and distributed instruction memory was developed. Each sequencer issues a statically scheduled instruction stream from its part of the instruction memory, which contains all fragment operations for the corresponding functional unit. A unit's sequencer and its part of instruction memory constitute a control module. The control issue logic of four address/execution processors is distributed among the control modules of the corresponding subsystem. Figure 10 shows the organization of the address and execution subsystems.

L c r o s s b a r s w i t c h ]

data paths t

4 output r registers <_...

t 4 results channels

r functional

unit 1

instructior~ path 1~

CONTROL MODULE 1

4 instruction registers

4 program counters

4 instruction buffers

instruction issue control

1 K-word instruction memory

1K-word basic block table

data paths to/from the crossbar switch

control paths

~ d a t a paths

4 output reg~ters

4 results ]

functional ~ . ] I< > unit n

t instmction ~ath n

CONTROL MODULE n

"~ fragment dispatcher

CONTROL UNIT

* 4 instruction registers * address adder * 4 program counters * 4x16 general-purpose <_..

registers * instruction issue control * 2K-word instruction

memory

address/ control path

4 instruction registers

4 program counters

4 instruction buffers

instruction issue control

1 K-word instruction memory

1K-word basic block table

paths to/from the conuol system

Figure 10. Control and data paths in the address and execution subsystems.


The library-oriented approach for fragments makes it possible to load the code for all fragments into the distributed instruction memory before a program is initiated and fetch it locally and independently by multiple sequencers at run time. The compiler specifies a timetable of all instruction streams to be issued by the control modules in executing any fragment. A similar extended VLIW approach, called XIMD, with multiple sequencers and homogeneous functional units was recently proposed [Wolfe and Shen 1991].

A fragment is separated into control and basic block portions. All fragment instructions that initiate the execution of basic blocks and execute branches and some other operations are located in the control portion. The control portions of all fragments are placed in the instruction memory of the control unit, whereas the basic block portions remain in the control modules. Some examples of the control operations follow:

�9 Load < the value > into a loop counter. �9 I f < the flag > = 0 then goto < the address within the instruction memory of the con-

trol unit > . �9 While < the specified branch queue or loop counter > = 1 Execute basic block with

< the specified address > .

A fragment can have zero instructions in the control unit memory. In this case its only control instruction is the one passed from the control subsystem to initiate the fragment. The operations of basic blocks are distributed among instruction memories within the control modules of the functional units used to execute them.

A basic block table is used to associate the control and basic block portions for each functional unit. A word of the table contains a pointer specifying two parameters: (1) the initial address of the basic block's operations in the functional unit instruction memory or a flag that there are no operations for this unit in the basic block and (2) the time delay (in clock periods) before issuing the unit's first operation from the basic block.

A functional unit's instruction contains an operation, sources of input operands, and a time issue delay for the instruction following it. After being fetched from memory, instructions are placed into the delay FIFO queue of the instruction buffer together with the next instruction's address. The instruction will be issued from the queue to an instruction register in accordance with its delay, and then a new instruction/address pair is placed into the queue. All control modules contain a separate program counter, instruction queue, and instruction register for each processor. In general, for a given processor the program counters in different control modules have different values at run time. Since a variable number of functional units are used by a fragment in each cycle, a processor's instructions can have variable lengths in different cycles. The address and execution instructions can be up to 175 and 135 bits wide and contain up to 12 and 10 operations, respectively.

Thus, a new operation of a basic block will be fetched from the unit's instruction memory and placed into the instruction queue only after issuing the current one from the queue. Consider a basic block consisting of the body of a simple Fortran DO loop without branches. To initiate one iteration per cycle, each control module was designed to trace up to eight iterations of such a loop. The compiler determines the maximum rate of initiating the loop's basic block, using the control unit's instruction, by examining the timetable of operations within the block. The compiler must resolve instances in which a loop's body contains several operations for one unit.


For example, the computational basic block of the expression

A ( I ) = B ( I ) + C ( I ) * D ( I )

can be initiated in each clock period since it does not contain multiple operations for any one of the functional units. This is not true for the expression

A ( I ) = B ( I ) + C ( I ) - D ( I ) .

In this case the basic block contains two operations executed by the adder. If the instruction issue rate were always one iteration per cycle, then in the fourth cycle (the adder's execution delay is three cycles), the add operation from the fourth iteration would have to be issued simultaneously with the subtract operation of the first. The control module cannot issue more than one operation per cycle, however. To enable such loops to be performed, a time issue delay generator was implemented in hardware in each of the control units. Its task is to supply the control instruction with the first compiler-specified issue delay the indicated number of times, then supply the second and to repeat these actions. In our example the control instruction has to use the issue delay generator with a one- clock delay as the first parameter during two cycles and a four-clock delay during the following cycle:

Iterations Cycles

i

i+1 (1 cycle)

i+2 (1 cycle)

i+3 (4 cycles)

i+4 (1 cycle)

i+5 (1 cycle)

i+6 (4 cycles)

1 2 3 4

+

+

5 6 7 8 9 10 11 12 13 14 15 16

m

§ m

+

§

+

3.4.3. Processor scheduling within address and execution subsystems. As has been stated above, four processors within each of the subsystems share common hardware, such as functional units, data paths, and communication ports. The task of processor scheduling is to form dynamically one long subsystem word from multiple processors' variable-length instructions for each machine cycle.

To determine which of the instructions can be issued simultaneously, each processor specifies in each cycle the functional units, functional units' output paths to the crossbar switch, and channel queues. This information, called a processor request vector, is passed to a fragment dispatcher. The dispatcher performs two operations: (1) It finds conflicts in the use of functional units and data paths, and (2) when queue operations are present, it checks the availability of operands and free space within the specified input and output queues, respectively.


Processors that have no space available in their queues are not considered by the dispatcher until space appears. To resolve conflicts between processors, a simple round-robin priority scheme is used in each machine cycle. All processors with nonconflicting requests issue their instructions simultaneously.

3.4.4. Array and vector processing. Since the MARS-M computer is oriented towards scientific computation, the effective processing of arrays and vectors was a primary goal. A number of address and execution fragments implementing typical array/vector operations were developed.

Address fragments can perform the following operations:

�9 form descriptors specifying address generation algorithms for objects in memory; �9 load/store elements of structured objects, using descriptors; �9 load/store scalar data without using descriptors.

Execution fragments perform both simple and complex operations on data coming from the input queues and write results into output queues. The length of the input data stream to an execution fragment can be specified in a parameter of the fragment call. It can also be specified dynamically by sending a control bit stream through a branch queue from another address or execution processor.

When static scheduling is used only within basic block boundaries, it is very difficult for one scalar fragment to use all functional units in each clock period. In scalar processing, maximum throughput is achieved by executing independent fragments concurrently. In array/vector computations a lack of conditional branches within execution fragments makes it possible to generate results in every clock cycle. But this is possible only if array/vector address fragments are capable of generating addresses in every clock cycle for any memory access.

In many cases, array-subscript expressions can be rather complex and contain conditional branches. To provide continuous address generation for such expressions, special array address generators were incorporated into the address subsystem. Each of them can compute array-subscript expressions containing up to three nested conditional branches in each clock period. To make this possible, however, the types of expressions and condition branches were limited.

The following algorithm was implemented in each of the address generators:

AO, j, k) = A(O) + I-step*i([+/-], [i/r], I-limit) + J-step*j([+ / - ] , [1/r], J-limit) + K-step*k([+~-], [I/r], K-limit),

where A(0) is a displacement from the beginning of the array;/-, J-, and K-steps are increment/decrement index values; i, j , and k are index generation functions;/-, J-, and K-limits are index values; l/r is a dependency pointer to the left or right index of the current; and + / - is an increment/decrement pointer for limit index values. Parameters in square brackets are optional.


A use of the dependency pointer means that a new value of the current index is computed only when the left or right index becomes equal to the left or right l imit index value, respectively. There are three different generation modes for each index: (1) The index depends

only on its limit value, (2) it depends on the left/right index and its l imit value, and (3) either 1 or 2 with incrementing/decrementing of the index's limit value.

Figure 11 shows an example of the index generation flowchart with the first mode for i, and the third for j and k. The first index is changed in every clock cycle; the second, only when i = IL; and the third, only when i = IL and j = JL.

In another example we consider the exotic case of accessing an upper triangular matrix stored columnwise in memory. The numbers specify the required access sequence.

1 5 8 10 2 6 9

3 7 4

Since the accessing process starts from the first element of the matrix, A(0) = 0. The

access function is expressed as A = i ( - , 3) + 4 * j ( - , 3) + 4 * k(1, 3). The per-clock diagram for this case is the following:

Cycles 1 2 3 4 5 6 7 8 9 10

/-value 0 1 2 3 0 1 2 0 1 0 j-value 0 1 2 3 0 1 2 0 1 0

k-value 0 0 0 0 1 1 1 2 2 3 /-limit 3 3 3 3 2 2 2 1 1 0 J-limit 3 3 3 3 2 2 2 1 1 0 K-limit 3 3 3 3 3 3 3 3 3 3 Address 0 5 10 15 4 9 14 8 13 12

There are two array generators, each of which was designed to implement the three- index generation. The third, used for vector addressing, performs common, single-index address calculation.

3.4. 5. Scalar processing. Investigations of MARS-M performance showed the weakness of the pure queue-based architecture of the address and execution subsystems in scalar

processing. This was due to a lack o f general-purpose registers for storing local data of the operators being executed within these subsystems, resulting in the use of the complex dynamic resource allocation mechanism to perform simple scalar operations. To resolve this problem, the local data memory of each subsystem was divided into two parts: one for storing the constants of all the fragments being loaded into the subsystems, and the other (64-128 words) for storing operators' local data. After redesigning some mechanisms,

44 M.N. DOROZHEVETSAND R WOLCOTT

start

no es

i=0 I ~ L + / - I ]

+/-1]

no es

k=0

= ~ r _ _ 7 _ 7 _ _ _ I [~ = KL+/- 1 ]

I I

finish

Figure 11. Index generation for the expression A = A(O) + l-step*i(IL)+

J-step*j([+/-], 1 ,JL) + K-step*k ([ + / - ] , l ,KL).

it became possible for fragments to take or pass their input/output parameters through the general-purpose part of the local data memories. This implementation significantly improved MARS-M scalar processing performance.

4. Conclusion

The El'brus-3 and MARS-M are two leading examples of recent Russian achievements in high-performance computing. Both machines offer nontraditional solutions to the problems of achieving high performance. Their development has been the result of a complex interplay between design goals and requirements; technological, economic, and political factors; and indigenous and Western ideas. Many architectural solutions that appear similar to Western developments were arrived at independently.

Both the El'brus-3 and the MARS-M seek to employ VLIW approaches, but do so in very different ways. We summarize the different solutions to principal architectural issues on the following page.

At present it is difficult to estimate how these innovations affect the real performance of both computers. The MARS-M computer did not have Fortran and C compilers, making it impossible to measure real performance using standard benchmarks. The El'brus-3 is now at the stage of prototype assembly, but does not have a Fortran compiler either.


El'brus-3 MARS-M

1. Memory load delay Statically scheduled prefetching of data into multiported random- access buffers

2. Scheduling of a) instruction streams

b) data streams

3. Vector/array processing in loops

4. Procedures

Statically scheduled multitracing with one sequencer

Static packaging of operations from alternative basic blocks into one very long instruction word

Synchronous multiport buffers

A full crossbar with distributed synchronous register memory for intermediate results

Asynchronous shared memory for processors' communication

A multiport iteration-frame buffer

Eight 1-index address generators

Loop unrolling with the ability to access several iterations of nested loops

Hardware support of the procedure- oriented architecture with multilevel lexical addressing

One copy of procedure code with dynamic overlapped execution of calling and called procedures on one central processor

Dynamic decoupled memory access and execution operations with prefetching of data into multiple asynchronous FIFO queues

Dynamic multitracing of multiple statically scheduled instruction streams with multiple sequencers

Dynamic packaging of operations from nonconflicting instruction streams into long address and execution instruction words

One-port local memory per address/ execution subsystem

A full crossbar with synchronous data-delay units for intermediate results in each subsystem

Dynamic chaining of execution/ address processors by means of FIFO queues

Asynchronous shared memory for communication with the CPU, control processor, and peripheral subsystem

Multiple FIFO queues

Two 3-index and one 1-index address generators

Complex vector/array operations with statically scheduled overlapped execution of several iterations of simple DO loops with no code expansion

Hardware support of the multiprocessor decoupled model of computation with operators as procedures with no internal procedure calls

No code duplication for fragments and the ability to execute operators in parallel on one central processing unit


The lack of sufficiently powerful Soviet computers means that software simulation of these computers is possible only for some portions of the machines, or with very simplified models.

While having a good deal in common with Multiflow Computer's Trace computers and the Cydra 5 departmental supercomputer, the El'brus-3 extends the VLIW approach to a multiprocessor environment, applying procedure-static/global-dynamic scheduling. It offers a number of features to increase efficiency and performance of execution, decrease code size, and make the system suited for both scientific and general-purpose applications.

The MARS-M, although quite complex, demonstrates the viability of integrating multiple architectural approaches such as data flow, decoupled heterogeneous processors, hierarchical systems, and VLIW scheduling. It also explores possibilities for combining static and dynamic scheduling in a VLIW implementation.

In spite of the near state of collapse of the post-Soviet computer hardware industry, users of microcomputers, minicomputers, low-end workstations, and mainframes can, given enough money, acquire Western machines. Users who demand high-performance computers, however, have few alternatives to domestic machines; although Western export control policies are softening, supercomputers remain strictly controlled. But in spite of their merits, the MARS-M and El'brus-3 are not likely to fill the supercomputing void.

The MARS-M project was terminated in 1990 due to the lack of government and industry support. State funding had been allocated through 1988. No funding was available through ITMVT because of its research policy decisions to end funding for projects to build add-on processors to the El'brus machines. It was exceedingly difficult to finish suc- cessfully large computer projects in the USSR Academy of Sciences, and will continue to be so (if even possible) in the Russian Academy of Sciences.

An El'brus-3 prototype is scheduled to be assembled during 1992, but faces many hurdles before it becomes available to the scientific community. Production of such complex machines in the post-Soviet environment is extraordinarily problematic. Computers de- pend on a host of upstream technologies and processes from materials research to chip manufacturing to subsystem assembly. Conducting research is difficult in an environment in which everything from paper to complex ICs is in perennially short supply. Western chip manufacturing has completely outstripped Eastern capabilities and the emergence of a globally competitive, indigenous post-Soviet microelectronics industry in the near future is next to unimaginable. Even if series production of the El'brus-3 is attained, the quality and performance of the machine will be considerably lower than had it been manufactured in Western plants. Researchers are eagerly seeking Western participation in manufacturing their machines, but face the paradoxical situation that if their machines were manufactured abroad, export controls would prevent their being imported back into the former Soviet Union.

Acknowledgments

We are grateful to the many individuals who have played a role in developing the El'brns-3 and the MARS-M. We especially thank L.N. Nazarov, Yu. Kh. Sakhin, and B.A. Babayan


from ITMVT; Yu. L. Vishnevskiy from ISI; and S.E. Goodman from the University of Arizona for their help in making this paper possible. We also thank the referees for their helpful comments.

Notes

1. The often-published performance figure for this machine, 125 MIPS, refers to the execution rate on a Gibson-3

mix. 2. In [Babayan et al. 1989] the authors claim a peak performance of 700 megaflops per processor (11 gigaflops

in full configuration). This is no longer true since the original target of a 10-ns clock was not met. The 8.96 gigaflops is based on 16 VLIW processors, each with 2 adders, 2 multipliers, 1 divider, and 2 logic units, each generating one result per 12.5-ns clock period. The logic units are able to perform a compare operation on floating point operands.

3. This eliminates main memory access to local variables of procedures whose activation records are within the stack buffer. This means that operations on local variables look like operations on registers, making it possible to package a functional unit's operation in the same very long instruction word as operations to read source operands and write results. Write operations are supplied with a compiler-specified delay that corresponds to the length of the pipeline in the functional unit. In contrast, in the Cydra 5 read operations are issued in advance to compensate for the memory access delay. Only operations on registers are issued simultaneously with read operations. The El'brus-3 approach has the advantage that no register use optimization is needed.

4. But only 456 bits are used. [Babayan 1989] states that the instruction word was 320 bits long. The size of the word was changed in 1989, after the article was written.

References

Babayan, B.A. 1989. Basic results and prospects for development of the "El 'b rus" architecture. Prikladnaya Informatika, 15: 100-113.

Babayan, B.A., Ryabov, G.G., and Chinin, G.D. 1989. El'brus software methodology: Instrumentation-experience. In Information Processing 89, Proc., 11th Worm Computer Congress (San Francisco, Aug. 28-Sept. 1), North- Holland, Amsterdam, pp. 879-882.

Colwell, R.P., O'Donnell, J.J., Papworth, R.D., and Rodman, P.K. 1987. A VLIW architecture for a Trace scheduling compiler. In Proc. , Second Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS-II) (Palo Alto, Calif., Oct. 5-8), pp. 180-192.

Dorozhevets, M.N. 1988. The use of virtual processes in hardware and microprogrammed implementation of an operating system kernel. Ph.D. thesis, Computing Center of the Siberian Dept. of the USSR Academy of Sciences, Novosibirsk.

Farrens, M.K., and Pleszkun, A.R. 1991. Implementation of the PIPE processor. 1EEE Comp., (Jan.): 65-70. Marchuk, G.I., and Kotov, V. Ye. 1978. Modular, asynchronous, expandable system (conception), parts 1 and

2. Computing Center of the Siberian Dept. of the USSR Academy of Sciences, Novosibirsk. Rau, B.R., Glaeser, C.D., and Greenwalt, E.M. 1982. Architectural support for the effective generation of

code for horizontal architectures. In Proc., Symp. on Architectural Support for Programming Languages and Operating Systems (Palo Alto, Calif., Mar. 1-3), pp. 96-99.

Rau, B.R., Yen, D.W.L., Yen, W., and Towle, R.A. 1989. The Cydra 5 departmental supercomputer. 1EEE Comp., 22, 1 (Jan.): 12-35.

Smith, J.E. 1989. Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Comp. (July): 21-35. Smith, J.E., Weiss, S., and Pang, N.Y. 1986. A simulation study of decoupled architecture computers. IEEE

Trans. Comps., C-35, 8 (Aug.): 692-702.


Wolcott, E, and Goodman, S.E. 1988. High-speed computers of the Soviet Union. IEEE Comp. (Sept.): 32-41. Wolcott, P., and Goodman, S.E. 1990. Soviet high-speed computers: The new generation. In Proc., Supercom-

puting '90 (New York, Nov. 12-16), IEEE Comp. Soc. Press, pp. 930-939. Wolfe, A., and Shen, J.P. 1991. A variable instruction stream extension to the VLIW architecture. In ASPLOS-IV

Proc.: Fourth lnternat. Conf. on Architectural Support for Programming Languages and Operating Systems (Santa Clara, Calif., Apr. 8-11), pp. 2-14.

Vishnevskiy, Yu.L. 1982. Architectural features of the Mini-MARS processor. Computing Center of the Siberian Dept. of the USSR Academy of Sciences, Novosibirsk, 5-33.

the el'brus-3 and mars-m: recent advances in russian high-performance computing

Documents