fisher - 1983 - very long instruction word architectures and the eli-512

8/13/2019 Fisher - 1983 - Very Long Instruction Word Architectures and the Eli-512

1/11

V E R Y L O N G I N S T R U C T I O N W O R DA R C H I T E C T U R E SAND THE ELI-512JOSEPH A. FISHERY A L E U N I V E R S IT Y

N E W H A V E N , C O N N E C T I C U T 0 6 5 20

A B S T R A C TB y c o m p i l i n g o r d i n a r y s c i en t i fi c a p p l i c a t i o n s p r o g r a m s w i t h a

r a d i c a l t e c h n i q u e c a l l e d t r a c e s c h e d u l i n g , w e a r e g e n e r a t i n gc o d e f o r a p a r a l l e l m a c h i n e t h a t w i l l r u n t h e s e p r o g r a m s f a s t e rt h a n a n e q u i v a l e n t s e q u e n t i a l m a c h i n e - - w e e x p e c t 1 0 t o 3 0t i m e s f a s t e r .

T r a c e s c h e d u l i n g g e n e r a t e s c o d e f o r m a c h i n e s c a l l e d V e r yL o n g I n s t r u c ti o n W o r d a r c h i t ec t u r e s . I n V e r y L o n g In s t r u c t io nW o r d m a c h i n e s , m a n y s t a t i c a l l y s c h e d ul e d , t i g h t l y c o u pl e d ,f i n e - g r a i n e d o p e r a t i o n s e x e c u t e i n p a r a l l e l w i t h i n a s i n g l ei n s t r u c t io n s t r e a m . V L I W s a r e m o r e p a r a l le l e x t e n si o n s o fs e v e r a l c u r r e n t a r c h i t e c t u r e s .

T h e s e c u r r e n t a r c h i t e c t u r e s h a v e n e v e r c r a c k e d af u n d a m e n t a l b a r r i e r . T h e s p e e d u p t h e y g e t fr o m p a r a l l el i s m i sn e v e r m o r e t h a n a f a c t o r o f 2 t o 3 . N o t t h a t w e c o u l d n t b u i l dm o r e p a r a l l e l m a c h i n e s o f t h i s t y p e ; b u t u n t i l t r a c e s c h e d u l i n gw e d i d n t k n o w h o w t o g e n e r a t e c o d e f o r t h e m . T r a c es c h e d u l i n g f i n d s s u f f i c i e n t p a r a l l e l i s m i n o r d i n a r y c o d e t oj u s t i f y t h i n k i n g a b o u t a h i g h l y p a r a l l e l V L I W .

A t Y a l e w e a r e a c t u a l l y b u il d i n g o n e . O u r m a c h i n e, t h eE L I - 5 1 2 , h a s a h o r i z o n t a l i n s t r u c t i o n w o r d o f o v e r 5 0 0 b i t s a n dw i l l d o 1 0 t o 3 0 R I S C - l e v e l o p e r a t i o n s p e r c y c l e [ P a t t e r s o n 8 2 ].E L I s t a n d s f o r E n o r m o u s l y L o n g w o r d I n s t r u c t i o n s ; 51 2 i s t h es i z e o f t h e i n s t r u c t i o n w o r d w e h o p e t o a c h i e v e . ( T h e c u r r e n td e s i g n h a s a 1 2 0 0 - b i t i n s t r u c t i o n w o r d . )

O n c e i t b e c a m e c l e a r t h a t w e c o u l d a c t u a l l y c o m p i l e co d e f o ra V L I W m a c h i n e , so m e n e w q u e s t i o n s a p p e a r e d , a n d a n s w e r s

T h i s r e s e a r c h i s s p o n s o r e d i n p a r t b y t h e N a t i o n a l S c i e n c eF o u n d a t i o n u n d e r g r a n t s M C S - 8 1 - 0 6 1 8 1 a n d M C S - 8 1 - 0 7 O 4 O , i np a r t b y t h e O f f i c e o f N a v a l R e s e a r c h u n d e r g r a n t n u m b e rN 0 0 0 1 4 - 8 2 -K - 0 1 8 4 , a n d i n p a r t b y t h e A r m y R e s e a r c h O f f i c eu n d e r g r a n t n u m b e r D A A G 2 9 - 8 1- K - 0 1 71 .

a r e p r e s e n t e d in t h i s p a p e r . H o w d o w e p u t e n o u g h t e s ts i ne a c h c y c le w i t h o u t m a k i n g t h e m a c h i n e t o o b i g? H o w d o w ep u t e n o u g h m e m o r y r e f er e n ce s i n e a c h c y cl e w i t h o u t m a k i n gt h e m a c h i n e t o o s l o w ?

W H A T I S A V L I W ?E v e r y o n e w a n t s t o u s e c h e a p h a r d w a r e i n p a r a l l el t o s p e e d

u p c o m p u t a t i o n . O n e o b v i ou s a p p r o a c h w o u l d b e t o t a k e y o u rf a v o r i t e R e d u c e d I n s t r u c t i o n S e t C o m p u t e r , l e t i t b e c a p a b l e o fe x e c u t i n g 1 0 t o 3 0 R I S C - l e v e l o p e r a t i o n s p e r c y c l e c o n t r o l l e d b ya v e r y l o n g i n s t r u c t i o n w o r d . ( I n f a c t , c a ll i t a V L I W . ) AV L I W l o o k s l i k e v e r y p a r a l l e l h o r i z o n t a l m i c r o c o d e .

M o r e f o r m a l l y , V L I W a r c h i t e c tu r e s h a v e t h e f o l lo w i n gp r o p e r t i e s :

T h e r e i s o n e c e n t r a l c o n t r o l u n i t i s s u i n g a s i n g le l o n gi n s t r u c t i o n p e r c y c l e .E a c h l o n g i n s t r u c t i o n c o n s i s t s o f m a n y t i g h t l y c o u p l e di n d e p e n d e n t o p e r a t i o n s.E a c h o p e r a t i o n r e q u i re s a s m a l l , s t a t i c a l l y p r e d i c t a b l en u m b e r o f c y c le s t o e x e c u t e .O p e r a t i o n s c a n b e p i p e l i n e d . T h e s e p r o p e r t i e s d is t i n g u i sh

V L I W s f r o m m u l t i p r o c e s s o c s ( w i t h l a r g e a s y n c h r o h o u s t a s k s}a n d d a t a f i o w m a c h i n e s ( w i t h o u t a s i n g l e f l o w o f c o n t r o l , a n dw i t h o u t t h e t i g h t c o u p l in g ) . V L I W s h a v e n o n e o f t h e r e q u i re dr e g u l a r i t y o f a v e c t o r p r o c e s s o r , o r t r u e a r r a y p r o c e s s o r .

M a n y m a c h i n e s a p p r o x i m a t e l y li k e t h i s h a v e b e e n b u i l t , b u tt h e y h a v e a l l h i t a v e r y l o w c e i l i n g in t h e d e g r e e o f p a r a l l e l i s mt h e y p r o v i d e . B e s i d e s h o r i z o n t a l m i c r o c o d e e n g i n e s , t h e s em a c h i n e s i n c l u d e t h e C D C 6 6 0 0 a n d i t s m a n y s u c c e s so r s , s u c ha s t h e s c a l a r p o r t i o n o f t h e C R A Y - 1 ; t h e I B M S t r e t c h a n d3 6 0 / 9 1 ; a n d t h e S t a n f o r d M I P S [ H e n n e ss y 82 ]. I t s n o ts u r p r is i n g t h a t t h e y d i d n t o f f er v e r y m u c h p a r a l le l i s m .E x p e r i m e n t s a n d e x p e ri e n c e i n d i c a te d t h a t o n l y a f a c t o r o f 2 t o3 s p e e d u p f r o m p a r a l l e l i s m w a s a v a i l a b l e w i t h i n b a s i c b l o c k s .( A b a s i c b l o c k o f c o d e h a s n o j u m p s i n e x c e p t a t t h e b e g i n n in g

1 9 8 3 A C M 0 1 4 9 - 7 1 1 1 / 8 3 / 0 6 0 0 / O 1 4 0 5 0 1 . O 0 140


2/11

and no jumps out except at the end.) No one knew how to findparallelism beyond conditional jumps, and evidently no one waseven looking. It seemed obvious tha t you couldn't putoperations from different basic blocks into the same instruction.There was no way to tell beforehand about the flow of control.How would you know whether you wanted them to be executedtogether?

Occasionally people have built much more parallel VLIWmachines for special purposes. But these have been hand-coded. Hand-coding long-ins truction -word machines is ahorrible task, as anyone who's written horizontal microcode willtell you. The code arrangements are unintuitive and nearlyimpossible to follow. Specia l-purpose processors can get awaywith hand coding because they need only a very few lines ofcode. The Flo atin g Poi nt Systems AP-120b can offer speedupby a factor of 5 or 6 in a few special-purpose applications forwhich code has been handwritten at enormous cost. But thiscode does not generalize, and most users get only the standard2 or 3 -- and then only after great labor and on smallprograms.

We're talking about an order of magnitude more parallelism;obviously we can forget about hand coding. But where doesthe parallelism come from?

Not from basic blocks. Experiments showed th at theparallelism wit hin basic blocks is very limited [Tjaden70, Foster 72] . But a radica lly new global compac tiontechnique called trace scheduling can fi nd large degrees ofparallelism beyond basic-block boundaries . Trace schedulingdoesn't work on some code, but it will work on most generalscientific code. And it works in a way that makes it possible tobuild a compiler that generates highly parallel code.Experiments done with trace scheduling in mind verify theexistence of huge amounts of parallelism beyond basic blocks[Nicolau 81]. Nicolau81 repeats an earlier experiment done in

a different context that found the same parallelism butdismissed it; trace scheduling was then unknown and immenseamounts of hardware would have been needed to takeadvantage of the parallelism [Riseman 72].

WHY NOT VECTOR MACHINES?Vector machines seem to offer much more parallelism than

the factor of 2 or 3 tha t cur rent VLIWs offer. Althoug h vectormachines have their place, we don't believe they have muchchance of success on general-purpose scientific code. They areerucifyingly difficult to program, and they speed up only innerloops, not the rest of the code.

To program a vector machine, the compiler or hand codermust make the d ata structures in the code fit nearly exactly theregular structure built into the hardware. That 's hard to do infirst place, and just as hard to change. One tweak, and thelow-level code has to be rewritten by a very smart anddedicated programmer who knows the hardware and often thesubtleties of the application area. Often the rewriting isunsuccessful; it's back to the drawing boards again. Manypeople hope that highly vectorized code can be produced fromordin ary scalar code b y a very i ntelleg ent compiler [Padua 80].We believe that vectorizing will produce sufficient parallelismin only a small percentage of programs.

And vectorizing works only on inner loops; the rest of thecode gets no speedup whatsoever. Even if 90% of the codewere in inner loops, the other 10% would run at the same speedas on a sequential machine. Even if you could get the 90% torun in zero time, the other 10% would limit the speedup to afactor of 10.

TRACE SCHEDULINGThe VLIW compiler we have built uses a recent global

compac tion technique called trace scheduling [Fisher 81] . Thistechnique was originally developed for microcode compaction,compaction being the process of generating very longinstruc tions from some s equential source.

Horizontal microcode is like VLIW architectures in its styleof parallelism. It differs in having idiosyncratic operations andless parallel hardware. Othe r techniques besides tracescheduling have been developed for microcode compaction[Tokoro 78, Dasgupt a 79, Jacobs 82]. They differ from t race

scheduling in taking already compacted basic blocks andsearching for parallelism in individual code motions betweenblocks. That might work for horizontal microcode but itprobably won't work for VLIWs. VLIWs have much moreparallelism than horizontal microcode, and these techniquesrequire too expensive a search to exploit it.

Trace scheduling replaces block-by-block compaction of codewith the compaction of long streams of code, possiblythous ands of instru ction s long. Here's the trick: You do alittle bit of preprocessing. The n you schedule the long streamsof code as if they were basic blocks. The n you undo the badeffects of prete nding tha t they were basic blocks. Wha t youget out of this is the ability to use well-known, very efficientscheduling techniques on the whole stream. These techniquespreviously seemed confined to basic blocks.

To sketch briefly, we start with loop-free code that has noback edges. Given a reducable flow graph, we can find loop-

141


3/11

' . 1

N. o '

TRACE SCHEDULING LOOP-FREE CODE(a) A flow graph, with each block representing a basic block

of code. (b) A trace picked from the flow graph. (c) The tracehas been scheduled but it ha sn't been relinked to the rest of thecode. (d) The sections of unscheduled code that allow re-linking.

free inne rmost code [Aho77]. Par t (a) of the figure shows asmall flow graph without back edges. Dynamic information - -jump predictions -- is used at compile time to select streamswith the highest proba bility of execution. Those streams wecall "traces." We pick our first trace from the most frequentlyexecuted code. In par t (b) of the figure, a trace has beenselected from the flow graph.

Preproeessing prevents the scheduler from ma king absolutelyillegal code motions between blocks, ones that would clobberthe values of live variable s off the trace. This is done byadding new, special edges to the da ta precedence graph builtfor the trace. The new edges are drawn between the testoperations that conditionally jum p to where the variable is liveand the operations that might clobber the variable. The edgesare added to the data precedence graph and look just like allthe other edges. The scheduler, none the wiser, is thenpermitted to behave just as if it were scheduling a single basicblock. It pays no att ention whatsoever to block boundaries.

After scheduling is complete, the scheduler has made manycode motions that will not correctly preserve jumps from thestream to the outside world (or rejoins back). So apostprocessor inserts new code at the stream exits andentrances to recover the correct machine state outside thestream. Withou t this ability, available parallelism would beunduly constrained by the need to preserve jump boundaries.

In part (c) of the figure, the trace has been isolated and in part(d) the new, uncompacted code appears at the code splits andrejoins.

Then we look for our second trace. Again we look at themost frequently executed code, which by now includes not onlythe source code beyond the first trace but also any new codetha t we generated to recover splits and rejoins. We compactthe second trace the same way, possibly producing recoverycode. (In our actual implementation so far, we have beenpleasantly surprised at the small amount of recovery code thatgets generated.) Even tual ly, this process works its way out tocode with little probability of execution, and if need be moremundane compaction methods are used so as not to producenew code.

Trace scheduling provides a natu ral solution for loops. Handcoders use software pipelinin g to increase parallelism, rewrit inga loop so as to do pieces of several consecutive iterationssimultaneously. Trace scheduling can be trivially extended todo software pipeli ning on an y loop. We simply unroll the loopfor many iterations. The unrolled loop is a stream, all theintermediate loop tests are now conditional jumps, and thestream gets compacted as above.

While this method of handling loops may be somewhat lessspace efficient than is theoretically necessary, it can handle

142


4/11


5/11

implies the same speedup, or less, or more, is subject to debate.These incidental operations may slow down the sequential codemore than the parallel, making the speedup due to parallelismall the greater. Only time will tell.

The front end we are currently using generates our RISC-level inter media te code, N-address code or NADDR. The in putis a local Lisp-sugared FORTRAN, C, or Pascal level languagecalled Tiny-Lisp. It was something we built quickly to give usmaxima l flexibility. We have an easy time writing sample codefor it, we didn't have to write a parser, and we can fiddle withthe compiler easily, which has proved to be quite useful. AFORTRAN '77 subset compiler into NADDR is written andbeing debugged, and we will consider other languages afterthat . Our RISC-ievel NADDR is very easy to generate code forand to apply standar d compiler optimizatious to.

We have two more code generators being written right now.A full ELI-512 generator is quite far along -- a subset of it isnow being interfaced to the trace picker and fixup code. Weare also writing a FPS-104 code generator. The FPS-164 is thesuccessor to the Floating Poi nt Systems AP-120b, probably thelargest-selling machine ever to have horizontal microcode as itsonly language. There is a FORTR AN compiler for theFPS-164, but ou r experience has been that it finds little of eventhe small amoun t of parallelism available on that machine. Acompiler tha t competes with hand code would really change thepotential usability of that machine (it's very difficult to handcode) and would demon strate the versatility of trace scheduling.

MEMORY ANTI-ALIASING ON BULLDOGTrace scheduling makes it necessary to do massive numbers

of code motions in order to fill instructions with operations thatcome from widely separa ted places in the program. Codemotions are restricted by data precedence. For example,suppose our program has the steps:

(I) Z := A X2 ) A := V +

Our code motions must not cause (2) to be scheduled earlierthan (1). So the trace scheduler builds a data-precedence edgebefore scheduling.

But what happens when A is an array reference?

1 ) Z : = A [ e x p r l ] + XC 2 ) A [ e x p r 2 ] : = Y +

Whether (2) may be done earlier tha n (1) is ambiguous. Ife x p r l can be guaranteed to be different from expr2, then the

code motion is legal; otherwise not. Answering this quest ion isthe problem of anti-alias ing memory references. Wit h otherforms of indirection, such as chasing down pointers, anti-aliasing has little hope of success. But when indir ect referencesare to array elements, we can usually tell they are different atcompile time. Indirect references in inner loops of scientificcode are almost always to array elements.

The system implemented in the Bulldog compiler attempts tosolve the equation exprl = expr2. It uses reaching definitions[Aho 77] to narrow the range of each variable in the

expressions. We can assume th at th e variables are integers anduse a diophantine equation solver to determine whether theycould be the same. Range analysis can be quite sophisticated.In the implemented system, definitions are propagated as far aspossible, and equations are solved in terms of simplest variablespossible. We do not yet use branch conditions to narrow therange of values a variable could take, but we will.

Anti-aliasing has been implemented and works correctly (ifnot quickly). Unfo rtun atel y, it is missing a few of its abilities- - very few, but enoug h to slow it down badly. In this case thetruism really holds: The chain is only as strong as its weakestlink. So far we get speedups in the range of 5 to 10 for thepractical code we've looked at. Good, but not what we want.Examining the results by hand makes it clear tha t when themissing pieces are supplied the speedup will be considerable.

A MACHINE TO RUN TRACE-SCHEDULED CODEThe ELI-512 has 16 c l u s t e r s each containing an ALU and

some storage. The clusters are arranged circularly, with eachcommunic ating to its nearest neighbors and somecommunic ating with farther removed clusters. (Rough sketchesof the ELI and its clusters are on the next page.)

The ELI uses its 500+ bit instruc tion word to initiate all ofthe following in each inst ruc tion cycle:

10 ALU operations. 8 will be 32-bit integer operations,and 8 will be done using 04-bit ALUs with a variedrepertoire, including pipelined floating-point calculations.8 pipelined memory references -- more about these later.32 register accesses.Very many d ata movements, including operand selects forthe above operations.A multi way conditional jump based on severalindependent tests - - more about these later too. (With

this much happening at once, only a maniac would want tocode the ELI by hand.)

To carry out these operations, the ELI has 8 M-clusters and 8

144


6/11

GLOBAL INTERCONNECTION SCHEME OF THE ELI-512

OFFt~

~l t ~ , L e - t n . t ~Tg ~

TYPICAL M AND F CLUSTER BLOCK DIAGRAMS

F-clnsters. Each M-cluster has within i tA local memory module (of so far undetermined size).

An integer ALU which is likely to spend most of its timedoing address calculations. The exact repertoires mayvary from cluster to cluster, and won't be fixed until wetune the architecture using actual code.A multipo rt integer register bank.A limited cluster crossbar, with 8 or fewer participants.Some of the par ticip ants will be off-clnster busses. Someof the crossbar connect ions will not be made.

And each F -cluster has within it:A floating point ALU. The repe rtoires of the ALUs will

vary from cluster t o cl uster and wo n't be f'Lxed unti l wetune the architecture.A multipo rt floating register bank.A limited cluster crossbar, with 8 or fewer participants.Some of the p arti cipants will be off-clnster busses. Someof the crossbar connection s will not be made.

Do not be deceived by occasional regularities in the structure.They are there to make the hardware easier to build. Thecompiler doesn't know about them, and it doesn't attempt tomake a ny use of them. When we start runni ng scientific codethrough the compiler, we will undoubtedly further tune thearchitectu re. We will wan t to remove as man y busses as wecan, and many of the regularities may disappear.

Current plans are to construct the prototype ELI from 100KECL logic, though we may opt for Shottkey TTL.

PROBLEMSNobody's ever wanted to build a 512-bit-wide instruction

word machi ne before. As soon as we started considering it, wediscovered th at t here are two big problems. How do you putenough tests in each instruction witho ut-mak ing the machinetoo big? How do you put en ough memory references in eachinstruction without ma king the machine too slow?

Comparing VLIWs with vector machines illustrates theproblems to be solved. VLIWs put fine-grained, tigh tlycoupled, but logically unrelated operations in singleinstructions. Vector machines do many fine-grained, tightlycoupled, logically related operations at once to the elements ofa vector. Vector machines can do many parallel operationsbetween tests; VLIWs cannot. Vector machines can structurememory references to entire arrays or slices of arrays; VLIWscannot. We've argued, of course, that vector machines fail on

145


7/11

general scientific code for othe r reasons. How do we get theirvirtues without their vices?

VLIWs NEED CLEVER JUMP MECHANISMSShort basic blocks implied a lack of local parallelism. They

also imply a low ratio of operations to tests. If we are going topack a great many operations into each cycle, we had better beprepared to make more than one test per cycle. Note that thisis not a problem for today's statically scheduled operationmachines, which don't pack enough operations in eachinstruction to hit this ratio.

Clearly we need a mechanism for jumping to one of severalplaces indicated by the results of several tests. But n ot justany multiway jum p mechanism will do. Many horizontallymierocodable machines allow several tests t o be specified ineach microinstruction. But the mechanisms for doing this aretoo inflexible to be of significant use here. They do not allowfor multiple independe nt tests, bu t rath er offer a hardwiredselection of tests that may be done at the same time. Somemachines allow some specific set of bits to alter the nextaddress calcula tion, allowing a 2n-way jump . This is used, forexample, to implement an opcode decode, or some otherhardware case statement.

Another approach that won't suffice for us is found in VAX11/780 microcode [Patte rson 79]. There, any one of severalfixed sets of tests can be specified i n a given instruc tion. Amask can be used to select any subset of those tests, which arelogically ANDed into the ju mp address. Unfor tunat ely, theprobability tha t two given conditional tests appear in the sameset in the reper toire is very low. In compacting it is extremelyunlikely that one can place exactly the tests one wants to in asingle instruction, or even a large subset of them. Instead,combinations are hardwired in advance. One would guess tha tthe combinations represent a c onvenient grouping for somegiven application program, in the case of the VAX, presumablythe VAX instruction set emulator.

The most convenient support the architecture could possiblyprovide would be a 2n-way ju mp based on the results of testin gn independen t conditions. This is not as unrealistic as itsounds; such a mechanism was considered in the course ofdeveloping trace scheduling [Fisher 80], an d seemed quitepractical. It turne d out, however, to be more general tha n weneeded.

After we had actually implemented trace scheduling, asurprising fact emerged: Wha t trace scheduling requires is amechanism for jumping to any of n +l locations as the result ofn indepe ndent tests. The tests should be any combination of nfrom the repertoire of tests available on the machine.

The jump works just the way a COND statement works inLISP. For simplicity, we will prete nd that a test 's FAILingmeans failing and wanting to stay on the trace and thatSUCCEEDing means succeeding and wan tin g to jum p off thetrace. A statemen t to express the multiway jump might appeara s :

C O N D ( t e s t l l a b e l 1 )( t e s t 2 l a b e l 2 )

( t e s t k l a b e l k )

( t e s t n l a b e l n )(SUCCEED l a b e l - f a l l - t h r o u g h ) )

If the first test, testl. F A I L s , it wants to stay on the traceand the second test, te st 2, is made. If tha t FAILs, it toowant s to stay on the trace. If a test, t es tk , SUCCEEDs, hen itwants to jump off the trace to lab elk ; no on-trace tests afterta st k will be made. If all the tests FAIL, we finally get to thelast address in the instruction and fall through.

We find that n+l target labels suffice; 2n aren 't needed. Wesort all the tests for one instruction in the order in which theyappeared in the trace . Then when a test SUCCEEDs we are g ladthat we already executed the tests that came earlier in thesource order but we don't much care about the ones that camelater since we're now off the trace.

It is not hard to build a n+l -wa y jump mechanism. Thefigure shows that all we need is a priority encoder and n testmultiplexers. The wide instruc tion word selects each of the ntests with n j-bit (where j is log the number of tests) fields.The first of the n tests (in sequence order) that wants to jumpoff the trace in effect selects the next instruction.

But how do we actually produce the next address? We couldplace n+ l candidates for the post of next address in full in eachinstruction. But even on the ELI using so many instructio nbits for that would seem like overkill. The idea of using theselect bits as part of the next address (as in the restrictedmultiway jumps referred to above) seems right, if we canovercome a small packing problem.

For example, if n=3 , and if the curren t instruction has anext inst ruct ion address field of, say, 0DO00011, the n we havet he f o l l ow i ng s i t ua t i on :

T E S T C O N D I T I O N A DD RE SS F TEST ISFIELD SELECT ED FIRS T TO SUCCEED0 t e s t l O0 000000111 te s t 2 01 000000112 t es t 3 10 000000113 SUCCEED 11 000 000 11

146


8/11

~ S

I I I

I D

~ 0/

' teSt. | t4 Stf i e l d '~ - ~ * f i l l d I

' ' iEt4U ~TI IJ~,XE

R I O R I T YE N C O D E R ~

l.ST

TEST S E L E C T S

I C ~P U n e x t i n s t x u c t i o n i~ c o ~ x o l a d d r e s s

CPU

I M P L E M E N T IN G N + I - W A Y I N D E P E N D E N T J U M P S

0 0 0 0 0 0 1 1 , a n d t h e t e s t f i e l d s a r e f i ll e d i n a s b e lo w .

TEST CONDITION ADDRESS F TEST ISFIELD SELECTED FIRST TO SUCCEED

0 t e s t 1 O 0 000000111 SUCCEED 01 000 000 112 d o n ' t c a r e w o n ' t h ap pe n3 SUCCEED w o n ' t happen

A f t e r I N S T R 2 , w e j u m p t o e i t h e r 1 0 0 0 00 0 01 1 o r t o 1 10 0 0 0 0 0 1 1 . So i t l o o k s l i k e :

TEST CONDITION ADDRESS F TEST ISFIELD SELECTED FIRST TO SUCCEED

0 F A I L w on ' t happen1 F A I L w on ' t happen2 TEST2 10 000000 113 SUCCEED 11 000 000 11

S i n c e w e d o n t h a v e a n i n c r e m e n t e d p r o g r a m c o u n t e r a n dc a n r e a r r a n g e a d d r e s s e s a t w i l l, t h e s e a l l o c a t i o n s c a n b e d o n e i na s t r a i g h t f o r w a r d m a t t e r i n a p o s t p a s s p r o g r a m . A l i t t le s p a c em a y b e w a s t e d a t c o d e r e j o i n s , b u t n o t m u c h .

O u r p r e v i o u s w o r k o n 2 n - w a y j u m p s a p p l ie s a ls o to n + l - w a yj u m p s a n d c o n t a i n s a m o r e c o m p l e t e e x p l a n a t i o n o f th e s e i d e a s

[Fisher 80].

I n t h i s s c h e m e w e d o n o t i n c r e m e n t a p r o g r a m c o u n t e r t o g e tt h e n e x t a d d r e s s , t h o u g h t h a t c o u l d b e f i t t e d i n if t h e r e w e r es o m e a d v a n t a g e i n s p e e d f o r a p a r t ic u l a r h a r d w a r ei m p l e m e n t a t i o n .

W h a t h a p p e n s w h e n w e p a c k f e w e r t h a n n t e s ts i n a c y c le ?F r o m t h e e x a m p l e a b o v e , w i t h n = 3 , i t m i g h t s e e m th a t w en e e d t o s p e n d f o u r p r o g r a m m e m o r y l o c a t i on s f o r e a c h s e t o ft a r g e t a d d r e s s e s . B u t w h a t i f w e h a v e s t r a i g h t l i n e c o d e , o rw a n t t o d o o n l y o n e o r t w o t e s t s i n s o m e c y c l e s ( a s w e s u r e l yw i l l) ? D o w e h a v e t o w a s t e th e u n u s e d s l o ts ? W e c a n a v o i dw a s t i n g s l o ts i f w e i n c l u de a t e s t t h a t a l w a y s S U C C E E D s a n da n o t h e r t h a t a l w a y s F A I L s . W i t h t h e s e t e s t s, w e c a n c a u s e t w oi n s t ru c t i o n s w h i c h e a c h w a n t t o p a c k o n e t e s t t o s h a r e a na d d r e s s s l i c e, o r w e c a n p a c k a n i n s t r u c t i o n t h a t d o e s t w o t e s t sw i t h a n i n s t r u c t io n t h a t w a n t s t o d o n o t e s t . F o r e x a m p l e ,t a k e t w o i n s t r u c ti o n s , I N S T R 1 , w h i c h w a n t s t o d o T E S T 1 , a n dI N S T R 2 , w h i c h w a n t s to d o T E S T 2 . W e c a n a r r a ng e t o ha v et h e m b o t h j u m p t o a l a b e l i n t h e a d d r e s s s l i c e 0 0 0 0 00 1 1 ( a n dt h u s b o t h h a v e 0 0 0 0 0 0 1 1 a s t h e i r n e x t - a d d r e s s f i e l d ) a s f o l l o w s :

A f t e r I N S T R 1 , w e j u m p t o e i t h e r 0 0 0 0 0 00 0 11 o r t o 010 0 0 0 0 0 1 1 . A s a r e s u l t , t h e n e x t a d d r e s s f ie l d o f I N S T R 1 i s

T H E J U M P M E C H A N I S M O N T H E E L I -5 1 2T h e E L I - 5 1 2 w i l l h a v e a n n + l - w a y j u m p m e c h a n i s m l ik e t h e

o n e d es c r i b e d a b o v e . D u r i n g t h e t im e w h e n w e a r e t u n i n g t h em a c h i n e d e s i g n a n d t h e c o m p i l e r , w e w il l d e t e r m i n e h o w m a n yt e s t s a r e a p p r o p r i a t e ; i t s e e m s l i k e l y t h a t n w i l l b e 3 , 4 , o r 5 .W e h a v e t w o i n s t r u c t i o n - f e t c h m e c h a n i s m s u n d e r c o n s i d e r a t i o n .

D e l a y e d b r a n c h e s a r e a n o l d m i c r o e o d e i n s tr u c t i o n - f e tc h t r i c kt h a t w o r k s p a r t i c u l a r l y w e l l h e r e f o r [ G r o s s 82 , P a t t e r s o n 8 2] .I n a d e la y e d - b r a n c h m e c h a n i s m t h e a d d r e s s o f i n s t r u c ti o n M + ki s d e t e r m i n e d b y t h e r e s u l t o f a t e s t i n i n s t r u c t i o n M . T h e ki n s t r u c ti o n s b e t w e e n M a n d M + k a r e d on e w h e t h e r t h e t e s ts u c c e e d s o r f a i ls . U s i n g t r a c e s c h e d u l i n g , w e k n o w w h i c h w a ym o s t j u m p s g o ; s o w e c a n f il l t h e g a p w i t h i n s t r u c t io n s t h a tw i l l p r o b a b l y b e i n t h e e x e c u t i o n s t r e a m . T h e c u r r e n t c o m p i l e rh a n d l e s d e l a y e d j u m p s a s a m a t t e r o f co u r s e, b u t w e v e t a k e nn o m e a s u r e m e n t s o n w h e t h e r o r h o w m u c h t h e y s l o w d o w n t h ec o d e .

T h e a l t e r n a t i v e i s t o f e t c h a n e n t i r e s li c e a t o n c e . W e w o u l dh a v e n + l b a n k s o f i n s tr u c t i o n m e m o r y a n d w o u l d f e t c h al l t h en e x t c a n d i d a t e w o r d s , u s i n g t h e n e x t - i n s t r u c ti o n a d d r e s s o f a ni n s t r u c t i o n a s s o o n a s i t is s e l e c t e d . T h e n , w h e n t h e t e s t s h a v e

147


9/11

settled down, the bits coming out of the priority encoder canmultiplex from among the n+ l choices. The large words onthe ELI may make this technique difficult.

VLIW COMPILERS MUST PREDICT MEMORY BANKSWith so many operations packed in each cycle, ma ny of them

will have to be memory references. But (as in any parallelprocessing system} we can not simp ly issue ma ny references eachcycle; two things will go wrong. Get ting the addresses throug hsome kind of global arbitration system will take a long time.And the probability of bank conflict will approach 1, requiringus to freeze the entire machine most cycles.

But here we can rely (as usual) on a combinatio n of smartcompiler and static cede. We ask our compiler to look at thecode and try to predict what bank the reference is in. Whenwe can predict the bank, we can use a dedicated addressregister to refer to it directly. To make several references eachcycle, we access each of the memory banks' address registersindividually. No arbi tration is necessary, since the paths of theaddresses will never cross.

When we ask the compiler which bank a reference is going tobe in, what are the chances of getti ng an answer? In the staticcode we expect to run on a VLIW, very good. Scalars alwayshave known locations. Wha t about arrays The same systemtha t does anti-aliasing can atte mpt to see which bank areference is in .. As you'll recall, loops get un wound to increaseparallelism. In fact, it's the anti-aliasing system that does theunwinding, since it knows which are induction variables andwhich aren't. (It renames non-induc tion variables that appearin successive unwo und iterati ons to avoid unnecessary data-precedence.) By unrolling so that the array subscripts increaseby a multiple of the number of banks every iteration, the anti-aliasing system often makes it possible to predict banks.

Wh at abou t unpredicta ble references.~ Two kinds ofunpredictable references muddy up this scheme.

First of all, we might simply have no chance to predict thebank address of a reference. For example, we might be chasingdown a pointer and need access to the entire memory addressspace. But such accesses are assumed to be in the tin yminority; all we have to do is be sure they don't excessivelyslow down the local, predictable accesses. Our solut ion is tobuild a shadow memory-access system that takes addresses forany bank whatsoever and returns the values at those addresses.This requires our memory banks to be dual-ported and havelockout capabilities. The allocation of hardware resourcesshould favor predictable access; unpredictable access can bemade slower. And the predictable accesses should have priorit y

in case of bank conflict; we can make the machine freeze whenan unpredictable reference doesn't finish in time. If there aretoo m any of these references, the m achine will perform badly.But in that case conservative data-precedence would havedestroyed any chance at large amounts of parallelism anyway.

The second problem is that even when we have arrays andhave unrolled the loops properly, we might not be able topredict the bank location of a subscript. For example thesubscript value might depend on a loop index variable with adata-dependent starting point. Unknown starting values don'truin our chances of doing predictable references inside the loop.All we have to do is ask the compiler to set up a kind of pre-loop. The pre-loop looks like the origina l loop, but it exitswhen the unknown variable reaches some known value modulothe num ber of banks. Although it may itself be unwound andcompacted, the pre-loop has to use the slow unpredictableaddressing system on the un predi ctabl e references. But it willexecute some short number of cycles compared to the very longunwound loop. The situa tion given B banks is illustrated on

the next page.

MEMORY ACCESSING IN THE ELI-512The current design of the ELI counts on the system outlined

above: ban k prediction, precedence for local searches, and pre-looping. Each of the 8 M-clusters has one memory access portused for times when the ba nk is known. We will start onepipelined access per cycle per M-clnster (which may require usto be able to distinguish among at least 16 physical banks, 2per module, depending upon the design of the memory). Thiswill give us a potent ial data memory access ba ndwidth of abo ut400 Mbytes/see. if our cycle time is in the neighborhood of 150ns. To im plem ent pre-looping, we will have tests for addressesmodulo the number of banks.

In addition, the ELI will have two access ports that addressmemory globally. When the y are ready, the results of a globalfetch are put in a local register bank. When a reference ismade to the data, the system freezes if the data isn't in theregisters, and all the pending global references sneak in whilethey have a chance. We expect this state of affairs to be quiteinfrequent.

WHAT WE ARE DOING AND NOT DOINGThis paper offers solutions to the problems standing in the

way of using Very Long Instruction Word architectures tospeed up scientific code. These problem s include highly parallelcode generation, multiple tests in each cycle, and multiplememory references in each cycle.

148


10/11

a )for I = k t o n

d o { l o o p b o d y )

b )

LOOP:l = k{ l o o p b o d y )i f I >= n gob o FALLTHROUGHI := 1 1{ l o o p b o d y }i f I >= n g o t ,o FALLTHROUGHI : = 1+1{ l o o p b o d y }i f I >= n g o t ,o FALLTHROUGHI : = I +1{ l o o p b o d y }

(c)

PRELOOP:

LOOP:

I = ki f I = 0 m o d ( B ) g o t ,o LOOP{ l o o p b o d y }I := I+1i f I n go to FALLTHROUGHgoto PRELOOP{ l o o p b o d y ) * d e d u c e I = 0

m o d B )i f I >= n g o t ,o FALLTHROUGHI := I+i{ l o o p b o d y )i f I > = n go to FALLTHROUGHI := I+l{ l o o p b o d y }i f I > = n g o t ,o FALLTHROUGHI := I i{ l o o p b o d y }

i f I < n g o t o L OOPFALLTHROUGH: i f I < n g o t o L OOP

FALLTHROUGH:

A D D I N G A P R E - L O O PA d d i n g a p r e - l o o p t o c a u s e u n k n o w n b a n k r e f e r en c e s t o s t a r t

t h e l o o p w i t h k n o w n v a l u es , m o d u l o t h e n u m b e r o f b a n k s. ( a )c o n t a i n s a s o u r ce l oo p . N o t e t h a t w e a r e u s i n g F O R T R A Nl o o p s t y l e w i t h t h e t e s t a t t h e e n d . I n ( b ) t h e l o o p i s u n w o u n d .I n ( c ) w e h a v e a d d e d a p r e - l o o p t h a t x e c u t e s u n t i l I is a k n o w nv a l u e m o d u l o t h e n u m b e r o f b a n k s .

T h e B u l l d o g c o m p i l e r a n d e x p e r i m e n t s d o n e o n r e a l c o d eh a v e d e m o n s t r a t e d t h a t a l a r g e d e g r e e o f p a r a l l e l i s m e x i s t s i nt y p i c a l s c i e n t i fi c c o d e . G i v e n t h a t t h e e x i s t e n c e o f t h i sp a r a l le l i s m m a k e s V L I W m a c h i n e s d e s i r ab l e , w e a r e b u i l d i n go n e : t h e E L I - 5 1 2 , a v e r y p a r a l l e l a t t a c h e d p r o c e s s o r w i t h a5 0 0 + b i t i n s t r u c ti o n w o rd . W e e x p e c t th e E L I t o s p e e d u pc o d e b y a f a c t o r o f 1 0 - 3 0 o v e r a n e q u i v a l e n t s e q u e n t i a lm a c h i n e . W e w i l l b e g e n e r a t i n g g o o d c o d e f o r t h e E L I b e f o r ew e b u i l d i t . W e a r e a l s o w r i t i n g a c o m p i l e r f o r t h e F P S - 1 6 4 , am u c h l e s s p a r a l l e l b u t o t h e r w i s e s i m i l a r a r c h i t e c t u r e .

O u r c o d e g e n e r a t o r s u s e t r a c e s c h e d u l i n g f o r l o c a t i n g a n ds p e c i f y i n g p a r a l l e l i s m o r i g i n a t i n g i n f a r r e m o v e d p l a c e s in t h ec o d e . T h e n + l - w a y j u m p m e c h a n i s m m a k e s i t p o s s ib l e t os c h e d ul e e n o u g h t e s t s i n e a c h c y c l e w i t h o u t m a k i n g t h em a c h i n e t o o b i g . B a n k p r e d i c t i o n , p r e c e d e n c e f o r l o c a ls e a r c h e s , a n d p r e - l o o p i n g m a k e i t p o s s i b l e t o s c h e d u l e e n o u g hm e m o r y r e f e re n c e s in e a c h c y c l e w i t h o u t m a k i n g t h e m a c h i n et o o s l o w .

w a y s . E L I w i l l b e a n a t t a c h e d p r o c e s s o r - - n o I /0 noc o m p i l e r s , n o E L I s i m u l a t o r s , n o u s e r a m e n i t i e s . R a t h e r w e w i l lc h o o s e a s a n e h o s t . E L I w i l l n o t b e o p t i m i z e d f o r e ff i c i e n tc o n t e x t s w i t c h o r p r o c e d u r e c a l l. T h e E L 1 w i l l b e r u n n i n gc o m p u t e - b o u n d s c i e n t i f i c c o d e . I t is d i f f i c u l t t o e x t e n d V L I Wp a r a l l e l i s m b e y o n d p r o c e d u r e c a l l s ; w h e n w e w a n t t o , w e c a ne x p a n d s u c h c a l l s i n l in e . A n y V L I W a r c h i t e c t u r e i s l i k e l y t op e r f o r m b a d l y o n d y n a m i c c o d e , i nc l u d i ng m o s t s y s t e m s a n dg e n e r a l - p u r p o s e c o d e a n d s o m e s c i e n t if i c c o d e . W e w i ll b ec o n t e n t t o h a v e E L I p e r f o r m v e r y w e l l o n m o s t s c i e n t i f ic co d e .

A C K N O W L E D G E M E N T SI n v a l u a b l e c o n t ri b u t i o n s t o t h e E L I d e s i gn a n d t h e B u l l d og

c o m p i l e r h a v e b e e n m a d e b y J o h n R u t t e n b e r g , A l e x a n d r uN i c o l a n , J o h n E l l i s , M a r k S i d e l i , J o h n O D o n n e l l , a n d C h a r l e sM a r s h a l l . T r i s h J o h n s o n t u r n e d s c r i b b l i n g i n t o n i ce p i c t u r e s .M a r y - C l a i r e v a n L e u n e n e d i t e d t h e p a p e r .

P a r t l y t o r e d u c e t h e s c o p e o f t h e p r o j e c t a n d p a r t l y b e c a u s eo f t h e t h e n a t u r e o f V L I W s , w e a r e l i m i t i n g o u rs e l ve s in v a r i o u s

149


11/11

[Aho 77]

[Dasgup ta 79]

[Fisher 80]

[Fisher 81]

[Foster 72]

[ G r o s s 82]

[Hennessy 82]

[Jacobs 82]

[Nieolau 81]

[Padua 80]

R e f e r e n c e sA. V. Aho and J . D . UI Iman .Principles o f Compiler Design.Addison-W es ley , 1977 .Dasgup ta , S .T h e O r g a n i z a t i o n o f M i c r o p r o g r a m S t o r e s.ACMComp. Surv. 11(1):39-05, Mar. 1979.J . A . F i she r .An e f fec t ive pack ing method fo r use wi th2 n - w a y j u m p i n s t ru c t i o n h a r d w a r e .In 18th annual microprogrammin workshop,pages 04-75 . AC M Special In te res t Groupo n M i c r o p r o g r a m m i n g , N o v e m b e r 1 9 80 .J . A . F i she r .Trace schedu l ing : A techn ique fo r g loba lm i c r o c o d e c o m p a c t i o n .IE EE Transactions on Computers

c-39(7):478-490, Ju ly 1981.C . C . F o s t e r a n d E . M . R i s e m a n .Perco la t ion o f code to enhance pa ra l l e l

d i spa tch ing and execu t ion .IE EE Transactions on Computers21(12):1411-1415, Dec emb er 1972.T . R . Gross and J . L . Hennessy .Opt imiz ing De layed Branches .In 15th annual workshop onmicroprooramming , pages 114-120 . ACMSpec ia l In te res t Group onMicroprogramming , Oc tober 1982 .J . Hennessy , N . Joupp i , S . P rzbysk i ,C . Rowen , T . Gross , F . Baske t t , andJ. Gill.MIPS : A M icroprocessor Arch i tec tu re .In 15th annual workshop onmieroproyramming, pages 17-22 . ACMSpec ia l In te res t Group onMicroprogramming , Oc tober 1982 .

D. Jacobs , J . P r ins , P . S iege l and K. Wi l son .Monte ca r lo t echn iques in code op t imiza t ion .I n 15th annual workshop onmieropro~ramming, pages 143-148. AC MSpec ia l In te res t Group onMicroprogramming , Oc tober 1982 .A l e x a n d r u N i c o l a u a n d J o s e p h A . F i s he r .Us ing an o rac le to measure pa ra l l e l i sm ins ing le ins t ruc t ion s t ream programs .I n 14th annual microprogramming workshop,page s 171-182. AC M Special InterestG r o u p o n M i c r o p r o g r a m m i n g , O c t o b e r1981.D. A . Padua , D . J . Kuck , and D. H . Lawr ie .High speed mul t ip rocessors and compi la t iont e c h n i q u e s .IE EE Transactions on Computers29(9):763-776, September 1980.

[Pa t te r son 79]

[Pa t te r son 82]

[Riseman 72]

[T jaden 70]

[Tokoro 78]

D. A . Pa t te r son , K . Lew, and R . Tuck .Towards an e f f i c ien t mach ine- independen tl a n g u a g e f o r m i c r o p r o g r a m m i n g .In 12th annual mieroprogramming workshop,pages 22-35 . AC M Spec ial In te res t Groupon Microprogramming , 1979 .

D. A . Pa t te r son and C . H . Sequ in .A V L S I R I S C .Computer 15(9):8-21, SEPT 1982.E . M. Riseman and C . C . Fos te r .Th e inh ib i t ion o f po ten t ia l pa ra l l e li sm b ycond i t iona l jumps .IEEE Transactions on Computers21(12):1405-1411, Dec emb er 1972.G . S . T j a d e n a n d M . J . F l y n n .De tec t ion and pa ra l l e l execu t ion o findependen t ins t ruc t ions .IE EE Transactions on Computers19(10):889-895, October 1970.T o k o r o , M . ; T a k i z u k a , T . ; T a m u r a E . a n d

Y a m a u r a , I .T o w a r d s a n E f f ic i e n t M a c h i n e - In d e p e n d e n tL a n g u a g e f o r M i e r o p r o g r am m i n g .In ll th Annual Micropro~ ammingWorkshop, pages 41-50 . S IGM ICRO , 1978.

Permission to copy without fee all or par t of this material is grantedprovided that the copies arc not m ade or distributed for directcommercial advantage, the ACM copyright notice and the title of thepublication and its date app ear, an d notice is given that copying is bypermission of the Association for Computing Machinery. To copyotherwise, or to republish, requires a fee an d/o r specific permission.

150

fisher - 1983 - very long instruction word architectures and the eli-512

Documents