microcode processing: positioning and directions

10
21 0272-1732/03/$17.00 2003 IEEE Published by the IEEE computer Society Microcode, which M.V. Wilkes introduced in 1951, constitutes an important computer engineering innovation. 1 Microc- ode allowed the emulation of complex instructions through simple hardware (oper- ations) and thus greatly helped drive the devel- opment of computer systems. Microcode actually partitioned computer engineering into two distinct conceptual layers: architec- ture and implementation. Architecture in this article denotes the programmer’s view of a sys- tem’s attributes—such as the processor’s con- ceptual structure and functional behavior. It is distinct from the dataflow’s organization and the processor’s physical implementation. 2 The partitioning is partly because emulation allowed the definition of complex instructions that might have been technologically unim- plementable (at the time they were defined), thus projecting an architecture to the future. That is, microcode let computer architects determine a technology-independent func- tional behavior (such as the instruction set) and select conceptual structures providing the following possibilities: Computer architects could define the computer architecture as a programmer’s interface to the hardware rather than to a specific technology-dependent realiza- tion of a specific behavior. Computer architects could determine a single architecture for a family of imple- mentations, giving rise to the important concept of compatibility. Simply stated, you could write programs for a specific architecture once and run them “ad infinitum” independent of the imple- mentations. Since its beginnings, microcode, as intro- duced by Wilkes, has been a sequence of micro-operations, called a microprogram. Such a microprogram comprises pulses for operating the gates associated with arithmeti- cal and control registers. Figure 1 depicts the method of generating this sequence of puls- es. 1 First, a timing pulse initiating a micro- operation enters the decoding tree that in turn generates an output signal depending on setup register R. This output signal passes to matrix Stamatis Vassiliadis Stephan Wong Sorin Cotofana Delft University of Technology MICROCODE IS AN IMPORTANT INNOVATION IN COMPUTER ENGINEERING. THE AUTHORS DISCUSS THE EVOLUTION OF MICROCODE FROM ITS INTRODUCTION TO ITS DECLINE AND TO ITS LIKELY RESURGENCE IN CUSTOM COMPUTING MACHINES. FURTHERMORE, THEY PRESENT A MICROCODED MACHINE AUGMENTED WITH FIELD-PROGRAMMABLE GATE ARRAYS (FPGAS) AND PROVIDE EXPERIMENTAL EVIDENCE THAT IT CAN SUBSTANTIALLY INCREASE THE PERFORMANCE OF SOME MEDIA BENCHMARKS. MICROCODE PROCESSING: POSITIONING AND DIRECTIONS

Upload: tudelft

Post on 14-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

210272-1732/03/$17.00 2003 IEEE Published by the IEEE computer Society

Microcode, which M.V. Wilkesintroduced in 1951, constitutes an importantcomputer engineering innovation.1 Microc-ode allowed the emulation of complexinstructions through simple hardware (oper-ations) and thus greatly helped drive the devel-opment of computer systems. Microcodeactually partitioned computer engineeringinto two distinct conceptual layers: architec-ture and implementation. Architecture in thisarticle denotes the programmer’s view of a sys-tem’s attributes—such as the processor’s con-ceptual structure and functional behavior. It isdistinct from the dataflow’s organization andthe processor’s physical implementation.2 Thepartitioning is partly because emulationallowed the definition of complex instructionsthat might have been technologically unim-plementable (at the time they were defined),thus projecting an architecture to the future.That is, microcode let computer architectsdetermine a technology-independent func-tional behavior (such as the instruction set)and select conceptual structures providing thefollowing possibilities:

• Computer architects could define thecomputer architecture as a programmer’sinterface to the hardware rather than toa specific technology-dependent realiza-tion of a specific behavior.

• Computer architects could determine asingle architecture for a family of imple-mentations, giving rise to the importantconcept of compatibility. Simply stated,you could write programs for a specificarchitecture once and run them “adinfinitum” independent of the imple-mentations.

Since its beginnings, microcode, as intro-duced by Wilkes, has been a sequence ofmicro-operations, called a microprogram.Such a microprogram comprises pulses foroperating the gates associated with arithmeti-cal and control registers. Figure 1 depicts themethod of generating this sequence of puls-es.1 First, a timing pulse initiating a micro-operation enters the decoding tree that in turngenerates an output signal depending on setupregister R. This output signal passes to matrix

Stamatis VassiliadisStephan WongSorin CotofanaDelft University of

Technology

MICROCODE IS AN IMPORTANT INNOVATION IN COMPUTER ENGINEERING. THE

AUTHORS DISCUSS THE EVOLUTION OF MICROCODE FROM ITS INTRODUCTION

TO ITS DECLINE AND TO ITS LIKELY RESURGENCE IN CUSTOM COMPUTING

MACHINES. FURTHERMORE, THEY PRESENT A MICROCODED MACHINE

AUGMENTED WITH FIELD-PROGRAMMABLE GATE ARRAYS (FPGAS) AND

PROVIDE EXPERIMENTAL EVIDENCE THAT IT CAN SUBSTANTIALLY INCREASE

THE PERFORMANCE OF SOME MEDIA BENCHMARKS.

MICROCODE PROCESSING:POSITIONING AND DIRECTIONS

A, which in turn generates pulses to controlarithmetical and control units and thus per-form the required micro-operation. The out-put signal also passes to matrix B, which in itsturn generates pulses to control and updatesetup register R (with a certain delay). Thenext timing pulse, therefore, generates thenext micro-operation in the required sequencedue to the updated setup register R.

Microcode has become amajor component, requiring alarge development effort, frommainframes to PC processors.To understand the magnitudeand the importance thatmicrocode has played, consid-er some key facts of “real”machine microcode depictedin Table 1.2 As you can

observe, several assembly and higher-level lan-guages have been developed and used, with asubstantial amount of code developed for theimplementations.3 This is indicated by thenumber of modules used, the number of linesof code, and the number of lines of code pluscomments.

Microcode has evolved considerably fromits initial structure. In this article, we look at

22

MICROCODE PROCESSING

IEEE MICRO

Matrix A Matrix B

R

Delay

Timingpulse

Decoding tree

To gates inarithmetical

unit, and so on

From sign flip-flopof accumulator

Figure 1. Wilkes’ microprogram control model.

Table 1. Some facts from two IBM enterprise servers.

No. of No. of assembly higher Lines of Lines of code

System languages languages Modules code plus commentsIBM ES/4381 3 1 1,505 480,692 791,696IBM ES/9370 6 2 3,130 796,136 1,512,750Total 9 3 4,635 1,276,828 2,304,446

current architectures and briefly discuss sometechnological advances that somehow dimin-ished the microcode appeal and other tech-nological advances that could reintroduce itsusefulness.

Modern microcode conceptsHere we discuss concepts associated with

the microcode structure and some organiza-tional techniques that microcode employs.Figure 2 depicts a high-level micropro-grammed computer.4

The control store in Figure 2 contains themicroinstructions (which represent one ormore micro-operations) and corresponds tomatrices A and B in Figure 1. The machine’soperation is as follows:

1. The control store address register (CSAR)contains the address of the next microin-struction in the control store. Themicroinstruction there is then forwardedto the microinstruction register (MIR).

2. The MIR decodes the microinstructionand generates smaller micro-operationsthat the hardwired units or control logicmust perform.

3. The sequencer uses status informationfrom the control logic or results from thehardwired units to determine the nextmicroinstruction and stores its controlstore address in the CSAR. The previousmicroinstruction could also influence thesequencer’s decision about whichmicroinstruction to select next.

As mentioned earlier, the MIR generatesmicro-operations depending on the microin-struction. If only one micro-operation is gen-erated that controls a single hardwired resource,the microinstruction is called vertical. In allother cases, the microinstruction is called hori-zontal. Finally, the control store could providepermanent (fixed) storage to frequently usedmicroprograms (a sequence of microinstruc-tions) and temporary (pageable) storage for less-frequently-used microprograms. In microcodedmachines, not all instructions have or will haveaccess to the control store. In fact, only emu-lated instructions have to go through themicrocode logic; the hardware can directly exe-cute all other instructions that the system hasimplemented, following path α in Figure 2.

That is, a microcoded machine is a hybrid withemulated and hardwired instructions. Contraryto some beliefs, from the moment it was possi-ble to implement instructions, microcodedmachines always had a hardwired core that exe-cuted reduced-instruction-set computer (RISC)instructions.

Microcode advantages?Why does an extremely successful mecha-

nism appear to be in decline, and what is hap-pening to the advantages of microcodedmachines? To examine this question, weexplore a simple example. We consider therequirements imposed by moving one or morecharacters (represented by bytes) from onememory location to another. Because thegeneric case is too complex and will not serveany purpose to our discussion, we furtherassume that at most 256 bytes are to be movedand that the source and destination addressesare always aligned at word boundaries. Thegeneric case requires many more test andboundary conditions that will only increasethe code sizes of the examples.

This example is by no means casual,because it relates to the move character fami-

23JULY–AUGUST 2003

Sequencer

CSAR

Statusα

Mainmemory

Controlstore M

IR

Hardwired unitsand

control logic

Memory addressregister

Memorydata

register

Figure 2. High-level microprogrammed computer.

ly of instructions, which includes the contro-versial complex-instruction-set computer(CISC) operation, move character long.

Figure 3 depicts three potential scenarios.The first shows no overlap between the stringto be moved and the moved string. In this sce-nario, moving characters does not pose anyproblem.

In the second scenario, however, movingthe first, (then second, and so on) characters

in a straightforward mannerwill overwrite the string to bemoved. The third scenarioalso has an overlap, butbecause the starting point Bis in front of A, it does notpose any problems whenmoving the characters in astraightforward manner.

The described move char-acter operation is an applica-tion requirement and is whatthe programmer intended todo. Thus, it has no relation toany architectural paradigm.The intention of the moveinstruction is to actually per-form a “true” move. That is,the programmer does notintend to replicate bytes usingmove instructions and over-lap. There are several ways ofperforming such an operationwith varying involvement ofsoftware and hardware.

Case 1. The first approach isto introduce a complex

instruction and assume that the hardwaresomehow will execute it. The instruction hasthe format: MOVE R1, R2, d, where R1 isthe source address, R2 is the destinationaddress, and d is the number of bytes to bemoved.

Case 2. A second approach is to assume sim-ple instructions, and produce a program likethe one shown in Table 2.5

24

MICROCODE PROCESSING

IEEE MICRO

Scenario II:

Scenario I:

Scenario III:

AB

BA

BA

Figure 3. Three scenarios of moving characters.

Label Instructionstart: lw r8, A

lw r9, Blw r10, dbeq r8, r9, stopsub r11, r9, r8bltz r11, no_overlapsub r12, r11, r10bltz r12, overlap

no_overlap: sub r11, r10, 4bltz r11, last_byteslw r12, 0(r8)sw r12, 0(r9)add r8, r8, 4add r9, r9, 4sub r10, r10, 4jmp no_overlap

last_bytes: beq r10, r0, stoplb r12, 0(r8)sb 0(r9), r12add r8, r8, 1add r9, r9, 1sub r10, r0, 1jmp last_bytes

Label Instructionoverlap: add r14, r8, r10

add r15, r9, r10word_aligned?: sll r13, r10, 30

beq r13, r0, yes_alignedsub r14, r14, 1sub r15, r15, 1lb r12, 0(r14)sb 0(r15), r12sub r10, r10, 1beq r10, r0, stopjmp word_aligned?

yes_aligned: sub r14, r14, 4sub r15, r15, 4lw r12, 0(r14)sw r12, 0(r15)sub r10, r10, 4beq r10, r0, stopjmp yes_aligned

stop:

Table 2. Sample program for case 2.

Case 3. A third approach is to decide on anarchitecture that will allow combining mul-tiple instructions and executing them in par-allel.6,7,8 You could then rewrite the movecharacter program as in Table 3. The programassumes five instruction slots for differentinstructions: one slot for load/store instruc-tions, two slots for integer instructions, oneslot for floating-point instructions, and oneslot for branch instructions. Furthermore, weassume that the arithmetic units producetheir results in one cycle.

Case 4. If we find (using simulations) that inmost of the cases (say 80 percent) a program-mer is moving 8 bytes without overlap, thenwe can rewrite the move character program asin Table 4.

Case 5. With the same assumptions of thearchitecture in case 3, the program from case4 can be rewritten as in Table 5 (on p. 27).

The differences among all the cases are the

hardware involvement, the software involve-ment, their complexity, performance, andcost (in terms of area). The trained eye willrecognize that case 1 is what has been calledCISC architecture. Cases 2 and 4 are RISCarchitectures with programming tricks toimprove performance, and cases 3 and 5 arevery long instruction word (VLIW) archi-tectures with the same considerations as forthe RISC architectures.

Microcode design principlesWhat might not be so obvious is that cases

2 through 5 are microcoded sequences for thecomplex instruction that case 1 describes. Tocomprehend this, you need to understandhow a microcoded machine is actuallydesigned. The design principles of such amachine are as follows:

• Principle 1. We design (and always havedesigned) hardware that is simple, mean-ing it is implementable using currenttechnologies and meeting certain con-

25JULY–AUGUST 2003

Table 3. Sample program for case 3.*

Label Load/store slot Integer slot 1 Integer slot 2 Branch slotstart: lw r8, A

lw r9, Blw r10, d sub r11, r9, r8 beq r8, r9, stop

sub r11, r10, 4 sub r12, r 11, r10 bltz r11, no_overlapadd r14, r8, r10 add r15, r9, r10 bltz r12, overlap

no_overlap: lw r12, 0 (r8) bltz r11, last_bytessw r12, 0(r9) add r8, r8, 4 add r9, r9, 4

sub r10, r10, 4 sub r11, r10, 8 jmp no_overlaplast_bytes: lb r12, 0(r8) beq r10, r0, stop

sb 0(r9), r12 add r8, r8, 1 add r9, r9, 1sub r10, r10, 1 jmp last_bytes

overlap: sll r13, r10, 30word_aligned?: beq r13, r0, yes_aligned

sub r14, r14, 1 sub r15, r15, 1lb r12, 0(r14) sub r10, r10, 1sb 0(r15), r12 beq r10, r0, stop

s11 r13, r10, 30 jmp word_aligned?yes_aligned: sub r14, r14, 4 sub r15, r15, 4

lw r12, 0(r14) sub r10, r10, 4sw r12, 0(r15) beq r10, r0, stop

jmp yes_alignedstop:

* This program does not execute floating-point instructions; therefore, this table does not show the floating-point

instruction slot.

straints (such as cycle, cost, or area).• Principle 2. We emulate and thus microc-

ode the instructions that are nonimple-mentable or are not frequently used.

• Principle 3. We permanently cache themicrocode (emulation code) of certain,frequently used, nonimplementableinstructions (or instances of ) in the con-trol store to improve performance. Wecache all other microcode correspondingto non-frequently-used nonimple-mentable instructions (or instances of )on demand with appropriate replace-ment policies.

• Principle 4. We write microcode to eithercontrol a single facility (vertical microc-ode) or control multiple facilities (hori-zontal microcode).

Going back to our cases, if we assume thatwe can implement all instructions in case 2(principle 1), then the program in case 2 emu-lates case 1 and we can store it in the control

store; thus, it is its microcode (principle 2). Wecan store it permanently—or not, dependingon simulation evaluations (principle 3)—andbecause we assume that the program only exe-cutes one instruction at a time, we have a verti-cal microcode (principle 4). We can apply thesame principles to case 3 with the addition thatthis case is a horizontal microcoded machine,because multiple facilities are controlled in a sin-gle cycle. Cases 4 and 5 are just different waysof performing the same emulation. In essence,they do move characters but are geared towardimproving the performance of the most fre-quent cases. So they possibly split the microcodeinto permanently and nonpermanently storedemulation code to compact the control storememory. To put things in perspective, by elim-inating instructions that require emulation andby exposing vertical microcode to the pro-grammer, we have a RISC architecture. Whenthe exposure involves horizontal microcode, wehave a VLIW architecture.

We are ready to consider the developments

26

MICROCODE PROCESSING

IEEE MICRO

Label Instructionstart: lw r8, A

lw r9, Blw r10, dbeq r8, r9, stopsub r11, r9, r8bltz r11, no_overlapsub r12, r11, r10bltz r12, overlap

no_overlap: sub r11, r10, 8beq r11, r0, 8bytes

no_overlap2: sub r11, r10, 4bltz r11, last_byteslw r12, 0(r8)sw r12, 0(r9)add r8, r8, 4add r9, r9, 4sub r10, r10, 4jmp no_overlap2

last_bytes: beq r10, r0, stoplb r12, 0(r8)sb 0(r9), r12add r8, r8, 1add r9, r9, 1sub r10, r0, 1jmp last_bytes

Label Instructionoverlap: add r14, r8, r10

add r15, r9, r10word_aligned?: sll r13, r10, 30

beq r13, r0, yes_alignedsub r14, r14, 1sub r15, r15, 1lb r12, 0(r14)sub 0(r15), r12sub r10, r10, 1beq r10, r0, stopjmp word_aligned?

yes_aligned: sub r14, r14, 4sub r15, r15, 4lw r12, 0(r14)sw r12, 0(r15)sub r10, r10, 4beq r10, r0, stopjmp yes_aligned

8bytes: lw r12, 0(r8)sw r12, 0(r9)lw r12, 4(r8)sw r12, 4(r9)

stop:

Table 4. Sample program for case 4.

leading to the proposal that eliminates instruc-tions requiring emulations (at least from theprocessor’s point of view). First of all, techno-logical improvements allow more complexinstructions to be implemented in hardware.Examples are multiply, divide, and square root.All these operations were microcoded in onetime period or another. Consequently, anyinstruction that we can implement in hard-ware does not need emulation; thus, we haveno need for microcode. The implication hereis that, to a certain extent, improvements intechnology have diminished the microcode’seffectiveness. Furthermore, the programmersdid not frequently use instructions that werecomplex, had side effects, or were difficult tocomprehend, making their elimination fromthe architecture desirable and thus furtherdiminishing the microcode’s appeal.

The event of caching was the second strikeagainst microcode’s effectiveness for general-purpose computers. We have added caches tothe processor to improve performance ofmemory accesses. The bottom line is, asequence of instructions accomplishing acomplex task (captured by a CISC instructionas in case 1) that resides frequently (almostpermanently) in caches requiring few accesscycles does not need to reside in the controlstore. Thus, the CISC instruction needs noemulation by a microcoded machine. Instead,a sequence of simple instructions (no microc-ode) can emulate the complex task.

Is microcode a thing of the past?At first glance, it appears that microcode is

on its way out. Maybe so. However, some cir-cumstances may indicate otherwise. First, while

27JULY–AUGUST 2003

Label Load/store slot Integer slot 1 Integer slot 2 Branch slotstart: lw r8, A

lw r9, Blw r10, d sub r11, r9, r8 beq r8, r9, stop

sub r11, r10, 4 sub r12, r11, r10 bltz r11, no_overlapadd r14, r8, r10 add r15, r9, r10 bltz r12, overlap

no_overlap: sub r16, r10, 8lw r12, 0(r8) beq r16, r0, 8bytes

no_overlap2: lw r12,0(r8) bltz r11, last_bytessw r12, 0(r9) add r8, r8, 4 add r9, r9, 4

sub r10, r10, 4 sub r11, r10, 8 jmp no_overlap2last_bytes: lb r12, 0(r8) beq r10, r0, stop

sb 0(r9), r12 add r8, r8, 1 add r9, r9, 1sub r10, r10, 1 jmp last_bytes

overlap: sll r13, r10, 30word_aligned?: beq r13, r0, yes_aligned

sub r14, r14, 1 sub r15, r15, 1lb r12, 0(r14) sub r10, r10, 1sb 0(r15),r12 beq r10, r0, stop

sll r13, r10, 30 jmp word_aligned?yes_aligned: sub r14, r14, 4 sub r15, r15, 4

lw r12, 0(r14) sub r10, r10, 4sw r12, 0(r15) beq r10, r0, stop

jmp yes_aligned8bytes: sw r12, 0(r9)

lw r12, 4(r8)sw r12, r(r9)

stop:* This program does not execute floating-point instructions; therefore, this table does not show the floating-point

instruction slot.

Table 5. Sample program for case 5.

technologies allow increasingly more instruc-tions to be implementable, several instructions(such as start I/O instructions) cannot possi-bly be implemented in a hardwired mode.Designers have implemented I/O instructions,for example, using multiple levels of emulation(that is, using multiple levels of microcode, fre-quently called picocode). As far as we can see,microcode for machines that require I/Oprocessors or processing mostly keep or changethe microcode name, but not the substance.The second reason relates to the rise of customcomputing machines. Such machines computecomplex tasks and have the hardware chang-ing via reconfiguration. Given that reconfigu-ration is a complex function that executes byusing a “long” program, special caching is nec-essary to reduce the reconfiguration times, rein-troducing the microcode concept. Becausereconfigurations are complex behaviors, cover-ing all of them with one hardware implemen-tation is not possible. Therefore, emulation(and thus microcode) is needed (principle 2).Furthermore, caching as we know it today maynot be capable of handling large and frequentreconfigurations until reused. The implicationhere is the reintroduction of special storage andtechniques to hold reconfigurations resemblingcontrol store, possibly with modifications, toaccommodate special reconfiguration needs.

We could achieve this by pointing to emula-tion code (microcode) that either resides onchip (frequently used) or in main memory, andis loaded in caches like control store when need-ed (less frequently used)—principle 3. Becauseof the configurable nature of the hardware, wecan configure it to contain many subunits(facilities) to optimally perform the functionsby exploiting parallelism. In this case, depend-ing on the available parallelism, we can useeither vertical or horizontal microcode (prin-ciple 4). Such microcode constructions havebegun to appear. In this article, we consider anaugmented general-purpose paradigm withreconfigurable microcode.9 Figure 4 depicts theorganization of the Molen reconfigurableprocessor.

The Molen processor fetches instructionsfrom the main memory and stores them in theinstruction fetch unit. The arbiter fetchesinstructions from the instruction fetch unit andperforms a partial decoding on the instructionsto determine where they should be issued.Instructions implemented in fixed hardwareare issued to the core processor. The coreprocessor further decodes the instructions andissues them to their corresponding functionalunits. The (core) processor fetches the sourcedata from the general-purpose registers (GPRs)and writes the results back to the same GPRs.

28

MICROCODE PROCESSING

IEEE MICRO

Main memory

Datafetch

Memorymultiplexer

Instructionfetch

Registerfile

Xregsfile

Arbiter

Coreprocessor

ρµ-codeunit CCU

Hardware

Memory

Reconfigurable unit

Figure 4. Reconfigurable microcoded Molen processor.

The reconfigurable unit consists of a customconfigured unit (CCU)—possibly imple-mented by using a field-programmable gatearray (FPGA)—and the ρµ-code unit thatencompasses the functionality of the sequencerand control store as depicted in Figure 2. Thereconfigurable unit is able to perform opera-tions that can be as simple as a single instruc-tion or as complex as a piece of application codethat describes a certain function. Therefore, theprocessor divides an operation that the recon-figurable unit executes into two distinct processphases: set and execute. The set phase is respon-sible for reconfiguring the CCU hardware,enabling the execution of the operation. Wemay subdivide the set phase into two subphas-es: partial set (p-set) and complete set (c-set).We envision the p-set phase covering commonfunctions of an application or set of applica-tions. More specifically, in the p-set phase, theCCU is partially configured to perform thesecommon functions. Although the system canpossibly perform the p-set subphase duringprogram loading or even at chip fabricationtime, it performs the c-set subphase during pro-gram execution. Furthermore, the c-set sub-phase only partially reconfigures remainingblocks in the CCU (not covered in the p-setsubphase) to complete the CCU’s functional-ity by enabling it to perform other less-frequentfunctions. We use the exchange registers(XRegs) file to provide a general methodologyto pass arguments to and results from thereconfigurable unit.

Finally, we performed simulations of theMolen processor using two well-known mul-timedia benchmarks (ijpeg and mpeg2enc) andthe SimpleScalar toolset (v2.0).10 We couldconfigure the CCU to implementations thatperform the discrete cosine transform (DCT),

the Huffman coding (a variable length codingtechnique denoted by VLC), or the calculationof the sum of absolute differences (SAD).11 Inthe simulations, we assume that an FPGAstructure (an Altera APEX20K) operating at aconsiderably lower core frequency (indicatedby synthesis software FPGA Express from Syn-opsys) augments a 1-GHz general-purposeprocessor. Due to this, we normalized the cyclesto perform DCT, VLC, and SAD to reflect thisdiscrepancy.9 As Table 6 shows, the simulationresults have shown an average overall decreasein cycles by about 30 percent.

We have shown two scenarios that indi-cate microcode’s current usefulness.

We believe that microcode use will continuein the future, especially in the advent of theincreasing use and acceptance of reconfig-urable computing in many new applicationareas, such as network processing. In this light,the challenge lies in semiautomatically deter-mining the most adequate design parameters,such as the size of the fixed and pageable stor-age within the control store and the microin-struction width, to meet the requirementsposed by new application areas. Another chal-lenge lies in defining a common reconfig-urable microcode that we can use to configureand to execute on various reconfigurable hard-ware implementations. In conclusion, theintroduction of microcode use in customcomputing machines is opening up newresearch and engineering challenges. MICRO

References1. M.V. Wilkes, “The Best Way to Design an

Automatic Calculating Machine,” Proc. Man-chester Univ. Computer Inaugural Conf., Fer-ranti Ltd., 1951, pp. 16-18.

29JULY–AUGUST 2003

Table 6. Media results of the Molen processor from simulations (ML = execution cycles).

ijpeg encoder mpeg2enc Input data Picture testimg Picture vigo Frames testframes

(No. of cycles) (No. of cycles) (No. of cycles)Default cycles 6,512,947 149,363,055 93,245,923DCT ML = 282 4,680,431 (–28.14%) 109,815,770 (–26.48%) 81,560,420 (–12.53%)VLC ML = 200 6,094,054 (–6.43%) 140,486,670 (–5.94%)SAD ML = 234 74,608,673 (–19.99%)DCT(282) + VLC(200) 4,505,811 (–30.82%) 106,157,895 (–28.93%)DCT(282) + SAD(234) 62,928,429 (–32.51%)

2. A. Padegs et al., “The IBM System/370 Vec-tor Architecture: Design Considerations,”IEEE Trans. Computers, vol. 37, no. 5, May1988, pp. 509-520.

3. G. Triantafyllos, S. Vassiliadis, and J. Delga-do-Frias, “Software Metrics and MicrocodeDevelopment: A Case Study,” J. SoftwareMaintenance: Research and Practice, vol. 8,no. 3, May-June 1996, pp. 199-224.

4. G. Tomlinson and P. Adams, “Micropro-gramming: A Tutorial and Survey of RecentDevelopments,” IEEE Trans. Computers,vol. 29, no. 1, Jan. 1980, pp. 2-20.

5. G. Kane and J. Heinrich, MIPS RISC Architec-ture, Prentice Hall, Englewood Cliffs, 1992.

6. S. Vassiliadis, B. Blaner, and R. Eickemeyer,“SCISM: A Scalable Compound InstructionSet Machine,” IBM J. Research and Devel-opment, vol. 38, no. 1, Jan. 1994, pp. 59-78.

7. TriMedia32 Architecture (preliminary speci-fication), TriMedia Technologies, 2000.

8. Intel IA64 Architecture: Software Develop-er’s Manual, Intel Corp., 2000.

9. S. Vassiliadis, S. Wong, and S. Cotofana,“The MOLEN ρµ-coded Processor,” Proc.11th Int’l Conf. Field-Programmable Logicand Applications (FPL 2001), Lecture Notesin Computer Science, vol. 2147, Springer-Verlag, 2001, pp. 275-285.

10. D.C. Burger and T.M. Austin, The Sim-pleScalar Tool Set, Version 2.0, tech. reportCS-TR-1997-1342, Univ. of Wisconsin-Madi-son, 1997.

11. S. Vassiliadis et al., “The Sum-Absolute-Dif-ference Motion Estimation Accelerator,” Proc.24th Euromicro Conf., 1998, pp. 158-163.

Stamatis Vassiliadis is the chair professor ofthe Computer Engineering Laboratory in theElectrical Engineering Department at DelftUniversity of Technology, Delft, the Nether-lands. His research interests include embed-ded systems, multimedia processors,computer architecture, hardware design ofcomputer systems, custom computingmachines, hardware/software co-design, com-puter arithmetic, low-power design,nanocomputing, and network processing.Vassiliadis has a PhD in electrical engineeringfrom Politecnico di Milano, Italy. He is anIEEE Fellow and a member of the IEEEComputer Society.

Stephan Wong is an assistant professor withthe Computer Engineering Laboratory in theElectrical Engineering Department at DelftUniversity of Technology, Delft, the Nether-lands. His research interests include embed-ded systems, multimedia processors, complexinstruction set architectures, reconfigurableand parallel processing, microcoded machines,and network processors. Wong has a PhD incomputer engineering from Delft Universityof Technology, Delft, the Netherlands.

Sorin Cotofana is an associate professor withthe Computer Engineering Laboratory in theElectrical Engineering Department at DelftUniversity of Technology, Delft, the Nether-lands. His research interests include comput-er arithmetic, parallel architectures, embeddedsystems, neural networks, fuzzy logic, com-putational geometry, and computer-aideddesign. Cotofana has a PhD in electrical engi-neering from Delft University of Technology,the Netherlands. He is a senior member of theIEEE Computer Society.

Direct questions and comments about thisarticle to Stephan Wong, Computer Engineer-ing Laboratory, Electrical Engineering Depart-ment, Delft University of Technology, Delft,the Netherlands; [email protected].

30

MICROCODE PROCESSING

IEEE MICRO

Investing in Students

computer.org/students/

Lance Stafford Larson Student Scholarship best paper contest

✶Upsilon Pi Epsilon/IEEE Computer Society Award

for Academic Excellence

Each carries a $500 cash award.

Application deadline: 31 October

SCHOLARSHIPMONEY FORSTUDENTMEMBERS