piperench: a reconfigurable architecture and compiler

10
Carnegie Mellon University Research Showcase Computer Science Department School of Computer Science 4-1-2000 PipeRench: A Reconfigurable Architecture and Compiler Seth C. Goldstein Carnegie Mellon University, [email protected] Herman Schmit Carnegie Mellon University Mihai Budiu Carnegie Mellon University Srihari Cadambi Carnegie Mellon University Matt Moe Carnegie Mellon University See next page for additional authors This Article is brought to you for free and open access by the School of Computer Science at Research Showcase. It has been accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase. For more information, please contact [email protected]. Recommended Citation Goldstein, Seth C.; Schmit, Herman; Budiu, Mihai; Cadambi, Srihari; Moe, Matt; and Taylor, R. Reed, "PipeRench: A Reconfigurable Architecture and Compiler" (2000). Computer Science Department. Paper 790. http://repository.cmu.edu/compsci/790

Upload: cmu

Post on 03-Dec-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Carnegie Mellon UniversityResearch Showcase

Computer Science Department School of Computer Science

4-1-2000

PipeRench: A Reconfigurable Architecture andCompilerSeth C. GoldsteinCarnegie Mellon University, [email protected]

Herman SchmitCarnegie Mellon University

Mihai BudiuCarnegie Mellon University

Srihari CadambiCarnegie Mellon University

Matt MoeCarnegie Mellon University

See next page for additional authors

This Article is brought to you for free and open access by the School of Computer Science at Research Showcase. It has been accepted for inclusion inComputer Science Department by an authorized administrator of Research Showcase. For more information, please [email protected].

Recommended CitationGoldstein, Seth C.; Schmit, Herman; Budiu, Mihai; Cadambi, Srihari; Moe, Matt; and Taylor, R. Reed, "PipeRench: A ReconfigurableArchitecture and Compiler" (2000). Computer Science Department. Paper 790.http://repository.cmu.edu/compsci/790

AuthorsSeth C. Goldstein, Herman Schmit, Mihai Budiu, Srihari Cadambi, Matt Moe, and R. Reed Taylor

This article is available at Research Showcase: http://repository.cmu.edu/compsci/790

0018-9162/00/$10.00 © 2000 IEEE70 Computer

PipeRench: AReconfigurableArchitecture and Compiler

Highly specialized embedded computer sys-tems abound, and workloads for comput-ing devices are rapidly changing. General-purpose processors are struggling to effi-ciently meet these applications’ disparate

needs, and custom hardware is rarely feasible. Thetime is ripe for reconfigurable computing, which com-bines the flexibility of general-purpose processors withthe efficiency of custom hardware. PipeRench, a newarchitecture for reconfigurable computing, and itsassociated compiler do just that. Combined with atraditional digital signal processor, microcontroller,or general-purpose processor, PipeRench can supporta system’s various computing needs without requir-ing custom hardware.

PipeRench is a reconfigurable fabric—an intercon-nected network of configurable logic and storage elements. By virtualizing the hardware, PipeRenchovercomes the disadvantages of using field-program-mable gate arrays as reconfigurable computing fab-rics. Unlike FPGAs, PipeRench is designed to efficientlyhandle computations. Using a technique called pipelinereconfiguration, PipeRench improves compilationtime, reconfiguration time, and forward compatibil-ity. PipeRench’s architectural parameters (includinglogic block granularity) optimize the performance ofa suite of kernels, balancing the compiler’s needsagainst the constraints of deep-submicron processtechnology.

PipeRench is particularly suitable for stream-basedmedia applications or any applications that rely onsimple, regular computations on large sets of smalldata elements. Conventional processors lag in effi-ciency because of forced serialization of intrinsicallyparallel operations; wasted space (small data elementsdo not use the processor’s wide data path); and exces-sive instruction bandwidth for regular, dataflow-dom-inated computations on large data sets. A system usinga reconfigurable fabric such as PipeRench can

• increase flexibility,• decrease design time,• extend system life,• reduce part count,• reduce development and maintenance costs, and• reduce chip fabrication costs through defect

tolerance.

PROBLEMS WITH COMMERCIAL FPGASInitial performance results with FPGAs were impres-

sive. However, commercial FPGAs have inherentshortcomings, which heretofore made reconfigurablecomputing impractical for mainstream computing:

• Logic granularity. FPGAs are designed for logicreplacement. The functional units’ granularity isoptimized to replace random logic, not to per-form multimedia computations.

Reconfigurable computing will change the way computing systems aredesigned, built, and used. PipeRench, a new reconfigurable fabric,combines the flexibility of general-purpose processors with the efficiencyof customized hardware to achieve extreme performance speedup.

Seth CopenGoldsteinHermanSchmitMihai BudiuSrihariCadambiMatt MoeR. Reed TaylorCarnegie Mellon University

C O V E R F E A T U R E

• Configuration time. The time to load a configu-ration in the fabric ranges from hundreds ofmicroseconds to hundreds of milliseconds. ForFPGAs to improve processing speed over that ofa general-purpose processor, they must amortizethis start-up latency over huge data sets, limitingtheir applicability.

• Forward compatibility. FPGAs require redesignor recompilation to benefit from future chip gen-erations.

• Hard constraints. FPGAs can implement onlykernels of a fixed and relatively small size. Thissize restriction makes compilation difficult andcauses large, unpredictable discontinuities be-tween kernel size and performance.

• Compilation time. A kernel’s synthesis, place-ment, and routing design phases take hundredsof times longer than the compilation of the samekernel for a general-purpose processor.

PipeRench overcomes these problems and maintainsoutstanding performance by virtualizing the hard-ware.

PIPELINED RECONFIGURATIONA configuration’s static nature causes two significant

problems: A computation may require more hardwarethan is available, and a single hardware design cannotexploit the additional resources that will inevitablybecome available in future process generations. A tech-nique called pipelined reconfiguration implements alarge logical configuration on a small piece of hard-ware through rapid reconfiguration of that hardware.1

With this technique, the compiler is no longer respon-sible for satisfying fixed hardware constraints. In addi-tion, a design’s performance improves in proportionto the amount of hardware allocated to that design.

Pipelined reconfiguration involves virtualizingpipelined computations by breaking a single staticconfiguration into pieces that correspond to pipelinestages in the application. Each pipeline stage is loaded,one per cycle, into the fabric. This makes performingthe computation possible, even if the entire configu-ration is never present in the fabric at one time.

Figure 1 illustrates the virtualization process, show-ing a five-stage pipeline virtualized on a three-stage fab-ric. Figure 1a shows the five-stage application and eachpipeline stage’s state in seven consecutive cycles. Figure1b shows the state of the physical stages in the fabric asit executes this application. In this example, virtual pipestage 1 is configured in cycle 1 and ready to execute inthe next cycle; it executes for two cycles. There is nophysical pipe stage 4; therefore, in cycle 4, the fourthvirtual pipe stage is configured in physical pipe stage 1,replacing the first virtual stage. Once the pipeline is full,every five cycles generates two results for two consecu-

tive cycles. For example, cycles 2, 3, 7, 8, … consumeinputs; and cycles 6, 7, 11, 12, … generate outputs.

In general, when PipeRench virtualizes an applica-tion with v virtual stages on a device with a capacityof p physical stages (p < v), the implementation’sthroughput is proportional to (p – 1)/v. Throughputis a linear function of the device’s capacity. Therefore,performance improves with new chip generations notonly because of increases in clock frequency but alsobecause of decreases in transistor size, without anyredesign, until p = v. Thereafter, an application’s per-formance continues to improve only through in-creased clock frequency.

Because some stages are configured while others areexecuted, reconfiguration does not decrease perfor-mance. As the pipeline fills with data, the system con-figures stages of the computation before that data. Infact, even if there is no virtualization, configurationtime is equivalent to the application’s pipeline fill timeand so does not reduce the application’s maximumthroughput. A successful pipelined reconfigurationshould configure a physical pipe stage in one cycle. Toachieve this, we connected a wide on-chip configura-tion buffer to the physical fabric. A small controllermanages the configuration process.

Virtualization through pipelined reconfigurationimposes some constraints on the kinds of computa-tions that can be implemented on the fabric. The mostrestrictive is that the state in any pipeline stage mustbe a function of only that stage’s and the previousstage’s current state. In other words, cyclic depen-dencies must fit within one pipeline stage. Therefore,we allow direct connections only between consecu-tive stages. However, we do allow virtual connectionsbetween distant stages.

April 2000 71

Cycle:

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 1

Stage 2

Stage 3

1 2 3 4 5 6 7

1 4

5

1

2

1

2

3

2

3

4

3

Configuring Executing

4

5

1

2

5

1

(a)

(b)

Figure 1. Pipelinereconfiguration show-ing the virtualizationof a five-stagepipeline on a three-stage device: (a) thevirtual pipeline stageand (b) the physicalpipeline stage (thenumbers in the ovalsrefer to the virtualpipeline stage).

72 Computer

THE PIPERENCH ARCHITECTUREIn its current implementation, PipeRench is an

attached processor. Figure 2a is an abstract view ofthe PipeRench architectural class, and Figure 2b is amore detailed view of a processing element (PE).PipeRench contains a set of physical pipeline stagescalled stripes (see the “How a Reconfigurable SystemHandles Computations” sidebar). Each stripe has aninterconnection network and a set of PEs.

Each PE contains an arithmetic logic unit and a pass

register file. Each ALU contains lookup tables (LUTs)and extra circuitry for carry chains, zero detection,and so on. Designers implement combinational logicusing a set of N B-bit-wide ALUs. The ALU operationis static while a particular virtual stripe resides in aphysical stripe. Designers can cascade the carry linesof PipeRench’s ALUs to construct wider ALUs, andchain PEs together via the interconnection network tobuild complex combinational functions.

Through the interconnection network, PEs can accessoperands from registered outputs of the previous stripe,as well as registered or unregistered outputs of the otherPEs in the same stripe. Because of hardware virtualiza-tion constraints, no buses can connect consecutivestripes. However, the PEs access global I/O buses. Thesebuses are necessary because an application’s pipelinestages may physically reside in any of the fabric’s stripes.Inputs to and outputs from the application must use aglobal bus to get to their destination.

The pass register file provides efficient pipelinedinterstripe connections. A program can write theALU’s output to any of the P registers in the pass reg-ister file. If the ALU does not write to a particular reg-ister, that register’s value will come from the value inthe previous stripe’s corresponding pass register.

The pass register file provides a pipelined intercon-nection from a PE in one stripe to the correspondingPE in subsequent stripes. For data values to move lat-erally within a stripe, they must use the interconnec-tion network (see Figure 2a). In each stripe, theinterconnection network accepts inputs from each PEin that stripe, plus one of the register values from eachregister file in the previous stripe. Moreover, a barrelshifter in each PE shifts its inputs B – 1 bits to the left(see Figure 2b). Thus, PipeRench can handle the dataalignments necessary for word-based arithmetic.

PIPERENCH COMPILERReconfigurable computing’s success depends not

only on good hardware fabrics but also on compilersthat can quickly create efficient configurations. Takingadvantage of PipeRench’s hardware virtualization, wehave developed a compiler that trades off configura-tion size for compilation speed. Thus, fitting the designinto a limited physical area is not a primary concern.

We parameterized the compiler to facilitate exper-imenting with different architectures. The compilerbegins by reading a description of the architecture.This description includes the number of PEs per stripe,each PE’s bit width, the number of pass registers perPE, the interconnection topology, PE delay character-istics, and so on.

The source language is a dataflow intermediate lan-guage. A single-assignment language with C operators,DIL allows, but doesn’t require, programmers to spec-ify the bit width of variables rather than defaulting to

Onestripe

PE

Out

Interconnection network

Barrelshifter

Barrelshifter

Arithmeticlogic unit

To interconnection network

Control/carry bitsControl/carrybits

B

B

B

To globaloutput bus

X Y

Y

PE PE PE

PE PE PE PE

Global buses Pass registers

Interconnection

Interconnection

Output fromprevious stripe

Globalbuses

P passregisters

B− 1bitsfrompreviousPE

B −1 bitstonext PE

(a)

(b)

N

N

2

2

1

1

0

0

X

Figure 2. PipeRencharchitecture: (a) eachstripe contains pro-cessing elements(PEs) and an intercon-nection; (b) a detailedview of a PE and itsconnections.

a standard width for a given data type. DIL can manip-ulate arbitrary width integer values and (unless directedexplicitly to do otherwise) automatically infers bitwidths, thereby preventing any information loss dueto overflow or conversions. Aside from this one excep-tion, DIL hides all notions of hardware resources, tim-ing, and physical layout from programmers.

After parsing, the compiler inlines all modules(places each function’s definition at that function’s call

site), unrolls all loops, and generates a straight-line,single-assignment program. Then the bit-value infer-ence pass computes the minimum width required foreach wire (and implicitly the amount of logic requiredfor computations). Additionally, when the compilerdetermines that any of a wire’s bits are constant, ituses those constants to simplify the program.2

After the compiler determines each operator’s size,the operator decomposition pass decomposes high-

April 2000 73

A reconfigurable system partitions com-putations between the fabric and the sys-tem’s other execution units. The fabricdoes reconfigurable computations; the sys-tem’s other execution units (for example,processors) do system computations. Thegreatest benefit arises when the fabricimplements the main computation’s com-putationally intensive portions as apipelined data path.

The system performs reconfigurablecomputations by configuring the fabric toimplement a circuit customized for eachparticular reconfigurable computation. Thecompiler embeds computations in a singlestatic configuration rather than an instruc-tion sequence, reducing instruction band-width and control overhead. Because thecircuit is customized for the computationat hand, function units are sized properly,and the system can realize all staticallydetectable parallelism (up to the fabric’s sizelimit).

Reconfigurable computationsA reconfigurable fabric can outperform

a general-purpose processor for computa-tions that

• operate on bit widths that differ fromthe processor’s basic word size,

• have data dependencies that let multi-ple function units operate in parallel,

• contain basic operations that cancombine into a single specializedoperation,

• can be pipelined,• enable constant propagation to

reduce operation complexity, or • reuse the input values many times.

Reconfigurable fabrics give the com-putational data path more flexibility.However, their utility and applicabilitydepend on the interaction between recon-figurable and system computations, theinterface between the fabric and the sys-

tem, and the way a configuration loadsonto the fabric.

We divide reconfigurable computationsinto two categories: Stream-based func-tions process large, regular data inputstreams, potentially produce a large dataoutput stream, and have little controlinteraction with the rest of the computa-tion. Custom instructions take a fewinputs and produce a few outputs, executeintermittently, and have tight control inter-actions with the other processes. Thus,stream-based functions are suitable for asystem where the reconfigurable fabric isnot directly coupled to the processor,whereas custom instructions are usuallybeneficial only when the fabric is closelycoupled to the processor.

A highly pipelineable streaming function

A reconfigurable fabric is most effec-tive when implementing entire pipelinesfrom applications. A classic example is afinite-impulse response filter. A streaming

function, the FIR filter samples an inputstream every cycle to convert it into anoutput stream. Figure A shows the filter’sC code and hardware implementation.

The filter’s hardware circuit combinesbit-value analysis, software pipelining, andretiming techniques. Inputs to the filter areusually between 8 and 16 bits, underuti-lizing a typical processor’s function units.The reconfigurable fabric, on the otherhand, constructs adders and multipliers ofthe proper width. In fact, because of dataaccumulation, the first adder is smallerthan the last. The filter can also exploittremendous parallelism, in the best caseobtaining a new result every cycle throughpipelining. When we map the filter to areconfigurable fabric, we implement thegeneral-purpose multipliers as constantmultipliers, where the constants are the tapweights (Wi values). This system requiresless hardware and fewer cycles than a gen-eral-purpose multiplier. Furthermore, allthe data movement is between short localwires.

*W0 *W1 *W2

Yout

Xin

+ +

for(int i=0; i<maxInput; i++){ y[i] = 0; for (int j=0; j<Taps; j++) y[i] = y[i] + x[i+j]*w[j];}

(1)

(2)

Figure A. A finite-impulse response filter: (1) the C code; (2) the hardware implementation of athree-tap FIR filter.

How a Reconfigurable System Handles Computations

74 Computer

level operators (for example, multiplies become shiftsand adds) and decomposes operators that exceed thetarget cycle time. For example, to avoid carry-chaindelays, this pass splits any wide adder (for example, a45-bit adder) into several smaller adders. This decom-position must also create new operators that handlethe routing of the carry bits between the partial sums.Such rather “naïve” decomposition often introducesinefficiencies. Therefore, an operator recompositionpass uses pattern matching to find subgraphs that itcan map to parameterized modules. These modulestake advantage of architecture-specific routing and PEcapabilities to produce a more efficient set of opera-tors.

The place-and-route algorithm is the key to the com-piler’s speed.3 Unlike many place-and-route algorithms,ours is a deterministic, linear-time, greedy algorithm.It runs between two and three orders of magnitudefaster than commercial tools and yields configurationswith a comparable number of bit operations. For exam-ple, our compiler can handle a 1D discrete cosine trans-form in 2.4 seconds, but the Synopsys Design Compilerwith the Xilinx Design Manager takes 75 minutes. TheXilinx configuration has four times fewer bit opera-tions. However, our focus is compiling data paths, andhardware virtualization lets us sacrifice some efficiencyfor compilation speed.

EVALUATING PIPERENCH’S PERFORMANCEUsing a compiler and CAD tools, we conducted a

study to optimize PipeRench’s stripe width N, PEword width B, and number of pass registers per PE P.One of the kernels, the International Data EncryptionAlgorithm (IDEA), illustrates how dynamic compila-tion can increase performance. In this example, thereconfigurable system actually outperforms existingcustom hardware.

KernelsTo evaluate PipeRench’s performance, we chose

nine kernels on the basis of their importance, recog-nition as industry performance benchmarks, and abil-ity to fit into our computational model:4

• Automatic target recognition (ATR) implementsthe shape-sum kernel of the Sandia algorithm forautomatic target recognition.

• Cordic implements the Honeywell timing bench-mark for Cordic vector rotations.

• DCT is a 1D, 8-point discrete cosine transform.• DCT-2D is a 2D discrete cosine transform.• FIR is a finite-impulse response filter with 20 taps

and 8-bit coefficients.• IDEA implements a complete 8-round Inter-

national Data Encryption Algorithm.• Nqueens is an evaluator for the N queens prob-

lem on an 8 × 8 board.• Over implements the Porter-Duff over operator.• PopCount is a custom instruction implementing

a population count instruction.

MethodologyWe used CAD tools to synthesize a stripe on the basis

of parameters N, B, and P in a 0.25-micron five-metal-layer process. We joined this automatically synthesizedlayout with a custom layout for the interconnection.Using the final layout, we determined the number ofphysical stripes that can fit in our silicon budget of 50mm2 (5 mm by 10 mm). We also determined the delaycharacteristics of the stripe’s components—for exam-ple, LUTs, carry chains, and interconnections. (Wereserve another 50 mm2 for memory to store virtualstripes, control, and chip I/O.) The compiler uses thedelay characteristics and the number of registers to cre-ate configurations for each architectural instance, yield-ing a design of a certain number of stripes at aparticular frequency. We then determined the kernel’soverall speed in terms of throughput, for each archi-tectural instance.

We wrote the kernels in DIL. The DIL compilerautomatically synthesizes, places, and routes the designfor each parameterized architecture.

FabricConsider the region of space bounded by PE bit widths

B of 2, 4, 8, 16, and 32; stripe widths (N ×B) from 64 to256 bits; and number of registers P equaling 2, 4, 8, and16. Two main constraints determine which parametersgenerate realizable fabrics: the stripe width and the num-ber of vertical wires that must pass over each stripe. Stripewidth depends on the number and size of PEs, as well asthe number of registers allocated to each PE. The num-ber of vertical wires depends on the number and width ofglobal buses, pass registers, and configuration bits.

64 96 128 160 192 224 256Stripe width (bits)

B = 16

B = 32

0

5

10

15

20

25

Har

mo

nic

mea

n o

f th

rou

gh

pu

t(m

illio

n in

pu

ts p

er s

eco

nd

)

B = 2

B = 8

B = 4

Figure 3. The har-monic mean ofthroughput for all ker-nels (each kernel usesup to 8 registers) as afunction of stripe widthand PE bit width B.

As the number of registers increases, computationaldensity (the number of bit operations per second exe-cuted per unit area) decreases. However, pass regis-ters are an important component of the interstripeinterconnection. Reducing their number increasesrouting pressure, which decreases stripe utilization.The best balance between computational density andstripe utilization typically comes from using eight reg-isters (P equals 8).

Figure 3 shows the harmonic mean of the through-put for all the kernels on a 100-MHz PipeRench forvarious stripe widths and PE sizes. Although wider PEsizes create fabrics with higher computational density,the kernels’ natural data sizes are smaller, causing 32-bit PEs to be underused. On the other end of the spec-trum, 2-bit PEs are not competitive, because of theincreased time for arithmetic operations, the lack ofraw computational density, and the increased numberof configuration bits needed per application.

Examining each kernel’s individual performance,we find that the kernels’ characteristics influencewhich parameters are best. The 8-bit PEs strike a bal-ance between utilization and carry-chain delay. Formany kernels, enough parallelism is available to keepthe wider stripes busy, but these stripes have fewer reg-isters. This means more stripes in the implementationand, therefore, lower overall throughput. Therefore,we built PipeRench as a 128-bit-wide fabric having 8-bit PEs with eight registers each.

Performance and forward compatibilityWe compared PipeRench’s performance with the

performance of an Ultrasparc II running at 300 MHz.Figure 4a shows the raw speedup for each kernel.Although achieving this performance can be difficultwith the reconfigurable system connected through theI/O bus, most of the raw speedup is obtainable. In fact,the I/O bandwidth required is inversely proportionalto the amount of hardware virtualization that occurs.If bandwidth continues to scale as feature sizes shrink,I/O bandwidth is not an issue. For example, IDEA hasa virtualization factor of around 12, which means onaverage it requires only one input every 12 cycles (thatis, fewer than 8 bits per cycle).

As process technology improves, configurations willrun faster because of both the increased number ofstripes that will fit in a given area and increased clockspeeds. Figure 4b shows how each kernel’s perfor-mance should scale (without recompilation) forprocess generations through 70 nanometers. This scal-ing is based on the Semiconductor Industry Associ-ation road map for both the number of transistorsavailable per unit area and the clock speed. All ker-nels do not scale in parallel in the same way, becauseeach kernel has a different number of virtual stripes.When a kernel fits completely in the physical fabric, it

can no longer benefit from the increased number ofstripes, without recompilation. These scaling effectsare conservative, as they use the low-volume applica-tion-specific integrated circuit (ASIC) transistor countsand clock speeds.

Performance in embedded systemsOne challenge for reconfigurable computing in

embedded systems is that performance depends on cre-ating custom configurations that can incorporate whatis typically considered runtime data. For example,IDEA’s performance on PipeRench depends on gener-ating a key-specific configuration (one that can encryptor decrypt only on the basis of a specific key). The com-piler can create a complete, single-key optimized IDEAconfiguration in less than one minute. However, this isstill too long for an embedded system, which must beable to work with any key. We can solve this problemby generating a dynamic configuration.

The IDEA block cipher includes three fundamen-tal operations: addition modulo 216, 16-bit exclusiveOR, and 16 × 16 multiplication modulo 216 + 1. The128-bit key generates 36 16-bit subkeys. One operand

April 2000 75

189.7

15.511.3 12.0

63.342.4

26.0

57.1

29.0

1

10

100

1,000

ATR

Cordic

DCT

DCT-2D FIR

IDEA

Nqueens

Over

PopCount

Spee

du

p o

ver

300-

MH

z U

ltra

spar

c II

1

10

100

ATR

Cordic

DCT

DCT-2D FIR

IDEA

Nqueens

Over

PopCount

Spee

du

p

70 nm (2008)100 nm (2005)130 nm (2003)180 nm (1999)

(a)

(b)

Figure 4. Speedup fornine kernels: (a) rawspeedup of a 128-bitfabric with 8-bit PEsand eight registers perPE, and (b) PipeRenchperformance scaledwith process technol-ogy based on TheNational TechnologyRoadmap for Semi-conductors (Semicon-ductor Industry Asso-ciation, San Jose,Calif., 1997), includ-ing the year projectedfor each process gen-eration.

76 Computer

of every multiplication operation in the algorithm is asubkey. This multiplication is essential for customiz-ing an IDEA configuration.

Compilation has two components: generating opti-mized multipliers and generating the rest of thepipeline. With an interface that satisfies the multipliersand the rest of the pipeline, the compiler can performthese two tasks independently. The compiler generatesthe nonmultiplier portion ahead of time, assumingfixed placement of multiplier inputs and outputs.

To generate the multipliers themselves, a small con-troller rapidly computes configuration bits for thestripes that perform the multiplication. This controllerhas a lookup table that converts constant multipli-cands into the necessary configuration bits. Pipe-Rench’s regular, small configuration bit stream makesthis generation very easy. Using canonical signed digit(CSD) representations for the subkeys, a multipliertakes 2.06 stripes on average. The result is a completeIDEA pipeline of only 177 stripes.

Table 1 compares PipeRench’s implementation withtwo optimized software implementations running onstate-of-the-art processors and a custom VLSI design.PipeRench outperforms the processors by more than10 times.

Surprisingly, PipeRench also outperforms the 0.25-micron IDEACrypt Kernel from Ascom Systec Ltd.7

This is due to several factors: First, the PipeRenchimplementation of IDEA does not include the timetaken to generate keys. PipeRench targets streaming-media applications, where key generation involvesonly a small preprocessing step. Second, because ofPipeRench’s pipelined nature, it effectively pipelinesIDEA into 177 stages, rendering cipher block chain-ing (CBC) and other chained implementations imprac-tical. Nevertheless, PipeRench’s raw throughput is 40percent faster than that of a fully custom chip.

R econfigurable computing is changing the face ofcustom hardware. In the past, “custom hard-ware” used general-purpose elements; a recon-

figurable device customizes the elements. Forexample, reconfigurable computing lets an IDEAimplementation use custom, rather than general-pur-pose, multipliers for a specific key at runtime. Thislevel of customization can yield tremendous perfor-mance improvements. Additionally, because recon-

figurable devices can serve many purposes, they willlikely become common off-the-shelf parts, furtherreducing system cost.

At some point in the near future, current processortechnology will cease reaping the gains of Moore’s law.Reconfigurable computing has the potential to becomea vital component in the next generation of computationdevices. The current implementation of PipeRench isthe first step in realizing this potential. BecausePipeRench is an attached processor, bandwidth betweenPipeRench, the main memory, and the processor is lim-ited. This places significant limitations on the types ofapplications that can realize speedup. Nevertheless, theattached processor is merely the initial phase in the evo-lution of reconfigurable processors. Just as floating-point computations migrated from software emulationto attached processors, then to coprocessors, and finallyto full incorporation in processor instruction set archi-tectures, so too will reconfigurable computing eventu-ally become an integral part of the CPU. ❖

References1. H. Schmit et al., “Pipeline Reconfigurable FPGAs,” J.

VLSI Signal Processing, Mar. 2000, pp. 1-18.2. M. Budiu, “Detecting and Exploiting Narrow Bitwidth

Computations,” Proc. 2nd Ann. CMU Symp. ComputerSystems, Carnegie Mellon Univ., Pittsburgh, 1999.

3. M. Budiu and S.C. Goldstein, “Fast Compilation forPipelined Reconfigurable Fabrics,” Proc. 1999 ACM/SIGDA 7th Int’l Symp. Field Programmable GateArrays, ACM Press, New York, 1999.

4. S.C. Goldstein et al., “PipeRench: A Coprocessor forStreaming Multimedia Acceleration,” Proc. 26th Ann.Int’l Symp. Computer Architecture, IEEE CS Press, LosAlamitos, Calif., 1999, pp. 28-39.

5. H. Lipmaa, “IDEA: A Cipher for Multimedia Architec-tures?” Proc. Selected Areas in Cryptography, LectureNotes in Computer Science, Vol. 1556, Springer-Verlag,New York, 1998, pp. 248-263.

6. B. Preneel, V. Rijmen, and A. Bosselaers, “Recent Devel-opments in the Design of Conventional CryptographicAlgorithms,” Proc. Computer Security and IndustrialCryptography, Lecture Notes in Computer Science, Vol.1528, Springer-Verlag, New York, 1998, pp. 106-131.

7. Ascom Systec Ltd., “IDEACrypt Kernel,” http://www.ascom.ch/infosec/idea/kernel.html.

Table 1. Comparison of IDEA implementations with other designs.

Processor Clock speed (MHz) Clocks per block Throughput (Mbytes/s)

PipeRench 100 6.3 126.6Pentium II using MMX5 450 358.0 10.0Pentium6 450 (scaled) 590.0 6.1IDEACrypt Kernel 100 3.0 90.0

Seth Copen Goldstein is an assistant professor in theSchool of Computer Science at Carnegie Mellon Uni-versity. His interests include reconfigurable computingsystems, with an emphasis on compilation and systemarchitectures. Goldstein has a BSE in electrical engi-neering and computer science from Princeton Univer-sity and a PhD in computer science from the University ofCalifornia at Berkeley. Contact him at [email protected].

Herman Schmit is an assistant professor at CarnegieMellon University. His interests include reconfigurablecomputing systems, field-programmable gate arrays,and IP-based design methodologies. Schmit has a BSEin computer science engineering from the Universityof Pennsylvania and a PhD in electrical and computerengineering from Carnegie Mellon University. Con-tact him at [email protected].

Mihai Budiu is a PhD student in the Department ofComputer Science at Carnegie Mellon University. Hisinterests include reconfigurable hardware design andhigh-level language compilation for systems contain-ing reconfigurable hardware devices. Budiu has a BSand an MS in computer science from the “Poly-tehnica” University of Bucharest, Romania. Contacthim at [email protected].

Srihari Cadambi is a PhD student in the Departmentof Electrical and Computer Engineering at CarnegieMellon University. His interests include hardwarecompilation—especially synthesis, technology map-ping, and place-and-route for reconfigurable andother novel architectures. Cadambi has a B.Tech inelectronics engineering from the Indian Institute ofTechnology and an MS in electrical and computerengineering from the University of Massachusetts,Amherst. Contact him at [email protected].

Matt Moe is a PhD student in the Department of Elec-trical and Computer Engineering at Carnegie MellonUniversity. He is working on specialized reconfig-urable architectures for different application domainsto realize faster and smaller implementations. Moehas a BS and an MS in electrical and computer engi-neering. Contact him at [email protected].

R. Reed Taylor is a graduate student in the Depart-ment of Electrical and Computer Engineering atCarnegie Mellon University. His interests includereconfigurable hardware, near-Shannon-limit errorcodes, and cryptography. Taylor has a BS in electri-cal and computer engineering from Carnegie MellonUniversity. Contact him at [email protected].

The MIS and LAN Managers Guide to Advanced Telecommunications

$40 for Computer Society members

Find out in

How will it all connect?How will itall connect?

Now available from the Computer Society Press

computer.org/cspress/