cathedral-ii: silicon compiler - cs

!

Cathedral-II:A Silicon Compilerfor Digital SignalProcessingH. De Man, J. Rabaey, P. Six, and L. ClaesenIMEC

I//ccan easily define a sili-con compiler in general

terms as a software system support-ing chip layout synthesis startingfrom a behavioral description at thealgorithmic level. However, stan-dardization beyond this simpledefinition can be risky. We believethat "the" silicon compiler simplycannot exist, any more than "the"software compiler can exist. It is thisbelief that led us to develop anapplication-specific silicon compiler-that is, a compiler necessarily tiedto a particular application, in thiscase, digital signal processing.

Our compiler, called Cathedral-lI,is based on what we call a "meet inthe middle" design method. Itsname describes the separation ofsystem design and reusable silicondesign and the gradual move toward

a middle ground during the synthe-sis process.

The specific function of Cathedral-11 is to synthesize synchronous multi-processor chips for digital signalprocessing. It was developed follow-ing a series of five steps:

1. Define a wide, but concise,class of system design appli-cations.

2. Define, based on manual de-sign exercises, a target siliconarchitecture and its associatedlayout style.

3. Define a design strategy basedon available designer skills.

4. Define the behavioral languageand the silicon modules.

5. Then and only then developthe CAD tools, with emphasison the "D" and the "A."

December 1986

Te article describes the status ofwork at IMEC on the Cathedral-lIsilicon compiler. The compiler wasdeveloped to synthesize synchronousmultiprocessor system chips for digi-tal signal processing. It is a continua-tion of work on the Cathedral-Ioperational silicon compiler for bit-serial digital filters.

Cathedral-lI is based on a "meet inthe middle" design method that en-courages a total separation betweensystem design and reusable silicondesign. The CAD system includes arule-based synthesis program, a pro-cedural program, and a controllersynthesis environment. Processors aresynthesized in terms of modulescalled from automated reusablemodule generators. Chip layout isdone on a floor planner. An expertsubsystem verifies correctness duringsilicon design and generates func-tional and timing models for verifica-tion at the module and chip levels.

0740-7475/86/1200-0013$1.00 0 IEEE 13

THEAPPLICATIONCathedral-lI is based on complex

algorithms in digital signal process-ing, which is playing an increasingrole in modern VLSI-based systems.DSP spans data throughput from 1Kbps up to 100M bps. Therefore, eventhis area is too wide for a singledesign style or silicon compiler.The low-throughput end of DSP is

well served by the new generationsof general-purpose DSP chips. How-ever, while the first implementablealgorithms addressed such simplecases as digital filtering, today's ap-plications require throughput fromaudio sample range up to 1M-bpssampling rates. Consequently, weneed very high precision or highlycomplex algorithms involving blockdata processing, matrix manipula-tions, multiple data rates, and a lotof decision making besides numbercrunching. Examples are digitalaudio, speech processing, smartmodems, and robotics. Also, theseapplications would be a lot moreattractive if they included adaptableI/O periphery on the same chip.General-purpose digital signal

processors are not very well-suitedfor the implementation of thesealgorithms. On the other hand, afull custom solution is too costly indesign time and, more important, in

time to enter a highly competitivemarket.

Because of the highly specializednature of such algorithms, weassume that the algorithm designeris able to do the silicon implementa-tion. Therefore Cathedral-lI ad-dresses highly complex, block-oriented DSP algorithms in theaudio to near video frequencyrange.

Figure 1 shows a pitch extractionalgorithm for speech.' This examplehas been used to study the designprocess and to define the tools inCathedral-l1. Real and imaginaryparts of blocks of 64 frequency com-ponents from a discrete Fouriertransform processor are first trans-formed into an amplitude spectrum.By averaging, a threshold is com-puted to eliminate4irrelevant spectralcomponents. From the remainder,the maxima in the spectrum are com-puted and then compared to 49sieves with meshes at octave dis-tances. Finally, the best match iscomputed as the pitch value. As thefigure shows, this problem, typicalfor third-generation DSP algorithms,naturally decomposes into a numberof subprocesses, which are fairly in--dependent of each other.

These algorithms give rise tocommunication bottlenecks, whichoccur from the accumulation of thesequentially generated data neededto create new samples for the nextsubprocess. As Figure 1 shows, wecombat the problem by putting adata storage element between twosubprocesses. Based on a carefulstudy of these effects we have there-fore defined a target architecture inwhich each subprocess is assignedto a dedicated processor, and theinterprocessor communication, in itsmost general form, is taken care ofby switched RAMs.

THEARCHITECTUREWe have selected a flexible,

tailorable multiprocessor architec-ture as the target for our design syn-thesis. The architecture is flexibleenough to evolve into a single, gen-eral processor at the lower end ofthe frequency spectrum and into aset of parallel, hard-wired data pathsat the upper end. As such, it spansmany applications-speech, audio,robotics, telecommunications, andimage processing. The basic idea ofthe proposed architecture is to addenough flexibility to attack the threebasic bottlenecks that normally limit

Alk]

k1 k k3 k4 5 k k612 n3 4 ~.26(a) [x] x[1 x[2] x[3] x44] x[5] x[6]

frequency

Am.spcr BApltd 8 X 8 |Optimal& threshold U maxima +*U _ harm. sleve

I omuitationj[ F coptato dtemnt

Fourier Threshold & Frequency(b) coefficients squared amplitudes values

Fundamental ofopfimal sieve &

frequency values

Figure 1. Typical example of a third-generation digital signal processing algorithm, pitch extraction for speechanalysis (a). The block structure of the algorithm (b) consists of independent processes, each of which is assigned toa dedicated processor.

I j

14 IEEE DESIGN & TEST

data throughput in general-purposesignal processors: arithmeticthroughput, congestion in datatransfer, and controller delays whenperforming conditional operations(frequent in third-generationalgorithms).At the highest level, the proposed

architecture is composed of a set ofconcurrent operating processors.Each executes one subtask of thealgorithm and is optimally tuned toperform just that one task (Figure 2).Each processor operates relativelyindependently of protocols, ex-changing only data that is globalamong processors.Depending on the data exchange

rate and the amount of bufferneeded, different protocols can beselected: synchronously switchedRAM buffers, FIFOs, or request- andacknowledge-based synchroniza-tion. Communication with the out-side world is over an I/O frame thatcan support a large range of I/Oprotocols (parallel to serial, syn-chronous to asynchronous).Each processor consists of a dedi-

cated data path and a controller.The datapath is optimized for thetask(s) it has to perform and is as-sembled from a set of selected exe-cution units (EXUs), interconnectedby a restricted number of cus-tomized buses. Each EXU contains aregister file (of variable size) at itsinput side. This structure makes itpossible to avoid the arithmetic anddata transfer bottlenecks. Studieshave shown that the following EXUsare sufficient to span most of thetarget application space:* a general-purpose (but rather

inefficient) ALU/Shift unit* an address computation unit(ACU) with modulo-countingcapabilities

* a parallel multiplier/accumu-lator

* a parallel-serial divider* a comparator (for max-min

computations)* a scaler/normalizer (for fixed-

point/floating-point conver-sions)

The first two units are general-purpose blocks, while the others areaccelerators. All these units havebeen designed with changeableparameters and have been imple-mented in the module generation

December 1986

Figure 2. A multiprocessor architecture.

Figure 3. A customized processor data path for processor 1 of the. pitchextraction algorithm in Figure 1.

environment. Typical parameters areword length, shift dimension, size ofregister files, and size of multiplearray.An example of a processor data

path built with this strategy is shownin Figure 3. It is used to compute theamplitude spectrum of a signal,given the complex frequencydomain spectrum, and at the sametime to determine the maximumamplitude. It consists of three con-current units: a multiplier/accumu-lator, a comparator, and an ACU.This data path can perform an ampli-tude computation and a maximumupdate in two cycles (average).We chose a multibranch, micro-

code-based controller to controldata flow through the data path.This structure can handle a largespan of algorithms flexibly and effi-ciently. It can support algorithmsthat require heavy decision-makingas well as those that are repetitive.Some EXUs can also have a local con-troller to help reduce the complexityand the size of the central controller.An example is the decoder of theregister files. Another controller isneeded at the highest level to con-trol the flow of data between pro-cessors and from the processors tothe outside world.

THELANGUAGEDifferent applications obviously

require different specification lan-guages; a microprocessor designerwill use conceptual constructs dif-ferent from those a modem designermight consider. Therefore we haveselected Silage72 a language op-timized for high-level description ofsignal processing algorithms, as thedesign language for Cathedral-Il.

The basic object in Silage is thesignal, which is a vector whose com-ponents are samples in the timedomain from infinity to actual time.The basic operation is a functionalapplication of those signals. In thisway, a Silage description of an algo-rithm resembles a signal flow graph,where nodes are instances of func-tions and arcs are the signals. Silagesupports time domain operationssuch as decimation and interpola-tion and allows for the description ofequations, decisions, iterations,hierarchy, and finite word-lengtheffects. The Silage description of thealgorithm implemented on the datapath of Figure 3 is given in Figure 4.

In some cases, the designer maywant to pass some structural hints tothe compiler to guide the synthesis

15

1.tI

I CTRLI

Figure 4. Silage description of ampli-tude and threshold computation inthe pitch extractor.

process. In Silage, a construct called"pragma" is provided to pass this typeof information.A Silage simulator, based on de-

mand-driven simulation techniques,is currently under development.Once the system is successfullysimulated, synthesis can start.

MEETINGINTHEMIDDLE

After choosing the application anddefining the target architecture thenext step is to select a design strat-egy. We are typically addressing thequick-turnaround design of com-plex systems on chips with about300,000 devices. The complexity ofthe algorithms to be implementedfar exceeds that of the circuit oreven the logic gate level. Moreover,since silicon designers are so scarce,we cannot exploit the potential ofsuch systems, unless they are directlydesigned by the system engineers,without detailed knowledge of sili-con implementation.

Therefore, silicon design knowl-edge (at the micron level) must belocalized in reusable modules at theMSI/LSI level familiar to the systemdesigner. Figure 5 shows a designscheme that satisfies these require-ments. We call it meet-in-the-middledesign methodology, and it can becharacterized as follows:

System design is separated from sili-con design. The interface is locatedat the level of arithmetic/logicoperator blocks, data storage, con-trollers, and 1/0 units. These are the

EXUs of our target architecture. Thesilicon primitives used at that levelare called modules. Design at thesystem level consists of translating asystem specification into a structurethat is a netlist of module instances.Placement and routing of layout in-stances of the modules becomes thechip design.Silicon modules are reusable. In thisway the costly investment in high-performance, advanced silicon tech-nology design is limited, and thecost is written off over as many de-signs as possible.Silicon modules are technologyadaptable. Silicon modules are muchmore complex than standard cellsand, to save even more in terms ofsilicon design cost, these modulesmust also be able to survive a num-ber of technology updates.Module design has a powerful de-sign environment. Silicon modules,however, are not enjoyed by a par-ticular foundry or CAD vendor.Since the competitive edge betweensystem houses will not be in thetechnology but in the architecturaltechnique and in its implementa-tion, we expect module design torequire a powerful design environ-ment for a local team of silicon de-signers. This may not yet be thecase, but we expect it to happen inthe future.

In this method, system design istop-down to the usual intermediate

level. Silicon designers compose, inthe usual bottom-up fashion, LSI-level modules from functional build-ing blocks, which in turn are com-posed of logic leaf cells at thetransistor level.Hence, both parties "meet each

other in the middle" of the designabstraction levels. Scarce talent isoptimally used and the design pro-cess corresponds to the usualpatterns.There are some fundamental de-

viations from classical design at thesilicon and system levels, however.These usually stem from the fact thatmodule generators are generallysoftware procedures rather thanfixed geometrical structures. Also,we expect system designers to thinkat the algorithmic rather than thestructural level.

THE CAD SYSTEMFigure 6 shows the CAD toolbox

used in Cathedral-l1. The figureshows that in the middle of thedesign abstraction is a separationbetween the silicon and the systemdesigners. The link between them isa "call" to a limited set of EXUs asdefined in the target architecture.The system designer defines the

system at the behavioral level inSilage. Coupled to this language isthe high-level simulator to verify thebehavioral correctness of the algo-rithm. Using the throughput require-ments and a set of expert design

Figure 5. "Meet in the middle" design methdology.

IEEE DESIGN & TEST

func AmplitudeSpectrum(A[], B[J: num8)Ampl[J, Max: numl6=

beginThresh [0] = 0;Max = Thresh[641;(i = 1 .. 64)beginThresh[i] = if (Ampl[i] - Ampl[i])

Thresh [i-11fi;

Ampifi] = num16 (A[i]*Ali] + B[i]*B[iJ);end

endpragma Processor (1,AmplitudeSpectrum);

Func Modu eModule

*4 migdle-Procedural

Gate module Cell~designf

Layout circuitTR Rect software

t engineers

Layout +technology rules Technology

engineers

16

rules, Silage code is first optimizedwith respect to the functions of thetarget architecture. The optimizedcode is then compiled directly intostructure (a netlist in terms of thelimited set of predefined EXU siliconmodules). Clearly most of this syn-thesis is dictated by the target archi-tecture, and therefore quite a largepart of it is rule-based. In Cathedral-I1, it was implemented in Prolog,which is discussed later.We also, in principle, allow the

designer to specify the design atintermediate levels, all the way downto structure, but at the price of in-creasing the amount of lower levelsimulation, the redesign risk, andthe time to the market. When allcalls to the modules are successful,the chip is ready for floor planning.This can be done either interactivelyor by automatic placement androuting techniques.

In Cathedral-Il, we include thesynthesis of the algorithm directlyinto data paths and control logic.That is, we are aiming at a truesilicon compiler, in contrast to mostcommercial systems today, which arelimited to floor planning and tomodule generation.How does this scheme work? First,

just as in a software language com-piler, success is intimately linked toa clear definition of the set ofmodules. Consequently, we mustgenerally limit the variability of themodules and carefully consider theapplication when choosing EXUs.

Second, in view of the evolvingtechnology, we cannot design in afixed technology but must be ableto adapt to technology changes.

Finally, we need a lot more in-formation from a module generatorthan just the layout view, which initself should consist of a boundingbox view, as well as a full, layoutview. Other views include a func-tional view, a timing view, and a testview.A functional view is an RTL-level

function, in case the design wasdone at the structural level, requir-ing simulation. Even if full synthesisis used, designers will still want tosimulate the actual chip. This simu-lation is possible in our system bythe Hilarics-Logmos program3 whichis a register-transfer simulator thatallows modules to be included and

Figure 6. The CAD toolbox in Cathedral-Il.

symbolic microcode descriptions tobe defined.A timing view is necessary, since

although the synthesis compiler cantake first-order throughput require-ments into account, it is only afterplacement and routing that a fullperformance check can be done.

In Cathedral-Il, we are following abottom-up hierarchical generationof timing models in a knowledge-based program, called Slocop, whichclosely follows the composition pro-cedure of a module. At the floor-plan level, interconnection parasiticsare tat en into account to checkglobal timing. If unsatisfactory,buffer sizes are adjusted. If still notsatisfactory placement is changed,or a pragna is formed at the Silagelevel to call for higher performancemodules or to increase parallelismin the algorithm.

Finally, although it is not in theactual Cathedral-Il version, a testview is needed with each module.We envision that most modules witha high degree of structure will beC-testable. That is, depending onthe particular structure, a small set

of word-length-independent pat-terns will guarantee module test-ability. A test assembly programwould then be implemented at thelevel of the floor planner. For theparticular architecture, the programwould generate total testability forthe whole chip.

In contrast to the traditional Calmatype of fixed layout oreven thetiling of fixed cells, our design styJecalls for modules to be written asprocedures with adjustable param-eters. Parameters range from simpleword length to conditional compo-sition in terms of functional buildingblocks or size of output buffers.These procedures should also gen-erate the other views needed by thesynthesis programs.

This style causes two problems.First, because silicon designers willconfront the software more often,and most are not trained to under-stand the software, we will have tohide the code generation as muchas possible. -Second, we must care-fully restrict the set of modules-amatter of carefully choosing EXUsaccording to the architecture.

December 1986 17

The designofa module isbased on afunctionaldescription atthe register-transferlevel.

It is this careful selection of EXUsfrom functional building blocks thatwe feel is the key to making Cathe-dral-ll feasible. It makes just as muchsense as selecting a language beforeconstructing a compiler.The selection is done in the

module generator environment.EXUs are composed from functionalbuilding blocks like adders, com-parators, and registers, which are, inturn, composed of logic cells.

Figure 7 shows the anatomy of themodule generator in Cathedral-l1.The arrow types indicate clearly the"create" (silicon designer), "gener-ate" (call from silicon compiler), and"adapt" (to technology rules) func-tions needed for such a program-ming design environment.The Cathedral-Il module genera-

tion environment provides not onlythe layout environment but also afirst version of an expert verificationsystem. The verification system isneeded for the following tasks:* to generate the functional/

timing and test models duringthe "generate" phase

* to verify the modules duringthe "create" and "adapt" designphases

The design of a module is based ona functional description at theregister-transfer level. This descrip-tion (and its simulation) can be doneby the Hilarics-Logmos system.3 It isthe documentation link between thesystem and the silicon designer.

In a module generator, the con-nectivity and relative placement of

cells must be procedural. Code mustbe generated to procedurally com-pose a module out of cells. Cathe-dral-lI generates this code fromgraphics definitions using a Lisp in-terpretive programming environ-ment. Only complex mathematicalrelationships are directly pro-grammed in Lisp.The primitive cells themselves are

designed by means of symbolic lay-out followed by automatic compac-tion, which also provides for auto-matic intercell abutment and routingfacilities that can be called directlyfrom Lisp procedures. We also havean efficient circuit extraction pro-gram based on symbolic layout in-formation. The advantage of sym-bolic layout is that we can adapt thesystem when the technology isupdated.

Finally, design experience is ac-cumulated and verified in the verifi-cation box. The box is, to a greatextent, also rule-based and uses theconcepts of expert debugging, rule-driven simulation, and timing veri-fication.

Design synthesisArchitectural synthesis, as envi-

sioned in Cathedral-Il, is based ontwo cornerstones:* the targeting of the synthesistoward a single (but flexible)architecture, allowing for con-siderable pruning of the searchspace and making it possible touse dedicated optimizationtechniques to exploit propertiesof the architecture

* the use of the module genera-tion environment as a moduleknowledge database that can bequeried by the synthesis tools,allowing design decisions to bebased on physical implementa-tion details (speed, area, power)

This strategy avoids the inefficiencyof previously published design sys-tems,4,5 which attacked synthesisproblems in a purely top-downfashion. These cornerstones are re-flected in the outline of our synthe-sis system, as presented in Figure 8.It consists of several subtasks, whichwere initially defined through a

CreateGenerateAdapt

(Si)(Sys)(Tech)

Figure 7. The module generator software environment has three functions:create (silicon), generate (system), and adapt (technology).

IEEE DESIGN & TEST18

number of manual design exerciseson real industrial examples. Theseexercises have shown that synthesisis an iterative and interactive opti-mization process. Therefore, we en-vision an open design system thatallows the experienced designer tointerfere with synthesis at variousintermediate levels. Consequently,we need entry points (descriptionlanguages) and verification means(simulators) at all those levels.However, user interference, shouldbe avoided as much as possible, es-pecially at the low levels, and thesystem should be intelligent enoughto restrict the iteration loops to alocal level (if possible).As Figure 8 shows, input to the

design system is Silage. The correct-ness of this description can be veri-fied using the Silage simulator. Thesimulator outputs can also be usedas a reference to verify the effects offurther design steps.

Algorithm partitioningThe first step in synthesis is to

partition the algorithm. The Silagedescription is split into distinct sec-tions, each of which is mapped ontoa separate processor. Each processoris then synthesized independentlyfrom the others, which is possiblebecause the interprocessor com-munication protocols are transpar-ent. The partitioning itself is basedon a number of often contradictoryobservations, such as load balancing,data transfer bottlenecks, cut-setanalysis, and computational require-ments. In most cases, however, aneducated guess can be made bysimple inspection. Therefore, wehave decided for the first version ofCathedral-Il to leave the partitioningto the user by means of pragmastatements in the Silage description.

Data path synthesisThe applicative description of

each processor is transformed into aprocedural one in two steps. First, acustomized data path is synthesized,and the high-level operations aremapped onto a set of register-trans-fer-level (RTL) operations for thisdata path (Figure 8). Second, theoptimal timing of the RTs is com-

puted. This is referred to as "micro-program scheduling" (discussedlater). The intermediate RT languagehas nonconflicting procedural (fixedarchitecture) and applicative (no ex-plicit timing specification) prop-erties.The synthesis of a processor data

path includes several tasks. A limitednumber of EXUs are stored in thelibrary. Using this knowledge andthe throughput specifications, theEXUs required to realize the Silageoperations are selected. Dedicatedbus connections between differentEXUs within a single processor arealso chosen.

Finally, the chosen EXUs arestripped of unnecessary operators,and internal parameter values areassigned. These decisions are in-fluenced by area and throughputspecifications, which determine thechoice between multiplication al-ternatives (hard-wired or shift-add-based multiplier) or concurrencylevels. During synthesis, Silage vari-ables are assigned to data pathregister files (containing multiplestorage fields). In this way, the Silageoperations are transformed into RToperations. The result is a "blackbox" description of the controller,which serves as input to the micro-program scheduler.An automated synthesis tool sup-

ported by knowledge-based tech-niques is under development. In afirst pass, the Silage code is prepro-cessed. This includes a simplificationof the source code and a number oflocal transformations (loop handling,reordering of singular variables,etc.). The preprocessor also preas-signs operations to EXUs. These pre-assignments are based on a globaloverview of the problem and areused to guide the data path mapper.The mapping tool itself, imple-mented in Prolog, tries to unify thehigh-level Silage equations withinternal low-level equations, eachof which represents a fundamentaloperation of an available EXU. Suc-cessful unification may mean fulfill-ing certain conditions, under theform of new high-level equations.Synthesis is governed by a programcalled inference engine, which ad-vises on the unification rule mostappropriate to a given high-levelequation. The architectural proper-

ties of the EXUs are stored as a set ofrules, which keep the tool flexiblewith respect to architecturalchanges.

Microprogram schedulingRegister transfers are fundamental

operations from a control point ofview, since each is to be realized in asingle machine cycle. In micropro-gram scheduling, the optimal map-ping of RT operations to time(machine cycles) is computed (op-timal meaning with a globally mini-mal cycle count). Conditional RTsand FOR constructs are allowed inthe RT language. A FOR loop can beregarded as a pragma by which thedesigner prescribes a limited order-ing of operations. Loops arestraightforwardly mapped into themicrocode. The algorithms arescheduled statically at compile time,as opposed to dynamically at run-time, thus saving complex data-driven control protocols. Thescheduling tool should take intoaccount all implications of the actualcontroller structure, such as theamount of pipelining and the typeof controller logic, as well as allrestrictions to prevent conflictsbetween RTs. Therefore, a prelimi-nary choice between alternative con-troller types is required (we haveselected the multibranch, micro-code-based controller).From the scheduling constraints

(data precedence constraints, re-source conflict constraints, and con-troller pipelining constraints), ageneral integer linear program (ILP)can be set up, which can be solvedusing standard procedures. Sched-uling is NP-complete; thereforelarge problems require the assist-ance or replacement of the ILPtechnique by heuristics. Afteranalyzing the lifetime of the registerfile variables, the optimal scheduleis converted into a readable sym-bolic microcode.

After scheduling is completed, thecycle count is evaluated, which mayforce backtracking to the data pathsynthesis level. Excessive cyclecounts may lead to faster implemen-tation schemes or changes of theconcurrency level, while low cyclecounts may suggest cheaper datapath structures.

December 1986 19

Figure 8.Overview of

functions,languages,and simu-

lators in theCathedral-lI

synthesisprocess.

Controller synthesisFrom the symbolic microcode, the

contro'ler decoding logic is synthe-sized. To feasibly compare controllerrealizations (microcode ROM-basedcontrollers, multilevel logic imple-mentations, PLA-based sequencers,etc.), we have a unified environmentwith a single input language. Thisenvironment provides a library ofutilities for synthesis, optimization,and layout generation of controllerstructures such as state assignment,logic minimization, partitioning, andassembly of regular arrays usingsymbolic layout techniques. In thisway, new controller architecturesare easily introduced. A prototypeversion of the environment is beingimplemented using the multibranchcontroller as a test case.

This synthesis system has beenused to generate the design of ahigh-quality pitch tracker for audiosystems (Figure 2). The result (Figure9) is a tracker with four processorsthat occupies only 37 mm2 in a 3,uCMOS process.

Figure 9. Overview ofthe datapath synthesisprocess: translation of

Silage into structureand register transfer

representation.

Silagedescription

Processor datapath A \ RT description

IEEE DESIGN & TEST

func amplitudeSpectrum (fr, fi: num[]) ampl: num[]; thresh: num =begin

max[O] = 0;ampl[65] = 0;(i ::1 . 64) max, ampl=begin

ampl[i] = square(fr[i]) + square(fi[i]);max[i] = if (ampl[i] > max[i - 1]) -ampl[i] max[i - 1] fi;end

thresh = max[64];end

PROGRAM amplitudeSpectrum;BEGIN

thresh:ram2 - max[64]:reg6 comp = pass6, bus2 = thresh;ampl[65]:reg6 - 'O':reg6;ampl[O]:reg6 = 'O':reg6;

FOR :=1 64 HOLDSBEGIN

fr[i]:reg2 fr[iJ:raml raml = reaa, busl = fr[i];fr[i]:regl fr[i]:raml raml = read, busl = fr[i];

ENDEND.

a

20

Module generatorTo define a module generator, we

need a system that enables us todesign the leaf cells and providehigh-level commands to express thecomposition of a module as a func-tion of the parameters. Since thecomposition rules may be as com-plex as

"Fit a number of leaf cells so thattheir terminals connect by abut-ment"

or"Route a set of nets"

the algorithms to support thesehigh-level commands must also beprovided in an integrated en-vironment.Many existing module generators

have a delay between definition andexecution, introduced by the com-pilation and linking steps of the tra-ditional programming languagesthey are implemented in. To over-come this problem, we designed aninteractive environment that givesgraphical feedback each time a com-position rule of the module genera-tor is defined, allowing mistakes tobe corrected immediately after theywere made.To describe a module we capture

four types of information.

Structural: The submodulesTopological: The relative place-

ment of the submodulesConnectivity: How the sub-

modules must be interconnectedand where the input and outputterminals of a module are

Construction: How submodulesfit together and which intercon-nections are realized by abutmentor routing.

In the first version of the modulegenerator in Cathedral-Il, calledMGE, structural and topologicalinformation is entered at the sametime. Components are added on gridpoints to represent their relativeordering but not their final coordi-nates. The command (ADD-COMPFA I1 . . N I 1), for example, adds ahorizontal row of length N of FAcells. As Figure 10 shows, the cellsare represented by their outline atthis stage, resulting in a clear pictureand fast display.

Figure 10. Structure and connections of a ripple adder.

Connections are also easily de-fined by specifying a net name, andthe list of terminals of the connectedcomponents. The command (ADD-CON (RIPPLE 12.. NI) (COUTI1 . . (- N 1) 1) (CIN 1 2. . N 1))defines the connections of the ripplesignals in the adder example. Noassumption is made at this stage as

to how the connections will berealized.

Indices of array-like structures canbe expressions; (N-1) is the lastripple signal, for example. MGEallows us to associate an expressionwith each horizontal or vertical gridline on the display (see x-axis inFigure 10). The creator can then, by

Figure 11. A set of symbolic cells before (a) and after (b) automatic fitting.

December 1986

,4 .(NEWDSD.E'R 6I

.5 * . .

.4

3

,.3 . . . * *

.12 1

P 1 2 3 4 N-I N N+1 N+2 N+3 NI

(a)

0(0b)a(a)

21

Figure 12. Add-Shift-Compare module as generated by the module genera-

tor environment.

simply placing cells and pointing atterminals graphically, automaticallygenerate the Lisp procedure. This isthe first step towards graphicaldefinition of module generatorcode.

For the design of the leaf cells weuse Cameleon, our symbolic layoutand compaction program.6 Layoutsare assembled from point elementslike transistors and contacts, frominterconnecting wires, and fromareas that need defining, for exam-

ple, the n-well or p-well in CMOStechnology. The layout is thencompacted, using a constraint graphand a longest path algorithm, tominimize cell area. The program

ensures that the layout is correctwith respect to the design rules, as

specified in a technology file, alsotaking into account extra constraintsthe user may have specified. Tech-nology changes are easily accommo-dated by changing the values ofsymbolic constants, like MIN-POLY-WIDTH or MIN-GATE-OVERHANG,in the technology file and recom-pacting the cell.

Construction rules specify howthe real layout of the module must

be made starting from the topology,the connections, and the basic cells.Either the cells fit on each other andare pushed together, or routing mustbe added to make the connections.Both options are available in MGE.The ROUTER command allows theuser to define a horizontal or verticalchannel and the nets that have to berouted. River and channel routersare available: The COMPACT com-

mand moves components togetherin the x or y direction as specified.The designer can combine theseconstruction rules to first route somechannels and then compact theglobal module.

In a traditional layout environ-ment, cells can be abutted onlywhile designing the layout of a cell.In our system, the designer can ex-

press this requirement by the com-

mand (ABUT XORY (CELL-LIST)).The module generator environmentwill then call the compactor, loadthe cells specified in CELL-LIST andcompact the internals of the cells,taking into account the constraintsimposed by their neighbors. Figureslla and llb show a set of cellsbefore and after applying the ABUTcommand.

The final step of an editing sessionin MGE is to save the commandsinto the module generator pro-

cedure. This is done automaticallyeach time the designer stops work-ing on a module. From this pro-

cedure, modules can be generatedfor any set of parameters in theallowed range.

To avoid inventing a completenew language and to take advantageof the interpretative nature of Lisp,the first version of the module gen-

erator environment was imple-mented as a superset of Lisp. For themore computationally intensivealgorithms like compaction androuting, Pascal and C programs are

called from Lisp. The system hasbeen tested through the design offive EXUs defined for a particulararchitecture. The resulting layout ofthe add-shift-compare is shown inFigure 12.

The use of Lisp as the basis of theMGE provides the creator with allthe flexibility of a high-level pro-

gramming language to definemodule generators that requiremore complex composition pro-

cedures. An example is the com-

parator shown in Figure 13.

Module verificationOne of the problems with module

generators is how to prove their cor-

rectness. The traditional way is bysimulation. However, simulation is asubjective test method that dependsentirely on the patterns defined bythe designer to detect expectedproblems. It is costly in CPU timeand is not guaranteed to capture un-

expected problems.

To overcome these drawbacks wehave developed Dialog,7 a set ofrule-based analysis tools that allowsus to detect violations against basicdesign principles. We have alsodeveloped Slocop for timing analy-sis. These tools are based on theLisp-like Lextoc language,7 whichprovides powerful facilities to ex-

press rules about MOS circuits.Figure 14 gives an overview of theverification system.

We begin with MOS network ex-

traction. To verify the design we firsthave to extract from the layout a

IEEE DESIGN & TEST

t .~~~~~~~V I .F

EsZL lib ^ ~~~~~~~~~~~~~~!. #_" wr Et.-P.IL

;0be a ig g W _,>=~1L a

A= A =

w_~~~ -9 IGI i ,

22

model for the cell on the circuitlevel. This can be done very easily,since the stick diagram contains thedevices and the interconnectionwires. The internode capacitancescan be found by detecting the pointswhere elements cross.

Next from the extracted MOStransistor network the user can de-compile and check clocking rules.The system first determines if thenetwork belongs to the class of validcircuit configurations, such as staticCMOS, using passive multiplexingtrees and prescribed registers. Allparts of the circuit violating theserules are flagged as errors. Thechecking is done by partitioning thenetwork into the basic subnetworksthat are possible in a given designstyle. The description of thetopology of those subnetworks(static CMOS gates, pass transistortrees with or without bleeder tran-sistors, etc.) resides in a knowledgebase. This knowledge base is writtenin Lextoc. Therefore, in contrast toCrystal,8 for example, our system iseasily extended with new types ofsubnetworks, and is applicable toany definable class of digital MOScircuits (Figure 15).A high-level check on the clock-

ing is now done. Starting from theprimary clock information, the de-rived clocks are found, and latchesare detected and separated fromcombinatorial logic. By assigning theappropriate properties to the nodesin the network, illegal combinationsare detected and reported. Forexample,

A derived clock can only beformed with signals latched onthe same clock phase

A latch data input must not be aprimary clock

Figure 16 shows the response ofDialog to a request to verify a clock-ing rule. The system has identifiedand highlighted that a loop existscontaining only a single clock phase,which could cause a race problem.From the primitive circuits found

in the previous step, we now searchfor illegal configurations like opens,shorts, and odd NMOS and PMOScombinations. All acceptable cir-cuits are checked to see if they can

Figure 13. Interpretively generated representation of the structure of asix-bit comparator (a); the layout of the six-bit comparator (b), in whichsome cells are connected via abutment, some with routing; and the layoutof an eight-bit comparator (c), which demonstrates the flexibility of themodule generator.

Figure 14. Overview of the Dialog expert verification system.

guarantee correct logic levels. Thischeck is not restricted to static situa-tions but includes dynamic effectslike charge redistribution. An exam-ple of a possible problem circuitand the rule involved in the check-ing procedure are in Figure 17.

To diagnose an error on node Bcaused by charge sharing we have tofind the combination of input signalsthat creates the worst case con-figuration. This is done by theSETUP-INPUTS procedure. Usingthese inputs, the part of the circuit

December 1986

CutTestCompare

23

Figure 15. A sample of the allowed transistor configura-tion stored in the rule base.

Figure 16. Simple clock phase loop detected by Dialog.

Figure 17. Rule to detect charge sharing (a) and possibleproblem circuit (b).

Figure 18. Hierarchical highlighting of critical delay pathby Slocop.

under investigation is simulated on

the circuit level using the SIMMYmodule.9 The output waveforms ofthe simulation are then checked tosee if the voltage on node N2 dropsbelow 3.0 volts. If it does, a messageis issued telling the designer to ex-

pect charge-sharing problems on

node N2. If the designer does notbelieve this diagnosis, a facility iscalled to disolay the conditionscreating the problem situation andexplain by what successive steps itwas detected.Even if the circuit has passed all

the general rules, we must checkthat it does not violate the timingspecifications for the module. Fromthe primary clock specifications, theedges of the clock inputs for thelatches are computed and the maxi-

mum allowed delay of the combina-torial blocks is calculated.

Slocop, the timing verificationprogram, uses local circuit simula-tion to calculate delays. This methodis more time consuming than RC cal-culations, but it is very reliable,much more accurate, and applicableto any type of subnetwork that can

be described in the program'sknowledge base. For global analysis,graph-based algorithms are used,since the large number of pathsmakes an exhaustive path search im-possible. Figure 18 gives a multiwin-dow output of Slocop, showing thehierarchical annotation of the criti-cal path of an adder module. Slocopuses a critical path algorithm thatprovides the longest signal propaga-tion paths from inputs to outpu s.

CONCLUSIONWe have described the concepts

and status of the development of an

application-specific silicon compilerfor highly complex DSP algorithms.We have shown that development

of a silicon compiler is only possibleafter carefully limiting the targetsilicon architecture to a restrictedapplication area. Only then can a

CAD system be developed. Mostparts of the synthesis and modulegeneration system have been suc-

cessfully prototyped, and the firstlarge chip designs are now beingundertaken. The use of the Al pro-

gramming techniques is becomingmore and more important in thisactivity.

Behavioral silicon compilation as

well as the module generators to

IEEE DESIGN & TEST

(defrule charge-sharing-noral ()*if ((property fibar-ndyn el)

(relation innodes nl el)(relation outnodes nl e2)(property fibar-pdyn e2) *(reli,tion outnodes n2 el) O(relation coupon ecomp el(relation innodes nl ecomp)(relation ionodes n2 ecomp)(relation ionodes n3 ecomp)

'then

(simulate el n2 1) (b)

(at exor.fibar (> 3.0) pl) C

(attribute voltage (< 4.0) n2 pl)

*then((assign-info charge-sharing-error ecomp)(true (format please re-irnsion -a' (instance n3))))))

(a)

24

4. 1--

J-

support it will require new skillsfrom both system and silicon de-signers. It remains to be seen howthey will react to it in the future. Z9

ACKNOWLEDGMENTSWe recognize contributions from

many people, but most of all from all themembers of the IMEC VSDM group, par-

!ew'ticularly K. Croes, L. Rijnders, I. Vande-weerd, M. Pauwels, F. Catthoor, M. Bar-tholomeus, G. Goossens, J. Vanhoof, E.Vanden Meersch, and 1. Bolsens.

Like Cathedral-I, this work was spon-sored by the Economic Community'sEsprit 97 project, which involves con-tributions from IMEC, Philips, SiemensAG, Bell Telephone Manufacturing,Silvar-Lisco, and Ruhr University ofBochum.We recognize contributions from all

our Esprit 97 partners, especially Philips,Siemens, Bell, and Silvar-Lisco, particu-larly J. Van Meerbergen of Philips whogreatly influenced Cathedral-Il's targetarchitecture.

REFERENCES1. R. Sluyter, H. Kotmans, and A. Van Leeu-

waarden, "A Novel Method for PitchExtraction from Speech and a HardwareModel Applicable to Vocoder Systems,"Proc. IEEE ICASSP Conf., Apr. 1980,pp.45-48.

2. P. Hilfinger, "A High-Level Languageand Silicon Compiler for Digital SignalProcessing," Proc. IEEE CICC Conf., May1985, pp.213-216.

3. E. Marien et al., Manual of LOGMOSV4.2: A Simulator Covering RegisterTransfer-Functional-Gate and SwitchedLevel, available from Silvar-Lisco,Kapeldreef 75, B-3030 heverlee, Belgium.

4. D. Thomas et al., "Automatic Data PathSynthesis," Computer, Vol. 16, No. 12,Dec. 1983, pp.59-70.

5. P Marwedel, "The MIMOLA Design Sys-tem: Tools for the Design of Digital Pro-cessors," Proc. Design AutomationConf., 1984, pp.587-593.

6. L. Rijnders et al., CAMELEON Version1.1, Users Guide, report 5-C1-3 of EECproject MR-03-KUL, available fromIMEC.

7. H. De Man et al., "DIALOG: An ExpertDebugging System for MOS VLSI De-sign," IEEE Trans. Computer-Aided De-sign, Vol. CAD-4, No. 3, July 1985, pp.303-311.

8. J. Ousterhout, "Switch Level DelayModels for Digital MOS VLSI," Proc. De-sign Automation Conf., 1984, pp.542-548.

9. J. Cockx, User's Manual for SIMMY, KULeuven, project MR03KUL report 5-B1-1,Oct. 1984.

ADDITIONAL READINGCatthoor, F., et al., "General Datapath,Controller and Inter-CommunicationArchitectures for the Creation of a Dedi-cated Multi-Processor Environment," Proc.ISCAS-86, Vol. 2, May 1986, pp.730-732.Catthoor, F., et al., "Investigation of FiniteWord-Length Effects on Arbitrary DigitalFilters Using Simulated Annealing," Proc.ISCAS-86, Vol. 3, May 1986, pp.1296-1297.De Man, H., et al., "CATHEDRAL-Il: A Syn-thesis and Module Generation System forMultiprocessor Systems on a Chip," NATOsummer course on logic synthesis and sili-con compilation for VLSI design, July 1986(to be published).Fettweiss, A., "Wave Digital Filters: Theoryand Practice," Proc. IEEE, Vol. 74, No. 2,Feb. 1986, pp.270-327.Goossens, G., et al., "A Computer-AidedDesign Methodology for Mapping DSP-Algorithms Onto Custom MultiprocessorArchitectures," Proc. ISCAS-86, Vol. 3, May1986, pp.924-926.Jain, R., et al., "Custom Design of a VLSIPLM-FDM Transmultiplexer from SystemSpecifications to Layout Using a CAD Sys-tem," IEEE J Solid-State Circuits, Vol. SC-21,No. 1, Feb. 1986, pp.73-85.Van Ginderdeuren, J., et al., "A DigitalAudio Filter Using Semi-Automated De-sign," Digest of technical papers of ISSCCconference, Feb. 1986, pp.88-89.

Hugo De Man is vicepresident of the VLSISystems Design groupat IMEC. His researchinterests are IC design

A _ and CAD. Previously,he was an ESRO-NASAresearch fellow work-ing on computer-aideddevice and circuit de-

sign, a research associate of the BelgianNational Science Foundation, a professorat the University of Leuven, and a visitingprofessor at the University of California atBerkeley.De Man holds a PhD in applied sciences

from the Katholieke Universiteit Leuven.He has been an associate editor for IEEEJournal of Solid-State Circuits and IEEETransactions on CAD. He is a fellow of theIEEE.

Jan Rabaey heads theArchitectural and Algo-S ~~~rithmic Strategiesgroup of IMEC's DesignMethodologies for VLSISystems Division. Hisresearch interests are

computer-aided analy-sis and automated de-sign of digital signal

processing circuits and architectural syn-thesis. Previously, he was a visiting researchengineer at the University of California atBerkeley, where he developed an auto-mated synthesis system for multiprocessorDSP architecture. Rabaey holds a PhD inapplied sciences from Katholieke Univer-siteit Leuven.

P. Six heads the SiliconImplementation Tech-niques group at IMEC,where his interests aresymbolic layout andcompaction, modulegenerators for datapaths and controllers,and graphical user in-terfaces. Previously, he

headed the CAD group at Bell TelephoneManufacturing in Antwerp, where he was

responsible for the development and inte-gration of CAD tools for design and verifi-cation of ICs. Six received a PhD in appliedsciences from Katholieke UniversiteitLeuven.

w Luc J.M. Claesen headsresearch in the DesignManagement and Veri-fication group withinIMEC's Design Meth-odologies for VLSIDivision. His interestsare CAD and integrateddigital signal processing

2 circuits. Previously, hewas a research assistant at the ESAT Labora-tory at Katholieke Universiteit Leuven,where he worked in CAD of integratedsystems for digital and analog signal pro-cessing. Claesen holds a PhD in appliedscience from Katholieke Universiteit.

Questions about this article can bedirected to H. De Man, IMEC, Kapeldreef75, B-3030, Belgium.

December 1986 25

cathedral-ii: silicon compiler - cs

Documents