an agile approach to building risc-v microprocessors

.................................................................................................................................................................................................................

AN AGILE APPROACH TO BUILDINGRISC-V MICROPROCESSORS

.................................................................................................................................................................................................................

THE AUTHORS ADOPTED AN AGILE HARDWARE DEVELOPMENT METHODOLOGY FOR 11

RISC-V MICROPROCESSOR TAPE-OUTS ON 28-NM AND 45-NM CMOS PROCESSES. THIS

ENABLED SMALL TEAMS TO QUICKLY BUILD ENERGY-EFFICIENT, COST-EFFECTIVE, AND

COMPETITIVE HIGH-PERFORMANCE MICROPROCESSORS. THE AUTHORS PRESENT A CASE

STUDY OF ONE PROTOTYPE FEATURING A RISC-V VECTOR MICROPROCESSOR INTEGRATED

WITH SWITCHED-CAPACITOR DC–DC CONVERTERS AND AN ADAPTIVE CLOCK GENERATOR

IN A 28-NM, FULLY DEPLETED SILICON-ON-INSULATOR PROCESS.

......The era of rapid transistor per-formance improvement is coming to an end,increasing pressure on architects and circuitdesigners to improve energy efficiency. Mod-ern systems on chip (SoCs) incorporate a largeand growing number of specialized hardwareunits to execute specific tasks efficiently, oftenorganized into multiple dynamic voltage andclock frequency domains to further saveenergy under varying computational loads.This growing complexity has led to a designproductivity crisis, stimulating developmentof new tools and methodologies to enable thecompletion of complex chip designs on sched-ule and within budget.

We propose leveraging lessons learnedfrom the software world by applying aspectsof the software agile development model tohardware. Traditionally, software was devel-oped via the waterfall development model, asequential process that consists of distinctphases that rigidly follow one another, just asis done for hardware design today. Over-budget, late, and abandoned software projectswere commonplace, motivating a revolution

in software development, demarcated by thepublication of The Agile Manifesto in 2001.1

The philosophy of agile software developmentemphasizes individuals and interactions overprocesses and tools, working software overcomprehensive documentation, customer col-laboration over contract negotiation, andresponding to change over following a plan.In practice, the agile approach leads to smallteams iteratively refining a set of working-but-incomplete prototypes until the end result isacceptable.

Inspired by the positive disruption of TheAgile Manifesto on software development, wepropose a set of principles to guide a newagile hardware development methodology.Our proposal is informed by our experiencesas a small group of researchers designing andfabricating 11 processor chips in five years.Lacking the massive resources of industrialdesign teams, we were forced to abandonstandard industry practices and explore differ-ent approaches to hardware design. We detailthis approach’s benefits with a case study ofone of these chips, Raven-3, which achieved

Yunsup Lee

Andrew Waterman

Henry Cook

Brian Zimmer

Ben Keller

Alberto Puggelli

Jaehwa Kwak

Ru�zica Jevtic

Stevo Bailey

Milovan Blagojevic

Pi-Feng Chiu

Rimas Avi�zienis

Brian Richards

Jonathan Bachrach

David Patterson

Elad Alon

Borivoje Nikolic

Krste Asanovic

.......................................................

8 Published by the IEEE Computer Society 0272-1732/16/$33.00�c 2016 IEEE

groundbreaking energy efficiency by integrat-ing a novel RISC-V vector architecture withefficient on-chip DC–DC converters. Takentogether, our experiences developing these aca-demic prototype chips make a powerful argu-ment for the promise of agile hardware designin the post-Moore’s law era.

An Agile Hardware ManifestoBorrowing heavily from the agile softwaredevelopment manifesto, we propose the fol-lowing principles for hardware design:

� Incomplete, fabricatable prototypesover fully featured models.

� Collaborative, flexible teams over rigidsilos.

� Improvement of tools and generatorsover improvement of the instance.

� Response to change over following aplan.

Here, we explain the importance of theseprinciples for building hardware and contrastthem with the original principles of agile soft-ware development.

Incomplete, Fabricatable Prototypes over FullyFeatured ModelsWe believe the primary benefit of adoptingthe agile model for hardware design is that itlets us drastically reduce the cost of verifica-tion and validation by iterating throughmany working prototypes of our designs. Inthis article, we use the standard definition ofverification as testing that determines whethereach component and the assembly of compo-nents correctly meets its specification (“Arewe building the thing right?”). In hardwaredesign, the term validation is usually nar-rowly applied to postsilicon testing to ensurethat manufactured parts operate withindesign parameters. We use the broader and,to our minds, more useful definition of vali-dation from the software world, in which val-idation ensures that the product serves itsintended purposes (“Are we building theright thing?”).

Figure 1 contrasts the agile hardware de-velopment model with the waterfall model.The waterfall model, shown in Figure 1a,relies on Gantt charts, high-level architecturemodels, rigid processes such as register-

transfer level (RTL) freezes, and CPU centu-ries of simulations in an attempt to achieve asingle viable design point. In our agile hard-ware methodology, we first push a trivial pro-totype with a minimal working feature set allthe way through the toolflow to a point whereit could be taped out for fabrication. We referto this tape-out-ready design as a tape-in.Then we begin adding features iteratively,moving from one fabricatable tape-in to thenext as we add features. The agile hardwaremethodology will always have an availabletape-out candidate with some subset of fea-tures for any tape-out deadline, whereas theconventional waterfall approach is in dangerof grossly overshooting any tape-out deadlinebecause of issues at any stage of the process.

Although the agile methodology providessome benefits in terms of verification effort, itparticularly shines at validation; agile designsare more likely to meet energy and perform-ance expectations using the iterative process.Unlike conventional approaches that usehigh-level architectural models of the wholedesign for validation with the intent to laterrefine these models into an implementation,we emphasize validation using a full RTLimplementation of each incomplete prototypewith the intent to add features if there is time.Unlike high-level architectural models, theactual RTL is guaranteed to be cycle accurate.Furthermore, the actual RTL design can bepushed through the entire very large-scaleintegration (VLSI) toolflow to give early feed-back on back-end physical design issues thatwill affect the quality of results (QoR), includ-ing cycle time, area, and power. For example,Figure 1b shows a feature F1 being reworkedinto feature F10 after validation showed that itwould not meet performance or QoR goals,whereas feature F3 was dropped after valida-tion showed that it exceeded physical designbudgets or did not add sufficient value to theend system. These decisions on F1 and F3 canbe informed by the results of the attemptedphysical design before completing the finalfeature F4.

Figure 1c illustrates our iterative approachto validation of a single prototype. Eachcircle’s circumference represents the relativetime and effort it takes to begin validating adesign using a particular methodology.Although the latency and cost of these

.............................................................

MARCH/APRIL 2016 9

prototype-modeling efforts increase towardthe outer circles, our confidence in the valida-tion results produced increases in tandem.Using fabricatable prototypes increases valida-tion bandwidth, because a complete RTLdesign can be mapped to field-programmablegate arrays (FPGAs) to run end-applicationsoftware stacks orders of magnitude fasterthan with software simulators. In agile hard-ware development, the FPGA models of itera-tive prototype RTL designs (together withaccompanying QoR numbers from the VLSItoolflow) fulfill the same function as workingprototypes in agile software development, pro-viding a way for customers to give early andfrequent feedback to validate design choices.This enables customer collaboration as envi-sioned in the original agile software manifesto.

Collaborative, Flexible Teams over Rigid SilosTraditional hardware design teams usuallyare organized around the waterfall model,with engineers assigned to particular func-tions, such as architecture specification,microarchitecture design, RTL design, veri-fication, physical design, and validation.This specialization of skills is more extremethan for pure software development, asreflected in the specificity of designers’ jobtitles (architect, hardware engineer, or veri-fication engineer). Sometimes, entire func-tions, such as back-end physical design, areeven outsourced to different companies. Asthe design progresses through these func-tional stages, extensive effort is requiredto communicate design intent for eachblock, and consequently to understand the

Design

Implementation

Tape-in

Desig

n

Sp

ecifi

cation

Specification

+ design

+ implementation

C++

FPGA

ASIC flow

Tape-in

Small tape-out

Big tape-out

Specification

Imp

lem

enta

tion

Tap

e-o

ut

F1 F2 F3 F4F1

F2

F3

F4

Valid

ation

Tape-out

Validation

F1 F1’F2 F3 F4

F1 F1’F2 F3 F4

F1 F1’F2 F3 F4

F1 F1’F2 F3 F4

F1/F2 F1’/F2/F3 F1’/F2/F4

Verification

Verification

F1/F2 F1’/F2/F3

F1’/F2/F4

F1’/F2/F4

(a) (b)

(c)

Figure 1. Contrasting the agile and waterfall models of hardware design. The labels FN represent various desired features of

the design. (a) The waterfall model steps all features through each activity sequentially, producing a tape-out candidate only

when all features are complete. (b) The agile model adds features incrementally, resulting in incremental tape-out candidates

as individual features are completed, reworked, or abandoned. (c) As validation progresses, designs are subjected to lengthier

and more accurate evaluation methodologies.

..............................................................................................................................................................................................

HOT CHIPS

............................................................

10 IEEE MICRO

documentation received from the precedingstage. Misunderstandings can lead to subtleerrors that require extensive verification toprevent, and QoR suffers because no engi-neer views the whole flow. End-to-enddesign iterations are too expensive to con-template. Considerable effort is requiredto push high-level changes down throughthe implementation hierarchy, particularlyacross company boundaries. This leads toinnovation-stunting practices such as theRTL freeze, in which only show-stoppingbugs are allowed to unfreeze a design. Inaddition, balancing the workload across thedifferent stages is difficult because engineersfocused in one functional area often lackthe training and experience to be effectivein other roles.

In the agile hardware model, engineerswork in collaborative, flexible teams that candrive a feature through all implementationstages, avoiding most of the documentationoverhead required to map a feature to thehardware. Each team can iterate on the high-level architectural design with full visibilityinto low-level physical implementation andcan validate system-level software impacts onan intermediate prototype. Figure 1b showshow multiple teams can work on differentfeatures in a pipelined fashion, each iteratingthrough multiple levels of abstraction (asshown in Figure 1c). A given tape-in willaccumulate some small number of featureupdates and integrate these for whole-systemvalidation. Engineers naturally develop acomplete vertical skill set, and the focus ofproject management is on prioritizing fea-tures for implementation, not communica-tion between silos.

Improvement of Tools and Generators overImprovement of the InstanceModern software systems could not be builteconomically without the extreme reuseenabled by extensive software libraries writ-ten in modern programming languages, orwithout functioning compilers and an auto-mated build process. In contrast, most com-mercial hardware design flows rely on ratherprimitive languages to describe hardware,and frequent manual intervention on tooloutputs to work around tool bugs and longtool runtimes. The poor languages and buggy

tools hinder development of truly reusablehardware components and lead to a focus onbuilding a single instance instead of design-ing components that can be easily reused infuture designs.

An agile hardware methodology relies onbetter hardware description languages (HDLs)to enable reuse. Instead of constructing anoptimized instance, engineers develop highlyparameterized generators that can be used toautomatically explore a design space or tosupport future use in a different design.Better reuse is also the primary way in whichan agile hardware methodology can reduceverification costs, because the verificationinfrastructure of these generators is also port-able across designs.

An agile hardware methodology also strivesto eliminate manual intervention in the chip’sbuild process. Replacing or fixing the manycommercial CAD tools required to completea chip design is not practical, but most ofa chip’s flow can be effectively automatedwith sufficient effort. Perhaps the mostlabor-intensive part of a modern chip flow isdealing with physical design issues for a newtechnology node. However, most new chipdesigns are in older technologies, becauseonly a few products can justify the cost ofusing an advanced node. Also, as technologyscaling continues to slow, back-end flowsand CAD tools will have time to matureand should become more automatable, evenfor the most advanced nodes. The end ofscaling will also coincide with a greaterneed for a variety of specialized designs,where the agile model should show greaterbenefit.

Our emphasis on tools and generators isnot at odds with the agile software develop-ment manifesto, which emphasizes collabora-tion over tools. Our proposed methodologyalso values collaboration more than specifictools, but we feel that modern hardwaredesign is far less automated and exhibits farless reuse than modern software development(or even software development in 2001).Accordingly, there is a need to emphasize theimportance of reuse and automation enabledby new tools in the hardware design sphere,because few modern hardware design teamsemphasize these universally accepted softwareprinciples.

.............................................................

MARCH/APRIL 2016 11

Response to Change over Following a PlanThis principle mirrors the original manifesto.Unlike software, a chip design will ultimatelybe frozen into a mask set and mass-manufac-tured, with no scope for frequent updates.However, even over the timeframe of a singlechip’s development, requirements will changeowing to market shifts, customer featureadditions, or standards-body deliberations.In agile design, change can result from valida-tion testing (as opposed to the waterfallmodel, in which designs will only change inresponse to verification problems).

An agile hardware development processembraces continual change in chip specifica-tions as a means to enhance competitiveness.Every incremental fabricatable prototype, notonly the most recent, becomes a startingpoint for the changed set of design require-ments, with any change treated as a reprioriti-zation of features to implement, drop, orrework. By first implementing features thatare unlikely to change, designers minimizethe effort lost to late changes.

Implementing the Agile HardwareMethodologyHere, we describe the steps we took towardimplementing an agile hardware design proc-ess. All of our processors, both general pur-pose and specialized, were built on the freeand open RISC-V instruction architecture(ISA). We developed our RTL source codeusing Chisel (Constructing Hardware in aScala-Embedded Language), an open-sourcehardware construction language, and ex-pressed the code as libraries of highly parame-terized hardware generators. We developed ahighly automated back-end design flow toreduce manual effort in pushing designsthrough to layout. Finally, we present theresulting chips from our agile hardware designprocess.

RISC-VRISC-V is a free, open, and extensible ISAthat forms the basis of all of our program-mable processor designs.2 Unlike many earlierefforts that designed open processor cores,RISC-V is an ISA specification, intended toallow many different hardware implementa-tions to leverage common software develop-

ment. RISC-V is a modular architecture, withvariants covering 32-, 64-, and 128-bit addressspaces. The base integer instruction set is lean,requiring fewer than 50 user-level hardwareinstructions to support a full modern softwarestack, which enables microprocessor designersto quickly bring up fully functional prototypesand add additional features incrementally. Inaddition to simplifying the implementation ofnew microarchitectures, the RISC-V designprovides an ideal base for custom accelerators.Not only can accelerators reuse common con-trol-processor implementations, they can sharea single software stack, including the compilertoolchain and operating system binaries. Thisdramatically reduces the cost of designing andbringing up custom accelerators.

Further information on RISC-V is avail-able from the nonprofit RISC-V Foundation,which was incorporated in August 2015 tohelp coordinate, standardize, protect, and pro-mote the RISC-Vecosystem.3 A free and openISA is a critical ingredient in an agile hardwaremethodology, because having an active anddiverse community of open-source contribu-tors amortizes the overhead of maintainingand updating the software ecosystem, whichmakes it possible for smaller teams to focus ondeveloping custom hardware.4 A free ISA isalso a prerequisite for shared open-source pro-cessor implementations, which can act as abase for further customization.5

ChiselTo facilitate agile hardware development byimproving designer productivity, we devel-oped Chisel6 as a domain-specific extension tothe Scala programming language. Chisel isnot a high-level synthesis tool in which hard-ware is inferred from Scala code. Rather,Chisel is intended to be a substrate that pro-vides a Scala abstraction of primitive hardwarecomponents, such as registers, muxes, andwires. Any Scala program whose executiongenerates a graph of such hardware compo-nents provides a blueprint to fabricate hard-ware designs; the Chisel compiler translates agraph of hardware components into a fast,cycle-accurate, bit-accurate Cþþ softwaresimulator, or low-level synthesizable Verilogthat maps to standard FPGA or ASIC flows.

Because Chisel is embedded in Scala, hard-ware developers can tap into Scala’s modern

..............................................................................................................................................................................................

HOT CHIPS

............................................................

12 IEEE MICRO

programming language features—such asobject-oriented programming, functional pro-gramming, parameterized types, abstract datatypes, operator overloading, and type infer-ence—to improve designer productivity byraising the abstraction level and increasingcode reuse. Furthermore, metaprogramming,code generation, and hardware design tasks areall expressed in the same source language,which encourages developers to write parame-

terized hardware generators rather than dis-crete instances of individual blocks. This inturn improves code reuse within a given designand across generations of design iterations.

To see how these language features inter-act in real designs, we describe a crossbar thatconnects two AHB-Lite master ports to threeAHB-Lite slave ports in Figure 2a. In the fig-ure, the high-level block diagram is on theleft, the two building blocks (AHB-Lite buses

Arbiter

ins(0,1)

out

AHB-Litecrossbar

masters(0,1)

slaves(0,1,2)

master

slaves(0,1,2)

AHB-Lite bus

masters(0) masters(1)

Arbiter Arbiter Arbiter

slaves(0) slaves(1) slaves(2)

AHB-Liteslave mux

AHB-Lite bus AHB-Lite bus

slave mux slave mux slave mux

AHB-Lite crossbar

A class AHBXbar(nMasters: Int, nSlaves: Int) extends ModuleB {C val io = new Bundle {D val masters = Vec(new AHBMasterIO, nMasters).flipE val slaves = Vec(new AHBSlaveIO, nSlaves).flipF }GH val buses = List.fill(nMasters){Module(new AHBBus(nSlaves))}I val muxes = List.fill(nSlaves){Module(new AHBSlaveMux(nMasters))}JK (buses.map(b => b.io.master) zip io.masters) foreach {

case (b, m) => b <> m }L (muxes.map(m => m.io.out) zip io.slaves) foreach {

case (x, s) => x <> s }M for (m <- 0 until nMasters; s <- 0 until nSlaves) yield {N buses(m).io.slaves(s) <> muxes(s).io.ins(m)O }P }

(a)

(b)

Figure 2. Using functional programming to write a parameterized AHB-Lite crossbar switch in Chisel. (a) Block diagrams. (b)

Chisel source code. The code is short and easy to verify by inspection, but can support arbitrary numbers of masters and slave

ports.

.............................................................

MARCH/APRIL 2016 13

and AHB-Lite slave muxes) are in the mid-dle, and the final design is shown on theright. To help understand the Chisel code,Figure 3 shows some generic functional pro-gramming idioms. Idiom A maps a list ofnumbers to an expression that adds 1 tothem. Idiom B zips two lists together. IdiomC iterates over a list of tuples, uses Scala’s casestatement to provide names of elements, andthen operates on those names. Idiom D iter-ates over a list when the output values are notcollected. Idiom E is a for-comprehension witha yield statement.

Figure 2b shows the Chisel code thatbuilds the crossbar. Line A declares theAHBXbar module with two parameters:number of master ports (nMasters) andslave ports (nSlaves). Lines C through Fdeclare the module’s I/O ports. Lines H and Iinstantiate collections of AHB-Lite buses andAHB-Lite slave muxes. Lines K through Nconnect the building blocks together. Notethat the Chisel operator <> connects mod-ule I/O ports together. Line K extracts themaster I/O port of each bus, zips it to the cor-responding master I/O port of the crossbaritself, and then connects them together. LineL does the same thing for each slave mux out-put and each slave I/O port of the crossbar.Lines M and N use for-comprehension togenerate the cross-product of all port combi-nations, and use each pair to connect a spe-cific slave I/O port of a specific bus to thematching input of the corresponding mux.

This code’s brevity ensures that it is easilyverified by inspection. Although the exampleshows two buses and three slave muxes, these10 lines of code would be identical for anynumber of buses and slave muxes. In contrast,implementations of this crossbar module writ-ten in traditional HDLs, such as Verilog or

VHDL, would be difficult to generalize into aparameterizable crossbar generator. This is justa small example of Chisel’s power to bringmodern programming language constructs tohardware development, increasing code main-tainability, facilitating reuse, and enhancingdesigner productivity.

Rocket Chip GeneratorChisel’s first-class support for object orienta-tion and metaprogramming lets hardwaredesigners write generators, rather than indi-vidual instances of designs, which in turnencourages the development of families ofcustomizable designs. By encoding micro-architectural and domain knowledge in thesegenerators, we can quickly create differentchip instances customized for particulardesign goals and constraints.7 Constructionof hardware generators requires support forreconfiguring individual components basedon the context in which they are deployed;thus, a particular focus with Chisel is toimprove on the limited module parameter-ization facilities of traditional HDLs.

The Rocket chip generator is written inChisel and constructs a RISC-V-based plat-form. The generator consists of a collectionof parameterized chip-building libraries thatwe can use to generate different SoC variants.By standardizing the interfaces used to con-nect different libraries’ generators to oneanother, we have created a plug-and-playenvironment in which it is trivial to swap outsubstantial components of the design simplyby changing configuration files, leaving thehardware source code untouched. We canalso test the output of individual generatorsand perform integration tests on the wholedesign, where the tests are also parameterized,so as to exercise the entire design under test.

idiom result

A (1,2,3) map { n => n + 1 } (2,3,4)

B (1,2,3) zip (a,b,c) ((1,a),(2,b),(3,c))

C ((1,a),(2,b),(3,c)) map { case (left,right) => left } (1,2,3)

D (1,2,3) foreach { n => print(n) } 123

E for (x <- 1 to 3; y <- 1 to 3) yield (x,y) (1,1),(1,2),(1,3),(2,1),(2,2),(2,3),(3,1),(3,2),(3,3)

Figure 3. Generic functional programming idioms to help understand the crossbar example.

..............................................................................................................................................................................................

HOT CHIPS

............................................................

14 IEEE MICRO

Figure 4 presents the collection of librarygenerators and their interfaces within theRocket chip generator. These generators canbe parameterized and composed into manyvarious SoC designs. Their current capabil-ities include the following:

� Core: the Rocket scalar core generatorwith an optional floating-point unit,configurable pipelining of functionalunits, and customizable branch-prediction structures.

� Caches: a family of cache and transla-tion look-aside buffer generators withconfigurable sizes, associativities, andreplacement policies.

� Rocket Custom Coprocessor: the RoCCinterface, a template for application-specific coprocessors that can exposetheir own parameters.

� Tile: a tile-generator template forcache-coherent tiles. The numberand type of cores and accelerators areconfigurable, as is the organization ofprivate caches.

� TileLink: a generator for networks ofcache-coherent agents and the associ-ated cache controllers. Configurationoptions include the number of tiles,coherence policy, presence of sharedbacking storage, and implementationof underlying physical networks.

� Peripherals: generators for AXI4-/AHB-Lite-/APB-compatible buses andvarious converters and controllers.

Physical Design FlowWe frequently translate Chisel’s Verilog outputinto a physical design through an ASIC CADtoolflow.Wehaveautomatedthisflowandopti-mized it to generate industry-standard GDS-format mask files without designer intervention.Automatic nightly regressions provide thedesigner with regular feedback on cycle time,area, and energy consumption, and they alsodetect physical design problems early in thedesign cycle. Automatic verification and report-ing on the quality of results allow more points inthe design space to be evaluated, and reduce thebarriertoimplementingmoresignificantchangesatlaterstagesinthedevelopmentprocess.

One major hindrance to agile methodol-ogy for hardware design is CAD tool run-

time. The deployment process for hardwareis significantly more time-consuming thanfor software. Using an initial hardwaredescription to generate a GDS that is electri-cally and logically correct can require manyiterations. Because each iteration can takemany hours to run to completion, the processcould require weeks or months for largerdesigns. To avoid the productivity loss ofthese long-latency iterations, we initiallyfocus on smaller designs, from which GDScan be generated more rapidly. Iterating onthese smaller-scale tape-ins has proven valua-ble, because major problems are uncoveredmuch earlier in the design process, and evenmore so when we target a new process tech-nology. Additionally, the parameterizablenature of hardware generators reduces the

RocketTile

Rocket L1I$RoCC

accel.

L1D$FPU

RocketTile

Rocket L1I$RoCC

accel.

L1D$FPU

L1 network

L2 network

Rocket

Cache

RoCC

Tile

TileLink

Periph.

A

B

C

D

E

F

TileLink/AXI4

bridge

L2$ bank

AXI4 crossbar

DRAM

controller

High-

speed

I/O device

AHB and APB

peripherals

Figure 4. The Rocket chip generator comprises the following

subcomponents: (a) a core generator, (b) cache generator, (c) Rocket

Custom Coprocessor-compatible coprocessor generator, (d) tile generator,

(e) TileLink generator, and (f) peripherals. These generators can be

parameterized and composed into many various system-on-chip designs.

.............................................................

MARCH/APRIL 2016 15

effort to scale up to larger designs later in thedesign process. Regardless of design size, weensure that each tape-in passes static timinganalysis, is functionally equivalent to theRTL model, and has only a small number ofdesign-rule violations. Together, these prop-erties ensure that the design is indeed tape-out ready.

As a particular tape-out deadline ap-proaches, we evaluate whether to tape out themost recently taped-in design, usually on thebasis of whether the set of new features vali-dated by the tape-in is sufficiently large. If wedecide to tape out, we clean up the remainingdesign-rule violations and run final checks onthe resulting GDS. The full cycle of tapingout the GDS and testing the resulting chip ismore involved than a tape-in, but qualita-tively it is just another, longer, iteration in theagile methodology. Thus, we focus on pro-ducing lineages of increasingly complicatedfunctioning chips, incorporating feedbackfrom the previous chip into the next in theseries. We believe that this rapid iterativeprocess is essential to improve the productiv-ity of hardware designers and enable iterativedesign-space exploration of alternative systemmicroarchitectures.

Silicon ResultsFigure 5 presents a timeline of UC BerkeleyRISC-V microprocessor tape-outs that wecreated using earlier versions of the Rocketchip generator. The 28-nm Raven chips com-

bine a 64-bit RISC-V vector microprocessorwith on-chip switched-capacitor DC–DCconverters and adaptive clocking. The 45-nmchips integrate a 64-bit dual-core RISC-Vvector processor with monolithically inte-grated silicon photonic links.8,9 In total, wetaped out four Raven chips on STMicroelec-tronics’ 28-nm fully depleted silicon-on-insulator (FD-SOI) process, six photonicschips on IBM’s 45-nm SOI process, and oneSwerve chip on TSMC’s 28-nm process. TheRocket chip generator forms the shared codebase for these 11 distinct systems; we incor-porated the best ideas from each design backinto the code base, ensuring maximal reuseeven though the three distinct families ofchips were specialized differently to test outdistinct research ideas.

Case Study: RISC-V Vector Microprocessorwith On-chip DC–DC ConvertersWe applied our agile design methodology toRaven-3, a 28-nm microprocessor thatintegrates switched-capacitor DC–DC con-verters, adaptive clocking circuitry, and low-voltage static RAM macros with a RISC-Vvector processor instantiated by the Rocketchip generator.10 The successful developmentof the Raven-3 prototype demonstrates howarchitecture and circuit research was sup-ported by the combination of the freelyextensible RISC-V ISA, the parameterizableChisel-based hardware implementation, and

Raven-1Raven-2

Raven-3

EOS14 EOS16 EOS18 EOS20

2011 2013 2014 20152012

May Apr. Aug. Feb. July Mar. Nov.

EOS22

Raven-4

Mar. Apr.Sept.

EOS24

Swerve

Figure 5. Recent UC Berkeley tape-outs using RISC-V processors for three lines of research. The Raven series investigated

low-voltage reliable operation with on-chip DC–DC converters, the EOS series developed monolithically integrated silicon

photonic links, and Swerve experimented with dynamic redundancy techniques to handle low-voltage faults.

..............................................................................................................................................................................................

HOT CHIPS

............................................................

16 IEEE MICRO

fast feedback via iteration and automation aspart of the agile methodology.

Figure 6 shows the Raven-3 system-levelblock diagram. The Raven-3 microprocessorinstantiates the left RocketTile in Figure 4with the Hwacha vector unit implemented asa RoCC accelerator. We did not include anL2 cache in this design iteration, leaving thisfeature for implementation in future tape-insand tape-outs. The chip consists of two volt-age domains: the varying core domain andthe fixed uncore domain. The core domainconnects to the output of fully integratednoninterleaved switched-capacitor DC–DCconverters11 and powers the processor and itsL1 caches. The fixed 1-V domain powers theuncore. The DC–DC converters generatefour voltage modes between 0.5 and 1 Vfrom 1 and 1.8 V supplies; they achieve bet-ter than 80 percent conversion efficiency,

while requiring no off-chip passives, whichgreatly simplifies package design.10 The clockgenerator selects clock edges from a tunablereplica circuit powered by the rippling corevoltage that mimics the processor’s criticalpath delay, and it adapts the clock frequencyto the instantaneous voltage.12

The design of Raven-3 benefitted greatlyfrom the agile hardware design methodology.By aspiring to build an incomplete prototype(rather than, for instance, a many-core designimplementing these technologies), we dra-matically improved the odds of project suc-cess. The unique integration of the powerdelivery, clocking, and processor systems wasonly possible because the design teambroadly shared expertise about all aspects ofthe project. The processor RTL could largelybe reused from previous tape-outs, with onlyincremental improvements and configuration

1.0 V

1.8 V

24 switched-capacitor

DC–DC unit cells

DC–DC controller

Adaptive clock

generator

Vout

Vref

...

togg

le

+

–FSM

clk

Asynchronous FIFO

between domains

2-Ghzref

CORE (1.19 mm2)

Uncore

To/from off-chip FPGA FSB and DRAM

Wire-bonded chip-on-board

To off-chip oscilloscope

1.0 V

RISC-V scalar core

Vector accelerator

Vector issue unit

Vector memory unit

...

...

int int int int int

(16-Kbye vector RF uses eightcustom 8T SRAM macros)

Crossbar

ScalarRF FPU

int

Arbiter

Sharedfunctional units

64-bit integer multiplierSingle-precision FMADouble-precision FMA

(Custom 8TSRAM macros)

16-Kbyte scalarI-cache


32-Kbyte sharedD-cache


8-Kbyte vectorI-cache

Figure 6. Raven-3 system-level block diagram. A RISC-V processor core with a vector coprocessor runs with varying voltage

and frequency generated by on-chip switched-capacitor DC–DC converters and an adaptive clock generator.

.............................................................

MARCH/APRIL 2016 17

changes required to validate it for the particu-lar feature set desired. As physical design onpartial feature sets gave incremental results,we were able to adjust design features toimprove feasibility and QoR.

The agile hardware design methodologywas especially important in this design due tothe complexity and risk introduced by thecustom voltage conversion and clocking, asillustrated by the following example. Tape-ins of a simple digital system with a singleDC–DC unit cell were small enough to sim-ulate at the transistor level, and these simula-tions found a critical bug in the DC–DCcontroller. These transistor-level simulationsshowed that a large current event could causethe output voltage to remain below thelower-bound reference voltage even after theconverter switched. A bug in the controllerwould cause the converter to stop switchingand drop the processor voltage to zero forthis case. We believe this bug, which wasfixed early in the design process by the addi-tion of a simple state machine in the lower-bound controller, would have been foundvery late (if at all) with waterfall designapproaches.

Figure 7 shows the floorplan and micro-graph of the resulting chip, the wire-bondedchip on the package board, and the test setup.The resulting 2:37-mm2 Raven chip is fullyfunctional and boots Linux and executescompiled scalar and vector application codewith the adaptive clock generator at all oper-ating modes, while the DC–DC convertercan transition between voltage modes in less

than 20 ns. Raven-3 runs at a maximumclock frequency of 961 MHz at 1 V, consum-ing 173 mW, and at a minimum voltage of0.45 V at 93 MHz, consuming 8 mW.Raven-3 achieves a maximum energy effi-ciency of 27 and 34 Gflops per watt whilerunning a double-precision matrix-multipli-cation kernel on the vector accelerator withand without the switched-capacitor DC–DCconverters, respectively. These results validatethe agile approach to hardware design.

O ur agile hardware development meth-odology played an important role in

successfully taping out 11 RISC-V micro-processors in five years at UC Berkeley. Witha combination of our agile methodology,Chisel, RISC-V, and the Rocket chip genera-tor, a small team of graduate students at UCBerkeley was able to design multiple micro-processors that are competitive with indus-trial designs. Although there is much workleft to do to show that this approach can scaleup to the largest SoCs, we hope these projectsserve as an opportunity to shed some light oninefficiencies in the current design process, torethink how modern SoCs are built, and torevitalize hardware design. MICRO

AcknowledgmentsWe thank Tom Burd, James Dunn, OlivierThomas, and Andrei Vladimirescu for theircontributions and STMicroelectronics forthe fabrication donation. This work wasfunded in part by BWRC, ASPIRE, DARPAPERFECT Award number HR0011-12-2-

Adaptive clockUnit cell

SC-DCDC Controller

D$

BIST

I$

VI$

VectorRF

SC DC-DCScalar core +

vector accelerator

(a) (b) (c) (d)

Adaptive clock Unit cell

SC-DCDC Controller

D$

BIST

I$

VI$

VectorRF

SC DC-DCScalar core +

vector accelerator

Figure 7. Raven-3 implementation: the (a) floorplan, (b) micrograph of the resulting chip, (c) wire-bonded chip on the package

board, and (d) test setup.

..............................................................................................................................................................................................

HOT CHIPS

............................................................

18 IEEE MICRO

0016, STARnet C-FAR, Intel ARO, AMD,SRC/TxACE, Marie Curie FP7, NSFGRFP, and an Nvidia graduate fellowship.

....................................................................References1. K. Beck et al., The Agile Manifesto, tech.

report, Agile Alliance, 2001.

2. A. Waterman et al., The RISC-V Instruction

Set Manual, Volume I: User-Level ISA, Version

2.0, tech. report UCB/EECS-2014-54, EECS

Dept., Univ. California, Berkeley, May 2014.

3. R. Merritt, “Google, HP, Oracle Join RISC-

V,” EE Times, 28 Dec. 2015.

4. V. Patil et al., “Out of Order Floating Point

Coprocessor for RISC-V ISA,” Proc. 19th

Int’l Symp. VLSI Design and Test, 2015;

doi:10.1109/ISVDAT.2015.7208116.

5. K. Asanovic and D. Patterson, “The Case for

Open Instruction Sets,” Microprocessor

Report, Aug. 2014.

6. J. Bachrach et al., “Chisel: Constructing

Hardware in a Scala Embedded Language,”

Proc. Design Automation Conf., 2012, pp.

1216–1225.

7. O. Shacham et al., “Rethinking Digital Design:

Why Design Must Change,” IEEE Micro, vol.

30, no. 6, 2010, pp. 9–24.

8. Y. Lee et al., “A 45nm 1.3GHz 16.7 Double-

Precision GFLOPS/W RISC-V Processor

with Vector Accelerators,” Proc. 40th Euro-

pean Solid-State Circuits Conf., 2014, pp.

199–202.

9. C. Sun et al., “Single-Chip Microprocessor

that Communicates Directly Using Light,”

Nature, vol. 528, 2015, pp. 534–538.

10. B. Zimmer et al., “A RISC-V Vector Pro-

cessor with Tightly-Integrated Switched-

Capacitor DC–DC Converters in 28nm

FDSOI,” Proc. IEEE Symp. VLSI Circuits,

2015, pp. C316–C317.

11. H.-P. Le, S. Sanders, and E. Alon, “Design

Techniques for Fully Integrated Switched-

Capacitor DC–DC Converters,” IEEE J.

Solid-State Circuits, vol. 46, no. 9, 2011, pp.

2120–2131.

12. J. Kwak and B. Nikolic, “A 550-2260MHz Self-

Adjustable Clock Generator in 28nm FDSOI,”

Proc. IEEE Asian Solid-State Circuits Conf.,

2015; doi:10.1109/ASSCC.2015.7387471.

Yunsup Lee is a PhD candidate in theDepartment of Electrical Engineering andComputer Sciences at the University ofCalifornia, Berkeley. He received an MS incomputer science from the University ofCalifornia, Berkeley. Contact him [email protected].

Andrew Waterman is a PhD candidate inthe Department of Electrical Engineeringand Computer Sciences at the University ofCalifornia, Berkeley. He received an MS incomputer science from the University ofCalifornia, Berkeley. Contact him [email protected].

Henry Cook is a PhD candidate in theDepartment of Electrical Engineering andComputer Sciences at the University ofCalifornia, Berkeley. He received an MS incomputer science from the University ofCalifornia, Berkeley. Contact him [email protected].

Brian Zimmer is a research scientist in theCircuits Research Group at Nvidia. Hereceived a PhD in electrical engineering andcomputer science from the University ofCalifornia, Berkeley, where he performedthe research for this article. Contact him [email protected].

Ben Keller is a PhD student in theDepartment of Electrical Engineering andComputer Sciences at the University ofCalifornia, Berkeley. He received an MS inelectrical engineering from the University ofCalifornia, Berkeley. Contact him [email protected].

Alberto Puggelli is the director of technol-ogy at Lion Semiconductor. He received aPhD in electrical engineering from the Uni-versity of California, Berkeley, where heperformed the research for this article.Contact him at [email protected].

Jaehwa Kwak is a PhD candidate in theDepartment of Electrical Engineering andComputer Science at the University ofCalifornia, Berkeley. He received an MS inelectrical engineering and computer sciences

.............................................................

MARCH/APRIL 2016 19

from Seoul National University. Contacthim at [email protected].

Ru�zica Jevtic is an assistant professor at theUniversidad de Antonio de Nebrija inMadrid. She received a PhD in electricalengineering from Technical University ofMadrid and was a postdoctoral fellow at theUniversity of California, Berkeley, whereshe performed the research for this article.Contact her at [email protected].

Stevo Bailey is a PhD student in theDepartment of Electrical Engineering andComputer Science at the University ofCalifornia, Berkeley. He received an MS fromthe University of California, Berkeley. Con-tact him at [email protected].

Milovan Blagojevic is a PhD student in aCIFRE program at the Berkeley WirelessResearch Center, STMicroelectronics, andInstitut Sup�erieur d’Electronique de Paris.He received an MSc in electrical engineeringfrom the University of Belgrade, Serbia.Contact him at [email protected].

Pi-Feng Chiu is a PhD student in theDepartment of Electrical Engineering andComputer Sciences at the University ofCalifornia, Berkeley. She received an MS inelectrical engineering from the NationalTsing Hua University. Contact her [email protected].

Rimas Avi�zienis is a PhD candidate in theDepartment of Electrical Engineering andComputer Sciences at the University ofCalifornia, Berkeley. He received an MS incomputer science from the University ofCalifornia, Berkeley. Contact him [email protected].

Brian Richards is a research staff engineerat the University of California, Berkeley,and a founding member of the BerkeleyWireless Research Center. He received anMS in electrical engineering and computer

science from the University of California,Berkeley. Contact him at [email protected].

Jonathan Bachrach is an adjunct assistantprofessor in the Department of ElectricalEngineering and Computer Sciences at theUniversity of California, Berkeley. Hereceived a PhD in computer science fromthe University of Massachusetts, Amherst.Contact him at [email protected].

David Patterson is the Pardee Professor ofComputer Science at the University of Cali-fornia, Berkeley. He received a PhD in com-puter science from the University of Califor-nia, Los Angeles. Contact him at [email protected].

Elad Alon is an associate professor in theDepartment of Electrical Engineering andComputer Sciences at the University ofCalifornia, Berkeley, and a codirector of theBerkeley Wireless Research Center. Hereceived a PhD in electrical engineeringfrom Stanford University. Contact him [email protected].

Borivoje Nikolic is the National Semicon-ductor Distinguished Professor of Engineer-ing at the University of California, Berkeley.He received a PhD in electrical and com-puter engineering from the University ofCalifornia, Davis. Contact him at [email protected].

Krste Asanovic is a professor in the Depart-ment of Electrical Engineering and Com-puter Sciences at the University of California,Berkeley. He received a PhD in computerscience from the University of California,Berkeley. Contact him at [email protected].

..............................................................................................................................................................................................

HOT CHIPS

............................................................

20 IEEE MICRO

an agile approach to building risc-v microprocessors

Documents