experiences using the risc-v ecosystem to design an …cbatten/pdfs/ajayi-celerity... ·...

35
Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC in TSMC 16nm Tutu Ajayi 2 , Khalid Al-Hawaj 1 , Aporva Amarnath 2 , Steve Dai 1 , Scott Davidson 4 , Paul Gao 4 , Gai Liu 1 , Anuj Rao 4 , Austin Rovinski 2 , Ningxiao Sun 4 , Christopher Torng 1 , Luis Vega 4 , Bandhav Veluri 4 , Shaolin Xie 4 , Chun Zhao 4 Ritchie Zhao 1 , Christopher Batten 1 , Ronald G. Dreslinski 2 , Rajesh K. Gupta 3 , Michael B. Taylor 4 , Zhiru Zhang 1 1 Cornell University 2 University of Michigan 3 University of California, San Diego 4 Bespoke Silicon Group, (U. Washington/ UC San Diego) MICRO-50 October 14, 2017

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

ExperiencesUsingtheRISC-VEcosystemtoDesignanAccelerator-CentricSoCinTSMC16nm

TutuAjayi2, KhalidAl-Hawaj1, Aporva Amarnath2, SteveDai1,ScottDavidson4, PaulGao4, Gai Liu1, AnujRao4,AustinRovinski2, Ningxiao Sun4, ChristopherTorng1, LuisVega4,Bandhav Veluri4, ShaolinXie4, ChunZhao4 RitchieZhao1,

ChristopherBatten1,RonaldG.Dreslinski2,RajeshK.Gupta3,MichaelB.Taylor4,Zhiru Zhang1

1 CornellUniversity2 UniversityofMichigan

3 UniversityofCalifornia,SanDiego4 BespokeSiliconGroup,(U.Washington/UCSanDiego)

MICRO-50 October 14, 2017

Page 2: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

ComputerArchitectureResearchPrototypingPrototypingis importanttocomplementtheresultsofsimulation-basedresearch

Manybenefitstoprototyping:

• Validatingassumptions• Validating designmethodologies• Measuring realsystem-levelperformanceandenergy efficiency

• Creating platformsforsoftwareresearch• Buildingcredibilitywithindustry• Buildingintuitionforphysicaldesign• Pedagogical benefits• Buildingrealthingsisfun!

Celerity::Introduction

Page 3: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

TheContinuingNeedforBuildingPrototypes

Theriseofthe darksiliconera [1],inwhichanincreasingfractionofsiliconmustremainunpowered,ismotivatinganincreasingtrendtowardsaccelerator-centricarchitectures.

Specializationresearchrequires:

• Newsimulation-based evaluationmethodologiesbasedonaccelerators[2]

• Newprototypingmethodologiesforrapidlybuildingaccelerator-centricprototypes

Unfortunately,buildingresearchprototypescanbetremendouslychallenging.

“Shrink” “Dim”

“Specialize” “Magic”

Celerity::Introduction

The FourHorsemenofthe Coming

DarkSilicon Apocalypse

[1]M.Taylor.“IsDarkSiliconUseful?HarnessingtheFourHorsemenoftheComingDarkSiliconApocalypse,”InDesignAutomationConference,2012.[2]Y.Shao,etal.“Aladdin:APre-RTL,Power-PerformanceAcceleratorSimulatorEnablingLargeDesignSpaceExplorationofCustomizedArchitectures”,ISCA2014

Page 4: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Prototyping withtheRISC-VSoftware/Hardware Ecosystem

Celerity::Introduction

SoftwareToolchain

• Acomplete,off-the-shelfsoftwarestack(e.g.,binutils,GCC,newlib/glibc,Linuxkernel&distros)forbothembeddedandgeneral-purpose

Architecture

• RISC-VISAspecificationdesignedtobebothmodularandextensible,withasmallbaseISAandoptionalextensions

Microarchitecture

• On-chipnetworkspecificationsandimplementations(NASTI,TileLink)• RISC-Vprocessorimplementationsforbothin-order(BerkeleyRocket)andout-of-order(BerkeleyBOOM)cores

PhysicalDesign

• PreviousspinsofchipsforreferenceTesting

• Standardcoreverificationtestsuites+Turn-keyFPGAgateware

ApplicationAlgorithm

Operating System

Instruction Set Architecture

Register-Transfer Level

Circuits

Programming Language

Compilers

Microarchitecture

Gate-Level

TechnologyDevices

Page 5: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

TheCeleritySystem-on-Chip

Celerity, anaccelerator-centricSoCwithatieredacceleratorfabric

thattargets highlyperformantandenergy-efficientembeddedsystems

FundedbytheDARPACRAFTprogram,“CircuitRealizationAtFasterTimescales”

Thegoalwastodevelopnewmethodologiestodesignchipsmorequickly

Celerity::Introduction

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

WeleveragedtheRISC-Vsoftware/hardwareecosystem as webuiltCelerity,andwebelieveitwasinstrumentalinenablingateamof20graduatestudentstotapeoutacomplexSoCinonly9months

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Page 6: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Celerity:ChipOverview

• TSMC16nmFFC• 25mm2 diearea(5mmx5mm)• ~385milliontransistors• 511RISC-Vcores

• 5Linux-capableRV64GBerkeleyRocketcores• 496-core RV32IM meshtiledarray“manycore”• 10-coreRV32IMmeshtiledarray(lowvoltage)

• Binarized NeuralNetworkSpecializedAccelerator• On-chipsynthesizablePLLsandDC/DCLDO

• Developedin-house• 3Clockdomains

• 400MHz– DDRI/O• 625MHz– Rocketcore+Specializedaccelerator• 1.05GHz– Manycorearray

• 672-pinflipchipBGApackage• 9-monthsfromPDKaccesstotape-out

Celerity::Introduction

http://www.opencelerity.org

Page 7: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Agenda

• Introduction• ForeachTier:

• Whatdidwebuild?• Howdidwebuildit?• RISC-VEcosystemSuccesses• RISC-VEcosystemChallenges

• Conclusion

Celerity::Introduction

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Page 8: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Celerity:General-Purpose Tier

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V• ChallengeswithRISC-V

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Page 9: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

General-PurposeTierOverview

• 5 BerkeleyRocketCores(RV64G)• Workload

• General-purposecompute• Operatingsystem(e.g.Linux&TCP/IPStack)• InterruptandExceptionhandling• Programdispatchandcontrolflow

• Interface• Interfacetooff-chipI/Oandotherperipherals• 4Coresconnecttothemanycore array• 1CoreinterfaceswiththeBNN

• Memory• Eachcoreexecutesindependentlywithinitsownaddressspace

• Memorymanagementforalltiers

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V• ChallengeswithRISC-V

Man

ycore

BNN

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

BaseJumpF

SB

BaseJumpM

otherboard

Page 10: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Berkeley RocketCores

• 5BerkeleyRocketCores(https://github.com/freechipsproject/rocket-chip)

• GeneratedfromChisel• RV64GISA• 5-stage,in-order,scalarprocessor• Double-precisionfloatingpoint• I-Cache:16KB4-wayassoc.• D-Cache:16KB4-wayassoc.

• PhysicalImplementation• 625MHz(CriticalpathinFSB)• 0.19mm2 percore

http://www.lowrisc.org/docs/tagged-memory-v0.1/rocket-core/

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Page 11: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

DesignIterations2.Alpaca

3.Bison 4.Coyote

1.Loopback

Bas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B Loopback FIFO

Bas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B

NA

STI RISC-V Rocket Core

I-CacheD-CacheN

AST

I RISC-V Rocket Core

I-CacheD-Cache

RoCC AcceleratorBas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• SuccesseswithRISC-V•ChallengeswithRISC-V

ImplementedNASTIbridgeandconnectedrocketcoreBaselinedesigntovalidateFSBandNorthbridge

ImplementedacceleratorconnectedthroughBlackboxed RoCC ModularizedRoCCinterfacetoaccelerator

Bas

eJum

pM

othe

rboa

rd

Bas

eJum

pFS

B NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

Accelerator

… …

Page 12: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

BaseJump Motherboard Celerity SoC

Off-ChipInterfaceandNorthbridge

• Open-sourceBaseJumpIPLibrary• http://bjump.org

• FrontSidebus• BaseJumpCommunicationLink• HighSpeed(DDR)Source-SynchronousCommunicationInterface

• Packaging• ModifiedBaseJumpBGAPackageandI/ORing

• Validation• BaseJumpSuperTroublePCB (DaughterCard)• BaseJumpMotherboard(ZedBoard)

DRAM Controller

Ethernet

SSD

L2 $

JTAG

Bas

eJum

pFS

B &

FPG

A B

ridge

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

NA

STI RISC-V Rocket Core

I-CacheD-Cache RoC

C

Bas

eJum

pFP

GA

Brid

ge

Clocks

...

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 13: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

RISC-VSuccesses

• BerkeleyRocketCores• Veryquicklygeneratedvalidateddesigns• Vibrantecosystemtoprovidefeedbackandsupport• TestandValidationinfrastructure• SoftwareandToolchainsupport

• FlexiblememorysystemandperipheralI/Osupport• EasyintegrationwithBaseJump IPLibrary

• Balancesextensibilityandsoftwaresupport

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Page 14: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

RISC-VLessonsLearned

• Componentstability,compatibilityandversioning• Chiseladoption• RTLsimulationissues

• DecipheringChiselgeneratedRTL• RegisterinitializationandX-Pessimism

Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Page 15: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Celerity:MassivelyParallel Tier

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

DevelopedbyTaylor’sBespokeSiliconGroup@UW

Celerity::Massively ParallelTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

http://bjump.org/manycore

Page 16: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Thetiledarchitecture

TheVanillacore: SimplebutefficienttorunCcodewithoutanytoolchainmodification

• ISA:RV32IM• Pipeline:5-stage,fullyforwarded,in-order,singleissue

• Scratchpadmemory:4KBforIMem,4KBforDMem

• SecondTape-outofthistiledarchitecture(10-core)

...

… … …...

...

...

...

… … …

NOCRouter

RISC-V Core

MEM

C

ross

bar

DMEM

IMEM

496RISC-VCores

Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Page 17: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

MeshNetwork

• LinkProtocol:Forward/Reversepaths,parameterizable address/databits

• Credit-Based:Eachpacketwillbeacknowledgedwithresponse

• Flowcontrol:Endpointcontrolsthenumberoftheoutstandingpacket.

• Router:simpleXY-dimensionrouting,buffered

17Celerity::MassivelyParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

...

...

...

Forward packetForward responseReverse packetReverse response

buffered

router

tilelink protocol

Page 18: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Manycore LinkstoGeneral-PurposeandSpecializedTier

CrossClockDomaininterface

• ToGeneral-PurposeTier:ConvertRoCCtolinkprotocol,supportconfiguringDMA,writeandresetmanycore etc.

• ToSpecializedTier:Aggregatelinkinterfacetoincreasethebandwidthandthroughput

Asy

nc F

IFO

Endp

oint

DM

A

L1D Cache

Core

req

resp

cmdrespbusy

link_to_rocc Router

...

… … …...

...

...

...

… … …

Rocket

Rocket

Rocket

Rocket

RoCC

RoCC

RoCC

RoCC

General-Purpose Tier clock domain

Massively Parallel Tier clock domain

Specialized Tier clock domain

Asy

nc

FIFO

Asy

nc

FIFO

Asy

nc

FIFO

Asy

nc

FIFO

32

32

32

32

64

64

64

Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Cross ClkDomain

Cross Clk Domain

Page 19: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

ProgrammingModel

Producer-consumerprogrammingmodel:

extendedinstructionsforefficientinter-tilesynchronization

• LoadReserved(lr.w):loadvalueandsetthereservationaddress

• Load-on-broken-reservation(lr.lbr):stallifthereservedaddressdidn’twrittenbyothercores

• Consumer: waiton<address,value>• Benefits:Nopolling,nointerrupt,fastresponse,stalledpipelinecansavepower

InputSplit Join

Feedback

Pipeline

Output

Producer-consumer Programming

DMEM

Core A Core B

NoC

Remote store

Reserved Address

Invoke pipelineStalled Pipeline

waiting for events

Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Page 20: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

ThreadDensity Comparison

[1]J.Balkind,etal.“OpenPiton:AnOpenSourceManycoreResearchFramework,”intheInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),2016.[2]R.Balasubramanian,etal."EnablingGPGPULow-LevelHardwareExplorationswithMIAOW:AnOpen-SourceRTLImplementationofaGPGPU,"inACMTransactionsonArchitectureandCodeOptimization(TACO). 12.2(2015):21.

ConfigurationNormalizedArea

(32nm)

Area

Ratio

CelerityTile@16nm

D-MEM=4KBI-MEM=4KB

0.024*(32/16)2=0.096mm2 1x

OpenPitonTile@32nm

L1D-Cache=8KBL1I-Cache=16KB

L1.5/L2Cache=72KB1.17mm2 [1] 12x

RawTile@180nm

L1D-Cache=32KBL1I-SRAM=96KB

16.0*(32/180)2=0.506mm2 5.25x

MIAOWGPUComputeUnitLane

@32nm

VRF=256KBSRF=2KB

15.0/16=0.938mm2[2] 9.75x

Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

• Timing:1.05GHz@16nm• Area:0.024mm2 @16nm• SiUtilizationRatio:90%

NormalizedPhysicalThreads(ALUops)perArea

Page 21: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Howdidwebuildthemassivelyparalleltier?

BasejumpSTLlibrary

Dataflow

NoC

Arithmetic

RISC-Vtoolchain

AssemblyTestSuite

ModifiedRuntime

CCompiler

…Inhouse

Design Testing

I Mem D MemRF

Hard-macro

One tile

Floorplan

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

Celerity::Massively ParallelTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

HierarchicalFlow

Page 22: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

RISC-VEcosystemSuccesses

• ModularISA• Flexibleforbothcomplexcores(i.e.Rocket) and simplecores(i.e.Vanilla)

• ExtensibleRoCCinterface• 4 customizableinstructions:weusedone

• Comprehensiveassembly testsuite(434testcases)• Off-the-shelftoolchain

Celerity::Massively ParallelTier::Whatisit?•Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Page 23: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

BuildinguptheRISC-VEcosystem

We provideanefficientRV-32IM implementation

inSystemVerilog.

We consolidatedInformationaboutRoCCthat

wasscattered acrosstheinternet.

WithCelerity

• Efficientopensourcecore• BasedonSystemverilog• Siliconproven

• PublicRoCC documentV.2[bjump.org/rocc_doc ]

• ExportedRoCCinterfaceontoplevel

Celerity::Massively ParallelTier::Whatisit?•Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V

Page 24: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Celerity:SpecializationTier

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-CacheN

AST

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

NA

STI

RoC

CRISC-V Rocket Core

I-CacheD-Cache

RISC-VVanilla-5

Core

I Mem

XB

AR

NoC

Router

D Mem

BaseJumpF

SBand

Motherboard

Page 25: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

CaseStudy:MappingFlexibleImageRecognitiontoaTieredAcceleratorFabric

Threestepstomapapplicationstotieredacceleratorfabric:

Step1. Implementthealgorithmusingthegeneral-purposetierStep2. Acceleratethealgorithmusingeitherthemassively

paralleltierOR thespecializationtierStep3. Improveperformancebycooperativelyusingboththe

specializationAND themassivelyparalleltier

Convolution Pooling Convolution Pooling Fully-connected

bird(0.02)boat(0.94)

cat(0.04)dog(0.01)

MassivelyParallel Tier

Specialization Tier

General-PurposeTier

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 26: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Step1:AlgorithmtoApplicationBinarizedNeuralNetworks

• Trainingusuallyusesfloatingpoint,whileinferenceusuallyuseslowerprecisionweightsandactivations(often8-bitorlower)toreduceimplementationcomplexity

• Rastergari etal.[3]andCourbariaux etal.[4]haverecentlyshownsingle-bitprecisionweightsandactivationscanachieveanaccuracyof89.8%onCIFAR-10

• Performancetargetrequiresultra-lowlatency(batchsizeofone)andhighthroughput(60classifications/second)

[3]M.Rastergari,etal.“Xnor-net:Imagenet classificationusingbinaryconvolutionalneuralnetworks,”InEuropeanConferenceonComputerVision,2016.[4]M.Courbariaux,etal.“Binarizedneuralnetworks:Trainingdeepneuralnetworkswithweightsandactivationsconstrainedto+1or-1,”arXiv preprintarXiv:1602.02830(2016).

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 27: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Step1:AlgorithmtoApplicationCharacterizingBNNExecution

• Usingjustthegeneral-purposetierwould be200xslowerthanthe performancetarget(60classifications/sec)• Binarizedconvolutionallayersconsumeover97%ofdynamicinstructioncount• Perfectaccelerationofjustthebinarizedconvolutionallayersisstill5xslowerthanperformancetarget• Perfectaccelerationofalllayersusingthemassivelyparalleltiercouldmeetperformancetargetbutwithsignificantenergyconsumption

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 28: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Step 2:ApplicationtoAcceleratorBNNSpecializedAccelerator

1. AcceleratorisconfiguredtoprocessalayerthroughRoCCcommandmessages

2. MemoryUnitstartsstreamingtheweightsintotheacceleratorandunpackingthebinarizedweightsintoappropriatebuffers

3. Binaryconvolutioncomputeunitprocessesinputactivationsandweightstoproduceoutputactivations

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 29: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Off-Ch

ipI/O

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

Step2:ApplicationtoAcceleratorGeneral-PurposeTierforWeightStorage

• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 30: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

Step3:AssistingAcceleratorsMassivelyParallelTierforWeightStorage

Off-Ch

ipI/O

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

AX

I

RoC

CRISC-V Rocket Core

I-CacheD-Cache

• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights

• Each core in the massively parallel tier executes a remote-load-store program to orchestrate sending weights to the specialization tier via a hardware FIFO

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 31: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

PerformanceBenefitsofCooperativelyUsingtheMassivelyParallelandtheSpecializationTiers

General-Purpose Tier Software implementation assuming ideal performance estimated with an optimistic one instruction per cycle

Specialization Tier Full-system RTL simulation of the BNN specialized accelerator running with a frequency of 625 MHz

Specialization + MassivelyParallel Tiers

Full-system RTL simulation of the BNN specialized accelerator with the weights being streamed from the manycore

General-Purpose Tier Specialization Tier Specialization + Massively Parallel Tiers

Runtime per Image (ms) 4,024 20 3.3

Power (Watts) 0.2 – 0.5 0.2 – 0.5 0.5 – 2.0

Improvement in Perf / Power 1x ~200x ~400x

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 32: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

DesignMethodology

SystemCConstraints

StratusHLS

RTL

PyMTL

Wrappers&

Adapters

FinalRTL

void bnn::dma_req() {while( 1 ) {DmaMsg msg = dma_req.get();

for ( int i = 0; i < msg.len; i++ ) {HLS_PIPELINE_LOOP( HARD_STALL, 1 );

int req_type = 0;word_t data = 0;addr_t addr = msg.base + i*8;

if ( type == DMA_TYPE_WRITE ) {data = msg.data;req_type = MemReqMsg::WRITE;} else {req_type = MemReqMsg::READ;}

memreq.put(MemReqMsg(req_type,addr,data));}

dma_resp.put(DMA_REQ_DONE);}}

Including

RoCCInterfaces

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Page 33: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

DesignMethodology

SystemCConstraints

StratusHLS

RTL

PyMTL

Wrappers&

Adapters

FinalRTL HardMacro

ASICFlow

Constraints

Files

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Including

RoCCInterfaces

Page 34: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

RISC-VEcosystemSuccessesandChallenges

Successes

• TheRoCCcommandandmemoryinterfacewerebothsignificantsuccesses.Weconnected the acceleratorwithnochangestoRV64Gcore,justaswedidforthemanycorearrayinthemassivelyparallel tier.

Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V

Challenges

• SmallchallengeintheRoCCacceleratorinterfaceatthespecificcommitwechosetouse

• MemorymanagementunitinRV64Gusedonlyphysicaladdresses

• Wedidasmallworkaroundtogiveusvirtualaddressesaswell

• Thischallengehasalreadybeenfixedupstream

Page 35: Experiences Using the RISC-V Ecosystem to Design an …cbatten/pdfs/ajayi-celerity... · 2020-06-24 · Experiences Using the RISC-V Ecosystem to Design an Accelerator-Centric SoC

TheCeleritySystem-on-Chip

Celerity, anaccelerator-centricSoCwithatieredacceleratorfabricthat

targetshighlyperformantandenergy-efficientembeddedsystems

Celerity’s goal wastodevelopnewmethodologiestodesignchipsmorequickly

WebelievetheRISC-Vsoftware/hardwareecosystem wasinstrumentalinenablingateamof20graduatestudents

totapeoutacomplexSoC inonly9months

General-Purpose

Tier

MassivelyParallel

Tier

Specialization

Tier

Wethankthemanycontributorstotheopen-sourceRISC-VsoftwareandhardwareecosystemwithspecialthankstoU.C.

BerkeleyforformingtheRISC-Vecosystem

Celerity::Conclusion

Acknowledgements:DARPA,undertheCRAFTprogram

SpecialthankstoDr.LintonSalmonforprogramsupportandcoordination

http://www.opencelerity.org