experiences using the risc-v ecosystem to design an …cbatten/pdfs/ajayi-celerity... ·...
TRANSCRIPT
ExperiencesUsingtheRISC-VEcosystemtoDesignanAccelerator-CentricSoCinTSMC16nm
TutuAjayi2, KhalidAl-Hawaj1, Aporva Amarnath2, SteveDai1,ScottDavidson4, PaulGao4, Gai Liu1, AnujRao4,AustinRovinski2, Ningxiao Sun4, ChristopherTorng1, LuisVega4,Bandhav Veluri4, ShaolinXie4, ChunZhao4 RitchieZhao1,
ChristopherBatten1,RonaldG.Dreslinski2,RajeshK.Gupta3,MichaelB.Taylor4,Zhiru Zhang1
1 CornellUniversity2 UniversityofMichigan
3 UniversityofCalifornia,SanDiego4 BespokeSiliconGroup,(U.Washington/UCSanDiego)
MICRO-50 October 14, 2017
ComputerArchitectureResearchPrototypingPrototypingis importanttocomplementtheresultsofsimulation-basedresearch
Manybenefitstoprototyping:
• Validatingassumptions• Validating designmethodologies• Measuring realsystem-levelperformanceandenergy efficiency
• Creating platformsforsoftwareresearch• Buildingcredibilitywithindustry• Buildingintuitionforphysicaldesign• Pedagogical benefits• Buildingrealthingsisfun!
Celerity::Introduction
TheContinuingNeedforBuildingPrototypes
Theriseofthe darksiliconera [1],inwhichanincreasingfractionofsiliconmustremainunpowered,ismotivatinganincreasingtrendtowardsaccelerator-centricarchitectures.
Specializationresearchrequires:
• Newsimulation-based evaluationmethodologiesbasedonaccelerators[2]
• Newprototypingmethodologiesforrapidlybuildingaccelerator-centricprototypes
Unfortunately,buildingresearchprototypescanbetremendouslychallenging.
“Shrink” “Dim”
“Specialize” “Magic”
Celerity::Introduction
The FourHorsemenofthe Coming
DarkSilicon Apocalypse
[1]M.Taylor.“IsDarkSiliconUseful?HarnessingtheFourHorsemenoftheComingDarkSiliconApocalypse,”InDesignAutomationConference,2012.[2]Y.Shao,etal.“Aladdin:APre-RTL,Power-PerformanceAcceleratorSimulatorEnablingLargeDesignSpaceExplorationofCustomizedArchitectures”,ISCA2014
Prototyping withtheRISC-VSoftware/Hardware Ecosystem
Celerity::Introduction
SoftwareToolchain
• Acomplete,off-the-shelfsoftwarestack(e.g.,binutils,GCC,newlib/glibc,Linuxkernel&distros)forbothembeddedandgeneral-purpose
Architecture
• RISC-VISAspecificationdesignedtobebothmodularandextensible,withasmallbaseISAandoptionalextensions
Microarchitecture
• On-chipnetworkspecificationsandimplementations(NASTI,TileLink)• RISC-Vprocessorimplementationsforbothin-order(BerkeleyRocket)andout-of-order(BerkeleyBOOM)cores
PhysicalDesign
• PreviousspinsofchipsforreferenceTesting
• Standardcoreverificationtestsuites+Turn-keyFPGAgateware
ApplicationAlgorithm
Operating System
Instruction Set Architecture
Register-Transfer Level
Circuits
Programming Language
Compilers
Microarchitecture
Gate-Level
TechnologyDevices
TheCeleritySystem-on-Chip
Celerity, anaccelerator-centricSoCwithatieredacceleratorfabric
thattargets highlyperformantandenergy-efficientembeddedsystems
FundedbytheDARPACRAFTprogram,“CircuitRealizationAtFasterTimescales”
Thegoalwastodevelopnewmethodologiestodesignchipsmorequickly
Celerity::Introduction
General-Purpose
Tier
MassivelyParallel
Tier
Specialization
Tier
WeleveragedtheRISC-Vsoftware/hardwareecosystem as webuiltCelerity,andwebelieveitwasinstrumentalinenablingateamof20graduatestudentstotapeoutacomplexSoCinonly9months
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
RISC-VVanilla-5
Core
I Mem
XB
AR
NoC
Router
D Mem
BaseJumpF
SBand
Motherboard
Celerity:ChipOverview
• TSMC16nmFFC• 25mm2 diearea(5mmx5mm)• ~385milliontransistors• 511RISC-Vcores
• 5Linux-capableRV64GBerkeleyRocketcores• 496-core RV32IM meshtiledarray“manycore”• 10-coreRV32IMmeshtiledarray(lowvoltage)
• Binarized NeuralNetworkSpecializedAccelerator• On-chipsynthesizablePLLsandDC/DCLDO
• Developedin-house• 3Clockdomains
• 400MHz– DDRI/O• 625MHz– Rocketcore+Specializedaccelerator• 1.05GHz– Manycorearray
• 672-pinflipchipBGApackage• 9-monthsfromPDKaccesstotape-out
Celerity::Introduction
http://www.opencelerity.org
Agenda
• Introduction• ForeachTier:
• Whatdidwebuild?• Howdidwebuildit?• RISC-VEcosystemSuccesses• RISC-VEcosystemChallenges
• Conclusion
Celerity::Introduction
General-Purpose
Tier
MassivelyParallel
Tier
Specialization
Tier
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
RISC-VVanilla-5
Core
I Mem
XB
AR
NoC
Router
D Mem
BaseJumpF
SBand
Motherboard
Celerity:General-Purpose Tier
Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V• ChallengeswithRISC-V
General-Purpose
Tier
MassivelyParallel
Tier
Specialization
Tier
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
RISC-VVanilla-5
Core
I Mem
XB
AR
NoC
Router
D Mem
BaseJumpF
SBand
Motherboard
General-PurposeTierOverview
• 5 BerkeleyRocketCores(RV64G)• Workload
• General-purposecompute• Operatingsystem(e.g.Linux&TCP/IPStack)• InterruptandExceptionhandling• Programdispatchandcontrolflow
• Interface• Interfacetooff-chipI/Oandotherperipherals• 4Coresconnecttothemanycore array• 1CoreinterfaceswiththeBNN
• Memory• Eachcoreexecutesindependentlywithinitsownaddressspace
• Memorymanagementforalltiers
Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V• ChallengeswithRISC-V
Man
ycore
BNN
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
BaseJumpF
SB
BaseJumpM
otherboard
Berkeley RocketCores
• 5BerkeleyRocketCores(https://github.com/freechipsproject/rocket-chip)
• GeneratedfromChisel• RV64GISA• 5-stage,in-order,scalarprocessor• Double-precisionfloatingpoint• I-Cache:16KB4-wayassoc.• D-Cache:16KB4-wayassoc.
• PhysicalImplementation• 625MHz(CriticalpathinFSB)• 0.19mm2 percore
http://www.lowrisc.org/docs/tagged-memory-v0.1/rocket-core/
Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
DesignIterations2.Alpaca
3.Bison 4.Coyote
1.Loopback
Bas
eJum
pM
othe
rboa
rd
Bas
eJum
pFS
B Loopback FIFO
Bas
eJum
pM
othe
rboa
rd
Bas
eJum
pFS
B
NA
STI RISC-V Rocket Core
I-CacheD-CacheN
AST
I RISC-V Rocket Core
I-CacheD-Cache
RoCC AcceleratorBas
eJum
pM
othe
rboa
rd
Bas
eJum
pFS
B
Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• SuccesseswithRISC-V•ChallengeswithRISC-V
ImplementedNASTIbridgeandconnectedrocketcoreBaselinedesigntovalidateFSBandNorthbridge
ImplementedacceleratorconnectedthroughBlackboxed RoCC ModularizedRoCCinterfacetoaccelerator
Bas
eJum
pM
othe
rboa
rd
Bas
eJum
pFS
B NA
STI RISC-V Rocket Core
I-CacheD-Cache RoC
C
Accelerator
… …
BaseJump Motherboard Celerity SoC
Off-ChipInterfaceandNorthbridge
• Open-sourceBaseJumpIPLibrary• http://bjump.org
• FrontSidebus• BaseJumpCommunicationLink• HighSpeed(DDR)Source-SynchronousCommunicationInterface
• Packaging• ModifiedBaseJumpBGAPackageandI/ORing
• Validation• BaseJumpSuperTroublePCB (DaughterCard)• BaseJumpMotherboard(ZedBoard)
DRAM Controller
Ethernet
SSD
L2 $
JTAG
Bas
eJum
pFS
B &
FPG
A B
ridge
NA
STI RISC-V Rocket Core
I-CacheD-Cache RoC
C
NA
STI RISC-V Rocket Core
I-CacheD-Cache RoC
C
NA
STI RISC-V Rocket Core
I-CacheD-Cache RoC
C
NA
STI RISC-V Rocket Core
I-CacheD-Cache RoC
C
NA
STI RISC-V Rocket Core
I-CacheD-Cache RoC
C
Bas
eJum
pFP
GA
Brid
ge
Clocks
...
Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
RISC-VSuccesses
• BerkeleyRocketCores• Veryquicklygeneratedvalidateddesigns• Vibrantecosystemtoprovidefeedbackandsupport• TestandValidationinfrastructure• SoftwareandToolchainsupport
• FlexiblememorysystemandperipheralI/Osupport• EasyintegrationwithBaseJump IPLibrary
• Balancesextensibilityandsoftwaresupport
Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
RISC-VLessonsLearned
• Componentstability,compatibilityandversioning• Chiseladoption• RTLsimulationissues
• DecipheringChiselgeneratedRTL• RegisterinitializationandX-Pessimism
Celerity::General-PurposeTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
RISC-VVanilla-5
Core
I Mem
XB
AR
NoC
Router
D Mem
BaseJumpF
SBand
Motherboard
Celerity:MassivelyParallel Tier
General-Purpose
Tier
MassivelyParallel
Tier
Specialization
Tier
DevelopedbyTaylor’sBespokeSiliconGroup@UW
Celerity::Massively ParallelTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
http://bjump.org/manycore
Thetiledarchitecture
TheVanillacore: SimplebutefficienttorunCcodewithoutanytoolchainmodification
• ISA:RV32IM• Pipeline:5-stage,fullyforwarded,in-order,singleissue
• Scratchpadmemory:4KBforIMem,4KBforDMem
• SecondTape-outofthistiledarchitecture(10-core)
...
… … …...
...
...
...
… … …
NOCRouter
RISC-V Core
MEM
C
ross
bar
DMEM
IMEM
496RISC-VCores
Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
MeshNetwork
• LinkProtocol:Forward/Reversepaths,parameterizable address/databits
• Credit-Based:Eachpacketwillbeacknowledgedwithresponse
• Flowcontrol:Endpointcontrolsthenumberoftheoutstandingpacket.
• Router:simpleXY-dimensionrouting,buffered
17Celerity::MassivelyParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
...
...
...
Forward packetForward responseReverse packetReverse response
buffered
router
tilelink protocol
Manycore LinkstoGeneral-PurposeandSpecializedTier
CrossClockDomaininterface
• ToGeneral-PurposeTier:ConvertRoCCtolinkprotocol,supportconfiguringDMA,writeandresetmanycore etc.
• ToSpecializedTier:Aggregatelinkinterfacetoincreasethebandwidthandthroughput
Asy
nc F
IFO
Endp
oint
DM
A
L1D Cache
Core
req
resp
cmdrespbusy
link_to_rocc Router
...
… … …...
...
...
...
… … …
Rocket
Rocket
Rocket
Rocket
RoCC
RoCC
RoCC
RoCC
General-Purpose Tier clock domain
Massively Parallel Tier clock domain
Specialized Tier clock domain
Asy
nc
FIFO
Asy
nc
FIFO
Asy
nc
FIFO
Asy
nc
FIFO
32
32
32
32
64
64
64
Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
Cross ClkDomain
Cross Clk Domain
ProgrammingModel
Producer-consumerprogrammingmodel:
extendedinstructionsforefficientinter-tilesynchronization
• LoadReserved(lr.w):loadvalueandsetthereservationaddress
• Load-on-broken-reservation(lr.lbr):stallifthereservedaddressdidn’twrittenbyothercores
• Consumer: waiton<address,value>• Benefits:Nopolling,nointerrupt,fastresponse,stalledpipelinecansavepower
InputSplit Join
Feedback
Pipeline
Output
Producer-consumer Programming
DMEM
Core A Core B
NoC
Remote store
Reserved Address
Invoke pipelineStalled Pipeline
waiting for events
Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
ThreadDensity Comparison
[1]J.Balkind,etal.“OpenPiton:AnOpenSourceManycoreResearchFramework,”intheInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),2016.[2]R.Balasubramanian,etal."EnablingGPGPULow-LevelHardwareExplorationswithMIAOW:AnOpen-SourceRTLImplementationofaGPGPU,"inACMTransactionsonArchitectureandCodeOptimization(TACO). 12.2(2015):21.
ConfigurationNormalizedArea
(32nm)
Area
Ratio
CelerityTile@16nm
D-MEM=4KBI-MEM=4KB
0.024*(32/16)2=0.096mm2 1x
OpenPitonTile@32nm
L1D-Cache=8KBL1I-Cache=16KB
L1.5/L2Cache=72KB1.17mm2 [1] 12x
RawTile@180nm
L1D-Cache=32KBL1I-SRAM=96KB
16.0*(32/180)2=0.506mm2 5.25x
MIAOWGPUComputeUnitLane
@32nm
VRF=256KBSRF=2KB
15.0/16=0.938mm2[2] 9.75x
Celerity::Massively ParallelTier::Whatisit? • Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
• Timing:1.05GHz@16nm• Area:0.024mm2 @16nm• SiUtilizationRatio:90%
NormalizedPhysicalThreads(ALUops)perArea
Howdidwebuildthemassivelyparalleltier?
BasejumpSTLlibrary
Dataflow
NoC
Arithmetic
…
RISC-Vtoolchain
AssemblyTestSuite
ModifiedRuntime
CCompiler
…Inhouse
Design Testing
I Mem D MemRF
Hard-macro
One tile
Floorplan
RISC-VVanilla-5
Core
I Mem
XB
AR
NoC
Router
D Mem
Celerity::Massively ParallelTier::Whatisit?• Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
HierarchicalFlow
RISC-VEcosystemSuccesses
• ModularISA• Flexibleforbothcomplexcores(i.e.Rocket) and simplecores(i.e.Vanilla)
• ExtensibleRoCCinterface• 4 customizableinstructions:weusedone
• Comprehensiveassembly testsuite(434testcases)• Off-the-shelftoolchain
Celerity::Massively ParallelTier::Whatisit?•Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
BuildinguptheRISC-VEcosystem
We provideanefficientRV-32IM implementation
inSystemVerilog.
We consolidatedInformationaboutRoCCthat
wasscattered acrosstheinternet.
WithCelerity
• Efficientopensourcecore• BasedonSystemverilog• Siliconproven
• PublicRoCC documentV.2[bjump.org/rocc_doc ]
• ExportedRoCCinterfaceontoplevel
Celerity::Massively ParallelTier::Whatisit?•Howdidwebuildit?• Successeswith RISC-V•ChallengeswithRISC-V
Celerity:SpecializationTier
General-Purpose
Tier
MassivelyParallel
Tier
Specialization
Tier
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-CacheN
AST
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
NA
STI
RoC
CRISC-V Rocket Core
I-CacheD-Cache
RISC-VVanilla-5
Core
I Mem
XB
AR
NoC
Router
D Mem
BaseJumpF
SBand
Motherboard
CaseStudy:MappingFlexibleImageRecognitiontoaTieredAcceleratorFabric
Threestepstomapapplicationstotieredacceleratorfabric:
Step1. Implementthealgorithmusingthegeneral-purposetierStep2. Acceleratethealgorithmusingeitherthemassively
paralleltierOR thespecializationtierStep3. Improveperformancebycooperativelyusingboththe
specializationAND themassivelyparalleltier
Convolution Pooling Convolution Pooling Fully-connected
bird(0.02)boat(0.94)
cat(0.04)dog(0.01)
MassivelyParallel Tier
Specialization Tier
General-PurposeTier
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
Step1:AlgorithmtoApplicationBinarizedNeuralNetworks
• Trainingusuallyusesfloatingpoint,whileinferenceusuallyuseslowerprecisionweightsandactivations(often8-bitorlower)toreduceimplementationcomplexity
• Rastergari etal.[3]andCourbariaux etal.[4]haverecentlyshownsingle-bitprecisionweightsandactivationscanachieveanaccuracyof89.8%onCIFAR-10
• Performancetargetrequiresultra-lowlatency(batchsizeofone)andhighthroughput(60classifications/second)
[3]M.Rastergari,etal.“Xnor-net:Imagenet classificationusingbinaryconvolutionalneuralnetworks,”InEuropeanConferenceonComputerVision,2016.[4]M.Courbariaux,etal.“Binarizedneuralnetworks:Trainingdeepneuralnetworkswithweightsandactivationsconstrainedto+1or-1,”arXiv preprintarXiv:1602.02830(2016).
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
Step1:AlgorithmtoApplicationCharacterizingBNNExecution
• Usingjustthegeneral-purposetierwould be200xslowerthanthe performancetarget(60classifications/sec)• Binarizedconvolutionallayersconsumeover97%ofdynamicinstructioncount• Perfectaccelerationofjustthebinarizedconvolutionallayersisstill5xslowerthanperformancetarget• Perfectaccelerationofalllayersusingthemassivelyparalleltiercouldmeetperformancetargetbutwithsignificantenergyconsumption
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
Step 2:ApplicationtoAcceleratorBNNSpecializedAccelerator
1. AcceleratorisconfiguredtoprocessalayerthroughRoCCcommandmessages
2. MemoryUnitstartsstreamingtheweightsintotheacceleratorandunpackingthebinarizedweightsintoappropriatebuffers
3. Binaryconvolutioncomputeunitprocessesinputactivationsandweightstoproduceoutputactivations
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
Off-Ch
ipI/O
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
Step2:ApplicationtoAcceleratorGeneral-PurposeTierforWeightStorage
• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
Step3:AssistingAcceleratorsMassivelyParallelTierforWeightStorage
Off-Ch
ipI/O
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
AX
I
RoC
CRISC-V Rocket Core
I-CacheD-Cache
• The BNN specialized accelerator can use one of the Rocket cores’ caches to load every layer’s weights
• Each core in the massively parallel tier executes a remote-load-store program to orchestrate sending weights to the specialization tier via a hardware FIFO
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
PerformanceBenefitsofCooperativelyUsingtheMassivelyParallelandtheSpecializationTiers
General-Purpose Tier Software implementation assuming ideal performance estimated with an optimistic one instruction per cycle
Specialization Tier Full-system RTL simulation of the BNN specialized accelerator running with a frequency of 625 MHz
Specialization + MassivelyParallel Tiers
Full-system RTL simulation of the BNN specialized accelerator with the weights being streamed from the manycore
General-Purpose Tier Specialization Tier Specialization + Massively Parallel Tiers
Runtime per Image (ms) 4,024 20 3.3
Power (Watts) 0.2 – 0.5 0.2 – 0.5 0.5 – 2.0
Improvement in Perf / Power 1x ~200x ~400x
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
DesignMethodology
SystemCConstraints
StratusHLS
RTL
PyMTL
Wrappers&
Adapters
FinalRTL
void bnn::dma_req() {while( 1 ) {DmaMsg msg = dma_req.get();
for ( int i = 0; i < msg.len; i++ ) {HLS_PIPELINE_LOOP( HARD_STALL, 1 );
int req_type = 0;word_t data = 0;addr_t addr = msg.base + i*8;
if ( type == DMA_TYPE_WRITE ) {data = msg.data;req_type = MemReqMsg::WRITE;} else {req_type = MemReqMsg::READ;}
memreq.put(MemReqMsg(req_type,addr,data));}
dma_resp.put(DMA_REQ_DONE);}}
Including
RoCCInterfaces
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
DesignMethodology
SystemCConstraints
StratusHLS
RTL
PyMTL
Wrappers&
Adapters
FinalRTL HardMacro
ASICFlow
Constraints
Files
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
Including
RoCCInterfaces
RISC-VEcosystemSuccessesandChallenges
Successes
• TheRoCCcommandandmemoryinterfacewerebothsignificantsuccesses.Weconnected the acceleratorwithnochangestoRV64Gcore,justaswedidforthemanycorearrayinthemassivelyparallel tier.
Celerity::SpecializationTier::Whatisit?• Howdidwebuildit?•SuccesseswithRISC-V•ChallengeswithRISC-V
Challenges
• SmallchallengeintheRoCCacceleratorinterfaceatthespecificcommitwechosetouse
• MemorymanagementunitinRV64Gusedonlyphysicaladdresses
• Wedidasmallworkaroundtogiveusvirtualaddressesaswell
• Thischallengehasalreadybeenfixedupstream
TheCeleritySystem-on-Chip
Celerity, anaccelerator-centricSoCwithatieredacceleratorfabricthat
targetshighlyperformantandenergy-efficientembeddedsystems
Celerity’s goal wastodevelopnewmethodologiestodesignchipsmorequickly
WebelievetheRISC-Vsoftware/hardwareecosystem wasinstrumentalinenablingateamof20graduatestudents
totapeoutacomplexSoC inonly9months
General-Purpose
Tier
MassivelyParallel
Tier
Specialization
Tier
Wethankthemanycontributorstotheopen-sourceRISC-VsoftwareandhardwareecosystemwithspecialthankstoU.C.
BerkeleyforformingtheRISC-Vecosystem
Celerity::Conclusion
Acknowledgements:DARPA,undertheCRAFTprogram
SpecialthankstoDr.LintonSalmonforprogramsupportandcoordination
http://www.opencelerity.org