improving the performance of opencl-based fpga accelerator ... · improving the performance of...
TRANSCRIPT
![Page 1: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/1.jpg)
ImprovingthePerformanceofOpenCL-basedFPGAAcceleratorforConvolutionalNeuralNetwork
JialiangZhangandJingLi
![Page 2: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/2.jpg)
Outline
• Background• Motivation• Balance analysis model• Proposed design• Performance• Conclusion
2
![Page 3: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/3.jpg)
ConvolutionalNeuralNetwork
3
Marco Architecture of VGG
Imagefrom:https://blog.heuritech.com/2016/02/29/a-brief-report-of-the-heuritech-deep-learning-meetup-5/l
![Page 4: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/4.jpg)
OpenCL Framework
Application Source CodeOpenCL Kernel
OpenCL FrameworkOpenCL CHost API
FPGA
InfrastructureKernel
FPGARuntimeFPGA Driver
Offline FPGA OpenCL Compiler
CLANG Frontend
IR Analysis &Optimization
VerilogBackend
BasicBlockLibrary
(i)
(iii)
(ii)
• OpenCLprovidesagoodFPGAabstration
• UnlikeOpenCLonGPUorCPU,OpenCLFPGAdescribesbothhardwareandsoftware
• Use #prgma to guide hardwaregeneration:– loopunrolling; SIMDfactor; compute
unitreplication
4
![Page 5: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/5.jpg)
OpenCL FPGA Framework
CU
LocalMemory
CU
LocalMemory
CU
LocalMemory
CU
LocalMemory
2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)
16x Onchip Memory Replication ( Compiler Inferred )
PCIe Controller
SOC BUS
Dispather
Infrastructure Region
HOST
DDR Controller
GlobalMemory
ComputationSubsystem
Local Memory
Subsystem
Control Flow Data Flow
DDR Controller
…
… GlobalMemory
KernelRegion
(a) Top Level (b) Compute Unit Level (c) Processing Element Level
PE
BRAMBRAM
PE
BRAMBRAM
PE
BRAMBRAM
PE
BRAMBRAM
5
![Page 6: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/6.jpg)
Related works• GenericCNNaccelerators: [1]-[3]• Balancecomputation and external memoryaccess: [4] -[7]• Hardware Abstraction: [7][8]• Contribution:– IdentifytheperformancebottleneckinlargescaleFPGACNNacceleratorison-chipmemorybandwidth
– A CNN accelerator achieved optimized balance among computation,on-chipmemory and external memory access
6
[1]Farabet,etal.Hardwareacceleratedconvolutional neuralnetworksforsyntheticvision systems.ISCAS2010[2]M.Peemen,etal.Memory-centricacceleratordesignforconvolutional neuralnetworks.ICCD2013[3]V.Gokhale, etal.A240G-ops/smobilecoprocessor fordeepneuralnetworks.CVPRWorkshops, 2014.[4]C.Zhang,etal.OptimizingFPGA-basedacceleratordesign fordeepconvolutional neuralnetworks.ISFPGA2015[5]N.Suda, et al. Throughput-OptimizedOpenCL-based FPGAAcceleratorforLarge-ScaleConvolutional NeuralNetworks. ISFPGA 2016[6] J. Qiu, et al. GoingDeeperwithEmbedded FPGAPlatformforConvolutional NeuralNetwork. ISFPGA 2016[7]C. Zhang Caffeine:TowardsUniformedRepresentationandAccelerationforDeepConvolutional NeuralNetworks. ICCAD 2016[8] H. Sharma, et al. FromHigh-LevelDeepNeuralModels toFPGAs. MICRO2016
![Page 7: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/7.jpg)
Motivation
• New FPGA devices havemoreand faster DSP resources
• We should useallDSPresourceseverycycletoobtaintheGFLOPSnumberindatasheet
• The existing design cannot utilizethe DSP resources well
7
![Page 8: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/8.jpg)
BalanceAnalysis Model• Balanceistherelationshipbetweenstorage and computation resources• Assumption:• Onlyonebottleneckinthesystem• Bandwidthbounded• Computationandmemoryaccessoverlapperfectly.
8
MachineBalance
CodeBalance
On-chip 𝑩𝒎𝑶𝒏 𝑩𝒎𝑶𝒏
Off-chip 𝑩𝒎𝑶𝒇𝒇 𝑩𝑪
𝑶𝒇𝒇
Hardwaredetermined
Algorithmdetermined
Refertoon-chipMemBWRefertooff-chipMemBW
![Page 9: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/9.jpg)
MachineBalance
9
PE PE PE
Buffer Buffer
Computational Resources
On-chip Memory
… …
…
NBRAM *WIDTHBRAM *fBRAM
PE PE PEComputational Resources
… …
NDDR*𝐵𝑊))*
DDR DDROff-chip Memory
…
NDSP*OPS/DSP*fDSP NDSP*OPS/DSP*fDSP
![Page 10: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/10.jpg)
CodeBalance
10
• On-chipCodeBalanceisdeterminedbyinputandoutputdatasizeofinstructions
• Forexample,MAC operation hasa
1
• Off-chipCodeBalanceisdeterminedbytheinputandoutputdatasizeofthewholeprogram
• Assumeunlimited on-chipmemorysize
![Page 11: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/11.jpg)
BreakthememoryWall
11
Toachieve𝑃,-. ,weneedtosatisfy𝐵,122 >𝐵3
122 and 𝐵,14 >𝐵3122
![Page 12: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/12.jpg)
Off-chipbalance
Layer 𝑩𝑪𝑶𝒇𝒇
CONV1 0.0021CONV2 0.0010CONV3 0.00062CONV4 0.00087CONV5 0.00277FC 0.5
12
TechnologyNode
𝑩𝒎𝑶𝒇𝒇
28nm 0.16820nm 0.217
14/16nm 0.17714/16nmw/HBM
0.177
• 𝑩𝑪𝑶𝒇𝒇 <𝑩𝒎
𝑶𝒇𝒇 forconvolutionlayers
• 𝑩𝑪𝑶𝒇𝒇 >𝑩𝒎
𝑶𝒇𝒇 forFC layer,butonlycontributeasmallportionofcomputation
![Page 13: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/13.jpg)
On-chipbalance
13
TechnologyNode
𝑩𝒎𝑶𝒏
28nm 0.0033
20nm 0.0026
14/16nm 0.0010
Layer 𝑩𝑪𝑶𝒏
CONV1 1CONV2 1CONV3 1CONV4 1CONV5 1FC 1
• 𝑩𝑪𝑶𝒏 >𝑩𝒎𝑶𝒏 foralllayer
• On-chipmemorybandwidthbecomesthebottleneck
• Weneedtoincrease𝑩𝒎𝑶𝒏
![Page 14: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/14.jpg)
DataReuseRequirement
303384
72
1000
0
200
400
600
800
1000
1200
28nm 20nm 14nmw/HMC
14nmw/DDR
• Thebalanceanalysis assumesunlimitedon-chipmemorysize(loaddataonce)
• Underlimitedon-chipmemorycapacity,weneedtoloaddatamorethanonce.
• Tosatisfiedexternalmemorybandwidthrequirement,weneedtoreusedataatleast.
14
![Page 15: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/15.jpg)
OptimizationDirection
• Increase to utilize all DSP resources
• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.
15
![Page 16: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/16.jpg)
OptimizationDirection
• Increase to utilize all DSP resources
• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.
16
![Page 17: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/17.jpg)
MatrixMultiplicationKernel
BRAM
PE
BRAM
PE
BRAM
BRAM
PE PE
17
CU
LocalMemory
CU
LocalMemory
CU
LocalMemory
CU
LocalMemory
2x2 PE Replication (Controlled by User) 4x CU Replication (Controlled by User)
16x Onchip Memory Replication ( Compiler Inferred )
PCIe Controller
SOC BUS
Dispather
Infrastructure Region
HOST
DDR Controller
GlobalMemory
ComputationSubsystem
Local Memory
Subsystem
Control Flow Data Flow
DDR Controller
…
… GlobalMemory
KernelRegion
(a) Top Level (b) Compute Unit Level (c) Processing Element Level
PE
BRAMBRAM
PE
BRAMBRAM
PE
BRAMBRAM
PE
BRAMBRAM
EnableMulticasting
![Page 18: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/18.jpg)
BRAMusagereduction
18
Bymulticasting,wecanincreaseon-chipmachinebalanceby 𝑁)67
ExperimentonArria10GX1150FPGA(w/ 1518 DSP and 2713 M20kMemory):
![Page 19: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/19.jpg)
OptimizationDirection
• Increase to utilize all DSP resources
• Satisfy thedatareuserequirementwithlimitedon-chipmemorysize.
19
![Page 20: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/20.jpg)
Kernel Design
2D DISPATHER
Compute Units
Compute Units
Compute Units
Compute Units
Buffer
PointerParameters
…
ShiftRegsiter
Crossbar
ExternalMemory
ExternalMemory
20
![Page 21: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/21.jpg)
CU Design
21
BRAM32bit x512
MAC
BRAM32bit x512
MAC
BRAM32bit x512
…
MAC
BRAM32bit x512
BRAM32bit x512
MAC
MAC
MAC
BRAM32bit x512
MAC
MAC
MAC
…
…
…
INPUT A512x
32x32x
32x
32x
…
…
INPUT B512x
32x
32x
32x 32x 32x
32x32x32x
32x 32x 32x
FF
FF
FF
…
MUX
OutputAddress
Output512x
512x
512x
512x
8x Duplication8x
Dup
licat
ion
(a)
![Page 22: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/22.jpg)
Work-item Scheduling
22
![Page 23: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/23.jpg)
2D Scheduling
0,0 0,1 0,2 0,n…
1,0 1,1 1,2 1,n…
2,0 2,1 2,2 2,n…
m,0 m,1 m,2 m,n…
… … … …
…
0,0 0,1 0,2 0,n…
1,0 1,1 1,2 1,n…
2,0 2,1 2,2 2,n…
m,0 m,1 m,2 m,n…… … … …
…
m
n(a) One Dimensional Dispatching (b) Two Dimensional Dispatching
23
𝑥9
𝑥:
![Page 24: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/24.jpg)
Minimizingexternalmemorybandwidth
• Objective:– Findtheoptimized schedulingpolicy
– To minimizeexternalmemorybandwidth
• Constraint:– On-chipmemorysize
• Basedon:– CNNlayerparameters
24
![Page 25: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/25.jpg)
Optimizationresult
25
Theoptimizationcaneffectivelyreducetherequirementofoff-chipmemorybandwidth
![Page 26: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/26.jpg)
Experiment Setup
• FPGA: Arria10GX1150• 1518 DSPs• 2713 M20k memory• 1GBDDR4 with17GB/sBW
• Kernel implemented as anOpenCLIPlibrarypackage usingVerilog
26
Arria 10 GX FPGA Development Kit
![Page 27: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/27.jpg)
Implementation Result
27
Total Ours Percentage
DSP 1518 1320 86
BRAM 2713 1250 46
Logic 1506k 437k 43
Frequency 370 MHz
![Page 28: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/28.jpg)
VGG Performance
Layers Number of Ops Duration (ms) PerformanceGOP/s
Conv 30.69G 11.9 2568
FC 0.073G 5.5 13
Total 30.76G 17.4 1790
28
Overall VGG Classification Throughput: 57 image/s
![Page 29: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/29.jpg)
Performance ComparisonSuda et. al [1] Qiu et. al [2] Ours
Platform Stratix V GSD8 Zynq XC7Z045 Arria 10 GX1150Performance(GOP/s) 117.8 136.9 1790
PerformanceDensity(OPs/DSP/Cycle)
0.36 1.17 3.06
Power efficiency(GOP/J)
1.84 14.22 47.88
29
[1]N.Suda, et al. Throughput-OptimizedOpenCL-basedFPGAAcceleratorforLarge-ScaleConvolutionalNeuralNetworks. ISFPGA 2016[2] J. Qiu, et al. GoingDeeperwithEmbeddedFPGAPlatformforConvolutionalNeuralNetwork. ISFPGA 2016
![Page 30: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/30.jpg)
Summary
• Motivation• Balance analysis model• Our Work:– Multicasting among PEs– 2D Scheduling
• Performance comparison
30
![Page 31: Improving the Performance of OpenCL-based FPGA Accelerator ... · Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network Jialiang Zhang and Jing](https://reader033.vdocuments.mx/reader033/viewer/2022042307/5ed3657832989a24575c583f/html5/thumbnails/31.jpg)
Thanks!
Q&A
31