code generation from a domain-specific language for c-based hls of hardware accelerators · 2017....
TRANSCRIPT
-
Code Generation from aDomain-specific Language for C-basedHLS of Hardware AcceleratorsOliver Reiche, Moritz Schmid, Richard Membarth, Frank Hannig,and Jürgen Teich
Hardware/Software Co-Design, University of Erlangen-Nürnberg
CODES+ISSS, October 14, 2014, New Delhi
-
Motivation: e. g. Driver Assistance Systems
Mostly based on image feature detection:
(a) Edge detection (b) Corner detection (c) Optical flow
Where to compute features?
ECU [4] µC [5] CPU [2] GPU [3] FPGA [1]
Write once, decide later!
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 0
-
Outline
HIPAcc Framework
FPGA Targets
Results
Conclusion
-
HIPAcc Framework
-
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
C++embedded DSL
Source-to-SourceCompiler
Clang/LLVM
DomainKnowledge
ArchitectureKnowledge
CUDA(GPU)
OpenCL(x86/GPU)
Renderscript(x86/ARM/GPU)
C/C++(x86)
CUDA/OpenCL/Renderscript Runtime Library
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 1
-
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output imageAccessor input ROI with filtering
BoundaryCondition boundary handling modesMask convolution mask
Output image Crop of outputimage
Crop of outputimage with offset
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 2
-
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output imageAccessor input ROI with filtering
BoundaryCondition boundary handling modeMask convolution mask
Image andboundary
Image crop Image crop withoffset
Image offset
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 2
-
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output imageAccessor input ROI with filtering
BoundaryCondition boundary handling modesMask convolution mask
B C DF G HJ K L
A B C DE F G HI J K L
A B CE F GI J K
N O PJ K LF G HB C D
M N O PI J K LE F G HA B C D
M N OI J KE F GA B C
N O PJ K LF G H
M N O PI J K LE F G H
M N OI J KE F G
Repeat
M M MM M MM M M
M N O PM N O PM N O P
P P PP P PP P P
M M MI I IE E EA A A
M N O PI J K LE F G HA B C D
P P PL L LH H HD D D
A A AA A AA A A
A B C DA B C DA B C D
D D DD D DD D D
Clamp
E I MF J NG K O
M N O PI J K LE F G H
P L HO K GN J F
O N MK J IG F EC B A
M N O PI J K LE F G HA B C D
P O NL K JH G FD C B
I E AJ F BK G C
A B C DE F G HI J K L
D H LC G KB F J
Mirror
Q Q QQ Q QQ Q Q
Q Q Q QQ Q Q QQ Q Q Q
Q Q QQ Q QQ Q Q
Q Q QQ Q QQ Q QQ Q Q
M N O PI J K LE F G HA B C D
Q Q QQ Q QQ Q QQ Q Q
Q Q QQ Q QQ Q Q
Q Q Q QQ Q Q QQ Q Q Q
Q Q QQ Q QQ Q Q
Constant
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 2
-
HIPAcc: The Heterogeneous Image Processing AccelerationFramework
Domain-specific Extensions
IterationSpace defines ROI of the output imageAccessor input ROI with filtering
BoundaryCondition boundary handling modesMask convolution mask
−10−5
0
5
10−10
−5
0
5
10
0
0.5
1
x y
f(x,y)
0.05710.1248
0.0571
0.12480.2725
0.1248
0.05710.1248
0.0571
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 2
-
Example: Laplacian Operator
1 // coefficients for Laplacian operator2 const int coef[3][3] = { { 0, 1, 0 },3 { 1, -4, 1 },4 { 0, 1, 0 } };56 Mask mask(coef);7 Image in(width, height);8 Image out(width, height);9
10 // load image data11 in = image_bits;1213 // reading from in with mirroring as boundary condition14 BoundaryCondition bound(in, mask, BOUNDARY_MIRROR);15 Accessor acc(bound);1617 // output image18 IterationSpace iter(out);1920 // define kernel21 Laplacian filter(iter, acc, mask);2223 // execute kernel24 filter.execute();
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 3
-
Example: Laplacian Operator Kernel
1 class Laplacian : public Kernel {2 private:3 Accessor &input;4 Mask &mask;56 public:7 Laplacian(IterationSpace &iter,8 Accessor &input, Mask &mask)9 : Kernel(iter), input(input), mask(mask) {
10 addAccessor(&input);11 }1213 void kernel() {14 int4 sum = convolve(mask, HipaccSUM, [&] () -> int4 {15 return mask() * convert_int4(input(mask));16 });17 sum = max(sum, 0);18 sum = min(sum, 255);19 output() = convert_uchar4(sum);20 }21 };
Convolution call
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 4
-
FPGA Targets
-
Workflow for FPGA Targets
C++ Embedded DSL
FPGA
GoldenReference
RTLSimulation Vivado HLS
Domain/ArchitectureKnowledge
C++
HDL
HDL
C++
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 5
-
Generating the Streaming Pipeline
Trace host code and translate it to internal representation:
• model as combinationof processes andspaces
• create unique streamobjects for each space
• identify memory reuse• insert copy processes• build dependencygraph
• traverse in depth-firstsearch starting fromoutput spaces
1× stream
3× streams1× copy process
Kernel
IterSpace
Image
Accessor
Kernel
process
process
space
process
Accessor Accessor
Kernel Kernel
space
process
space space
process process
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 6
-
Streaming Pipeline: Example
Transform sequential execution order. . .
dx
dy
sx
sxy
sy
gx
gxy
gy
hcinput outputin
dx
dy
sx
in’
dx’
dy’
sx’
in’’
dx’’
Figure: HIPAcc’s sequential execution for the Harris corner detector
. . . into streaming pipeline of Vivado kernels.
dx
dy
sxy
sy
gxy
gy
hcinput output
sx gx
Figure: Representation of Vivado kernels
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 7
-
Mapping of Local Operators
For each local operator kernel, a separate line and window buffer isallocated.
? ? ?
G H ?
A B C
? ? ?
D E F. . .
0 1 2 3 4 5
0
1
2 ? ? ?
3
. . .
. . .
. . .
row
col
IMAGE DATA
? ? ? ?
. . .
. . .
. . .
. . .
IMAGE DATA
G H
A B C D E F
? ? ?
? ? ? ?
M N O
J K LI
O
HH
0 1 2 3 4 5col
0
1
2
3
row
Figure: Line buffer and group delay
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 8
-
Packing Vector Types
Pack multiple vector channels into wider stream type (e. g. uint32_t)• only use single window and line buffer for all channels• apply vector operations by operator overloading
Operator
GR
BA
f
f
f
f
InputRGBAPixel
Vivado HLS StreamOutputRGBAPixel
Figure: RGBA vector channels packed into single stream
The result is basically a SIMD unit realized in an FPGA
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 9
-
Results
-
Experimental Setup
AlgorithmsThree feature detection algorithms:
(a) Laplacian operator (b) Harris corner detector (c) Census transform(optical flow)
Evaluation EnvironmentZynq 7045 Kintex FPGAImage size 1024×1024 pixels
Xilinx OpenCV a Vivado HLS-specific image processing library
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 10
-
Results: Performance OpenCV vs. HIPAcc
50 100 150 200 250 300 350LPHV3×
3LPD3×
3LP5×
5
HC
Throughput in [MPixel/s]
OpenCV HIPAcc
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 11
-
Results: Resource Usage OpenCV vs. HIPAcc
0
10
20
30
40
LPHV3×
3
LPD 3×3
LP5×
5 HC
OH
O
H
O
H
O
H
Hardw
areRe
sources[%
] LUT FF BRAM DSP
0
100
200
300
400
ClockFreq
uency[M
Hz]
Fmax
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 12
-
Results: Performance GPU vs. FPGA
100 101 102 103 104LPHV3×
3LPD 3×3
LP5×
5
HC
OF
Throughput in [MPixel/s]
Mali-T604 Zynq 7045 Tesla K20
All implementations are stemming from the exact same DSL code base.
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 13
-
Conclusion
-
Conclusion
Advantages of DSL-based Approach
Productivity – compact algorithm description– less error-prone
Performance – efficient target-specific code generationPortability – flexible target choice
– performance-portability, not just functionalportability
HIPAcc DSL code serves as baseline implementation⇒ Test bench
Moving to higher abstraction level than HLS allows to further postponedesign decisions and therefore achieve higher quality.
CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 14
-
Questions?
Thanks for listening.Any questions?
http://github.com/hipacc/hipacc-vivado
Title Code Generation from a Domain-specific Language forC-based HLS of Hardware Accelerators
Author Oliver Reiche
http://github.com/hipacc/hipacc-vivado
-
References
-
References I
[1] Dake. Fpga xilinx spartan. url: http://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpg.
[2] Henriok. LPC18A1-and-A7. url:http://commons.wikimedia.org/wiki/Category:ARM_Cortex-M#mediaviewer/File:LPC18A1-and-A7.jpg.
[3] Poeggi. NVIDIA T20 and T30 chips. url:http://en.wikipedia.org/wiki/Tegra#mediaviewer/File:NVIDIA_T20_and_T30_chips.jpg.
[4] Ildar Sagdejev. 2008-04-17 ECU. url: http://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpg.
[5] Ioan Sameli. File:Intel 8742 153056995.jpg. url:http://en.wikipedia.org/wiki/File:Intel_8742_153056995.jpg.
http://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpghttp://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpghttp://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpghttp://commons.wikimedia.org/wiki/Category:ARM_Cortex-M#mediaviewer/File:LPC18A1-and-A7.jpghttp://commons.wikimedia.org/wiki/Category:ARM_Cortex-M#mediaviewer/File:LPC18A1-and-A7.jpghttp://en.wikipedia.org/wiki/Tegra#mediaviewer/File:NVIDIA_T20_and_T30_chips.jpghttp://en.wikipedia.org/wiki/Tegra#mediaviewer/File:NVIDIA_T20_and_T30_chips.jpghttp://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpghttp://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpghttp://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpghttp://en.wikipedia.org/wiki/File:Intel_8742_153056995.jpg
-
Backup Slides
-
Results: Power Consumption
Table: Comparison of throughput and energy consumption for the ARMMali-T604, Xilinx Zynq 7045,and Nvidia Tesla K20.
Mali-T604 Zynq 7045 Tesla K20TP [fps] E [fpW] TP [fps] E [fpW] TP [fps] E [fpW]
LPHV 3x3 100.4 41.8 333.3 1423.1 10000.0 74.1LPD 3x3 62.2 25.9 324.7 1387.6 6250.0 46.3LP 5x5 22.0 9.2 209.2 846.9 2631.6 19.5HC 11.9 4.9 228.3 458.5 1098.9 8.1OF 0.4 0.2 192.5 409.6 452.5 3.4
HIPA0.65plus0.65minus0.6510.950.65cc FrameworkFPGA TargetsResultsConclusionAppendixReferencesBackup Slides