code generation from a domain-specific language for c-based hls of hardware accelerators · 2017....

Code Generation from aDomain-specific Language for C-basedHLS of Hardware AcceleratorsOliver Reiche, Moritz Schmid, Richard Membarth, Frank Hannig,and Jürgen Teich

Hardware/Software Co-Design, University of Erlangen-Nürnberg

CODES+ISSS, October 14, 2014, New Delhi

Motivation: e. g. Driver Assistance Systems

Mostly based on image feature detection:

(a) Edge detection (b) Corner detection (c) Optical flow

Where to compute features?

ECU [4] µC [5] CPU [2] GPU [3] FPGA [1]

Write once, decide later!

CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 0

Outline

HIPAcc Framework

FPGA Targets

Results

Conclusion

HIPAcc Framework

HIPAcc: The Heterogeneous Image Processing AccelerationFramework

C++embedded DSL

Source-to-SourceCompiler

Clang/LLVM

DomainKnowledge

ArchitectureKnowledge

CUDA(GPU)

OpenCL(x86/GPU)

Renderscript(x86/ARM/GPU)

C/C++(x86)

CUDA/OpenCL/Renderscript Runtime Library



Domain-specific Extensions

IterationSpace defines ROI of the output imageAccessor input ROI with filtering

BoundaryCondition boundary handling modesMask convolution mask

Output image Crop of outputimage

Crop of outputimage with offset





BoundaryCondition boundary handling modeMask convolution mask

Image andboundary

Image crop Image crop withoffset

Image offset






B C DF G HJ K L

A B C DE F G HI J K L

A B CE F GI J K

N O PJ K LF G HB C D

M N O PI J K LE F G HA B C D

M N OI J KE F GA B C

N O PJ K LF G H

M N O PI J K LE F G H

M N OI J KE F G

Repeat

M M MM M MM M M

M N O PM N O PM N O P

P P PP P PP P P

M M MI I IE E EA A A


P P PL L LH H HD D D

A A AA A AA A A

A B C DA B C DA B C D

D D DD D DD D D

Clamp

E I MF J NG K O

M N O PI J K LE F G H

P L HO K GN J F

O N MK J IG F EC B A


P O NL K JH G FD C B

I E AJ F BK G C

A B C DE F G HI J K L

D H LC G KB F J

Mirror

Q Q QQ Q QQ Q Q

Q Q Q QQ Q Q QQ Q Q Q

Q Q QQ Q QQ Q Q

Q Q QQ Q QQ Q QQ Q Q


Q Q QQ Q QQ Q QQ Q Q

Q Q QQ Q QQ Q Q

Q Q Q QQ Q Q QQ Q Q Q

Q Q QQ Q QQ Q Q

Constant






−10−5

0

5

10−10

−5

0

5

10

0

0.5

1

x y

f(x,y)

0.05710.1248

0.0571

0.12480.2725

0.1248

0.05710.1248

0.0571


Example: Laplacian Operator

1 // coefficients for Laplacian operator2 const int coef[3][3] = { { 0, 1, 0 },3 { 1, -4, 1 },4 { 0, 1, 0 } };56 Mask mask(coef);7 Image in(width, height);8 Image out(width, height);9

10 // load image data11 in = image_bits;1213 // reading from in with mirroring as boundary condition14 BoundaryCondition bound(in, mask, BOUNDARY_MIRROR);15 Accessor acc(bound);1617 // output image18 IterationSpace iter(out);1920 // define kernel21 Laplacian filter(iter, acc, mask);2223 // execute kernel24 filter.execute();


Example: Laplacian Operator Kernel

1 class Laplacian : public Kernel {2 private:3 Accessor &input;4 Mask &mask;56 public:7 Laplacian(IterationSpace &iter,8 Accessor &input, Mask &mask)9 : Kernel(iter), input(input), mask(mask) {

10 addAccessor(&input);11 }1213 void kernel() {14 int4 sum = convolve(mask, HipaccSUM, [&] () -> int4 {15 return mask() * convert_int4(input(mask));16 });17 sum = max(sum, 0);18 sum = min(sum, 255);19 output() = convert_uchar4(sum);20 }21 };

Convolution call


FPGA Targets

Workflow for FPGA Targets

C++ Embedded DSL

FPGA

GoldenReference

RTLSimulation Vivado HLS

Domain/ArchitectureKnowledge

C++

HDL

HDL

C++


Generating the Streaming Pipeline

Trace host code and translate it to internal representation:

• model as combinationof processes andspaces

• create unique streamobjects for each space

• identify memory reuse• insert copy processes• build dependencygraph

• traverse in depth-firstsearch starting fromoutput spaces

1× stream

3× streams1× copy process

Kernel

IterSpace

Image

Accessor

Kernel

process

process

space

process

Accessor Accessor

Kernel Kernel

space

process

space space

process process


Streaming Pipeline: Example

Transform sequential execution order. . .

dx

dy

sx

sxy

sy

gx

gxy

gy

hcinput outputin

dx

dy

sx

in’

dx’

dy’

sx’

in’’

dx’’

Figure: HIPAcc’s sequential execution for the Harris corner detector

. . . into streaming pipeline of Vivado kernels.

dx

dy

sxy

sy

gxy

gy

hcinput output

sx gx

Figure: Representation of Vivado kernels


Mapping of Local Operators

For each local operator kernel, a separate line and window buffer isallocated.

? ? ?

G H ?

A B C

? ? ?

D E F. . .

0 1 2 3 4 5

0

1

2 ? ? ?

3

. . .

. . .

. . .

row

col

IMAGE DATA

? ? ? ?

. . .

. . .

. . .

. . .

IMAGE DATA

G H

A B C D E F

? ? ?

? ? ? ?

M N O

J K LI

O

HH

0 1 2 3 4 5col

0

1

2

3

row

Figure: Line buffer and group delay


Packing Vector Types

Pack multiple vector channels into wider stream type (e. g. uint32_t)• only use single window and line buffer for all channels• apply vector operations by operator overloading

Operator

GR

BA

f

f

f

f

InputRGBAPixel

Vivado HLS StreamOutputRGBAPixel

Figure: RGBA vector channels packed into single stream

The result is basically a SIMD unit realized in an FPGA


Results

Experimental Setup

AlgorithmsThree feature detection algorithms:

(a) Laplacian operator (b) Harris corner detector (c) Census transform(optical flow)

Evaluation EnvironmentZynq 7045 Kintex FPGAImage size 1024×1024 pixels

Xilinx OpenCV a Vivado HLS-specific image processing library


Results: Performance OpenCV vs. HIPAcc

50 100 150 200 250 300 350LPHV3×

3LPD3×

3LP5×

5

HC

Throughput in [MPixel/s]

OpenCV HIPAcc


Results: Resource Usage OpenCV vs. HIPAcc

0

10

20

30

40

LPHV3×

3

LPD 3×3

LP5×

5 HC

OH

O

H

O

H

O

H

Hardw

areRe

sources[%

] LUT FF BRAM DSP

0

100

200

300

400

ClockFreq

uency[M

Hz]

Fmax


Results: Performance GPU vs. FPGA

100 101 102 103 104LPHV3×

3LPD 3×3

LP5×

5

HC

OF

Throughput in [MPixel/s]

Mali-T604 Zynq 7045 Tesla K20

All implementations are stemming from the exact same DSL code base.


Conclusion

Conclusion

Advantages of DSL-based Approach

Productivity – compact algorithm description– less error-prone

Performance – efficient target-specific code generationPortability – flexible target choice

– performance-portability, not just functionalportability

HIPAcc DSL code serves as baseline implementation⇒ Test bench

Moving to higher abstraction level than HLS allows to further postponedesign decisions and therefore achieve higher quality.


Questions?

Thanks for listening.Any questions?

http://github.com/hipacc/hipacc-vivado

Title Code Generation from a Domain-specific Language forC-based HLS of Hardware Accelerators

Author Oliver Reiche

http://github.com/hipacc/hipacc-vivado

References

References I

[1] Dake. Fpga xilinx spartan. url: http://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpg.

[2] Henriok. LPC18A1-and-A7. url:http://commons.wikimedia.org/wiki/Category:ARM_Cortex-M#mediaviewer/File:LPC18A1-and-A7.jpg.

[3] Poeggi. NVIDIA T20 and T30 chips. url:http://en.wikipedia.org/wiki/Tegra#mediaviewer/File:NVIDIA_T20_and_T30_chips.jpg.

[4] Ildar Sagdejev. 2008-04-17 ECU. url: http://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpg.

[5] Ioan Sameli. File:Intel 8742 153056995.jpg. url:http://en.wikipedia.org/wiki/File:Intel_8742_153056995.jpg.

http://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpghttp://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpghttp://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpghttp://commons.wikimedia.org/wiki/Category:ARM_Cortex-M#mediaviewer/File:LPC18A1-and-A7.jpghttp://commons.wikimedia.org/wiki/Category:ARM_Cortex-M#mediaviewer/File:LPC18A1-and-A7.jpghttp://en.wikipedia.org/wiki/Tegra#mediaviewer/File:NVIDIA_T20_and_T30_chips.jpghttp://en.wikipedia.org/wiki/Tegra#mediaviewer/File:NVIDIA_T20_and_T30_chips.jpghttp://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpghttp://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpghttp://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpghttp://en.wikipedia.org/wiki/File:Intel_8742_153056995.jpg

Backup Slides

Results: Power Consumption

Table: Comparison of throughput and energy consumption for the ARMMali-T604, Xilinx Zynq 7045,and Nvidia Tesla K20.

Mali-T604 Zynq 7045 Tesla K20TP [fps] E [fpW] TP [fps] E [fpW] TP [fps] E [fpW]

LPHV 3x3 100.4 41.8 333.3 1423.1 10000.0 74.1LPD 3x3 62.2 25.9 324.7 1387.6 6250.0 46.3LP 5x5 22.0 9.2 209.2 846.9 2631.6 19.5HC 11.9 4.9 228.3 458.5 1098.9 8.1OF 0.4 0.2 192.5 409.6 452.5 3.4

HIPA0.65plus0.65minus0.6510.950.65cc FrameworkFPGA TargetsResultsConclusionAppendixReferencesBackup Slides

code generation from a domain-specific language for c-based hls of hardware accelerators · 2017....

Documents