code generation from a domain-specific language for c-based hls of hardware accelerators · 2017....

29
Code Generation from a Domain-specic Language for C-based HLS of Hardware Accelerators Oliver Reiche, Moritz Schmid, Richard Membarth, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nürnberg CODES+ISSS, October 14, 2014, New Delhi

Upload: others

Post on 04-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Code Generation from aDomain-specific Language for C-basedHLS of Hardware AcceleratorsOliver Reiche, Moritz Schmid, Richard Membarth, Frank Hannig,and Jürgen Teich

    Hardware/Software Co-Design, University of Erlangen-Nürnberg

    CODES+ISSS, October 14, 2014, New Delhi

  • Motivation: e. g. Driver Assistance Systems

    Mostly based on image feature detection:

    (a) Edge detection (b) Corner detection (c) Optical flow

    Where to compute features?

    ECU [4] µC [5] CPU [2] GPU [3] FPGA [1]

    Write once, decide later!

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 0

  • Outline

    HIPAcc Framework

    FPGA Targets

    Results

    Conclusion

  • HIPAcc Framework

  • HIPAcc: The Heterogeneous Image Processing AccelerationFramework

    C++embedded DSL

    Source-to-SourceCompiler

    Clang/LLVM

    DomainKnowledge

    ArchitectureKnowledge

    CUDA(GPU)

    OpenCL(x86/GPU)

    Renderscript(x86/ARM/GPU)

    C/C++(x86)

    CUDA/OpenCL/Renderscript Runtime Library

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 1

  • HIPAcc: The Heterogeneous Image Processing AccelerationFramework

    Domain-specific Extensions

    IterationSpace defines ROI of the output imageAccessor input ROI with filtering

    BoundaryCondition boundary handling modesMask convolution mask

    Output image Crop of outputimage

    Crop of outputimage with offset

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 2

  • HIPAcc: The Heterogeneous Image Processing AccelerationFramework

    Domain-specific Extensions

    IterationSpace defines ROI of the output imageAccessor input ROI with filtering

    BoundaryCondition boundary handling modeMask convolution mask

    Image andboundary

    Image crop Image crop withoffset

    Image offset

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 2

  • HIPAcc: The Heterogeneous Image Processing AccelerationFramework

    Domain-specific Extensions

    IterationSpace defines ROI of the output imageAccessor input ROI with filtering

    BoundaryCondition boundary handling modesMask convolution mask

    B C DF G HJ K L

    A B C DE F G HI J K L

    A B CE F GI J K

    N O PJ K LF G HB C D

    M N O PI J K LE F G HA B C D

    M N OI J KE F GA B C

    N O PJ K LF G H

    M N O PI J K LE F G H

    M N OI J KE F G

    Repeat

    M M MM M MM M M

    M N O PM N O PM N O P

    P P PP P PP P P

    M M MI I IE E EA A A

    M N O PI J K LE F G HA B C D

    P P PL L LH H HD D D

    A A AA A AA A A

    A B C DA B C DA B C D

    D D DD D DD D D

    Clamp

    E I MF J NG K O

    M N O PI J K LE F G H

    P L HO K GN J F

    O N MK J IG F EC B A

    M N O PI J K LE F G HA B C D

    P O NL K JH G FD C B

    I E AJ F BK G C

    A B C DE F G HI J K L

    D H LC G KB F J

    Mirror

    Q Q QQ Q QQ Q Q

    Q Q Q QQ Q Q QQ Q Q Q

    Q Q QQ Q QQ Q Q

    Q Q QQ Q QQ Q QQ Q Q

    M N O PI J K LE F G HA B C D

    Q Q QQ Q QQ Q QQ Q Q

    Q Q QQ Q QQ Q Q

    Q Q Q QQ Q Q QQ Q Q Q

    Q Q QQ Q QQ Q Q

    Constant

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 2

  • HIPAcc: The Heterogeneous Image Processing AccelerationFramework

    Domain-specific Extensions

    IterationSpace defines ROI of the output imageAccessor input ROI with filtering

    BoundaryCondition boundary handling modesMask convolution mask

    −10−5

    0

    5

    10−10

    −5

    0

    5

    10

    0

    0.5

    1

    x y

    f(x,y)

    0.05710.1248

    0.0571

    0.12480.2725

    0.1248

    0.05710.1248

    0.0571

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 2

  • Example: Laplacian Operator

    1 // coefficients for Laplacian operator2 const int coef[3][3] = { { 0, 1, 0 },3 { 1, -4, 1 },4 { 0, 1, 0 } };56 Mask mask(coef);7 Image in(width, height);8 Image out(width, height);9

    10 // load image data11 in = image_bits;1213 // reading from in with mirroring as boundary condition14 BoundaryCondition bound(in, mask, BOUNDARY_MIRROR);15 Accessor acc(bound);1617 // output image18 IterationSpace iter(out);1920 // define kernel21 Laplacian filter(iter, acc, mask);2223 // execute kernel24 filter.execute();

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 3

  • Example: Laplacian Operator Kernel

    1 class Laplacian : public Kernel {2 private:3 Accessor &input;4 Mask &mask;56 public:7 Laplacian(IterationSpace &iter,8 Accessor &input, Mask &mask)9 : Kernel(iter), input(input), mask(mask) {

    10 addAccessor(&input);11 }1213 void kernel() {14 int4 sum = convolve(mask, HipaccSUM, [&] () -> int4 {15 return mask() * convert_int4(input(mask));16 });17 sum = max(sum, 0);18 sum = min(sum, 255);19 output() = convert_uchar4(sum);20 }21 };

    Convolution call

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 4

  • FPGA Targets

  • Workflow for FPGA Targets

    C++ Embedded DSL

    FPGA

    GoldenReference

    RTLSimulation Vivado HLS

    Domain/ArchitectureKnowledge

    C++

    HDL

    HDL

    C++

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 5

  • Generating the Streaming Pipeline

    Trace host code and translate it to internal representation:

    • model as combinationof processes andspaces

    • create unique streamobjects for each space

    • identify memory reuse• insert copy processes• build dependencygraph

    • traverse in depth-firstsearch starting fromoutput spaces

    1× stream

    3× streams1× copy process

    Kernel

    IterSpace

    Image

    Accessor

    Kernel

    process

    process

    space

    process

    Accessor Accessor

    Kernel Kernel

    space

    process

    space space

    process process

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 6

  • Streaming Pipeline: Example

    Transform sequential execution order. . .

    dx

    dy

    sx

    sxy

    sy

    gx

    gxy

    gy

    hcinput outputin

    dx

    dy

    sx

    in’

    dx’

    dy’

    sx’

    in’’

    dx’’

    Figure: HIPAcc’s sequential execution for the Harris corner detector

    . . . into streaming pipeline of Vivado kernels.

    dx

    dy

    sxy

    sy

    gxy

    gy

    hcinput output

    sx gx

    Figure: Representation of Vivado kernels

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 7

  • Mapping of Local Operators

    For each local operator kernel, a separate line and window buffer isallocated.

    ? ? ?

    G H ?

    A B C

    ? ? ?

    D E F. . .

    0 1 2 3 4 5

    0

    1

    2 ? ? ?

    3

    . . .

    . . .

    . . .

    row

    col

    IMAGE DATA

    ? ? ? ?

    . . .

    . . .

    . . .

    . . .

    IMAGE DATA

    G H

    A B C D E F

    ? ? ?

    ? ? ? ?

    M N O

    J K LI

    O

    HH

    0 1 2 3 4 5col

    0

    1

    2

    3

    row

    Figure: Line buffer and group delay

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 8

  • Packing Vector Types

    Pack multiple vector channels into wider stream type (e. g. uint32_t)• only use single window and line buffer for all channels• apply vector operations by operator overloading

    Operator

    GR

    BA

    f

    f

    f

    f

    InputRGBAPixel

    Vivado HLS StreamOutputRGBAPixel

    Figure: RGBA vector channels packed into single stream

    The result is basically a SIMD unit realized in an FPGA

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 9

  • Results

  • Experimental Setup

    AlgorithmsThree feature detection algorithms:

    (a) Laplacian operator (b) Harris corner detector (c) Census transform(optical flow)

    Evaluation EnvironmentZynq 7045 Kintex FPGAImage size 1024×1024 pixels

    Xilinx OpenCV a Vivado HLS-specific image processing library

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 10

  • Results: Performance OpenCV vs. HIPAcc

    50 100 150 200 250 300 350LPHV3×

    3LPD3×

    3LP5×

    5

    HC

    Throughput in [MPixel/s]

    OpenCV HIPAcc

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 11

  • Results: Resource Usage OpenCV vs. HIPAcc

    0

    10

    20

    30

    40

    LPHV3×

    3

    LPD 3×3

    LP5×

    5 HC

    OH

    O

    H

    O

    H

    O

    H

    Hardw

    areRe

    sources[%

    ] LUT FF BRAM DSP

    0

    100

    200

    300

    400

    ClockFreq

    uency[M

    Hz]

    Fmax

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 12

  • Results: Performance GPU vs. FPGA

    100 101 102 103 104LPHV3×

    3LPD 3×3

    LP5×

    5

    HC

    OF

    Throughput in [MPixel/s]

    Mali-T604 Zynq 7045 Tesla K20

    All implementations are stemming from the exact same DSL code base.

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 13

  • Conclusion

  • Conclusion

    Advantages of DSL-based Approach

    Productivity – compact algorithm description– less error-prone

    Performance – efficient target-specific code generationPortability – flexible target choice

    – performance-portability, not just functionalportability

    HIPAcc DSL code serves as baseline implementation⇒ Test bench

    Moving to higher abstraction level than HLS allows to further postponedesign decisions and therefore achieve higher quality.

    CODES+ISSS’14 | Oliver Reiche | Hardware/Software Co-Design | Code Generation from a DSL for C-based HLS of Hardware Accelerators 14

  • Questions?

    Thanks for listening.Any questions?

    http://github.com/hipacc/hipacc-vivado

    Title Code Generation from a Domain-specific Language forC-based HLS of Hardware Accelerators

    Author Oliver Reiche

    http://github.com/hipacc/hipacc-vivado

  • References

  • References I

    [1] Dake. Fpga xilinx spartan. url: http://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpg.

    [2] Henriok. LPC18A1-and-A7. url:http://commons.wikimedia.org/wiki/Category:ARM_Cortex-M#mediaviewer/File:LPC18A1-and-A7.jpg.

    [3] Poeggi. NVIDIA T20 and T30 chips. url:http://en.wikipedia.org/wiki/Tegra#mediaviewer/File:NVIDIA_T20_and_T30_chips.jpg.

    [4] Ildar Sagdejev. 2008-04-17 ECU. url: http://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpg.

    [5] Ioan Sameli. File:Intel 8742 153056995.jpg. url:http://en.wikipedia.org/wiki/File:Intel_8742_153056995.jpg.

    http://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpghttp://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpghttp://en.wikipedia.org/wiki/Field-programmable_gate_array#mediaviewer/File:Fpga_xilinx_spartan.jpghttp://commons.wikimedia.org/wiki/Category:ARM_Cortex-M#mediaviewer/File:LPC18A1-and-A7.jpghttp://commons.wikimedia.org/wiki/Category:ARM_Cortex-M#mediaviewer/File:LPC18A1-and-A7.jpghttp://en.wikipedia.org/wiki/Tegra#mediaviewer/File:NVIDIA_T20_and_T30_chips.jpghttp://en.wikipedia.org/wiki/Tegra#mediaviewer/File:NVIDIA_T20_and_T30_chips.jpghttp://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpghttp://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpghttp://en.wikipedia.org/wiki/Electronic_control_unit#mediaviewer/File:2008-04-17_ECU.jpghttp://en.wikipedia.org/wiki/File:Intel_8742_153056995.jpg

  • Backup Slides

  • Results: Power Consumption

    Table: Comparison of throughput and energy consumption for the ARMMali-T604, Xilinx Zynq 7045,and Nvidia Tesla K20.

    Mali-T604 Zynq 7045 Tesla K20TP [fps] E [fpW] TP [fps] E [fpW] TP [fps] E [fpW]

    LPHV 3x3 100.4 41.8 333.3 1423.1 10000.0 74.1LPD 3x3 62.2 25.9 324.7 1387.6 6250.0 46.3LP 5x5 22.0 9.2 209.2 846.9 2631.6 19.5HC 11.9 4.9 228.3 458.5 1098.9 8.1OF 0.4 0.2 192.5 409.6 452.5 3.4

    HIPA0.65plus0.65minus0.6510.950.65cc FrameworkFPGA TargetsResultsConclusionAppendixReferencesBackup Slides