automatic kernel fusion for image processing dsls · in this work, we investigate the automatic...

10
Automatic Kernel Fusion for Image Processing DSLs Bo Qiao, Oliver Reiche, Frank Hannig, Jürgen Teich {bo.qiao, oliver.reiche, hannig, teich}@fau.de Hardware/Software Co-Design, Department of Computer Science Friedrich-Alexander University Erlangen-Nürnberg (FAU) ABSTRACT Programming image processing algorithms on hardware acceler- ators such as graphics processing units (GPUs) often exhibits a trade-off between software portability and performance portability. Domain-specific languages (DSLs) have proven to be a promising remedy, which enable optimizations and generation of efficient code from a concise, high-level algorithm representation. The scope of this paper is an optimization framework for image processing DSLs in the form of a source-to-source compiler. To cope with the inter-kernel communication bound via global memory for GPU applications, kernel fusion is investigated as a primary opti- mization technique to improve temporal locality. In order to enable automatic kernel fusion, we analyze the fusibility of each kernel in the algorithm, in terms of data dependencies, resource utilization, and parallelism granularity. By combining the obtained informa- tion with the domain-specific knowledge captured in the DSL, a method to automatically fuse the suitable kernels is proposed and integrated into an open source DSL framework. The novel kernel fusion technique is evaluated on two filter-based image processing applications, for which speedups of up to 1.60 are obtained for an NVIDIA Geforce 745 graphics card target. CCS CONCEPTS Computing methodologies Image processing; Graphics processors; Software and its engineering Domain spe- cific languages; KEYWORDS Domain-Specific Languages, Image Processing, Kernel Fusion, GPUs ACM Reference Format: Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2018. Automatic Kernel Fusion for Image Processing DSLs. In SCOPES ’18: 21th International Workshop on Software and Compilers for Embedded Systems, May 28–30, 2018, Sankt Goar, Germany. ACM, New York, NY, USA, 10 pages. https: //doi.org/10.1145/3207719.3207723 This is the author’s version of the work. The definitive work was published in Proceedings of the 21th International Workshop on Soſtware and Compilers for Embedded Systems (SCOPES 2018), May 28–30, 2018, Sankt Goar, Germany Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany © 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa- tion for Computing Machinery. ACM ISBN 978-1-4503-5780-7/18/05. . . $15.00 https://doi.org/10.1145/3207719.3207723 1 INTRODUCTION Image processing today gains more and more importance among multiple domains including computer vision, medical imaging, pho- tography, artificial intelligence, and autonomous driving. Such ap- plications are intrinsically data-intensive and demand accelerators such as GPUs and FPGAs for fast execution. Programming image processing algorithms on such architectures is challenging and often exhibits a trade-off between software portability and perfor- mance portability. DSLs have proven to be a promising remedy for this problem. By combining domain-specific and architecture knowledge, DSLs enable optimizations and generation of efficient code for various backends from a single set of algorithm description. Several high-performance image processing DSLs have been proposed in the past years. Halide [10] demonstrates the benefits of decoupling the algorithm from its schedule. Here, algorithms are specified in a functional manner. A scheduler determines the evaluation order, storage, mapping information, etc. PolyMage [9] employs a similar algorithm representation but focuses on tiling techniques using the polyhedral model for scheduling. The func- tional description of algorithms in those DSLs effectively shields programmers from hardware implementation details, which yields a much more concise description. Nevertheless, such representa- tions capture merely data flow information from the program. The burden of optimization is entirely offloaded to the scheduler. A scheduler is key for DSLs to generate efficient code. It can be spec- ified by an architecture expert, which is costly, error-prone, and not easily portable. A better approach is to generate a schedule from the source description automatically. For example, the Halide auto-scheduler [8] performs locality and parallelism-enhancing optimizations automatically, by using techniques such as inlining, grouping, tiling. It partitions the algorithm into groups and applies optimizations within each group. Subsequently, a schedule can be generated in seconds with reasonable performance. Nevertheless, the auto-scheduler might require user assists for providing function bounds. Also these optimizations are generic such that all SIMD ma- chines with caching behavior can benefit. If the target back end is a GPU, the communication among the partitioned groups happens still via the main memory. Accelerators such as GPUs are used extensively beyond image processing. Our work is also motivated by DSLs from other do- mains. Wang et al. [14] initiated Kernel Fusion as an alternative to loop fusion for power optimization on GPUs. Here, energy con- sumption can be reduced by balancing resource utilization, which is achieved by fusing two or more kernels. The goal is to reduce power consumption rather than execution time. Wu et al. [15] en- riched the benefits of kernel fusion with smaller data footprint and larger optimization scope. Nevertheless, the proposed fusion method is dedicated to data warehousing applications. Filipovič et al. [2] attempted to harness the existing benefits, with a focus on

Upload: others

Post on 14-Oct-2019

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

Automatic Kernel Fusion for Image Processing DSLsBo Qiao, Oliver Reiche, Frank Hannig, Jürgen Teich

{bo.qiao, oliver.reiche, hannig, teich}@fau.deHardware/Software Co-Design, Department of Computer Science

Friedrich-Alexander University Erlangen-Nürnberg (FAU)

ABSTRACTProgramming image processing algorithms on hardware acceler-ators such as graphics processing units (GPUs) often exhibits atrade-off between software portability and performance portability.Domain-specific languages (DSLs) have proven to be a promisingremedy, which enable optimizations and generation of efficientcode from a concise, high-level algorithm representation.

The scope of this paper is an optimization framework for imageprocessing DSLs in the form of a source-to-source compiler. To copewith the inter-kernel communication bound via global memory forGPU applications, kernel fusion is investigated as a primary opti-mization technique to improve temporal locality. In order to enableautomatic kernel fusion, we analyze the fusibility of each kernel inthe algorithm, in terms of data dependencies, resource utilization,and parallelism granularity. By combining the obtained informa-tion with the domain-specific knowledge captured in the DSL, amethod to automatically fuse the suitable kernels is proposed andintegrated into an open source DSL framework. The novel kernelfusion technique is evaluated on two filter-based image processingapplications, for which speedups of up to 1.60 are obtained for anNVIDIA Geforce 745 graphics card target.

CCS CONCEPTS• Computing methodologies → Image processing; Graphicsprocessors; • Software and its engineering → Domain spe-cific languages;

KEYWORDSDomain-Specific Languages, Image Processing, Kernel Fusion, GPUs

ACM Reference Format:Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2018. AutomaticKernel Fusion for Image Processing DSLs. In SCOPES ’18: 21th InternationalWorkshop on Software and Compilers for Embedded Systems, May 28–30,2018, Sankt Goar, Germany. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3207719.3207723

This is the author’s version of the work. The definitive work was published in Proceedings of the21th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2018), May 28–30, 2018, Sankt Goar, Germany

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’18, May 28–30, 2018, Sankt Goar, Germany© 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa-tion for Computing Machinery.ACM ISBN 978-1-4503-5780-7/18/05. . . $15.00https://doi.org/10.1145/3207719.3207723

1 INTRODUCTIONImage processing today gains more and more importance amongmultiple domains including computer vision, medical imaging, pho-tography, artificial intelligence, and autonomous driving. Such ap-plications are intrinsically data-intensive and demand acceleratorssuch as GPUs and FPGAs for fast execution. Programming imageprocessing algorithms on such architectures is challenging andoften exhibits a trade-off between software portability and perfor-mance portability. DSLs have proven to be a promising remedyfor this problem. By combining domain-specific and architectureknowledge, DSLs enable optimizations and generation of efficientcode for various backends from a single set of algorithm description.

Several high-performance image processing DSLs have beenproposed in the past years. Halide [10] demonstrates the benefitsof decoupling the algorithm from its schedule. Here, algorithmsare specified in a functional manner. A scheduler determines theevaluation order, storage, mapping information, etc. PolyMage [9]employs a similar algorithm representation but focuses on tilingtechniques using the polyhedral model for scheduling. The func-tional description of algorithms in those DSLs effectively shieldsprogrammers from hardware implementation details, which yieldsa much more concise description. Nevertheless, such representa-tions capture merely data flow information from the program. Theburden of optimization is entirely offloaded to the scheduler. Ascheduler is key for DSLs to generate efficient code. It can be spec-ified by an architecture expert, which is costly, error-prone, andnot easily portable. A better approach is to generate a schedulefrom the source description automatically. For example, the Halideauto-scheduler [8] performs locality and parallelism-enhancingoptimizations automatically, by using techniques such as inlining,grouping, tiling. It partitions the algorithm into groups and appliesoptimizations within each group. Subsequently, a schedule can begenerated in seconds with reasonable performance. Nevertheless,the auto-scheduler might require user assists for providing functionbounds. Also these optimizations are generic such that all SIMDma-chines with caching behavior can benefit. If the target back end isa GPU, the communication among the partitioned groups happensstill via the main memory.

Accelerators such as GPUs are used extensively beyond imageprocessing. Our work is also motivated by DSLs from other do-mains. Wang et al. [14] initiated Kernel Fusion as an alternative toloop fusion for power optimization on GPUs. Here, energy con-sumption can be reduced by balancing resource utilization, whichis achieved by fusing two or more kernels. The goal is to reducepower consumption rather than execution time. Wu et al. [15] en-riched the benefits of kernel fusion with smaller data footprintand larger optimization scope. Nevertheless, the proposed fusionmethod is dedicated to data warehousing applications. Filipovič etal. [2] attempted to harness the existing benefits, with a focus on

Page 2: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany B. Qiao et al.

inter-kernel communications. They established a source-to-sourcecompiler in which kernel fusion is implemented as an optimizationstep, for linear algebra applications. The employed technique isclosely related to our work, but the target is on different domains. Itis well-known in GPU programming that frequent data movementbetween host and device deteriorates the performance. Therefore,techniques such as kernel fusion have become an important opti-mization step on the kernel level.

In this work, we investigate the automatic optimization of imageprocessing applications on GPUs in the context of DSLs. For thefirst time, we proposed and embed kernel fusion techniques in theopen source1 compiler Heterogeneous Image Processing Acceler-ation (Hipacc) [7], and present the achievable speedups for twocommonly used image processing applications. Our contributionsare as follows:

(1) A fusibility analysis model, which automatically extracts ker-nel information from the Hipacc algorithm representation.The fusibility of each kernel is analyzed based on data de-pendencies, resource utilization, and parallelism granularity.The model is implemented based on Clang 2 abstract syntaxtree (AST). Analysis and decision making are done automati-cally in the compiler, without additional input required fromthe programmer.

(2) Automatic domain-specific fusion for GPUs. This techniqueenables Hipacc kernel fusion on the AST level. The wholeprocess is automatically performed by utilizing domain knowl-edge and GPU architecture knowledge. When enabled, read-ability of the emitted CUDA code is preserved after fusion.

(3) A generic optimization framework that is seamlessly in-tegrated into Hipacc, a modular source-to-source domain-specific compiler. The framework is extendible for futureoptimizations without sacrificing any existing feature. Pro-grammers only need to specify their algorithms usingHipacc,without any additional input. The optimization can be en-abled/disabled by simply passing a flag during compilation.

The remainder of this paper is structured as follows: Section 2gives a brief introduction on Hipacc, focus on its algorithm repre-sentations. Subsequently, we elaborate why kernel fusion is inves-tigated in this context as an important performance optimizationtechnique. Section 3 introduces kernel fusion. First, we give an intro-ductory example to illustrate the basics of kernel fusion, highlightits benefits as well as challenges. Then, we present the fusibilityanalysis model. Finally, the proposed methods for the automaticdomain-specific fusion are explained. Benchmark results are pre-sented in Section 4 before we conclude our work in Section 5.

2 IMAGE PROCESSING DSLsOne of the benefits of using a DSL to program image process-ing applications is the conciseness of the algorithm description.A comprehensive understanding of the descriptors in the DSL isthe premise of any optimization. In this section, we first introduceHipacc and present the features offered by this compiler. Then weelaborate the decision to choose kernel fusion as an optimizationtechnique.1http://clang.llvm.org2http://hipacc-lang.org

.... ....

.... ....

.... ....

(a) Point Operator (b) Local Operator

.... ....

.... ....

.... ....

(c) Global Operator

Figure 1: Memory access patterns for operators in Hipacc.Top: Input image, middle: SIMD processing, bottom: Outputimage.

2.1 HipaccHipacc consists of an open source image processing DSL, embeddedinto C++ and a source-to-source compiler. Initially developed totarget GPUs [6], it was later extended to support more targets suchas FPGAs [12, 11]. Within the framework, Hipacc classifies imageprocessing algorithms based on what information contribute to theresult. Three groups of operators have been identified: (a) pointoperators, (b) local operators, and (c) global operators, as depictedby Figure 1. To compute each pixel in the output image, if (a) onepixel is required from the input image, then it is a point operator.(b) a region of pixels is required from the input image, then it is alocal operator. (c) all the pixels are required from the input image,then it is a global operator.

To illustrate briefly how operators are used in Hipacc, a Gaussianblur filter is shown as an example in Listing 1. The actual kerneldefinition is not important here and hence omitted.

The design of the Hipacc language aims to capture the commoncompute patterns in image processing. From this perspective, thelanguage specifications of the high-performance image processingDSLs such as Hipacc, Halide, and PolyMage share some similarity.For example, point operators, local operators, and global operatorsin Hipacc resemble the point-wise operation, stencil operation, andtime-iterated operation in PolyMage, respectively. Those opera-tors in Hipacc are used as basic building blocks to compose morecomplicated algorithms. For example, Figure 2 (a) depicts how theHarris corner detector [3] is implemented using Hipacc operators.Note that in other languages, an operator is often referred to as akernel, hence the name kernel fusion. In this paper, we will use theterm operator and kernel interchangeably.

2.2 Optimization in Image Processing DSLsOptimizations in DSLs are accomplished by combining domain-specific knowledgewith architecture knowledge. To gain this knowl-edge, the compiler needs to analyze the programmer-specified al-gorithm description as well as the target hardware architectureinformation. Afterwards, it needs to extract information such asinput size, data dependencies, memory size, and available computa-tional resources. Finally, by combining the gained knowledge withcertain metrics such as minimizing execution time by reducing

Page 3: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

Automatic Kernel Fusion for Image Processing DSLs SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany

img

dx dy

img img

sxysx sy

imgimg img

gxygx gy

imgimg img

hc

img

Point Operator

Local Operator

Memory Access

Harris Corner Detector

img

dx dy

img img

sxysx sy

imgimg img

gxygx gy

imgimg img

hc

img

img

dx dy

img img

KbKa Kc

imgimg img

hc

img

FusibilityAnalysis

Domain-specificFusion

(a) (b) (c) (d) (e)

Figure 2: Overview of kernel fusion in Hipacc. Procedure from left to right: (a) a Hipacc internal representation of a Harriscorner detector specification, consisting of point operators and local operators. Buffers allocated in global memory are shownbetween kernel executions. (b) Our fusibility analysis model takes the representation as input, identifies fusible kernels andgenerates (c) a set of fusible kernel lists, as indicated by the red dashed rectangles. Subsequently, (d) the domain-specific fusertakes the lists of fusible kernels, combines with domain and architecture knowledge, and executes fusion for each of theselists. The outcome of fusion is (e) one fused kernel per list.

communication via global memory on a GPU, the compiler derivesthe optimized execution pattern and generates code.

For image processing DSLs, we argue that there exists a trade-offbetween the constraints exposed to the programmer for algorithm

1 // filter mask for Gaussian blur filter

2 const float filter_mask [3][3] = {

3 { 0.057118f, 0.124758f, 0.057118f },

4 { 0.124758f, 0.272496f, 0.124758f },

5 { 0.057118f, 0.124758f, 0.057118f }

6 };

7 Mask <float > mask(filter_mask);

8

9 // input image

10 size_t width , height;

11 uchar *image = read_image (&width , &height , "input.pgm");

12 Image <uchar > in(width , height , image);

13

14 // reading from in with clamping as boundary condition

15 BoundaryCondition <uchar > cond(in, mask , Boundary ::CLAMP);

16 Accessor <uchar > acc(cond);

17

18 // output image

19 Image <uchar > out(width , height);

20 IterationSpace <uchar > iter(out);

21

22 // instantiate and launch the Gaussian blur filter

23 LinearFilter Gaussian(iter , acc , mask , 3);

24 Gaussian.execute ();

Listing 1: Instantiation of an operator for the Gaussian blurfilter in the Hipacc DSL.

specification and the ease of applying optimizations in the com-piler, regardless of the target backend. For example, Halide allowsits programmers to describe algorithms in a functional manner,which enables effortless translation from algorithms written inpure mathematical form to the corresponding Halide representa-tion. Nevertheless, only dependency information is encapsulated insuch algorithm representations. Other information such as buffersize and location are determined during schedule generation, whichgenerally yields a very expensive design space exploration.

In contrast, Hipacc limits the freedom exposed to its program-mers to some extent. Hipacc users should specify their algorithmusing the provided domain-specific operators, as introduced previ-ously. In this way, programmers should first match their algorithmswith the compute patterns offered by Hipacc operators. Note thatthis translation generally requires no additional effort since thelanguage and its operators are designed to express the commoncompute patterns in image processing. Consequently, the algorithmspecification in Hipacc is as concise as the other functional repre-sentation languages such as Halide.

Having this trade-off, Hipacc representations have states andcan encapsulate more information in addition to data dependen-cies, e.g. parallelism or memory size and location. This additionalinformation can be combined with the domain and architectureknowledge, to enable many domain-specific optimizations, such asautomatic border handling and memory coalescing.

One observation in Hipacc is that algorithms are generally com-posed of multiple kernels, as can be seen from the Harris cornerdetector example in Figure 2 (a). Unfortunately, the inter-kernelcommunication is typically handled via global memory, which often

Page 4: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany B. Qiao et al.

leads to memory-bound execution on GPUs. Kernel fusion as anoptimization technique is well-suited in this context. In the next sec-tion, we present how kernel fusion can be represented as a compilertransformation that can be automatically applied for algorithmsspecified in Hipacc.

3 KERNEL FUSIONIn this section, we introduce kernel fusion in the context of imageprocessing DSLs. First, we give an introductory example to illustratethe basics of kernel fusion, highlight its benefits as well challenges.Then, we present a fusibility analysis model that is implemented inHipacc. Finally, the methods for the automatic fusion are discussedin detail.

The primary goal of kernel fusion is to improve the temporallocality during execution by eliminating unnecessary communi-cations via global memory. Figure 3 depicts the general conceptof kernel fusion. Given f and д are two kernels with a producer-consumer data dependency, the result of f is written to the globalmemory and later read by д, no extra modifications are made on theresult. Kernel fusion identifies such intermediate memory accessesbeing unnecessary and composes f into д to create a new fusedkernel д ◦ f , reducing the amount of unnecessary global memorytransfers.

3.1 BenefitsThere are many benefits for optimizing programs using kernelfusion. In this section, we highlight three key benefits that areparticular and important for GPU targets:• Faster execution: accessing global memory on a GPU re-quires hundreds of cycles per transaction, which is costlycompared to arithmetic computations and easily yields mem-ory bandwidth bottleneck. If two kernels can be fused, theintermediate data is kept locally or even at the register level.Consequently, the expensive memory accesses are elimi-nated.• Larger optimization scope: compared with the kernels be-fore fusion, the fused kernel typically has a larger codebody, which may offer more opportunities for optimization.This is well understood as a general-purpose compiler tech-nique [1]. Hipacc is a source-to-source compiler, which gen-erates CUDA code and then passes the code to NVCC for fur-ther compilation. By kernel fusion, the generated CUDA code

img

img

img

Memory

f

g

Load

Compute

Store

Load

Compute

Store

img

img

Memory

g ◦ f

Load

Compute

Store

Figure 3: A simple example for kernel fusion

has a larger code body. Thus, certain optimization passes inNVCC such as common subexpression elimination might beapplicable after kernel fusion. Potentially, the final code ismore efficient.• Smaller data footprint: if every pixel of an intermediate im-age can be directly produced and consumed by the samecomputation unit, e.g. a thread. Then, no buffer needs to beallocated in global memory during the whole execution. Thismay be beneficial also from the energy perspective.

For kernel fusion, two questions need to be answered beforeone can harness the benefits introduced above. Given an algorithmspecified in Hipacc, which kernels can be fused? How to fuse themautomatically by the compiler? In the remainder of this section, weprovide answers to these two questions.

3.2 Fusibility AnalysisA fusibility analysis model is used to determine which kernels in aHipacc representation can be fused. In this subsection, we illustratethe procedure for identifying fusible kernels.

3.2.1 Internal Representation. The fusibility analysis starts byconstructing an AST-like internal representation in Hipacc. First,our model takes the AST representation of the input algorithm (gen-erated by the Clang frontend) and performs an in-order traversal. Ittracks kernel executions, memory accesses, and buffer allocations.During the traversal, dependencies between buffers and kernelsare recorded. Then, a directed acyclic graph (DAG) is built basedon the obtained AST information. In the DAG, kernels and buffersare the vertices, represented as processes and spaces, respectively.Data dependencies are captured by edges. Figure 2 (a) depicts sucha representation for the Harris corner detector specification.

Given such an internal representation as G = (V ,E), where Gis a DAG, V is the set of vertices3 and E is the set of edges in G.The next step is to generate an initial set of fusible kernel list byanalyzing the dependency information of the DAG.

3.2.2 Dependency Constraint. Data dependencies is consideredas the primary constraint in our fusibility analysis model. We gen-erate an initial set S ′ of fusible kernel lists by examining the datadependencies in G.

Fusing multiple kernels can be seen as a reduction operation onfusing pairs of kernels in a DAG. Therefore, we examine any twokernels in G that share a producer-consumer dependency. Threescenarios can be identified as depicted by Figure 4. Given f as theproducer kernel, the result of which is used by the consumer kernelд. Scenario (a) is identical to the introductory example presentedin Figure 3, where no external dependency exists. Hence, it is safeto execute fusion. If the consumer kernel д demands other inputimages in addition to the producer result, as depicted in scenario (b),i.e. an external dependency exists, fusing f withд in this case mightnot improve temporal locality. Furthermore, if the producer resultimд is consumed by other kernels in addition to the consumer kernelд, as depicted in scenario (c), then fusing f with д is prohibitedbecause Hipacc kernels can only produce a single output image.3For notation simplicity, we regard each process and its output image as a single vertex.Thus, the DAG only has one type of vertices. This is valid for all the analysis in thispaper since in Hipacc, each kernel has only one output image. Hipacc currently doesnot support generating multiple output images per kernel.

Page 5: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

Automatic Kernel Fusion for Image Processing DSLs SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany

img

f

д

(a) None

img

f

д

img... ...

...

(b) External Input

img

f

д k... ...

...

(c) External Output

Figure 4: Dependency scenarios for kernel pairs

In this work, we therefore tighten our dependency constraint toscenario (a) only. Eq. (1) gives a dependency constraint definitionfor our analysis model:

Cdep(vi ,vj ) = (vi ,vj ) ∈ E ∧ deд+(vi ) = 1 ∧ deд−(vj ) = 1 (1)

Where Cdep denotes the dependency constraint on a pair ofkernels. deд− and deд+ denote the indegree and outdegree of anode, respectively.

Subsequently, we examine every pair of kernels in G accordingto Eq. (1). Pairs that satisfy Cdep are put in a temporal set of fusiblekernel pairs S ′. After all the pairs have been examined inG , we visitthe collected kernel pairs in S ′, and concatenate all the dependentpairs into new lists. For example, given two pairs of kernels in S ′ as(p,q) and (q, r ). The consumer kernel q in the first pair is also theproducer kernel in the second pair. Subsequently, these two pairsare concatenated into one list with three kernels, namely {p,q, r }.This step is executed in-place, and recursively in S ′ for all the pairs.Finally, S ′ is updated and becomes a set that contains one or morefusible kernel lists. Assume the obtained set S ′ consists of n fusiblekernel lists {L1,L2, ...,Ln }, each of these lists Li for 1 ≤ i ≤ n hasthe following properties:• Li = (Vi ,Ei ) ∧ Vi ⊆ V ,Ei ⊆ E. Namely, Li is also a DAG,which is a subset of G.• |Vi | ≥ 2. Namely, Li should contain at least two kernels.There should be at least a start kernel and an end kernel ineach fusible kernel list.• Li ∩ Lj = ∅, i , j. Namely, neither kernels nor edges canbe shared among different lists. Each list can later be fusedindependently and resulting in one fused kernel.

Figure 2 (c) depicts the dependency analysis outcome on theinternal representation for the Harris corner detector application.As can be seen from the figure, three fusible kernel lists are obtainedfor this application, each of which consists of two kernels. In ourfusibility analysis, the number of kernels in each list is not limited.A list can have many kernels that satisfy the dependency constraintand that are fused to a single kernel. Nevertheless, fusing morekernels does not always guarantee performance improvement, dueto the limited amount of resource and the variable granularity ofdifferent kernels. Next, we elaborate and impose two additionalconstraints in our fusibility analysis. They are used to guaranteethe performance improvement in our fusible kernel list, before theactual fusion starts.

3.2.3 Resource Estimation. Kernel fusion potentially increasesthe utilization of registers and shared memory4. If a fusible kernellist contains many kernels, it might not be beneficial to fuse themall, without considering resource usage.

The absolute resource usage of a CUDA kernel such as the num-ber of used registers, can only be checked by inspecting the outputof NVCC compiler. Doing so requires that the kernel has alreadybeen translated and code generated by Hipacc, which indicates thesource-to-source compiler execution has been finished. Therefore,it is not possible to obtain any accurate information on resourceusage during the execution of the source-to-source compiler.

In this work, we therefore compute an estimated resource uti-lization by extracting buffer information from the Hipacc kernels.Currently, we limit the estimation to shared memory only. Con-sidering modern GPUs to have generally cached global memoryaccess, when the kernel code causes register spills, the data mightstill get served from L1 cache. Even if shared memory is used in-stead, the communication is still faster than through global memory.Therefore, we consider shared memory usage more critical thanregister usage during kernel fusion. Estimating register usage isleft for future work.

In Hipacc, the size of the allocated shared memory depends onthe given work-group configuration, local operator size, and thesize of the output image. Here, we omit the impetus behind thissize calculation, interested readers are referred to [7]. This sizeinformation can be easily extracted from all the kernels, thanks toHipacc’s DSL.

Given s(k) as the buffer size (in bytes) allocated in kernel k ∈ V .Mshared as the total amount of shared memory per block for thecurrent target GPU. The resource constraint used by the fusibilityanalysis model is defined by Eq. (2) :

Crc(L) =(∑

ki ∈Ls(ki )

)< Mshared (2)

For each fusible kernel list L, Eq. (2) is evaluated to guaranteesufficient resource utilization during kernel fusion. If a list utilizesmore shared memory than available, the end kernel in the list willbe removed from this list. Then, the new list is re-evaluated. Thesteps are repeated until Eq. (2) is evaluated to true. The droppedkernels from the same list are re-concatenated to formulate anothernew list, which must also be evaluated and satisfy the resourceconstraint.

The above constraint can be loosened by applying additionaloptimizations, e.g. shared memory reuse. In Hipacc, kernels dependon each other by sharing the same input or output images. Buffersallocated inside each kernel are not shared and thus local to eachkernel. This raises an opportunity to reuse certain buffers duringkernel fusion. For each fusible kernel list, the buffer has the largestsize can be shared and reused for all the kernels in the same list.This optimization will be incorporated also in future work.

3.2.4 Granularity Constraint. Another important concern in ourfusibility analysis is the granularity of parallelism. For kernel fusion,granularity indicates the number of threads that is used by the4The term shared memory is used as CUDA terminology, which is equivalent to thelocal memory in OpenCL. We refer to the fast on-chip scratchpad memory on GPUs.This memory is shared between all threads of a thread block and private to each block.Threads from other blocks cannot access the shared memory.

Page 6: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany B. Qiao et al.

(a) Same (b) Offset (c) Stride

Figure 5: Granularity scenarios for point operator pairs

producer and consumer kernel to write and read the intermediatebuffer.

Figure 5 depicts three scenarios of a point-to-point kernel pair. Inthe context of Hipacc, the access pattern for output images is identi-cal among kernels. The scenarios in the figure are distinguished bythe access pattern of the consumer kernel. Scenario (a) illustratesthe basic use case where the fused kernels share the same granu-larity. In this case, registers can be used to buffer the intermediatedata. Scenarios (b) and (c) depict the situations in which the userinput requires an offset or stride access pattern. In these cases, theintermediate data is produced and consumed by different threads.Shared memory might to be used to store the data, and further anal-ysis is required to detect for example if all the accessing threads arein the same work-group. Moreover, whenever a buffer in sharedmemory is written and read by different threads, we have a racecondition. Hence synchronization is required among the threads.For scenario (c), if the three active threads are in the same warpwith the other four inactive threads, if-statements are needed toexplicitly disable the execution of the inactive threads. This intro-duces branch divergence and greatly degrades the performance ofthe fused kernel. In this work, we constrain our granularity analysisto scenario (a). Eq. (3) is used by the fusibility model to define thegranularity constraint Cgl.

Cgl(L) =∧

vi ,vj ∈L(vi ,vj ) ∈ E =⇒ AccPatterna (vj ) (3)

Here, AccPatterna (vj ) evaluates to true if the access pattern ofthe consumer kernel vj adheres to pattern (a) as shown in Figure 5.

Analogous to the previous resource constraint, Eq. (3) is alsoevaluated on the fusible kernel lists. All the fusible kernel pairsin a list must evaluate true for the granularity constraint. Both

Algorithm 1: Fusibility analysis algorithm.1 function FusibilityAnalysis(G)2 S ′ ← DependencyAnalysis(G) // Dependency constraint

3 S ← {} // Set of fusible kernel lists

4 forall l ∈ S ′ do5 l ′ ← ResourceAnalysis(l )6 l ′′ ← l ′ ∩Granular ityAnalysis(l )7 S ← S ∪ {l ′′ }8 end9 return S

10 end

constraints attempt to eliminate kernels from the fusible lists thatwill degrade the performance after fusion. The execution steps areidentical in both constraints, hence omitted here.

The overall procedure is summarized as follows: First, G is tra-versed in the analysis phase. Processes and spaces that satisfyingthe dependency constraintCdep are recorded and used to constructan initial set S ′ of fusible kernel lists. Then, for each list in theset S ′, the resource constraint Crc and granularity constraint Cglare examined to further reduce the fusible kernels, if applicable.Finally, the lists that go through all the constraints are passed toour domain-specific fusion algorithm, resulting in one fused kernelper list. The entire algorithm is shown in Algorithm 1.

So far, we have described all the building blocks in our fusibilityanalysis model. By combining that knowledge during compilation,we generate the fusible kernel lists. In the next section, we explainhow each of those lists is fused into a single kernel.

3.3 Domain-specific FusionIn this subsection, we present the implementation of the automatickernel fusion in Hipacc. Given a fusible kernel list L produced fromour analysis model, we currently consider any two kernels thatshare a producer-consumer dependency having one of the followingcombinations: (a) A point operator followed by point operator, (b) alocal operator followed by point operator, and (c) a point operatorfollowed by local operator. For each of these combinations, wediscuss locality improvement, fusing procedures, and computationand memory access estimations.

3.3.1 Point-to-Point Fusion. For two subsequent point opera-tors in the list, the temporal locality can be significantly improvedby using registers to buffer the intermediate data, as depicted inFigure 6. The left side shows the compute pattern before fusion.Initially, three buffers need to be allocated in the global memory,and the pink regions represent image pixels that are read or written.

The locality of the intermediate data (pixels) is determined basedon the threads that produce them and the threads that consume

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

Figure 6: Point-to-point kernel fusion

Page 7: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

Automatic Kernel Fusion for Image Processing DSLs SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany

Table 1: Computation and global memory access estimationfor point-to-point fusion for an image with N pixels.

Load Store ComputeBefore fusion 2 · N 2 · N 2 · NAfter fusion N N 2 · N

them. In Figure 6, we can observe that each pixel in the intermediateimage is produced and consumed by the same thread. This indicatesthat registers should be used to improve locality.

The procedure to execute point-to-point fusion works as follows:First, the producer kernel is encountered during AST traversal.The body of the kernel is cloned and the output image assignmentis replaced by register assignment. Second, the consumer kernelis encountered. The kernel body is cloned, and the input imageassignment is replaced with the previous register assignment. Third,the bodies of both kernels are extracted from their AST nodes andare concatenated to formulate the newly fused kernel body. Forth,the entire fused kernel body is traversed, and variables are renamedto avoid repeated declaration errors. Finally, a new AST node iscreated by constructing a new kernel with the fused body and thecorresponding argument list. The procedures are similar to inlining.Nevertheless, the whole process is automated at AST level.

The right-hand side of Figure 6 depicts the computation patternafter fusion. We can observe that the intermediate read and storeactions are eliminated. We perform a worst-case estimation of thenumber of global memory accesses and computations before andafter kernel fusion. Assume two kernels execute point-wise oper-ations on an image of size N . Without considering caching andmemory coalescing, Table 1 depicts the estimation result. Note thatwe count all the operations to produce one pixel output as onecomputation.

3.3.2 Local-to-Point Fusion. Local-to-point fusion in Hipaccclosely resembles the previous point to point scenario. Figure 7depicts the compute pattern before and after fusion. Although thelocal operator requires more pixels as input, the intermediate pixelremains produced and consumed by the same thread. Hence, thepoint-to-point analysis on locality still applies. Moreover, the pro-cedures for fusion are also analogous due to registers are used forintermediate storage.

Table 2 presents the estimation results for this scenario. Weassume a filter of width wx = 3 for the local operator, whichindicates the producer kernel needs 3 loads per pixel. Consequently,after fusion, the fused kernel also needs 3 loads per pixel. All otherestimations remain the same. We can observe that the percentageof memory accesses saved is less compared to the point to pointscenario. Nevertheless, the gain might still be significant for largeimages.

3.3.3 Point-to-Local Fusion. Whenever a local operator servesas the consumer kernel in a fusible pair, the locality analysis be-comes challenging.

Figure 8 depicts the compute pattern before and after fusion. Onthe left side, we observe that the intermediate pixels are no longerproduced and consumed by the same thread. For every three pixels

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

Figure 7: Local-to-point kernel fusion with local operator ofwidthwx = 3.

Table 2: Computation and global memory access estimationfor local-to-point fusion for an image with N pixels.

Load Store ComputeBefore fusion (wx + 1) · N 2 · N 2 · NAfter fusion wx · N N 2 · N

as depicted in the intermediate buffer, they are produced by threethreads but consumed by the middle thread only. This prohibits theuse of registers for buffering since registers in GPUs are privateper thread. The middle thread cannot access the registers of itsneighboring threads, thereby cannot execute the filter operation.The next candidate for locality improvement is shared memory.Buffering data in shared memory enables multiple thread accessesbut limited to the same work-group. Threads in one work-groupcannot access the shared memory of another work-group. Therebythe boundary threads in anywork-group are likely to stall executiondue to data unavailability. Using shared memory only for certainthreads easily yields branch divergence, and synchronization isrequired. These constraints disqualify the shared memory optionand leave us no room to improve locality.

Due to the characteristics of GPU’s SIMD execution and memoryhierarchy, there is an opportunity to trade redundant computationsfor better temporal locality. In the remainder of this subsection,we discuss this tradeoff in detail and elaborate the benefits in thispoint-to-local fusion scenario.

As depicted in Figure 8, to produce one pixel in the output im-age, each thread executing the local operator (consumer kernel)demands three pixels as input. Instead of fetching them from anintermediate buffer, the thread can load the other three pixels fromthe input image, which are used to produce these intermediate pix-els, as shown on the right side. After fusion, each thread executesnot only the computation of the original local operator but alsothree times the computation of the original point operator. In thisway, all the computations are kept in the same thread. Consequentlyall the intermediate pixels are buffered in registers.

Page 8: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany B. Qiao et al.

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

.... ....

Figure 8: Point-to-local kernel fusion with local operatorwidthwx = 3.

Table 3: Computation and global memory access estimationfor point-to-local fusion for an image with N pixels.

Load Store ComputeBefore fusion (wx + 1) · N 2 · N 2 · NAfter fusion wx · N N (wx + 1) · N

For the computation and global memory access estimation, wecan compare with the previous local-to-point scenario. Before fu-sion, both scenarios have a point operator and a local operator.The order is not important since they are estimated individually.Thereby, the number of load, store, and compute operations beforefusion are the same in both scenarios, as can be seen in Tables 2and 3. After fusion, the load and store patterns for both scenariosare also the same, as depicted by the right side in Figures 7 and 8.Regarding computation, as previously explained, each thread inthe current scenario executes one computation from the consumerkernel and three other computations for the producer kernel. There-fore, the number of computations is doubled after fusion. Since thebottleneck in image processing is usually the available amount ofmemory bandwidth, the cost for executing additional arithmeticinstructions can almost be neglected, and we can still gain a largemargin of improvement, which will be shown in the next section.

The procedures to execute point-to-local fusion are differentfrom the previous cases: First, the producer kernel is encounteredduring AST traversal. The AST of the kernel body is cloned and boththe input and output image assignments are replaced by registerassignments. Second, the consumer kernel is encountered. Thekernel body AST is cloned, and the input image assignment thatuses an intermediate buffer is replaced by another input imageassignment that uses the input buffer. Third, the AST of the producerkernel is inserted after the just-replaced input assignment. At thislocation, the original intermediate image load should be replacedwith the input image load plus a point operator execution. Forth,the result of previous assignments is put in another register andis concatenated with the rest of the consumer kernel. Fifth, the

whole consumer kernel body AST is traversed, and variables arerenamed to avoid repeated declaration errors. Finally, a new ASTnode is created by creating a new kernel with this fused AST andthe corresponding argument list. The whole process is automatedby an AST-Fuser class implemented in Hipacc at the AST level.

As can be seen from the computation and global memory accessestimation tables, all the scenarios mentioned above are expectedto benefit from kernel fusion. In general, kernel pairs with point op-erators as consumers can yield higher performance improvementsthan those with local operators. Nevertheless, by exploring thetrade-off between locality and redundant computation, local opera-tors as consumer kernels can also benefit from fusion. Local-to-localoperator fusion is not addressed in this paper, as it is planned forfuture work.

To sum up, our fusibility analysis model parses a DAG that repre-sents the input algorithm and generates a set of fusible kernel lists.Then, the domain-specific fusion is performed automatically, fusingeach list into one fused kernel as a source-to-source transformation.In the following section, we will present results on performanceimprovement.

4 EVALUATION AND RESULTSIn this section, we analyze the speedups achievable from performingkernel fusion as an optimization technique on two image processingapplications. First, the evaluation environment is introduced, thenthe algorithms are described. Finally, the benchmark results arediscussed.

4.1 EnvironmentOur evaluation is based on Hipacc [7] (main branch) version 38,which depends on Clang/LLVM 3.8. The supported backends aredescribed in Section 2.1. In this work, we focus on generating CUDAcode for NVIDIA GPUs. Nevertheless, the technique is applicableto other target languages as well. To further compile the generatedCUDA code, we use NVCC release V7.5.17.

The hardware accelerator used in the evaluation is a Geforce GTX745 graphics card from NVIDIA. The card is built on the Maxwellarchitecture and has compute capability 5.0. It facilitates 384 CUDAcores with a base clock of 1033 MHz, 900 MHz memory clock, and128-bit memory bus width. The total amount of shared memoryper block is 48 Kbytes, the total number of registers available perblock is 65536, and the warp size is 32. We will use a constant blocksize of 128 by 1 during the evaluation.

All the present results are the median of a total of 50 executions.

4.2 ApplicationsWe chose two image processing applications to benchmark ouroptimization technique. Those two applications are based on im-age filters. This type of stencil operations are commonly seen inmany image processing algorithms. Next, we briefly explain theimplemented algorithms.

4.2.1 Harris Corner Detector. Harris corner detector [3] is aclassic filter for image preprocessing or low-level feature extrac-tion. The Hipacc implementation requires a combination of pointand local operators to form the whole pipeline, as depicted by Fig-ure 2 (a). Overall, nine kernel invocations are needed to process

Page 9: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

Automatic Kernel Fusion for Image Processing DSLs SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany

one input image. The local operators in Hipacc require accessingneighboring pixels, which would need boundary handling. In thisevaluation, we specify clamp as the boundary handling mode inHipacc. Furthermore, the filter is computed on a gray-scale image.

4.2.2 Night Filter. The Night filter [4] [13] is chosen as a repre-sentation of a widely-adopted postprocessing filter. First, it executesbilateral filtering by iteratively applying the ’atrous’ (with holes)algorithm of different size (3x3, 5x5). Then, the actual tone mappingcurve is applied. The Hipacc implementation of this filter is quitestraightforward. It consists of three kernels Atrous0, Atrous1, andScoto, which are linearly dependent and are executed in sequentialorder. Furthermore, the filter is computed on RGB images.

4.3 ResultsThe results for the Harris corner detector are depicted in Figure 9.We execute the algorithm with an input image of size 2048 by 2048,and a filter of size 3 by 3. Figure 2 depicts the complete optimizationpipeline for the Harris corner algorithm, including which kernelsare identified and fused.

Figure 9 compares the execution time reduced from the fusedkernels. Table 4 presents the detailed execution time for all the

sx+gx sy+gy sxy+gxy0

1

2

3

4

5

Executiontim

e[m

s]

Non-fusedFused

Figure 9: Execution time comparison for the fusible kernelsof the Harris corner detector

Table 4: Execution time (ms) for the pipeline of Harris cor-ner detector

Kernels Before fusion After fusion Speedupdx 2.18 2.16 -dy 2.16 2.19 -sx + gx 3.79 2.38 1.59sy + gy 3.80 2.37 1.60sxy + gxy 4.49 3.90 1.15hc 2.95 2.79 -Overall 19.37 15.80 1.23

Table 5: Execution time (ms) for night filter

Kernels Before fusion After fusion SpeedupAtrous0 7.68 7.69 -Atrous1+Scoto 9.29 8.84 1.05

Overall 16.97 16.53 1.03

kernels as well as the whole pipeline. Note that kernel dx, dy, andhc are categorized as unfusible, hence not shown in Figure 9. Theexecution time of those three kernels remains approximately un-changed compared to the unfused version, as shown in Table 4. Theother kernels in the Harris corner detector pipeline are identified asfusible and are grouped into three fusible kernel lists, respectively.From Figure 9 and Table 4, we achieve a speedup of 1.6 by fusingsx with gx, and sy with gy. By applying kernel fusion, we are ableto speedup the entire Harris corner detector by a factor of 1.23.

Another observation is that fusing kernel sxy with kernel gxyyields less improvement. This is because kernel sxy requires twoinput images, as can be seen from Figure 2. In the worst case globalmemory access estimation, the extra input image will reduce theamount of improvement for kernel fusion. Here, we further extendthe analysis in Table 3 as follows: before fusion, the number ofglobal memory loads is (wx + 1) · N for a single input image. If anadditional input image is required, the first point operator shouldload two images. Hence, the number of loads becomes (wx + 2) · N .After fusion, this number is wx · N for single image input, and2 ·wx · N for two input images. Now, assume the employed filterhas a width of 3, the number of global memory loads before andafter fusion are 4·N and 3·N , respectively. Hence 25% of the originalglobal memory loads can be saved.Whereas for the two input imagecase, the number of global memory loads before and after fusion are5 · N and 6 · N , respectively. Without considering optimizations orcaching effect, fusion might increase the number of global memoryloads, but the number of stores is still reduced by half. Dependingon image size, further optimizations, and hardware specifications,memory loads might be cached and the actual number might differfrom the above analysis. Nevertheless, we can conclude that thenumber of input images can potentially degrade the performance.Fusion might not always be beneficial for point-to-local scenario.The number of input images for the producer kernel should beconsidered during fusibility analysis.

For the night filter, we compile and run the fused code withan input image of size 1920 by 1200, and two filters of size 3 by3 and 5 by 5, respectively. Our fusibility analysis model identifiesthe last two kernels as fusible, hence Atrous1 and Scoto are fusedduring optimization. Table 5 presents the detail execution time forall the kernels as well as the whole pipeline. By applying kernelfusion, we are able to gain a speedup of 1.03 only. This amountof improvement is minor compared to the Harris corner detector,which is due to the arithmetic intensity difference in the kernels.The Atrous and Scoto kernels in the night filter are much moreexpensive to compute than the kernels of the Harris corner detec-tor. In addition to many instructions per kernel, the RGB imagerepresentation triples the number of computations compared to thegray image representation. This shows that the complexity of the

Page 10: Automatic Kernel Fusion for Image Processing DSLs · In this work, we investigate the automatic optimization of image processing applications on GPUs in the context of DSLs. For the

SCOPES ’18, May 28–30, 2018, Sankt Goar, Germany B. Qiao et al.

kernels can affect the optimization performance. It is desirable toidentify applications that are memory-bound during optimization.As a potential remedy, the number of arithmetic instructions in thekernel can be estimated as an arithmetic intensity indication. Sincekernel fusion focuses on inter-kernel communications, kernels withhigh compute intensity will unlikely benefit. This should be builtas a constraint into our fusibility analysis framework, and will beaddressed in future work.

5 CONCLUSIONWe presented an automatic optimization framework that employskernel fusion in the context of image processing DSLs. We seam-lessly integrated the proposed technique into the Hipacc frame-work, and thus enabled inter-kernel optimizations during source-to-source compilation for GPUs. Specifically, we delivered a fusibil-ity analysis model, which benefits from the Hipacc kernel-typerepresentations and extracts information on the AST level duringcompilation. The presented domain-specific fusion method enablesthe automatic fusion of Hipacc operators during code generation.Results show that we can achieve speedups of up to 1.23 for com-mon filter-type image processing pipelines and up to 1.60 for thefusible kernels.

This work stands at the beginning of compiler-based automaticoptimization in the context of image processing DSLs. In the future,we will enlarge the fusion scope of our analysis model by looseningthe constraints and consider more scenarios, such as local-to-localfusion. Moreover, we plan to support kernel fusion also for otherbackends in Hipacc, such as FPGAs [5, 11].

REFERENCES[1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compil-

ers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA,2006. isbn: 0321486811.

[2] J. Filipovič, M. Madzin, J. Fousek, and L. Matyska. OptimizingCUDA code by kernel fusion: Application on BLAS. TheJournal of Supercomputing, 71(10):3934–3957, Oct. 2015. issn:1573-0484. doi: 10.1007/s11227-015-1483-z.

[3] C. Harris and M. Stephens. A combined corner and edgedetector. In In Proceedings of the Fourth Alvey Vision Con-ference (AVC). (Manchester, UK), pages 147–151, Sept. 1988.doi: 10.5244/C.2.23.

[4] H. W. Jensen, S. Premoze, P. Shirley, W. B. Thompson, J. A.Ferwerda, and M. M. Stark. Night Rendering. Technical re-port UUCS-00-016, Computer Science Department, Univer-sity of Utah, Aug. 2000.

[5] D. Koch, F. Hannig, and D. Ziener, editors. FPGAs for SoftwareProgrammers. Springer, June 2016. 327 pages. isbn: 978-3-319-26406-6. doi: 10.1007/978-3-319-26408-0.

[6] R. Membarth, F. Hannig, J. Teich, M. Körner, and W. Eck-ert. Generating device-specific GPU code for local operatorsin medical imaging. In Proceedings of the 26th IEEE Interna-tional Parallel and Distributed Processing Symposium (IPDPS).(Shanghai, China), pages 569–581. IEEE, May 21–25, 2012.isbn: 978-0-7695-4675-9. doi: 10.1109/IPDPS.2012.59.

[7] R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, andW. Eckert. HIPAcc: A domain-specific language and compilerfor image processing. IEEE Transactions on Parallel and Dis-tributed Systems, 27(1):210–224, Jan. 2016. issn: 1045-9219.doi: 10.1109/TPDS.2015.2394802.

[8] R. T. Mullapudi, A. Adams, D. Sharlet, J. Ragan-Kelley, andK. Fatahalian. Automatically scheduling Halide image pro-cessing pipelines. ACM Transactions on Graphics, 35(4):83:1–83:11, July 2016. issn: 0730-0301. doi: 10.1145/2897824.2925952.

[9] R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage:Automatic optimization for image processing pipelines.ACMSIGARCH Computer Architecture News, 43(1):429–443, Mar.2015. issn: 0163-5964. doi: 10.1145/2786763.2694364.

[10] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand,and S. Amarasinghe. Halide: a language and compiler foroptimizing parallelism, locality, and recomputation in im-age processing pipelines. In Proceedings of the 34th ACMSIGPLAN Conference on Programming Language Design andImplementation (PLDI). (Seattle, WA, USA), pages 519–530,New York, NY, USA. ACM, 2013. isbn: 978-1-4503-2014-6.doi: 10.1145/2491956.2462176.

[11] O. Reiche, M. Özkan, R. Membarth, J. Teich, and F. Hannig.Generating FPGA-based image processing accelerators withHipacc. In Proceedings of the International Conference on Com-puter Aided Design (ICCAD). (Irvine, CA, USA), pages 1026–1033. IEEE, Nov. 13–16, 2017. isbn: 978-1-5386-3094-5. doi:10.1109/ICCAD.2017.8203894.

[12] O. Reiche, M. Schmid, F. Hannig, R. Membarth, and J. Teich.Code generation from a domain-specific language for C-based HLS of hardware accelerators. In Proceedings of theInternational Conference on Hardware/Software Codesign andSystem Synthesis (CODES+ISSS). (New Dehli, India), 17:1–17:10. ACM, Oct. 12–17, 2014. isbn: 978-1-4503-3051-0. doi:10.1145/2656075.2656081.

[13] M. J. Shensa. The discrete wavelet transform: Wedding theà trous and Mallat algorithms. IEEE Transactions on SignalProcessing, 40(10):2464–2482, Oct. 1992. issn: 1053-587X. doi:10.1109/78.157290.

[14] G. Wang, Y. Lin, and W. Yi. Kernel fusion: An effectivemethod for better power efficiency onmultithreaded GPU. InProceedings of the 2010 IEEE/ACM Int’L Conference on GreenComputing and Communications & Int’L Conference on Cyber,Physical and Social Computing, GREENCOM-CPSCOM ’10,pages 344–350, Washington, DC, USA. IEEE Computer Soci-ety, 2010. isbn: 978-0-7695-4331-4. doi: 10.1109/GreenCom-CPSCom.2010.102.

[15] H. Wu, G. Diamos, J. Wang, S. Cadambi, S. Yalamanchili, andS. Chakradhar. Optimizing data warehousing applicationsfor GPUs using kernel fusion/fission. In Proceedings of theIEEE 26th International Parallel and Distributed ProcessingSymposium Workshops & PhD Forum (IPDPSW), pages 2433–2442, May 2012. doi: 10.1109/IPDPSW.2012.300.