design-space exploration of embedded hardware accelerators

Design-Space Exploration of Embedded Hardware

Accelerators for Image Processing Applications

by

Onur Can Ulusel

B.S., Sabanci University; Istanbul, Turkey, 2008

Sc.M., Sabanci University; Istanbul, Turkey, 2010

A dissertation submitted in partial fulfillment of the

requirements for the degree of Doctor of Philosophy

in The School of Engineering at Brown University

PROVIDENCE, RHODE ISLAND

May 2016

This dissertation by Onur Can Ulusel is accepted in its present form

by The School of Engineering as satisfying the

dissertation requirement for the degree of Doctor of Philosophy.

Date

Iris Bahar, Ph.D., Advisor

Recommended to the Graduate Council

Date

Sherief Reda, Ph.D., Reader

Date

Benjamin Kimia, Ph.D., Reader

Approved by the Graduate Council

Date

Peter Weber, Dean of the Graduate School

iii

Vitae

Onur Can Ulusel was born in Balıkesir, Turkey on November 14, 1986. He

received his B.Sc and M.Sc in Electronics Engineering from Sabanci University in

2008 and 2010. He then came to Brown University, Providence, Rhode Island to

pursue a Doctor of Philosophy in Engineering.

His research interests include the exploration of parallel computing techniques for

low-power embedded systems, power reduction techniques for reconfigurable comput-

ing and the development of design methods to accelerate image processing systems.

onur [email protected]

Brown University, RI, USA

iv

Acknowledgements

First and foremost, I thank my advisor Professor Iris Bahar for her patience, encour-

agement and guidance throughout my study. She has been a great mentor to me and

I feel privileged to be her student. I would also like to thank Professor Sherief Reda

who has provided great insight and invaluable feedback throughout my studies.

I am thankful to Professor Benjamin Kimia for agreeing to serve as a member

in my dissertation committee even at hardship. His comments and questions have

made this work better.

I also would like to thank all my friends and colleagues from Barus and Holley

3rd floor. Thanks to them my staying in Brown has been an even more pleasant ex-

perience. I would like to thank Kumud, Marco, Dimitra, Kapil, Anıl, Fırat, Osman,

Octi, Brandon, Soheil, Chhay, Reza, Xin, Cesare, Monami, Chris P. and Chris H.

Last but not least, I deeply thank my family. Unwavering support and relentless

encouragement from my fiancee Sema has helped me immensely during this journey.

I would also like to express my deepest gratitude for my beloved parents Enis and

Cigdem, and also my sister Melis, who always believed in me and always been there

to support me.

v

Abstract of “ Design-Space Exploration of Embedded Hardware Accelerators forImage Processing Applications ” by Onur Can Ulusel, Ph.D., Brown University,May 2016

Computer vision applications have gained significant popularity in their use for mo-

bile, battery powered devices. These devices range from every-day smart-phones

to autonomously navigating unmanned aerial vehicles (UAVs). While the image

processing required by these applications may be transferred to the cloud or other

off-device computing engines, because of real-time computing requirements and lim-

ited data transfer capabilities, it is desirable for computation to be handled locally

whenever possible. However, local computation can be quite challenging for mo-

bile and embedded systems due to the highly computationally intensive nature of

computer vision algorithms and require careful consideration of the target design

constraints and possible design parameters.

In this dissertation work, we first implement two real-time image processing ac-

celerators as test cases to be used for fast design space exploration: one for image

deblurring and one for block matching. For these designs, we identifiy both algorith-

mic and hardware parameters that optimize these accelerators and demonstrate the

performance, power and accuracy trade-offs in our target applications on FPGAs.

For the second part of this dissertation, we present a power and performance eval-

uation of several low cost feature detection and description algorithms implemented

on various embedded systems platforms (embedded CPUs, GPUs and FPGAs). We

present a streamlined FPGA implementation for feature detection which includes a

pre-processing stage to eliminate unnecessary computation and a computation flow

which makes maximum utilization of pixel proximity and avoids down-time after the

initial loading of image pixels. In addition, we present a combined FPGA implemen-

tation on low-cost Zynq SoC FPGAs which pipelines feature detection with feature

description, realizing increased efficiencies in performance, power dissipation and en-

ergy consumption compared to other embedded platforms. We show that despite the

high-level parallelization embedded GPU platforms like NVIDIA Jetson TK1 pro-

vide, computation of multiple kernels are highly bounded by the kernel scheduler and

memory bottlenecks reducing GPUs’ effectiveness whereas customization of FPGAs

on multiple layers can tackle operation of multiple kernels much more efficiently.

Contents

Vitae iv

Acknowledgments v

1 Introduction 11.1 Performance, Power and Accuracy Trade-offs in FPGA-based Accel-

erators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Hardware Acceleration on Low-power Embedded Platforms . . . . . . 61.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background and Previous Work 102.1 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Image Processing Applications . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Image Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2 Block Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . 242.2.4 Feature Description . . . . . . . . . . . . . . . . . . . . . . . . 252.2.5 Hardware Acceleration of Image Processing Kernels . . . . . . 30

3 Performance, Power and Accuracy Trade-offs in FPGA-based Ac-celerators 343.1 Modeling and Optimization Methodology . . . . . . . . . . . . . . . . 383.2 Image Processing Applications . . . . . . . . . . . . . . . . . . . . . . 39

3.2.1 Image Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.2 Block-Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.3.1 Modeling Results . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 Hardware Acceleration on Low-power Embedded Platforms 644.1 Selection of Feature detection and description algorithms . . . . . . . 654.2 Platform Implementations . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 694.2.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 80

5 Summary of Dissertation and Possible Future Extensions 825.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

List of Tables

3.1 Data flow of block-matching PEs . . . . . . . . . . . . . . . . . . . . 48

4.1 Instruction number Comparison between GPU and FPGA implemen-tations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2 Resource utilization on the Zynq FPGA . . . . . . . . . . . . . . . . 79

List of Figures

2.1 Pareto efficiency in a design space [72]. . . . . . . . . . . . . . . . . . 112.2 Design space exploration framework proposed by Palermo et al. [51]. . 122.3 Technology mapping: (a) An original netlist (b) is segmented into

possible covering and (c) mapped into LUTs [42] . . . . . . . . . . . . 162.4 Unmanned air vehicle system to be used with our deblur accelerator. 182.5 (a) Example of a blurred image taken by aerial photography and (b)

deblured image using Landweber algorithm . . . . . . . . . . . . . . 202.6 The search patterns for (a) Three step search, (b) Diamond Search,

and (c) Hexagonal Search [59] . . . . . . . . . . . . . . . . . . . . . . 212.7 Computation of motion vectors of a given image block in a reference

frame using Block Matching algorithm [55]. . . . . . . . . . . . . . . . 222.8 Application of Block Matching algorithm over two consecutive frames

and the resulting motion vectors . . . . . . . . . . . . . . . . . . . . . 232.9 The Bresenham circle is used to determine if interest point p is a

corner feature. Figures taken from [57]. . . . . . . . . . . . . . . . . . 252.10 Various sampling patterns used for BRIEF descriptor. Figures taken

from [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.11 The sampling pattern proposed for BRISK with N = 60 points. The

blue circles corresponds to the points of interest detected by the fea-ture detector algorithm and the surrounding red circles represent thestandard deviation of the Gaussian smoothing kernel applied over theinterest points. Figure taken from [39]. . . . . . . . . . . . . . . . . . 29

2.12 Illustration of (a)FREAK sampling patterns and (b) human retina.The receptive cells in the retina are clustered into four areas withdifferent densities which is replicated in the FREAK sampling pattern.In (a), each circle represents a image block that requires smoothingwith its corresponding Gaussian kernel. Figure taken from [3]. . . . . 30

3.1 Illustration of the idea of using regression based modeling for designspace exploration and finding important designs based on objectivesand constraints. Each star on the graph on the right represents adesign variant and the dashed line represents the Pareto frontier. De-signs shown in dashed yellow boxes represent optimal designs givenby the optimization framework while the ones in blue represent thetraining set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Top-level block diagram for deblur architecture. . . . . . . . . . . . . 403.3 Architecture of a single row of pixel array . . . . . . . . . . . . . . . 433.4 Comparison of DSP pipeline depth of (a) 6 and (b) 3. . . . . . . . . . 443.5 Time-division multiplexing for a factor of 2 . . . . . . . . . . . . . . . 453.6 Top-level block diagram for block-matching architecture. . . . . . . . 46

3.7 Power measurement setup using external digital multimeter . . . . . . 513.8 Error percentage of power model over explored design space percentage. 523.9 Sensitivity of different parameters over the power estimation. . . . . . 543.10 Comparison of mean error percentage using different model fits for

power estimation, area and arithmetic accuracy models for the imagedeblur algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.11 Comparison of mean error percentage using different model fits forpower estimation, area and arithmetic accuracy and throughput mod-els for the block-matching algorithm. . . . . . . . . . . . . . . . . . . 58

3.12 Trade-off between power and arithmetic inaccuracy of the image de-blurring system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.13 Trade-off between area and power of the image deblurring system. . . 593.14 Trade-off between arithmetic inaccuracy and area of the block match-

ing system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1 The precision/recall rate and the run-time comparison of feature de-scriptors on an Intel i7 CPU. . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Flowchart for feature detection and description. . . . . . . . . . . . . 674.3 Top-level block diagram for FPGA implementation with FAST feature

detection and BRIEF/BRISK/FREAK feature description. . . . . . . 704.4 Issue Stall Reasons for FAST and BRIEF implementations on GPU . 764.5 Run-time and power results for FAST feature detection and BRIEF/BRISK/F-

REAK feature description algorithms over various embedded systems. 78

Chapter 1

Introduction

Visualization and communication technology is growing at a rapid rate. The avail-

ability of a camera and a large number of sensors in mobile devices have made

significant changes in our expectations in all aspects of our life from healthcare to

education and from defense to entertainment. Some of the technological enablers for

this change has been the rising trends in semiconductor technology, with an increase

in the number of transistors on chips in alignment with Moore’s Law [47], and com-

puter architecture, that managed to keep the power density of these chips relatively

constant in accordance to Dennard Scaling [22]. However, despite the ever-increasing

expectations of the end users and the designers, the technological advancements in

semiconductor and computer architecture technology can no longer sustain this de-

mand alone. Similarly, simply increasing clock frequencies to speed up digital designs

is becoming increasingly difficult [36, 45, 71]. As we try to transfer the state-of-the-

art computer vision algorithms designed from high-performance desktop computers

onto more modest performing, energy-efficient mobile platforms, designers are ex-

pected to increase throughput per Watt in order to achieve maximum performance

1

2

while still meeting the low power budgets [26, 6].

In the past decade, computer vision applications have gained significant popular-

ity in their use for mobile, battery powered devices. These devices range from every-

day smart-phones to autonomously navigating unmanned aerial vehicles (UAVs).

While the image processing required by these applications may be transferred to

the cloud or other off-device computing engines, because of real-time computing

requirements and limited data transfer capabilities, it is desirable for computation

to be handled locally whenever possible. However, local computation can be quite

challenging for mobile and embedded systems due to the highly computationally

intensive nature of computer vision algorithms. Even a typical digital camera cap-

turing VGA-resolution (640×480) video at a rate of 30 frames requires processing

of 27 million pixels per second [30]. In addition, limited size, weight, and battery

lifetime of these systems provide further constraints.

Reconfigurable logic, such as Field Programmable Gate Arrays (FPGA) or Graphic

Processing Units (GPU)-based embedded system solutions have been especially sought

after for real-time computer vision applications because of their high throughput and

computation capabilities. These systems are ideal for inexpensive prototyping plat-

forms in order to implement high-throughput solutions where iterative refinement

and validation of a design implementation can be performed until the desired perfor-

mance goals are achieved. Examples of such platforms are described in [36, 45, 71].

However, adding more hardware resources to solve the throughput problem may not

always lead to a feasible real-time solution.

In this thesis, we explore the design space of various computer vision algorithms

for embedded systems, namely FPGA and GPU based systems. We analyze the

impact of algorithmic and design-level implementation decisions on various metrics

3

such as throughput, power, design area and arithmetic accuracy. We observe how

different design decisions lend themselves better to certain embedded system plat-

forms and how we can generalize these techniques for efficient acceleration of other

computer vision algorithms. These generalized techniques can then be used to for-

mulate regression-based mathematical models in order to speed up the design space

exploration process by discovering optimal custom-designed solutions for specific

computer vision applications.

With this thesis work, we aim to help designers make more educated choices

during the early design phase for any given implementation. We will first try to

answer how regression based fast design space exploration models can be applied

specifically to computer vision algorithms and what types of algorithmic and archi-

tectural design decisions should be made to help with different design constraints.

Then we will expand our design space exploration to guide designers to select the

optimal embedded system platform based on algorithmic characteristics of desired

applications.

Chapter 2 provides background and related work on design space acceleration

and the acceleration of selected computer vision algorithms. Several comparisons

of embedded systems and various related work on their design space is presented.

In the following sections we discuss a number of computer vision algorithms and

implementations as well as the techniques we have developed to efficiently accelerate

computer vision algorithms for embedded systems. We discuss how we can formulate

our findings to speed up the design space exploration for computer vision algorithms.

4

1.1 Performance, Power and Accuracy Trade-offs

in FPGA-based Accelerators

The ease-of-use and reconfigurability of FPGAs makes them an attractive platform

for accelerating algorithms. Therefore, FPGA-based accelerators are widely used

in real-time image processing. With the level of customization provided via pro-

grammable logic elements, lookup tables (LUTs), Block RAMs (BRAMs), and digital

signal processor (DSP) blocks, FPGAs can achieve high throughput and computation

capabilities and provide a faster time-to-prototyping cycle compared with Applica-

tion Specific Integrated Circuits (ASICs). Due to the real-time requirements for our

high-throughput test cases, we have elected to demonstrate the performance, power

and accuracy trade-offs in our target applications on FPGAs.

We observe that FPGA-based accelerators, especially those that can be used for

image processing, offer many algorithmic and hardware design parameters, which

when properly chosen, can lead to outcomes with the desired throughput, power,

design area and arithmetic accuracy. However, compared with standard cell-based

ASICs, LUT based logic implementation is inefficient in terms of power consumption

and programmable switches have higher power consumption because of large output

capacitances. As low power has become an important design metric, designers should

now consider the impact of their design decisions not only on speed and area, but

also on power consumption throughout the entire design process [1, 21].

In Chapter 3, we discuss the exploration of algorithmic and design level decisions

we have applied for FPGA based hardware acceleration. We propose techniques

for fast design exploration and multi-objective optimization to quickly identify both

algorithmic and hardware parameters that optimize these accelerators. We also show

5

the regression based modeling we have applied to our design decision to accelerate

the design space exploration process on the given parameter space of the algorithms.

To demonstrate the effectiveness of our methodology, we have selected image de-

blurring and block-matching algorithms as our test cases for hardware acceleration

in Chapter 3. Both operations are fundamental components of many image process-

ing applications. Image deblurring is the process of restoring blurred images where

image blur is a form of bandwidth reduction caused by the imperfect nature of the

image capturing process. It can be caused by the relative motion between the cam-

era and the original scene, or by an optical system that is out of focus. Even slight

camera shakes under low light conditions may cause image blur such as atmospheric

turbulence for aerial photography [9, 75]. Therefore restoration of blurred images is

an essential initial step in many image processing applications.

Our secondary benchmark, block-matching is a sliding window operation per-

formed over video sequences and is commonly used in motion estimation and video

compression applications such as H.264 and MPEG-4 standards. Block-matching is

used to reduce the bit-rate in video compression systems by exploiting the tempo-

ral redundancy between successive frames, and it is used to enhance the quality of

displayed images in video enhancement systems by extracting the true motion infor-

mation. Finding matching frames in successive frames allows the information to be

compressed as the motion vectors and pixel intensity differences instead of sending

the raw image data [55].

We will use both image deblurring and block-matching designs to analyze the

effectiveness of algorithmic and hardware-level design choices and show the impact

of these choices on our regression based-models.

6

This work has allowed us to generate methodological formulations to predict the

impact of various design choices on the desired design metrics such as area, through-

put and power. Using such fast design space exploration techniques, custom designs

can be adjusted to target specific design constraints without designers enumerat-

ing all permutations of the design space explicitly. However the methods presented

require extensive knowledge of the potential design parameters and the target appli-

cation domain, and also limits itself to the domain of FPGAs. In our next work, we

have expanded into a different set of image processing applications that share major

components in order to explore the design space of various embedded systems.

1.2 Hardware Acceleration on Low-power Embed-

ded Platforms

Moving beyond the core image processing operations of image deblurring and block-

matching, core kernel in the next part of this thesis: feature detection and de-

scription. Feature detection and feature description are key building blocks of many

computer vision algorithms, including image retrieval, biometric identification, visual

odometry [50], object detection, tracking, motion estimation and 3D reconstruction.

Efficient feature extraction and description are crucial due to the real-time require-

ments of such applications over a constant stream of input data. High-speed com-

putation typically comes at a cost of high power dissipation, yet embedded systems

are often highly power constrained, making discovery of power-aware solutions es-

pecially critical for these systems. Therefore a computationally efficient means of

detection and analysis of image features is a critical first step in the development of

energy-efficient, single-chip solutions for these applications.

7

In Chapter 4, we introduce our comparative study of embedded platforms and

how the potential of different embedded platforms can me maximized through ap-

plication specific customization. We present a power and performance evaluation of

several low cost feature detection and description algorithms implemented on various

embedded systems. We evaluate these algorithms in terms of run-time performance,

power dissipation and energy consumption. In particular, we compare embedded

CPU-based, GPU-accelerated, and FPGA-accelerated embedded platforms and ex-

plore the implications of various architectural features for the acceleration of these

fundamental computer vision algorithms. We show that FPGAs in particular of-

fer attractive solutions for both performance and power and describe several design

techniques utilized to accelerate feature extraction and description algorithms on

low-cost Zynq system on chip (SoC) FPGAs.

In our analysis we customize off-the-shelf implementations of our algorithms of in-

terest and target both embedded CPUs and GPUs. However, our FPGA-accelerated

implementations of feature detection description take advantage of the highly cus-

tomizable logic fabric to realize significant improvements in both run-time and power

dissipation compared to other embedded solutions.

In addition, we discuss the design techniques applied to obtain high-throughput

solutions and hardware-specific power reductions. We conclude that, due to its extra

customization and flexibility, our FPGA-accelerated implementation is a promising

way forward for the development of low-power, energy-efficient platforms capable

of providing real-time performance for complex computer vision based applications

such as autonomous navigation.

Under this research we provide a comprehensive comparison between embedded

CPU, GPU and FPGA implementations of feature detection and description algo-

8

rithms, evaluating their power and performance trade-offs. We propose a streamlined

FPGA implementation for feature detection which includes a pre-processing stage to

eliminate unnecessary computation and a kernel that uses a zig-zag pattern for image

masks which makes maximum utilization of pixel proximity and avoids down-time

after the initial loading of image pixels. In addition, we propose a combined FPGA

implementation which pipelines feature detection with feature description, realiz-

ing increased efficiencies in performance, power dissipation and energy consumption

compared to other embedded platforms.

1.3 Thesis Contributions

To summarize, in this thesis we will investigate different hardware accelerator plat-

forms specifically targeted for real-time image processing applications. We will ex-

plore various options for algorithmic as well as architectural design decisions and

use them to train design space exploration models that designers can use to create

optimal accelerators under a range of constraints. We will explore different embed-

ded systems and accelerate various image processing algorithms in order to demon-

strate the applicability of algorithms for specific platforms. We will use the design

techniques we have developed with to present streamlined FPGA implementations

utilizing deep pipelining, continuous filter flow, and pre-computation steps.

Each design presents a vast number of parameters from which to select. Enu-

merating each permutation of these parameters is simply not possible. This thesis

work provides the necessary tools for a designer to make educated design decisions

during the early phase of the design process. We will show the interdependence of

design parameters and present constraint specific guidelines for our selections. Our

9

analysis covers power as well as performance driven constraints compared to other

design space exploration work found in the literature.

This thesis is organized as follows. In Chapter 2, we will present the necessary

background and related work on the field of design space exploration. We will de-

scribe the image processing applications used in this work in detail and also discuss

the previous work in literature on hardware acceleration such image processing ap-

plications. In Chapter 3, we discuss the exploration of algorithmic and design level

decisions we have applied on the unage deblurring and block-matching algorithms for

FPGA-based hardware accelerators. We also show the regression based modeling we

have applied to our design decision to accelerate the design space exploration process

on the given parameter space of the algorithms. Chapter 4 presents our comparative

study of embedded platforms and how each algorithm maps to various embedded

systems differently based on the underlying architecture of the platform as well as

the characteristics of the algorithms themselves. Finally, Chapter 5 presents our

conclusions and potential future projects that can build up on presented work.

Chapter 2

Background and Previous Work

In this chapter, we will discuss the design-space exploration for hardware accelerated

computer vision algorithms and the current trends and prospects of available systems.

We will present some fundamentals of design-space exploration and review various

techniques proposed in prior literature. We will start out by describing some of the

metrics that are important from the design point of view. We will then address

specific methodologies to obtain designs that are considered optimal in terms of

these design metrics. We will investigate specific cases of optimization done so far

for hardware accelerators, with greater focus on analytical modeling of the design

space as well as inexact circuits and approximate computing as a means for obtaining

low area/power circuit alternatives.

10

11

Figure 2.1: Pareto efficiency in a design space [72].

2.1 Design Space Exploration

Previous work on accelerating design space exploration mainly follows two different

approaches: reducing the number of configurations to be evaluated and design space

evaluation via modeling. Some of the publications that follow the former approach

include the work by So et al. [66] where design space exploration options are auto-

matically explored by their own FPGA synthesis compiler. They suggest that a key

step to fast design space exploration is to automate it using a high-level program-

ming paradigm coupled with compiler technology oriented towards FPGA designs.

Their proposed compiler tool analyses a given design and makes pre-defined trans-

formations, such as loop unrolling and array renaming, to automatically optimize

it according to compiler defined criteria. Their automated tool can find the opti-

mal design by searching through only 0.3% of the design space, yet their proposed

optimization criteria is solely driven by minimizing the execution time of the given

algorithm while staying under the area budget and can not be changed based on

designer input.

12

APPLICATION

ARCHITECTUREMAPPING

SYSTEM - LEVELEXECUTABLE

MODEL

COLLECTINGRESULTS

OPTIMIZERDESIGN SPACE DESCRIPTION

EVALUATIONFUNCTIONS

PARETOCURVE

SystemDescription

Modules Design SpaceExploration Modules

Figure 2.2: Design space exploration framework proposed by Palermo et al. [51].

Palermo et al. [51, 2] propose finding approximate Pareto points over the de-

sign space as a means for efficient design space exploration. Pareto efficiency is a

commonly used term in optimization that represents a state where it is impossible

to improve quality of a design objective without making at least one other design

objective worse off. All design permutations that are in a Pareto efficiency state are

considered to be Pareto points and the collection of all the Pareto points is called

the Pareto curve. A simple illustration of Pareto efficiency is given in Figure 2.1,

where all the red points represents Pareto points for a design space with two objec-

tives. The design space exploration framework proposed by Palermo et al. is given

in Figure 2.2. System Description Modules are the inputs to their Design Space

Evaluation Modules with the information of target design space and the applica-

tion domain. The Optimizer module selects a set of candidate optimal points to

be evaluated in terms of evaluation functions. Each selected point is mapped to a

target architecture and the evaluated using the executable model. The results are

evaluated by the Optimizer to estimate the Pareto curve using various heuristic al-

gorithms such as Random Search Pareto [76] and Pareto Simulated Annealing [19].

Despite performing a multi-objective exploration, they present a limited design space

mainly composed of transformations applied to specifically modified source code.

13

The work by Sheldon and Vahid [62] uses the Design of Experiments paradigm

to generate Pareto points which are of most interest to the designer. Design of Ex-

periments (DoE) [46] is a statistical paradigm whose objective is to design a small

set of experiments that provides maximum information on how the experimental

parameters influence the experimental output and interact with one another. There

are three statistics that can be learned through DoE; (1) positive or negative impact

of a given parameter on the output, (2) whether a parameter is beneficial to the

output, and (3) how each parameter interacts with one another. Sheldon and Vahid

proposes a DoE-based Pareto point generation using a multi-phase approach . The

first phase automatically generates a parameter interdependency graph, which is a

weighted graph whose edges show the dependencies between the parameters. Each

parameter is initially assumed to be independent and for each potential dependency,

tests are evaluated where each one of the parameters is first changed individually

and then together with the rest of the parameters. The accuracy of these estima-

tions are used to compute the pairwise edge errors and update the edge values in the

generated graph. Then the second phase of the algorithm generates Pareto points

from the weighted parameter interdependency graph starting from the node pairs

with the highest edge value. The DoE approach presented by Sheldon and Vahid is

effective for small number of parameters, however due to generating parameter de-

pendencies phase, it has a quadratic time complexity (O(n2), where n is the number

of parameters) and therefore inherently slow.

Similarly, the work by Givargis and Vahid [25, 24] proposes to find all Pareto-

optimal configurations of parameterized SoC architectures using the pre-identified

interdependencies among the design parameters using parameter interdependency

graphs. They simulate the design space of specifically embedded SoC architectures

over a parameterized SoC platform called Platune using a MIPS processor. Various

14

components of their system are configurable such as the size of caches or width

of the busses. They explore the various voltage levels of the MIPS processor as a

design parameter as well and explore the design space in terms of power along with

performance metrics. Each of the design parameters are searched exhaustively and

local Pareto points are identified. Using the pre-identified interdependencies among

the design parameters, these local Pareto points are merged to generate the system

level Pareto curve. This is one of the earliest works targeting embedded system

design space exploration and focuses on the design space of the target platforms

rather than the application domain. In addition, it relies heavily on the designer

input to generate the interdependencies of the parameter space. Our aim, in this

thesis, is to fully consider the applications themselves as part of the optimization

process.

Instead of the previously mentioned exhaustive search approaches, several ran-

domized search approaches have been proposed to find the Pareto curve of a given

design space [52, 5, 61]. These works perform design space exploration inspired

from genetic algorithms and iterate through the design space by evolving the design

parameter permutations. Each parameter permutation can be mapped as a chromo-

some whose genes define the parameters of the systems. Design space is explored via

mutation and crossover operators, where mutation refers to random modification

of a parameter and crossover is the random exchange of parameters between two

chromosomes, as in parameter permutations. Although these genetic algorithms can

explore a design space with minimal designer input, they have very long run times.

Use of analytical models using design parameters to evaluate design metrics for

a large design space have been discussed in other literature [64, 20, 29, 37]. The

works by Smith et al. and Das et al. present technology mapping models that

relate architectural parameters to the speed of FPGAs where both the architectural

15

parameters and the metric defining the speed of the FPGA is different. The process of

technology mapping is an FPGA specific problem concerning the mapping of a given

circuit netlist into lookup-tables (LUTs) as illustrated in Figure 2.3. Smith et al. [64]

present models that estimate the average post-placement pre-routing wire length of

an implementation using architectural parameters such as number and positioning of

logic blocks and pins whereas Das et al. [20] presents models that estimate the depth

of a circuit using architectural parameters such as lookup-table size, cluster size, and

number of inputs per cluster. Jiang et al. [29] use a least squares regression analysis

to estimate the power and area consumptions of specific computation units of an

implementation such as logical operators (e.g., AND/OR) and arithmetic operators

(e.g., multiply/add). Input bit widths are used as the sole design parameters for

their proposed area model, while average input transition density and average input

spatial correlation are used to generate power models. Analytical models can lead to

very fast and accurate design space exploration, however selection of the parameters

and identifying the fitting model based on parameter interactions is a crucial step

since it determines the limits of the design space that can be explored. In this

thesis work we will expand the number of parameters and design constraints that

can be used in analytical models and explore the design space with consideration to

parameter sensitivity.

Prior work has also been done on optimizing certain design metrics after co-

exploration, especially for throughput or power. For instance, Irturk et al. [27]

propose a tool that generates a variety of architectures specifically for matrix inver-

sion and find the optimum parameters for area and throughput constraints. The

approach of Chen et al. [16] aims to minimize power dissipation for an FPGA im-

plementation by doing careful allocation of functional units and registers. Other

related work by Sing and Yajun optimize the FPGA architecture for performance

16

FPGA Technology Mapping: A Study of Optimality

Andrew LingDepartment of Electrical and

Computer EngineeringUniversity of Toronto

Toronto, Canada

[email protected]

Deshanand P. SinghAltera Corporation Toronto

Technology CentreToronto, Canada

[email protected]

Stephen D. BrownAltera Corporation Toronto

Technology CentreToronto, Canada

[email protected]

ABSTRACT

This paper attempts to quantify the optimality of FPGAtechnology mapping algorithms. We develop an algorithm,based on Boolean satisfiability (SAT), that is able to map asmall subcircuit into the smallest possible number of lookuptables (LUTs) needed to realize its functionality. We itera-tively apply this technique to small portions of circuits thathave already been technology mapped by the best availablemapping algorithms for FPGAs. In many cases, the optimalmapping of the subcircuit uses fewer LUTs than is obtainedby the technology mapping algorithm. We show that forsome circuits the total area improvement can be up to 67%.

Categories and Subject DescriptorsB.6.3 [Hardware]: Logic Design - Design Aids

General TermsAlgorithms, Experimentation, Performance

KeywordsBoolean Satisfiability, Resynthesis, Optimization, Cone,FPGA, Lookup Table

1. INTRODUCTIONFPGAs (Field Programmable Gate Arrays) are reconfig-

urable integrated circuits that are characterized by a sea ofprogrammable logic blocks surrounded by a programmablerouting structure. Most modern FPGA devices contain pro-grammable logic blocks that are based on theK-input lookuptable (K-LUT) where aK-LUT contains 2K truth table con-figuration bits so it can implement any K-input function.Figure 1 illustrates the general structure of a 2-LUT. Thenumber of LUTs needed to implement a given circuit de-termines the size and cost of the FPGA-based realization.Thus one of the most important phases of the FPGA CADflow is the technology mapping step that maps an optimizedcircuit description into a LUT network present in the targetFPGA architecture. The goal of the technology mappingstep is to reduce area, delay, or a combination thereof in the

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DAC 2005, June 13–17, 2005, Anaheim, California, USA.Copyright 2005 ACM 1-59593-058-2/05/0006 ...$5.00.

x2

L1

L3

L4

L2

00

01

10

11

x1

Figure 1: 2-input LUT

network of programmable logic blocks that is produced. Inthis work, we assess state-of-the-art FPGA technology map-ping algorithms in terms of area-optimality. Timing-driventechnology mapping is not covered in this study.

a b c d e

f g

a b c d e

f g

a b c d e

f g

LUT LUT

(a) (b) (c)

x x

Figure 2: Technology mapping as a covering prob-lem. (a) Original Netlist (b) Possible Covering (c)LUT Mapping from Covering

The process of technology mapping is often treated as acovering problem. For example, consider the process of map-ping a circuit into LUTs as illustrated in Figure 2. Figure 2aillustrates the initial gate-level network, Figure 2b illustratesa possible covering of the initial network using 4-LUTs, andFigure 2c illustrates the LUT network produced by the cov-ering. In the mapping given, the gate labeled x is coveredby both LUTs and is said to be duplicated. Somewhat sur-prisingly, gate duplication is often necessary to minimize thearea of LUT networks [6].

The fundamental question that we ask in this paper is:Given the LUT-level network created by a technology map-ping algorithm, howmuch can its area be reduced? For smallsubcircuits, it is possible to answer this question in an opti-mal manner. Consider an arbitrary function f(i0, i1, . . . , in).Suppose that we seek to determine if it can be implementedin three or fewer 2-LUTs. This problem can be solved by

Figure 2.3: Technology mapping: (a) An original netlist (b) is segmented into possible coveringand (c) mapped into LUTs [42]

and power by allowing the designer to specify various parameters for the routing

architectures [63]. Tsoi and Luk [68] conduct power profiling and optimization for

heterogeneous multi-core systems (CPUs, GPUs and FPGAs) using on-board power

measurements. All these work specifically target either architectural or algorithmic

design parameters available in a system, and yet still need to explore a large set

of the design space for each platform to be able to perform their optimization via

interpolation of the measured data. In this thesis we expand the design space to

include architectural, algorithmic and target platform level parameters while at the

same time reducing the design space that needs to be sampled with the help of L1

regularization.

Compared to these previous techniques, our methodology for design space explo-

ration is novel in a multiple of ways. We propose an approach for model generation

using L1 regularization with traditional least squares regression. Using L1 regulariza-

tion leads to more accurate models that are identified in an entirely automated way.

We perform accelerator optimization by using the developed mathematical models

directly in numerical multi-objective optimization formulations. Also we combine

17

both algorithmic and hardware design considerations in the exploration and opti-

mization framework. We exploit the reconfigurability of FPGA platforms and tie it

with mathematical analysis for a swift, accurate and more importantly, optimizable

accelerator implementation under various objectives and constraints (e.g., power,

area, throughput, and arithmetic accuracy).

2.2 Image Processing Applications

In order to establish a preliminary understanding of the image processing algorithms

used in this thesis work. In this section, we first present an overview of these algo-

rithms. Our implementations on image deblurring and block matching algorithms

will be discussed in detail in Chapter 3 to present our design space exploration

methodology based on algorithmic and architectural design parameters of these algo-

rithms. Then in Chapter 4 we will discuss our implementations on feature detection

and description, evaluating their performance and power trade-offs for low-power

embedded systems.

2.2.1 Image Deblurring

Image deblurring is performed by a filtering operation over a given image, which is

one of the fundamental operations of image processing applications [9]. Our target

accelerator is designed to be deployed within a real-life image processing system

mounted on a unmanned aerial vehicle (UAV) system for surveillance as shown in

Figure 2.4. As the UAV platform moves, the sensor tracks a point on the ground

so that the center pixel Sc will stay fixed on ground sample Gc. For pixels away

18

G0

Gn

Gc

Sc

S0

S0S

c

Figure 2.4: Unmanned air vehicle system to be used with our deblur accelerator.

from the center, the pose and position change of the sensor means that periphery

pixel S0 is composed of a number of ground samples G0 to Gn. The deblurring

accelerator will be used to offset blur effects created during image capture mainly

caused by the shaking camera during aerial transportation. The real-life setting

of the accelerator has put tight requirements on its throughput, power, area, and

arithmetic accuracy, which motivated the need for our proposed modeling and multi-

objective optimization methodology.

Image blur can be modelled by the convolution of an unblurred image and a blur

kernel such as:

Ib(x, y) =∑dx,dy

I0(x+ dx, y + dy)H(dx, dy),

where Ib(x, y) and I0(x, y) are the deblurred and original pixel intensities at coordi-

19

nate (i, j) and H(dx, dy) is the blur kernel value. The kernel is uniquely determined

by the motion blur, which is a 2D vector. In our application different parts of image

have different amounts of blurs, and different directions. However, the blur in a given

region can be treated as uniform. In addition, although the kernel of the different

pixels are different, locally they can be approximated as the same.

One of the most commonly used solutions to image blur is the iterative Landweber

method [9, 41] which is represented as:

I0 = Ib, I(n+1) = In + αHT ∗ (Ib −H ∗ In),

where (Ib−H ∗ In) is the residue denoting the error of the deblurred image and α is

the step size. An example of using Landweber method over images captured by aerial

photography can be seen in Figure 2.5 where Figure 2.5 (a) shows the blur effects

caused by the image capturing process and Figure 2.5 (b) shows the deblurred image.

The images are displayed in various zoom levels for ease of visualization. Due to the

high throughput requirements of our application and need for heavy parallelization,

instead of using an iterative blur kernel H, we use an estimated deblur kernel K

directly such that:

I = Ib ∗K.

This concept has been proved to work for small blurs as presented in [79]. The

20

(a) (b)

Figure 2.5: (a) Example of a blurred image taken by aerial photography and (b) deblured imageusing Landweber algorithm

generation of the deblur kernel has been done in collaboration with Object Video

and our accelerator uses pre-determined deblur kernels with size up to 13×7 or 7×13

and input (Ib) and output (I0) images with 12-bit pixels. Each kernel computation

requires 91 multiply-and-add operations.

2.2.2 Block Matching

Block-matching is a sliding window operation performed over video sequences and is

commonly used in video compression applications such as MPEG-4 and H.264 [55].

Block matching partitions a given frame into non-overlapping N × N rectangular

blocks and tries to find the block from the reference frame in a given search range

that best matches the current block.

Among the block matching algorithms, the full search algorithm finds the refer-

ence block that best matches the current block among all possible locations by ex-

haustively comparing each candidate block. As a result, full search achieves the best

performance among block matching algorithms at the cost of having the highest com-

putational complexity. Various fast search algorithms have been developed for block

21

Figure 2.6: The search patterns for (a) Three step search, (b) Diamond Search, and (c) HexagonalSearch [59]

matching that reduce the number of reference block comparisons instead of perform-

ing exhaustive search. Some such algorithms use window strides larger than 1 and

switch to a stride of 1 only at their final comparison step [40, 78, 77, 7]. Three Step

Search [40] initially evaluates the reference frame exhaustively using block strides of

4. In the next step the best matching block on average is re-evaluated using block

strides of 2, and the final step re-evaluates the best matching block using strides of

1. The Diamond Search [78] and Hexagon-Based Search [77] algorithms both search

over a reference frame using strides of 2, however instead of using exhaustive search

they search over a diamond or hexagonal pattern. This allows the algorithms to

move towards a region of best potential blocks adaptively; however they might get

stuck in local minima during the search. The search patterns for these algorithms

are illustrated in Figure 2.6. Despite the potential savings in computation time, full

search has remained a popular candidate for hardware acceleration because of its

regular dataflow and good compression performance [59].

For our design, we perform full search block-matching over a search window in a

reference frame to determine the best match for a block in a current frame. As shown

in Figure 2.7, the location of a block in a frame is given using the (x,y) coordinates

22

Current Frame

fd(m,n)=a * fb(m-1,n-1) + b * fb(m,n-1) + …. + i * fb(m+1,n+1)

Search Block

Debluring Algorithm

(x,y)

Reference Frame

(x,y) -p

-p

p

p

Search Range [-p,p]

Reference Frame

(x,y)

Search Region

(x+u,y+v)

mv(u,v)

Figure 2.7: Computation of motion vectors of a given image block in a reference frame usingBlock Matching algorithm [55].

of top-left corner of the block. The search window in the reference frame is the [-p,

p] size region around the location of the current block in the current frame.

The most commonly used matching criteria are the mean square error (MSE),

sum of square error (SSE) or sum of absolute difference (SAD). The SAD approach

provides a fairly good match at lower computational requirement due to lack of a

multiplier, and because of this SAD is most commonly used for block matching [48].

The SAD value for a current block in the current frame and a candidate block in

the reference frame is calculated by accumulating the absolute differences of corre-

sponding pixels in the two blocks as shown in:

SADBmxn(d) =

m,n∑x=1y=1

|c(x, y)− r(x+ dx, y + dy)|, (2.1)

where Bmxn is a block of size m×n, d = (dx, dy) is the motion vector, c and r are the

current and reference frames respectively. SAD is an extremely fast metric due to

its simplicity. It is very effective for a wide motion search of many different blocks.

SAD is also easily parallelizable since it analyzes each pixel separately, making it

23

Current Frame Reference Frame Motion Vectors

Figure 2.8: Application of Block Matching algorithm over two consecutive frames and the resultingmotion vectors

easily implementable with hardware and software coders [34].

Since a motion vector expresses the relative motion of the current block in the

reference frame, motion vectors are specified in relative coordinates. If the location

of the best matching block in the reference frame is (x+ u, y + v), then the motion

vector is expressed as (u, v). Motion estimation is performed on the luminance (Y )

component of a YUV image and the resulting motion vectors are also used for the

chrominance (U and V ) components. An example of block matching showing two

consecutive frames of a video sequence as the current and reference frames and the

resulting motion vector for each image block is given in Figure 2.8.

For a typical block size of 16 × 16 and a reference window size ±p = 16 , the

full-search block matching algorithms requires 16 × 16 = 256 absolute difference

operations per block comparison and a total of 256 block comparisons. Given an

adder tree is being used to compute the SAD from the generated absolute differences,

(256×2+255)×256 = 196352 adder/subtractors are required per block comparison.

Therefore a wide VGA resolution (480 × 800) image would require approximately

300 Giga adder/subtructor operations to be processed.

24

2.2.3 Feature Detection

Feature detection is a low-level processing operation for identifying pixels of interest

in an image which correspond to some elements of a scene that can be reliably

located in different views of the same scene. Corners, and edges are typical examples

of features.

Previous work on feature detection algorithms primarily involve studies which

attempt to detect the highest number of valid features with the least amount of

computational effort. The Scale-Invariant Feature Transform (SIFT) algorithm [43]

is the most prominent algorithm used for feature detection and has been used as

a baseline for most feature detection algorithms since it was first published over

10 years ago. As a more efficient alternative to SIFT, both in terms of speed and

computational complexity, the Features from Accelerated Segment Test (FAST) al-

gorithm was proposed [57]. FAST uses corner-based feature detection, as opposed

to the Difference of Gaussian (DoG) approach used by SIFT and its faster successor,

the Speeded Up Robust Features (SURF) [8] algorithm. It should be noted that the

use of corner based features had been previously proposed in other widely-accepted

algorithms, such as Harris Corner detection and Smallest Univalue Segment Assim-

ilating Nucleus (SUSAN) corner detection [65]. However FAST was the first corner

based feature detector to greatly reduce the computation requirements of feature de-

tectors, achieving a speedup of 169× over SIFT and 89× over SURF algorithms [44].

Analysis done by Canclini et al. [13] on low complexity feature detectors demon-

strates definitively the strength of corner based feature detectors over DoG based

detectors .

Corner-based feature detectors are derived from the idea of finding rapid changes

in direction on image edges to determine a unique image region of interest. In order to

25

Figure 2.9: The Bresenham circle is used to determine if interest point p is a corner feature.Figures taken from [57].

identify whether a pixel p with an intensity value Ip is a corner, the FAST detector

analyzes a 16 pixel Bresenham circle surrounding p. The Bresenham circle is an

approximation of a circle around the center pixel, as shown in Figure 2.9. A positive

detection is identified if n points of this circle form a contiguous segment, which

is either darker or brighter than the center pixel, within a pre-defined threshold T .

Algorithm 1 shows the pseudo code for FAST.

Algorithm 1: FAST Feature Detection

1 For each pixel p in an image, assume the intensity of the pixel to be Ip ;2 Define a threshold intensity value T ;3 Define a 16 pixels Bresenham circle of radius 3 centered around p, where each

pixel corresponds to I1, I2, ..., I16 ;4 for each i ∈ Ii do5 if Ii...i+12 + T < p ∨ Ii...i+12 − T > p then6 p is a corner ;7 end

8 end

2.2.4 Feature Description

Feature extraction involves computing a unique and identifying descriptor from the

pixels in the region around each point of interest. Descriptors are used to uniquely

identify each feature and match regions of features between two or more images.

The SIFT and SURF algorithms also generate descriptors for the features they

26

detect. These descriptors are represented by Histograms of Gradients (HoG). As

more efficient alternatives, algorithms such as BRIEF [12] and BRISK [39] use bi-

nary feature descriptors, which have the advantage of significantly decreasing the

complexity of feature matching by allowing the use of Hamming Distance to mea-

sure similarity between two descriptors.

As an alternative to storing the feature descriptors using Histograms of Gra-

dients (HoG) as in the case of SIFT and SURF, various binary feature descriptor

algorithms have also been proposed, such as BRIEF [12], BRISK [39] and FREAK [3]

to further improve the computation efficiency of the algorithms. The use of binary

feature descriptors have decreased the feature matching computation requirements

significantly by enabling the use of Hamming Distance to measure similarity of any

given two features. As a result, BRIEF feature descriptors can be computed 118×

faster than SIFT and 31× faster than SURF descriptors [44]. Using the hamming

distance as a distance measure between two binary strings, matching between two

patch descriptions can be done using a single instruction, as the hamming distance

equals the sum of the XOR operation between the two binary strings.

Binary descriptors are composed of 3 main sections: a sampling pattern, orienta-

tion compensation, and sampling pairs. A region centered around a detected corner

needs to be described as a binary string. Given a sampling pattern, pick N pairs of

points on the pattern and determine whether the first element or the second element

of the pair is greater than the other and define the pair as binary 1 or 0 correspond-

ingly. The resulting N bit vector is the feature descriptor for the said point to be

used for feature matching.

BRIEF is the first binary descriptor published. It has a simple pattern and does

not offer orientation invariancy for the detected features. The pseudo-code for the

27

Algorithm 2: BRIEF Feature Description

1 For each interest point p in an image, define a region of interest S × Scentered around p ;

2 Apply a gaussian smoothing filter over the region to reduce the camera noise. ;3 Use any of the sampling patterns given in Figure 2.10 to generate a pair of

arrays Xi and Yi, where i ∈ 0, ...., N . ;4 for each i ∈ Xi do5 if Xi > Yi then6 Di = 1 ;7 end8 else9 Di = 0 ;

10 end

11 end

algorithm is given in Algorithm 2. The descriptors are computed usingN neighboring

pixel pairs around the given feature location, denoted as Xi and Yi. The resulting N

bit descriptor vector is computed by 1-to-1 comparisons of Xi and Yi and denoted

as Di. Due to the use of raw pixel intensities for each pixel, a smoothing filter needs

to be applied to the image as a pre-processing step. The BRIEF algorithm presents

5 different methods to select the vectors X and Y as visualized in Figure 2.10 and

described as follows:

I Xi and Yi are randomly and uniformly sampled around a pre-defined region

S × S centered around interest point p.

II Xi and Yi are randomly sampled using a Gaussian distribution of distances

to interest point p, such that points close to the center are more likely to be

selected as a sample pair.

III Xi is first randomly sampled using a Gaussian distribution of distances to

interest point p, then Yi is randomly sampled using a Gaussian distribution of

distances to the pairing on Xi.

IV Xi and Yi are randomly sampled from discrete locations of a coarse polar grid.

28

V For each i, Xi is (0, 0) and Yi takes all possible values on a coarse polar grid.

Figure 2.10: Various sampling patterns used for BRIEF descriptor. Figures taken from [12].

Once the N -bit descriptor vectors are identified, the number of different bits

between any two feature vectors is calculated as the difference between features.

BRIEF differs from BRISK by the custom sampling pattern applied to compute

the binary descriptor vector. Each sample pair corresponds to a concentric ring

as shown in Figure 2.11, where red circles represent the standard deviation of the

Gaussian smoothing kernel applied over the interest points.

Unlike BRIEF, BRISK is an orientation invariant feature descriptor, meaning

that it estimates the orientation of the interest point from the selected sampling

pairs and rotates the sampling pattern to neutralize the effect of rotation. This is

done by distinguishing the sampling pairs as short pairs and long pairs, where long

pairs are used to determine orientation and short pairs are used for the intensity

comparisons that build the descriptor, as in the BRIEF algorithm. Short pairs

are pairs of sampling points that that have distance below a certain threshold and

long pairs are pairs of sampling points that have distance above a certain different

threshold.

29

Figure 2.11: The sampling pattern proposed for BRISK with N = 60 points. The blue circlescorresponds to the points of interest detected by the feature detector algorithm and the surroundingred circles represent the standard deviation of the Gaussian smoothing kernel applied over theinterest points. Figure taken from [39].

The orientation of the sampling pattern is estimated by summing all the local

gradients of the long pairs and computing the ratio of the y component of the local

gradients over the x component of the local gradients. Short pairs are used similar to

BRIEF to generate the N bit descriptor vector where the distance of two descriptor

vectors are computed by the use of XOR operation between the vectors.

Similar to BRISK, the FREAK algorithm also uses a custom sampling pattern

that is based on the model of a human retina. As shown in Figure 2.12, the suggested

sampling pattern corresponds to the distribution of receptive regions over the retina.

This results in a higher density of points near the center (feature coordinate), with

exponentially decreasing density as one moves away from the center.

Unlike the other two descriptors, FREAK sampling pairs are learned by maxi-

mizing the variance of the pairs and selecting uncorrelated pairs. This process results

in 4 sampling patterns of 128 pairs each. All 512 sampling pairs need to be evalu-

ated in order to generate the descriptor; however, descriptor matching for FREAK

30

Figure 2.12: Illustration of (a)FREAK sampling patterns and (b) human retina. The receptivecells in the retina are clustered into four areas with different densities which is replicated in theFREAK sampling pattern. In (a), each circle represents a image block that requires smoothingwith its corresponding Gaussian kernel. Figure taken from [3].

descriptors can be applied over each pair sequentially and if the distance between

two pairs is larger than a given threshold, the subsequent pairings do not need to be

further evaluated.

The orientation computation of FREAK descriptors is very similar to BRISK.

The only difference is that FREAK uses a pre-defined set of orientation pairs instead

of long pairs, as done for BRISK.

2.2.5 Hardware Acceleration of Image Processing Kernels

As mentioned earlier in Chapter 1, multimedia applications running on portable

devices have gained huge popularity in 21st century. This increased demand also

comes with higher expectations in application accessibility and quality. Therefore

hardware acceleration is an important tool for computer architects to create high-

throughput computer vision systems that can sustain high image resolutions in real-

time. Thankfully, computer vision kernels are highly data parallel and map well

to streaming architectures. In this section we will discuss some of the hardware

accelerated computer vision systems proposed in literature.

31

Different computational architectures present different opportunities to accelerate

feature detection and descriptor generation. This is particularly true of feature

detectors where every pixel is examined in order to determine if it meets the criteria

of a feature. These detectors tend to have few data dependencies, limited branching,

and few control dependencies. As noted in by Che et al. [15], kernels with these

characteristics are strong candidates for GPU acceleration.

Feature descriptor generation, however, requires computation across non-contiguous

image sub-regions. This results in more irregular memory access patterns compared

to feature detectors. In addition, feature description algorithms tend to have higher

complexity and require pre-computation with less data parallelism, but which can

benefit from design techniques such as pipelining. This class of algorithms is a

stronger candidate for acceleration via FPGAs as compared to GPUs.

There has been a significant amount of research into hardware acceleration of

feature detection algorithms. For instance, Bouris et al. [10] and Svab et al. [67]

both presented FPGA implementations for the SURF algorithm. Both works em-

phasize the power efficiency gained over GPU implementations; however they fall

short in terms of comparing FPGA run times to GPU versions [17]. Indeed, GPU

implementations in literature, such as [73] and [53], show the performance poten-

tial of feature detection working on high-end GPUs. Hongtao et al. [73] presents

a Harris-Hessian corner detector optimized for CUDA GPUs that can reach up to

20× faster computation speeds than CPUs. They also compare their implementa-

tion with SURF and report 1.25× speed-up. Phull et al. [53] also proposes a novel

low cost corner based detector algorithm optimized for CUDA GPUs that can run

14× faster than CPU implementations of the algorithms and 2× faster than Harris

corner detector implementations for GPUs. However, these works do not discuss the

power and energy repercussions of using such high-end GPUs, which dissipate power

32

in the range of 200W, and the challenges of running these algorithms on tightly

constrained embedded platforms. We emphasize that for embedded platforms, both

runtime performance and power need to be optimized.

On the other hand, a fully customized ASIC implementation of SURF on a 28nm

CMOS was recently presented [28], where a very low power dissipation of 2.8mW

was reported. Despite the high computation requirement of the SURF algorithm

due to the use of the DoG approach, this work demonstrates the gains that can be

achieved by hardware-oriented design optimizations.

There exists other publications in the literature that suggest the use of FPGAs

for feature detection applications. Schaeferling and Kiefer [60] presents a SoC fea-

ture detection system that uses a Xilinx Virtex 5 FPGA to accelerate SURF feature

detector algorithm. They propose a Flex-SURF+ customizable accelerator IP that

can be easily configured for algorithmic parameter changes in their design. Wal et

al. [70] presents a distributed feature detector algorithm implementation for low-cost

Zynq FPGAs. The proposed distributed feature detector first finds all features with

strength above a certain threshold, and then divides the image into multiple small

tiles. Then the best features from each tile are returned irrespective of the relative

strength of features between the tiles. Their approach doesn’t return the strongest

features over an image, but a well-distributed set of features. In addition, an FPGA

implementation for FAST feature detection was proposed by Kraft et al. [33], where

a baseline FAST implementation is presented. In the work by Rublee et al. [58],

the FAST feature detector is extended by adding the rotational BRIEF feature de-

scriptor to form the Oriented FAST Rotated BRIEF (ORB) feature detection and

description algorithm. Lee et al. [38] and Kulkarni et al. [35] each presented com-

plete hardware implementations for ORB, which emphasized run-time performance

gains over their hardware accelerated SIFT and SURF counterparts. The work done

33

by Fularz et al. in [23] goes a bit further and implements a hardware accelerated

architecture for real-time image feature detection (FAST), description (BRIEF) and

matching (Hamming distance) on an FPGA. Their performance measurements, how-

ever, are limited to runtime in terms of frames per second, and resource utilization

(LUTs, FFs, and BRAMS) in one embedded environment. In one of the more recent

publications on hardware accelerated feature detection, Chang et al. [14] presents an

implementation of Harris corner detection on an IBM POWER8 system integrated

with Altera FPGAs. While all of these works offer promising results, they all ignore

power aspect of the presented systems and do not provide any analysis of power

dissipation.

As observed from all these literary work, acceleration of image processing algo-

rithms is an emerging topic of interest with a wide range of application domains

and varying design constraints. In order to create constraint optimal accelerators,

a very large design space needs to be considered. In the following chapters, we will

first present configurable accelerator designs for image processing that can be used

to explore this design space and come up with mathematical formulations to speed

up the design space exploration. Then we present multiple implementations using

the know-how we have established for efficient acceleration design and explore the

design space of various embedded platforms for such algorithms.

Chapter 3

Performance, Power and Accuracy

Trade-offs in FPGA-based

Accelerators

Chapters 1 and 2 discussed the main motivations for design space exploration of

hardware accelerated systems for real-time image processing applications. We have

presented the current trends in literature for design space exploration and also defined

the image processing algorithms we will be using as test cases for acceleration in this

study. Before looking into various embedded systems as a design space, in this

chapter we discuss our regression based technique for fast design space exploration

and multi-objective optimization for FPGA-based hardware accelerators.

FPGA-based accelerators are becoming widely used in real-time image process-

ing, for applications in, scientific research, smart camera technologies and automotive

industries [31], among others. FPGAs are ideal for inexpensive prototyping platforms

34

35

to implement high-throughput solutions. Their reconfigurability allows for iterative

refinement and validation of a design implementation until desired goals are achieved.

With programmable logic elements, registers, lookup tables (LUTs), Block RAMs

(BRAMs), DSP blocks, and digital clock managers, FPGAs, by themselves or as

parts of a heterogeneous system, have the capability of parallelizing algorithms on

various hardware modules, making them superior in instrumenting tasks that require

high throughput.

Many of these high-performance platforms are also used in highly resource con-

strained environments where reduced power consumption becomes imperative. As

such, care must be taken to increase parallelism while at the same time minimiz-

ing energy consumption. Indeed, simply adding more hardware resources (whether

through FPGA logic or other computing fabrics) to solve the throughput problem

will not necessarily lead to a feasible solution for power/energy constrained systems.

Fortunately, we have observed that FPGA-based accelerators, specially those that

can be used for image processing, offer many algorithmic and hardware design pa-

rameters, which when properly chosen, can lead to outcomes with the desired design

metrics of throughput, power, design area and arithmetic accuracy.

While having flexibility in both algorithmic and hardware design parameters

will increase the possibility of creating hardware accelerators that meet all design

constraints, it raises the question of how one should go about discovering an optimal

design among all possible design choices. Indeed, even the choice of a relatively few

parameters can lead to hundreds or even thousands of designs. Therefore, effective

design space exploration techniques are critical for efficiently navigating the large

design space. To speed up design exploration, we propose to sample the large design

space and then use regression models and statistical inference from the samples to

create mathematical models that estimate the target metrics over the entire design

36

Objectives

Constraints

Predictive Modeling

of the Rest of the

Design Space

Optimization

Framework

Training Samples

Optimal Design

Arithmetic Accuracy

Pow

er

Design 1(x1, y1, z1)Area = A1

Power = P1


Power = P2


Power = P6


Power = P7

Design N(x2, y2, zN)Area = AN

Power = PN. . .


Power = P3


Power = P4


Power = P5

Power = f(x1, x2, …, xn)

Figure 3.1: Illustration of the idea of using regression based modeling for design space explorationand finding important designs based on objectives and constraints. Each star on the graph on theright represents a design variant and the dashed line represents the Pareto frontier. Designs shownin dashed yellow boxes represent optimal designs given by the optimization framework while theones in blue represent the training set.

space.

Our approach aims to identify both algorithmic and hardware parameters that

optimize hardware accelerators. This information is used to run regression analysis

and train mathematical models within a non-linear optimization framework in order

to identify the optimal algorithm and design parameters under various objectives and

constraints. To automate and improve the model generation process, we propose the

use of L1-regularized least squares regression techniques. We implement two real-

time image processing accelerators as test cases: one for image deblurring and one

for block matching.

This work has been done in collaboration with my colleague Kumud Nepal. Im-

37

plementation of parameterized hardware accelerators and selection of algorithmic and

hardware parameters that optimize our accelerators will be the main focus of this

thesis. We will then discuss how these parameters were applied for our L1-regularized

design space exploration. More details on the modeling and multi-objective optimiza-

tion methodologies can be found in our previously published work [69] and Kumud

Nepal’s PhD Thesis [49].

A simplified illustration of our problem statement and proposed solution is pre-

sented in Figure 3.1, where we have a system with three design parameters: x, y

and z. The total design space for this system consists of any permutation of these

three design parameters. Each design dissipates power differently and has a certain

arithmetic accuracy. Assume we are interested in figuring out which design gives

us the best trade-off between power and accuracy, i.e. we want to achieve a system

that dissipates as little power as possible while its accuracy is still maintained at an

acceptable level. To find the optimal design variants, the results from the predicted

accuracy and power metrics from regression modeling are fed into an optimization

framework. The optimization framework presents a subset of those variants that

create a Pareto frontier (dashed green line on the graph), where the frontier points

do not dominate each other in both accuracy and power, but dominate other non-

frontier points. The Pareto frontier points represent the optimal trade-off between

arithmetic accuracy and power. It is up to the designer to pick from these opti-

mal designs (marked as dashed orange boxes in the design space) depending on the

allowed accuracy and/or power budget.

38

3.1 Modeling and Optimization Methodology

A hardware accelerator design implementation of an algorithm can have tens of al-

gorithmic and physical design parameters, with potentially a large range of possible

values for each parameter. The combinations of these parameters create a design

space that grows exponentially as a function of the number of parameters. As a

result, explicit enumeration of every possible design choice is impossible, as it entails

creating, compiling, and programming the design for every combination choice in the

register-transfer level (RTL) flow. Nevertheless, designers are interested in exploring

this design space in order to identify the optimal values for these design parame-

ters that meet the target metrics such as throughput, power, area, and arithmetic

accuracy.

To evaluate the design metrics for any combination of parameters, the models

are queried with the values of the parameters. Following Design of Experiments [46]

techniques, it is important to sample a small design space but in a uniform way to

capture the essential features of the design. We do this by selecting design combi-

nations randomly from within the design space. We incorporate possible minimum

and maximum configurations of each of the parameters in our training samples so

that we consider the full range of the design space. In this way, the models out-

put predictions that span the range of possible designs such that our optimization

framework can identify the configurations that lead to optimal designs.

These sample combinations are implemented in the design and the resultant

metrics characterized from real measurements (e.g., the deblur accelerator) and/or

from synthesis tool results (e.g., block matching accelerator). The characterized

results are then used as a training set to generate the scalable models.

39

Once the best model representing the design objective is obtained, we are able

to estimate each of our design metrics (e.g., power, area, arithmetic accuracy) for

a given set of design parameters. These models can also be used with non-linear

optimization formulations with an objective and under certain model constraints,

giving designers the ability to target their design with a focus on a desired design

variable. Such multi-objective optimization problems can be solved using standard

non-linear optimizing techniques, as presented in the work by Byrd et al. [11].

To demonstrate the effectiveness of our methodology, we focus on algorithms

relevant for image processing as test cases due to their high suitability and adoption

for acceleration in FPGAs.

3.2 Image Processing Applications

To use as test cases for our methodology, we have implemented two image processing

applications for FPGAs: image deblurring and block matching. We have identified

several design parameters, both algorithmic and architectural, and created parame-

terized architectures for each system. These designs were realized on and optimized

for the Xilinx Virtex 6 FPGA which impacted our design options especially for

the architecture parameters such as the frequency limitations of the dedicated DSP

blocks.

Image deblurring and block matching algorithms use sliding window operations

in their core. For both of our implementations, we have operated over still im-

ages initialized BRAMs. The image pixel intensity data is transferred to the image

deblurring and block matching designs row by row as a stream.

40

data_in [11:0]

kernel_in [17:0]

data_enable clock

reset

Kernel Control

13x7 Kernel Buffer

12-‐Line Line Buffer

….

Adder Tree

….

data

_out

[11:

0]

Processing Element

X

X

X

13x13 Pixel Array

Control

Address Generator

kernel_sizeX [2:0]

kernel_reinst

kernel_sizeY [3:0]

….

Figure 3.2: Top-level block diagram for deblur architecture.

3.2.1 Image Deblurring

Images are produced to store and display useful information. However captured

images almost always represent flawed replicas of the original scene due to the im-

perfect nature of capturing process. Image blur is one of the major degradations

caused during this process and therefore image deblurring is an essential component

of many image processing application. Image deblurring is performed by a filtering

operation over the image. The accelerator is deployed within a system requiring

real-time image processing (for instance, mounted on an unmanned aerial vehicle

(UAV) used for surveillance). The real-time processing requirements of the accel-

erator in this environment has put tight constraints on its throughput, power, and

area; especially since the accuracy of the image deblurring algorithm itself must be

kept within acceptable margins. The need to meet all the requirements motivated

our proposed modeling and multi-objective optimization methodology.

Our implementation was targeted for an ultra high-throughput application where

41

each frame is divided into 368 sections of smaller images with pixel resolution

2592×1944. With a required 10 frames per second rate, 18.5 GigaPixels per second

must be processed. Each input pixel is 12 bits. The kernel size can vary throughout

the computation of a single frame from 3×3 to 13×7 and each kernel data is 18 bits.

This system was targeted to be run on a highly parallelized platform with 20 FPGAs

where we implement the prototype system running on each FPGA. In order to ac-

commodate the require processing constraints, we have designed our accelerator to

process 8 pixels per second at a frequency of 125MHz, providing a processing power

of 1 GigaPixels, which can reach the target constraint with a theoretical 20 GigaPixel

throughput running over 20 FPGAs.

The high-throughput requirement of our application required us to create a highly

parallelized architecture, which was enabled due to the sliding window structure of

the image deblurring alogorithm. We have used the dedicated DSP blocks on the

FPGA to perform the multiply-add operations required for each kernel multiplica-

tion.

The block diagram of the deblurring hardware is given in Figure 3.2. The image

pixels and kernel values stored on BRAMs of a Xilinx Virtex 6 FPGA board are fed

into the deblur system as video streams to be stored in line buffers. The image values

and the kernel values are then processed in the processing elements (PEs) and the

output is stored in yet another set of BRAMs. Dedicated DSP units on the FPGA

are used as PEs. The kernel control module reads in the kernel values and updates

the kernel buffer accordingly and the control module reads in the video stream data

and feeds the data samples into the pixel array.

The deblurring algorithm uses masks that have a maximum size of 13×7 which

requires being able to access 13 rows of data at a given time. The input data in

42

provides data from a row of the video frame; however, the previous 12 rows need to

be stored as line buffers. The most outdated line buffer is updated with the input

row information at the same time. This architecture deblurs 8 pixels/cycle running

at 125 MHz on our FPGA board.

The block diagram for a single row of our pixel array is given in Figure 3.3. Each

BRAM is connected to a 12-by-1 multiplexer that feeds the incoming pixel stream

to a single line of the register array. The multiplexers are controlled by the modulo

12 row counter to accommodate for the changing tag of each BRAM as new row

data is acquired. Border replication method [54] is used to handle the boundary

conditions. The boundary conditions in our system are handled using the border

replication method [54]. The border replication method requires extending the image

boundaries with a copy of the closest pixel intensity. Extension of boundaries on the

y-axis of an image is performed by the use of boundary select and boundary address

signals showed in Figure 3.3. Boundary address always corresponds to the BRAM

holding either the first or last row of an image and is selected when the current deblur

mask goes over the image boundaries on the y-axis. Replication of boundaries on

the x-axis of the image is handled by broadcasting the first pixel of each image row

to all registers during computation. To the to be able to sustain constant 8-pixel

throughput in our register array, we utilize 2 delay registers and 6 temporary registers

along with 20 (13 + 7) registers directly connected PEs. The delay and temporary

registers are used to make sure 2 cycles of 8 pixel inputs provide enough data to fill

up our PEs for computation at the beginning of a row of pixels.

Our image deblurring accelerator has a number of algorithmic and hardware

design parameters, the values of which determine its final metrics (e.g., power, de-

sign area, and arithmetic accuracy). Parameters that are chosen by the designer

are expected to have an impact on the constraint metrics and thus their selection

4313x13 Register Array

• Single row of the register array is given above, total of 13 rows are used.

• A non-uniform 13x13 register array is used as illustrated on the left. This allows easy transition between 13x7 and 7x13 kernels.

- registers that feed the Processing Elements- 6 temp registers are used to handle boundary conditions

- apart from temp registers, 16 (13+3) or 20 (13+7) registers are used for 4 and 8 pixel throughput respectively

BRAMs

boundary_addrrow cnt%12

boundary_select

en_temp_reg

- extra registers needed to handle 4 and 8 pixel throughput

Figure 3.3: Architecture of a single row of pixel array

requires an understanding of the inherent nature of the algorithm and the design

(e.g., parameters that affect the number of critical resources in the hardware or the

accuracy of the algorithm in the software). However, the designer does not need to

understand exactly how these parameters may affect the design constraints. Our

goal is to simplify the parameter selection process by enhancing the least squares

based modeling methodology by an L1 regularization process. L1 regularization, as

we have previously mentioned, suppresses irrelevant parameter and interaction terms

between them and selects only those that have an impact on the constraint metrics,

thereby highlighting design choices that may not have been obvious to the designer.

The parameters selected for image deblurring test case are as follows:

Kernel Bit-Width (algorithm parameter): The fixed point kernel bit-width is

explored in the range of 8 to 18 bits as a design choice. Different bit-width selections

do not have any effect on the area and throughput of the design due to the fixed

width allocated for DSP inputs; however, both power and accuracy of the design

varies with different bit-widths.

Kernel Size (algorithm parameter): The kernel size used in the design can be

dynamically changed to be of any size up to 13×7. Any kernels smaller than 13×7

will be padded with zeros so that the kernel input for corresponding DSP blocks is

44

DSP

DSP

DSP

DSP

DSP

DSP

DSP

partial sums

delay registers

: DS

P DS

P

DSP

DSP

DSP

DSP partial

sum

(a) (b)

Figure 3.4: Comparison of DSP pipeline depth of (a) 6 and (b) 3.

equal to zero and any switching activity due to changing pixel values do not propa-

gate through the processing elements. Thus both accuracy and power of our design

vary with changing kernel size but area will remain unaffected.

DSP Pipeline Depth (design parameter): The architecture needs to account

for the maximum possible kernel size of 13×7; therefore, a total of 13 × 7 = 91

multiplications need to be performed. However this processing element array can

be implemented using DSP pipelines of varying depths. As the depth of each DSP

pipeline decreases, the number of required DSP groups increase to perform the same

number of computations. The smaller pipeline depths require fewer delay registers

for synchronization but use extra DSP slices for the addition of computed partial

sums, as illustrated in Figure 3.4. We use the average DSP pipeline depth as a design

variable.

Time-Division Multiplexing (design parameter): The provision that DSP

blocks are allowed to run at potentially faster clock frequencies than the rest of the

45

pixelA

add

pixelB

output

kernel

clk

pixelA

add

pixelB

output

kernel

clkpixelA_set1

pixelA_set2

pixelB_set1

pixelB_set2

pixelA_set1

pixelA_set2

pixelB_set1

pixelB_set2

clk clk_dsp

Time Multiplexing DSPs

0

1

0

1

0

1

0

1

DSP

DSP

Figure 3.5: Time-division multiplexing for a factor of 2

FPGA system enables time-division multiplexing of image data to be employed [74].

For time-multiplexing factor of n, n sets of pixels need to be available every system

cycle and the DSP blocks process n sets sequentially at an increased frequency. The

kernel inputs to the DSP blocks are constant and do not need to change in order to

compute different sets of image masks. Figure 3.5 illustrates the DSP usage for the

case when the time-division multiplexing factor is 2. This design choice can either

lead to an increase in system throughput by a factor of 2 or a decrease in the number

of DSPs used for computation by half. For our work, prefer decreasing the number of

DSPs rather than increasing throughput when applying time-division multiplexing.

Our experiments allowed this multiplexing parameter to be 1, 2, or 4. Higher levels

of time-division multiplexing are not used due to achievable frequency limitations

within the DSPs.

46

current block

search block

reset

Search Window Buffer Co

mparator

Min

SA

D v

ecto

r

PE Array

start

clock

PE PE PE

PE PE PE

PE PE PE

PE PE PE

Adder Tree

Figure 3.6: Top-level block diagram for block-matching architecture.

3.2.2 Block-Matching

We have chosen the block matching algorithm to further test our design space ex-

ploration methodology. As detailed in Chapter 2, the highly computation intensive

and data-parallel nature of block-matching provided to be a very good match for our

hardware acceleration. We have selected to use CIF (352×288) size images as our

benchmark [4] and kept throughput as a design constraint dependent on the design

parameters.

The block diagram of the block-matching hardware is given in Figure 3.6. Each

block from the current frame is loaded into the processing elements (PEs). Meanwhile

the search window is loaded into a buffer to be read by corresponding PEs. Each PE

calculates the absolute difference between a pixel in the current block and a pixel in

the search window block. The SAD of a search location is calculated by adding the

absolute differences calculated by PEs using an adder tree.

The implemented hardware traverses the search locations in the search window

47

row by row in a zigzag pattern. This allows continuous computation of SADs for

each block of the search window no matter the direction of the search by utilizing the

overlap of search blocks within a given search window. For each new search window,

16 new pixels are required. This zigzag flow is enabled by the use of PEs that are

capable of shifting data up, left and right, and also by the use of 16 8-bit temporary

registers that store the values for an up shift of the search block. These temporary

registers are also filled using the corresponding values of the search window.

For a block matching algorithm using block sizes of N×N and search range [−p, p],

the dataflow of the PEs using zig-zag flow is shown in Table 3.1. Sx,y is a search

window pixel and current block pixels are not shown in the table since they are

not modified (e.g. PE0,0 stores C0,0) throughout the operation of the current search

block. The search window buffer feeds new pixel information to the first column of

PEs every cycle. For a block size of N×N, N cycles are required to fully fill up the

PEs as the pixel information is prorogated within each row of PEs. An extra BRAM

is used to perform a reverse shift in the PE array, after all the search locations

in a column are searched in N + 2p cycles. The extra BRAM is connected to the

temporary registers and by the time the search window needs to move in the opposite

direction, the temporary registers have the necessary pixel data to shift through the

PEs. This enables the zig-zag search flow to proceed uninterrupted during vertical

shift of the search window.

Similar to the image deblurring implementation, we have identified a number of

algorithmic and hardware design parameters that create our design space.

Pixel Truncation (algorithm parameter): The SAD calculation is approximated

by the pixel truncation parameter. By truncating the least significant bits (varied

48

Table 3.1: Data flow of block-matching PEs

ClockCycle

First Row of PEs. . .

Last Row of PEs Temp ColumnPE1,N . . . PE1,1 PEN,N . . . PEN,1 FFN . . . FF1

0 S1,1

. . .

nop

. . .

SN,0

. . .

nop SN+1,0

. . .

nop1 S1,2 nop SN,1 nop SN+1,1 nop. . . . . . . . . . . . . . . . . . . . .N S1,N S1,1 SN,N SN,1 SN+1,N SN+1,1

N+1 S1,N+1

. . .

S1,2

. . .

SN,N+1 SN,2 SN+1,N+1

. . .

SN+1,2

N+2 S1,N+2 S1,3 SN,N+2 SN,3 SN+1,N+2 SN+1,2

. . . . . . . . . . . . . . . . . . . . .N+2p S1,N+2p S1,2p SN,N+2p SN,N+2p SN+1,N+2p SN+1,N+2p

N+2p+1 S2,N+2p

. . .S2,2p

. . .SN+1,N+2p

. . .SN+1,2p nop

. . .nop

. . . . . . . . . . . . . . . . . . . . .N+4p S1,N S1,1 SN+1,N SN+1,1 SN+2,1 SN+2,1

. . .

from 0 to 5 bit truncation) of the pixels in image blocks, a reduction in the hardware

area and power dissipation is achieved in both SAD computation (PEs and adder

tree) and comparison blocks. However this comes at a cost of arithmetic accuracy.

Search Window Size (algorithm parameter): We define a search window of

size ±p×p if the maximum and minimum distance from the current block is bounded

by ±p on both x and y axis. By increasing the search window size, we allow the

design to find a potential best matching block further away from the original block,

however this comes at the cost of decreased throughput due to the increased number

of search locations.

Block Size (algorithm parameter): The size of non-overlapping N ×N rectan-

gular blocks (N) is defined as an algorithm parameter. Larger block sizes require

evaluation of a larger area per comparison however reduces the number of blocks to

be evaluated per frame. The granularity of the comparison directly affects the accu-

racy of the system and also provides a trade-off between area and power dissipation.

49

Number of Processing Elements (design parameter): We vary the number of

PEs used to compute the SAD for a single N×N block. Since the number of absolute

difference computations per block is fixed at N × N , we can have a maximum of

N ×N PEs. By reducing the number of PEs by a power of 2, we end up requiring

more cycles to compute the SAD of a given block, thus allowing a trade-off between

the area and throughput of the implementation.

3.3 Experimental Results

Our hardware accelerator prototypes use a 40 nm Xilinx XC6VLX240T FPGA with

240,000 logic elements and 768 DSP blocks. Xilinx ISE Design Suite 12.4 is used

for physical synthesis and Mentor Graphics Modelsim 10.1b is used for functional

and timing simulations of the design. MATLAB is used for regression and optimiza-

tion. To evaluate our accelerator performance for the image deblurring system, we

use a number of sample images that are captured from the aerial vehicle platform.

For evaluation of the block matching architecture, results are computed using the

Foreman sequence [4]. The design metrics are estimated as follows:

• Throughput: For the deblur design, throughput is measured in terms of the

number of pixels deblurred per cycle. In all our design variants, we ensure the

design meets an operating frequency of 125 MHz and 8 pixels/cycle deblur. We

relax this limitation for the block-matching algorithm and make throughput

a variable that depends on design parameters. Throughput for this particular

example is measured in terms of frame rate — how many frames we can perform

block matching on, per second.

50

• Area: Since DSPs are the most critical resource, area for the deblur example

is measured by the number of DSP blocks used by the accelerator. For the

block matching architecture, since DSPs are not used, we measure the area

metric using total number of LUTs. Each logic element in the FPGA used is

composed of 8 registers and 4 LUTs. For purpose of uniformity, we convert

the number of registers to equivalent LUTs and use the total number of LUTs

as our measurement for area of the design.

• Accuracy: To estimate the inaccuracy of a particular deblur accelerator vari-

ant, we compute the mean square error (MSE) between a sample image data

and its deblurred result from the accelerator. The MSE is the average of

the squared differences between the image pixels and its deblurred result as

produced from the accelerator variant. In the case of the block-matching al-

gorithm, we use MSE between reference block and the current block relative

to the results obtained from the base implementation: 32×32 window size, no

pixel truncation and 64 PEs.

• Power: To estimate the power dissipation of the accelerator, we followed two

approaches. The first approach executes Modelsim on the routed design to

obtain signal activity values and then feeds these values to the Xilinx XPower

tool to estimate power. The second approach measures the incremental power

consumption of our prototype board directly using an external digital multime-

ter (e.g., Agilent 34410) where the incremental power is the difference between

the reset state power and the execution state power of the design. Our setup

is displayed in Figure 3.7. The first approach, which estimates power from

timing simulation information, computes the power dissipated by the archi-

tecture only, while the second approach gives more representative results as it

accounts for all the additional system power (e.g., FPGA and memory) that

51

Figure 3.7: Power measurement setup using external digital multimeter

is associated with the computations of our accelerator. The incremental sys-

tem power is the real cost that the end user incurs. For variation, we use the

XPower results with our block matching architecture analysis and the board

results for the deblur architecture. Consistency within each example ensures

that the validity and accuracy of our methodology is not compromised.

Experimental results for each of our test cases are discussed hereafter:

3.3.1 Modeling Results

Image Deblurring

For the image deblurring accelerator, we implement the design parameters as men-

tioned in Section 2.2.1. We use factors of 1, 2, and 4 for time-division multiplexing

which correspond to a DSP clock frequency of 125, 250 and 500 MHz. We also make

four different choices of average DSP pipeline depths between 3.3 and 11.5. These

depths are calculated by dividing the total number of DSPs used in all pipeline blocks

by the number of blocks used. For kernel bit-width, we vary the parameter from 8

bits to 18 bits. We also pick four random kernel sizes between 5×3 and 13×7. The

52

0

20

40

60

80

100

120

140

0.15 0.2 0.25 0.3 0.35 0.4 0.45

Mean Error %

Explored Design Space %

linear interac3on quadra3c purequadra3c L1-‐regularized

Figure 3.8: Error percentage of power model over explored design space percentage.

combinations of parameters create a design space with 3 × 8 × 11 × 45 = 11,880

possible design points that potentially lead to different accelerator variants.

Full physical synthesis (which includes placement and routing) of an accelerator

variant takes about two hours on our quad-core based system, which puts limitations

on the ability to execute a brute-force exploration of all accelerator variants. This

motivates the need for fast design space exploration and optimization. To obtain

our samples, we fully synthesize and implement 50 deblur accelerator variants with

different parameter permutations; i.e., we only sample 5011880

= 0.42% of the entire

design space. These design points are selected randomly across the entire design

space with the condition that the minimum and maximum configuration for each

parameter is used at least once. This guarantees that any data point estimated by

our predictor lies within the space covered by the training set, and that the training

set is representative of the entire design space.

We first analyze the precision of different regression models in estimating power,

area, arithmetic accuracy, and throughput against the measurements we obtained

53

from our samples. We show the comparison of the precision of the models as an

illustration of how our approach stands against traditional regression models (linear,

quadratic, etc.). All these models without L1 regularization have similar run times

(i.e., less than 0.2 seconds). The additional step of finding the right λ parameter

that minimizes the prediction error in the L1 regularization methodology involves

searching from a range and trying out all possibilities before a specific value is chosen.

In our approach, we varied λ from 0.0001 to 1000 and randomly picked 25 values

within this range. Of these 25 values, we chose the one that leads to the lowest

error. This whole process took about 25 mins; however, it should be noted that

this a one-time overhead required for accurate model generation. Once the model

is derived, the process of physically implementing and synthesizing a design can be

eliminated — thus allowing for what would take two hours to be completed in less

than a second.

To train and evaluate the aforementioned model, we split our samples into two

subsets: a training subset is used to learn the model parameters, and a query subset

is used to validate the closeness of the model to the true measurements by taking the

average absolute error between the model predictions and the actual measurements.

To evaluate the quality of the results, we follow the repeated random sub-sampling

validation methodology [32] and repeat our training and query set selection 100

times so that any training bias is eliminated. For the purpose of this particular

implementation, we randomly chose 35 samples as training and the remaining 15 as

query for each iteration. Evaluation of predicted values from the model is validated

by averaging over these 100 runs.

To gain further insights into the effectiveness of various models, we evaluate the

relation between the mean error generated by the different models as a function of

the training subset size. For example, we plot the results for the power model used in

54

2 31

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

time−division multiplexing factor

Pred

icte

d Po

wer

(W)

4 6 8 10average DSP pipeline depth

12 14 16kernel bit−width

40 60 80kernel size

Figure 3.9: Sensitivity of different parameters over the power estimation.

the deblur example in Figure 3.8. The plot shows that the model obtained using L1-

regularization requires a slightly larger training set to stabilize due to the presence of

higher order terms. However, it performs better and stabilizes after exploring only

0.3% of the design space is explored for both accelerator setups.

The model coefficients obtained from L1-regularization reveal the impact and sig-

nificance of different algorithm-design variables on the final outcome of the design.

However, because of the presence of quadratic terms and interactions, numerical

comparisons are not straightforward. To carry out an accurate evaluation, we con-

sider the sensitivity of the L1 model to different variables using MATLAB’s response

surface modeling toolbox. We plot the results, again as given for the power model

for the deblur example in Figure 3.9. The solid line represents estimated power for

each parameter variation, given that all other parameters are kept constant. The

dotted lines show the 95% prediction bands for the power value estimated, showing

that if the prediction was repeated with different samples, the estimated value would

lie within the range specified 95% of the time. It can be inferred from the picture

that average DSP pipeline depth has the highest sensitivity on power estimation,

and the power values vary the most for this parameter. The trend observed for

55

17.56%

22.55%

14.76%

14.25%

7.48%

16.23%

19.69%

2.52%

3.22%

2.38%

9.22%

9.83%

11.08%

10.72%

9.22%

0%

5%

10%

15%

20%

25%

linear interac6on quadra6c purequadra6c l1-‐regularized

Mean Error %

Model Fits

Power Model

Area Model

Accuracy Model

Figure 3.10: Comparison of mean error percentage using different model fits for power estimation,area and arithmetic accuracy models for the image deblur algorithm.

DSP pipeline depth reflects the trade-off obtained by varying the parameter as the

smaller pipeline depths requires larger DSP groups to perform the same number of

computations with fewer delay registers for synchronization. Therefore both small

and large DSP pipeline depths benefit from this trade-off from different ends in terms

of power. The time-division multiplexing factor affects power the most after aver-

age DSP pipeline depth, followed by kernel bit-width and size, which have similar

impacts as time-division multiplexing. Time-division multiplexing has a quadratic

relationship with respect to power which is caused by the trade-off between the num-

ber of DSPs used and their running frequencies. Lower values require more DSPs,

while larger time-division multiplexing factors require fewer DSPs running at higher

frequencies. As expected, kernel bit-width has a linear interaction with power, since

larger complexity in pixel arithmetic always results in higher power dissipation. Ker-

nel size also has a quadratic relationship with power and power saturates for very

large kernel sizes. The sensitivity results also align with the individual trends we

observed in our measurements.

56

The estimation accuracy of the different models used for power, area and arith-

metic accuracy metrics is given in Figure 3.10 for a training size of 36, 9, 23 samples.

The results show that the models obtained from L1 regularization outperform other

models.

Block Matching

Similarly, for the block-matching accelerator, we use multiple search window sizes of

±4×4 to ±32×32. The number of lower significant bits truncated from pixel values

for computation defines our second design parameter. We vary this from 0 to 5. For

the block size (N ×N), we use 4× 4, 8× 8 and 16× 16 as three different sizes; and

for the number of PEs, we use N × N , N×N2

and N×N4

. As in the deblur test case,

combination of these parameters create a design space with 29 × 6 × 3 × 3 = 1566

possible variants of the accelerator.

Likewise, we only synthesize 18 variants, and hence sample a mere 181566

= 1.1%

of the block-matching design space and use sets of training and query data points to

model regression behaviors and cross-validate their performance.

Predictably, the model coefficients obtained from L1-regularization, as seen in

Figure 3.11, are superior and predict power values very close to those obtained

using XPower generated data for all metrics (power, area, arithmetic accuracy and

throughput) than any other model tried.

The L1-regularized model coefficients provide insight into the interaction of de-

fined parameters with the constraint metrics. In this example, the parameters and

interactions that were dominant in our model for each metric are as follows:

57

• Throughput: As expected, the two parameters that are dominant for this

metric are window size and the number of PEs. The terms that involve only the

number of PEs have the most impact among all parameters, with the quadratic

term of number of PEs having a larger weight than any other term. These are

followed by the pair-wise interaction terms between the number of PEs and

the window size.

• Area: The area model is dominated by the number of PEs and the pair-wise

interactions involving it. It is interesting to observe that the contribution of

pixel truncation and window size is mainly through their pair-wise interactions

with number of PEs.

• Accuracy: The quadratic term for the search window parameter and its high-

order interaction with pixel truncation are the main contributors to the ac-

curacy model. Similar to the area model, it is surprising to see that pixel

truncation doesn’t impact accuracy on its own but rather mainly through in-

teraction with the window size.

• Power: The only parameter that influences power alone is pixel truncation

and the dominant terms observed are all interaction terms. The term with

the largest weight is the pair-wise interaction between window size and pixel

truncation followed by the pair-wise interaction between pixel truncation and

number of PEs.

It is observed that the non-suppressed terms that appear for each model match

our expectations. However, certain interaction terms have a much larger impact on

the models than may be intuitively obvious to the designer. This demonstrates the

usefulness of our methodology at automatically detecting the combined effects of

parameters on design constraints without relying on user input.

58

16.82%

'

20.25%

'

16.75%

'

11.28%

'

9.85%'

5.11%' 3.97%'

4.50%'

4.72%'

3.56%'

7.29%'

7.41%'

7.63%'

8.70%'

5.55%'12.47%

'

13.72%

'

8.41%'

11.19%

'

3.80%'

0%'

5%'

10%'

15%'

20%'

25%'

linear' interac6on' quadra6c' purequadra6c' l1<regularized'

Mean%Error%%

%

Model%Fits%

Power'Model'

Area'Model'

Accuracy'Model'

Throughput'Model'

Figure 3.11: Comparison of mean error percentage using different model fits for power estimation,area and arithmetic accuracy and throughput models for the block-matching algorithm.

0.7

0.8

0.9

1

1.1

1.2

1.3

0.0006 0.00065 0.0007 0.00075 0.0008 0.00085 0.0009 0.00095 0.001

Po

we

r (W

)

Arithmetic Inaccuracy (MSE)

timemux:4,bit-width:18, kernel 13x13



Figure 3.12: Trade-off between power and arithmetic inaccuracy of the image deblurring system.

Case Study Results

The mathematical models obtained through L1-regularization enable us to create a

numerical optimization framework to optimize the accelerator designs with respect

to certain selected metrics while imposing constraints on other metrics. These chosen

metrics and their values are selected to optimize the design within the specifications

59

50

60

70

80

90

100

110

120

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

# o

f D

SPs

Power (W)

pipeline depth= 5.3 pipeline

depth = 6.3pipeline

depth = 7.0

Figure 3.13: Trade-off between area and power of the image deblurring system.

of its target deployment. We have verified the optimization methodology presented

by Kumud Nepal [49] by applying the parameters detected by the optimization

framework on our target accelerators.

Figure 3.12 shows the data points from our design space for the image deblurring

example. Each data point corresponds to a set of parameters of a design yielding

the least amount of power for a given arithmetic inaccuracy constraint. The data

points marked with red filling are the Pareto points of our design space as there is

no point on the design space that can improve either the arithmetic inaccuracy or

the power dissipation without making the other design metric worse. We observe

that for the image deblurring algorithm, arithmetic inaccuracy is heavily dependant

on the kernel bit-width parameter whereas higher time-multiplexing results in lower

power dissipation when the full bit-width of the 18-bit kernels are used.

Similarly, Figure 3.13 shows the data points from our design space that reflect

the trade-off between area and power dissipation of the image deblurring design in

60

0

2000

4000

6000

8000

10000

12000

14000

0.0006 0.0206 0.0406 0.0606 0.0806 0.1006 0.1206

Are

a (L

UTs

)

Arithmetic Inaccuracy (MSE)

pixel truncation: 0search window ±4x4

64 PEs


64 PEs


64 PEs

Figure 3.14: Trade-off between arithmetic inaccuracy and area of the block matching system.

terms of the pipeline depth parameter. The data points with red fillings represent

the Pareto points on our design space and each Pareto point corresponds to a dif-

ferent pipeline depth value for our system. Having this behavior represented in our

prediction modeling framework, a designer can identify the area vs. power trade-

off that is most suitable for a given application domain and use the corresponding

pipeline depth as a parameter.

We observe the trade-off between area and arithmetic inaccuracy for our block

matching design in Figure 3.14. The sole parameter that has control over this trade-

off is pixel truncation since the data points that result with minimum area for a

given arithmetic inaccuracy constraint always share the same value for the rest of

parameters. Each Pareto point marked with red filling on the plot corresponds to

a different pixel truncation parameter, therefore a designer needs to only adjust the

value of pixel truncation to generate a design with the required area vs. arithmetic

inaccuracy trade-off.

61

Significance of Results

Compared to previous regression techniques proposed in the literature, L1 regular-

ization enables automatic discovery of the exact mathematical dependency between

the variables and the desired model outcome with no need for guesswork from the

designers. This exact dependency results in improved correspondence between the

results of the model and the actual measurements.

These benefits of automated model generation and the resulting correlation of our

models to the measurements enables us to use this model to query directly for non-

sampled design points obtaining a dramatic speed-up in design space exploration.

Since a full synthesis, plus place and route for our deblur design takes two hours,

and our model achieves its least error using only 35 samples (0.3%) of the full 11880

points design space, our L1 based model is able to achieve approximately a 340×

speedup in design exploration with estimation errors of 7.48%, 2.38% and 9.22%

for power, area, and accuracy respectively. With the block matching algorithm, we

use only 18 samples from the entire design space for training, so our speedup is

approximately 90×. We report speedup for both examples as the ratio of the time it

would have taken to implement and synthesize the entire design space to the time it

takes to implement and synthesize only a few sample points for training and query.

The non-linear optimization framework implemented in MATLAB after the models

are generated for each metric takes about 0.1 seconds to run and is almost negligible

in comparison to the runtime taken for synthesis of the designs.

Our optimization framework presents us with the ideal parameters to adjust in

order to obtain the trade-offs between two design constraints. Using this information,

we are able to identify the range of the design metrics most crucial to our target

application.

62

3.4 Summary and Discussion

In this chapter we explored, via two different architecture setups, techniques for fast

design space exploration for FPGA-based accelerators. Reconfigurability of FPGAs

enabled us to implement a small fraction of a large design space and apply regression

analysis to obtain analytical information and formulate scalable models to predict

values for various design metrics such as power, arithmetic accuracy, performance,

and area. We proposed automatic techniques to devise the best model using re-

gression analysis such as the L1-regularized least squares estimation. We created

a case study for an image deblurring accelerator. For the accelerator design, the

proposed models predict the implementation metrics within 8% of measured power

values, 10% within the output arithmetic accuracy, and within 3% of actual FPGA

resources used. We also studied a block-matching accelerator as a second test case

and introduced system throughput as an additional variable dependent on the de-

sign parameters. This second accelerator design confirmed our findings about the

benefits of the L1-based modeling methodology. Our predictions were again fairly

close to the actual measurements of the metrics — within 10%, 4%, 6%, and 4% for

power, area, arithmetic accuracy and throughput respectively. With these accurate

models in hand, we are able to expedite immensely the design space exploration pro-

cess - a 340× speedup for the image deblurring test case and a 90× speedup for the

block matching test case were achieved. We are also able to come up with numerical

optimization formulations that give directly the optimal design parameters under

various objectives and constraints.

Despite the huge speedups we have gained in terms of design space acceleration

and constraint driven optimization, the very long synthesis and placement times

to gather our samples were the limiting factors in generation of our models. More

63

importantly, time to prototype these designs is fairly long mainly due to the waveform

simulation based verification and debugging of these implementations. During the

course of this work, we aimed to test our experience in the domain of power aware

implementation of image processing algorithms on FPGAs by entering the first series

of the currently annual low-power image recognition challenge (LPIRC). Due to the

tight schedule of the competition we had to forgo our FPGA implementation and

proceed to the competition with a software solution. We have observed that none

of the FPGA focused teams managed to come up with a working system by the

time of the competition and the competition was heavily dominated by groups using

embedded GPUs, specifically the NVIDIA Jetson TK1. The GPUs performed very

well in terms of power efficiency compared with low-power CPU based systems, and

provided an easy to implement solution compared with hardware accelerators.

Given our experience participating in the low power image recognition challenge,

going forward, we have decided to explore the design space of other low-power em-

bedded platforms, mainly the embedded GPUs, in order to better understand their

impact on area, runtime, and power dissipation of various algorithms.

Chapter 4

Hardware Acceleration on

Low-power Embedded Platforms

In the previous chapter, we presented an approach for design space exploration using

analytical models. Our design space is composed of design configurations that use

both algorithmic level design parameters (e.g., input bit-width and kernel size) and

hardware level design parameters (e.g., time-division multiplexing and DSP pipeline

depth) for FPGA based accelerators.

In this chapter, we present a comparative study of feature detection and descrip-

tion algorithms across various embedded platforms. We evaluate these algorithms in

terms of run-time performance, power dissipation and energy consumption. In par-

ticular, we compare embedded CPU-based, GPU-accelerated, and FPGA-accelerated

embedded platforms and explore the implications of various architectural features

for the acceleration of these fundamental computer vision algorithms.

Feature detection and description algorithms form the basis for the majority of

64

65

1

10

100

1000

0

10

20

30

40

50

60

70

80

SIFT SURF BRIEF BRISK FREAK

Run-

tim

e (m

s) -

log

scal

e

Prec

isio

n-Re

call

(%)

Precision Recall Run-time

18

Figure 4.1: The precision/recall rate and the run-time comparison of feature descriptors on anIntel i7 CPU.

present-day computation-intensive computer vision applications such as 3D mapping,

object detection and tracking, and motion and camera pose estimation. This section

provides an overview of these algorithms, a discussion of their amenability to hard-

ware acceleration on three different low-power embedded systems (ARM CPU, GPU

and FPGA), and an overview of the metrics utilized to characterize the performance

of hardware-accelerated detection and description algorithms.

4.1 Selection of Feature detection and description

algorithms

Feature descriptors based on Histogram of Gradients (HoG) such as SIFT and SURF

require computing the gradient of the image in the region of each feature, which is

a very costly process. The SURF algorithm speeds up this process via the use of

integral images; however, it is still not efficient enough to be used for real-time

embedded applications.

66

As shown in Figure 4.1, SIFT feature descriptors perform best out of commonly

used feature descriptors in terms of precision and recall rates, where precision is the

number of relevant detected features over the total number of detected features and

recall is the number of relevant detected features over the total number of relevant

features. However, when comparing the run time of HoG-based and binary feature

descriptors, we observe that SIFT (as a HoG-based descriptor) has a computation

time 2 orders of magnitude greater than that of binary descriptors (540ms vs. 3.5ms

for BRISK, the slowest binary descriptor). The SURF algorithm, which is also HoG-

based, improves computation time through the use of integral images; however, it

is still an order of magnitude slower than the binary descriptor algorithms which is

insufficient for real time use.

The flowchart of the feature detection and description framework is given in

Figure 4.2. The program starts by reading the input frame from the memory. The

FAST feature detection algorithm is then applied over each pixel (p) of the frame

as a sliding window operation as detailed in Chapter 2. Despite the non-standard

mask of the FAST algorithm, the Bresenham circle, where immediate neighbors of

the center pixel are not used for computation, spatial locality can still be used for

parallelization of the algorithm. The high spatial locality nature of sliding window

operations lend themselves to better parallelization since a pixel is more likely to be

accesses when its neighboring pixel has already been accessed, therefore each memory

read for a group of pixels is more likely to access multiple data for immediate use.

Both instruction-level and thread-level parallelization techniques can be applied over

the algorithm. The pixels over the Bresenham circle are transferred into 16 element

arrays where instruction-level parallelism is utilized to compare the values over the

circle with the center pixel. Once the comparison binary array is generated, existence

of a continuous string of 12 bits is checked. If this check returns true, then the center

67

Figure 4.2: Flowchart for feature detection and description.

pixel is declared a corner feature. This process is repeated for each pixel of the input

frame and can be parallelized over multiple computation units or threads. Once a

certain p is found to be a corner, N sampling pairs Xi and Yi (∀i ∈ N) surrounding

p are evaluated to describe the feature using an N bit descriptor vector D. Each bit

of vector D is assigned true (1) or false (0) based on evaluation of Xi > Yi.

68

4.2 Platform Implementations

We have targeted three distinct low-power embedded platforms for evaluating the

various feature descriptor and detector algorithms. For the low-power GPU and

embedded CPU platforms, we used the Jetson TK1 development kit. The Jetson

board has a 28nm Tegra K1 SoC with an integrated Kepler GPU with 192 CUDA

cores that run at 950MHz and a quad-core ARM Cortex A15 CPU that runs at

2.5GHz. The system has 2GB on board memory. For the FPGA design we use a

MicroZED development board featuring a 28nm Zynq 7020 SoC, which integrates

an Artix-7 FPGA with a dual-core ARM Cortex A9 CPU, and a 1GB DDR3. This

platform is logically divided into a Processing System (PS) side containing the ARM

CPUs and the Programmable Logic (PL) side with the FPGA and associated support

logic. The DDR3 in the FPGA is used to store input image data, output coordinates,

and feature descriptor vectors. A 32-bit AXI Central Interconnect module is used

to interface our custom IP module with the DDR3 via the ARM AMBA AXI4-Lite

protocol standard at an effective frequency of 111MHz.

The Zynq FPGA uses a bare-metal configuration, and one of two available ARM

Cortex-A9 CPUs was used for debugging and initialization purposes. The Zynq

FPGA monitors the address space of the DDR3 via the Memory Interface to read

and verify its contents while our design is running. Our custom IP module and

AXI interconnect module are located on the PL side. The memory interface which

directly connects the ARM CPUs to the DDR3 is located in the PS side. The

different image processing algorithms were simulated and implemented on Vivado

2015.2 IDE and run on the Zynq MicroZED board using Xilinx SDK. The GPU

and embedded CPU implementations were tested using the Ubuntu Linux for Tegra

distribution and OpenCV version 2.4.10. All of our implementations were tested

69

using an 800× 480 Wide VGA (WGA) image resolution which is commonly used for

high-quality hand-held devices and CMOS image sensors used in robotics.

Both the Zynq and Tegra SoCs utilize the same critical feature size (28nm), thus

making the architectures the primary differentiator in our experimental evaluation.

4.2.1 FPGA Architecture

The block diagram of the FPGA hardware is given in Figure 4.3. We rely on the

AXI4-Lite protocol to transfer data between our custom IP and the DDR3 where

our image input data and output memory reside. Depending on the algorithm under

evaluation, the custom IP module contains the Verilog implementation of FAST,

the integrated versions of FAST+BRIEF, FAST+BRISK or FAST+FREAK. In our

design, the Zynq processing system has the dual role of initiating the main system

clock and monitoring the contents of the DDR3. The custom IP module behaves as a

master and initiates memory mapped reads and writes to the DDR3, which behaves

as a slave in accordance to the AXI4-Lite protocol.

To initiate a memory read, our custom IP waits for the DDR3 to assert a ready

signal. This signal is not asserted every clock cycle; hence, to guarantee a continuous

flow of input data from the DDR3 to the custom IP, our block diagram, shown in

Figure 4.3, uses a 10-line word buffer to deliver buffered pixel data at a rate of

1 byte per clock cycle as a continuous stream input to FAST, FAST+BRIEF or

FAST+BRISK. The output coordinates and/or descriptor vectors produced by the

algorithms are then written to the DDR3 one word at a time.

FAST feature detection uses the Bresenham circle mask to traverse an image

70

!"#$%&

'#(#&

)*$+)#$&

,-#.%-/*0

$&

1*0%&234%-/&

5#/6&7*8%

&&

9%$*/(%

-&

:--#

;&

234%-&:

''-%//&<

%0%-#(=-&#

0'&

9%$*/(%

-&:--#

;&&

?-%+@=

"A3(%&3

0*(/&

BC3#D*(;&

E&FG&

>&+

H*-@D%

&

H="A#-#(=-&

E&J&K&&&&@==-'*0#(%/&

B0#LD%&

J&7;0@&

/*$0#D/&

:95&H=-(%I&:M&H?N&

H%0(-#D&!0(%-@=00%@(&

OGL&<?&:E!&5#/(%-&?=-(&

FP+1*0%&Q=-'&234%-&

?-=@%//*0$&7;/(%"&R?7S&

H=0(-=D&

?-=$-#""#LD%&1=

$*@&R?

1S&

*/T@=-0%-&

7"==(U*0$&J&

9%$*=0&<%0%-#V=0&

&WFG+Q

*'%&

@="A#-#(=-&

X-*%

0(#V=0&

H="A%0/#V=0&

'%/@-*A(=-&

:E!&!0(%-@=00%@(&>

&

+

!"#$%&

'#(#

)*$+)#$&

,-

/*

5#

*8%&&

9%$*/(

234%-&:

''-%//&<

%0%-#(=-&#

0'&

9%$*/(%

-&:--#

;&&

H*-@D%

&

E&J&K&&&&@==-'*0#(%/&

B0#LD%&

FP+1*0%&Q=-'&234%-& H=0(-=D&Q=-'&234

'#(#

J&7;0@&

/*$0#D/&

)*$+)#$&

5#/6&7*8

5#

BC3#D*(;&

E&FG&

>>+

(#,-#.%-/*0

$&

1*0%&234%-/&

9%$*/(%

-&

:--#

;&

(#&

0%-

*/T@=-0%-

7"==(U*0$&J&

9%$*=0&<%0%-#V=0&

9%

&WFG+Q

*'%&

@="A#-#(=-&

FG+Q

*'

X-*%

0(#V=0&

H="A%0/#V=0&

*%0(#V

'%/@-*A(=-& K&&&

>>

+

5%"=-;&

!0(%-Y#

@%&

:

5%"=

ZZ9O&

!0(%-Y#

@

ZZ9O

=-(%I&

! ( -!0(%-

&:E!&5#/

!0(%-

&:E!&5#/

!0(%-@=00%@(

H=0(-=

Figure 4.3: Top-level block diagram for FPGA implementation with FAST feature detection andBRIEF/BRISK/FREAK feature description.

71

frame. Even though each mask operation uses 16 pixels, the whole mask size is

7×7, and the whole mask region must be provided to the FAST feature detection

computational logic to sustain a single pixel per cycle throughput, requiring a high

memory bandwidth. However, on an FPGA implementation, available logic elements

can be freely traded for additional storage or computation, allowing us to create line

buffers to reuse overlapping pixels for subsequent masks and effectively increasing

the bandwidth of our computational logic.

Each of the line buffers are used to store the contents of a single row of the input

image data using address accessible 1-D register arrays. For a mask size of N ×N ,

N line buffers are utilized. Each subsequent row of the input image overwrites the

contents of the oldest line buffer.

The image corresponding to the 7×7 mask size of the Bresenham circle is stored

in a 7×7 register array made up of shift registers. With each pixel read from the input

image, the bottom row of the register array is updated with the new data, shifting

all the other pixel values horizontally. Meanwhile all other rows of the register

array are updated reading the corresponding pixel value from the corresponding line

buffer. The matching of the line buffers with the row of the register array is done

by utilizing pointers that keep shifting whenever a new line buffer is being written.

This architecture enables us to reduce the memory bandwidth requirement of the

FPGA design to one pixel per cycle despite the size of the computational mask.

For the reading of the line buffers, we have applied a zigzag access pattern similar

to the implementation used for our block matching hardware presented in Chapter 3.

This approach allows continuous computation of the mask by utilizing the overlap of

pixels in a given image block even through row changes. The filtering starts from the

top left pixel location of the input I and proceeds to the following pixel on the same

72

row until the last possible location of this row is filtered. Then, the filter operation

continues with the last pixel location of the next row and proceeds with the previous

pixel until the first search location of this new row is parsed. Only as many pixels

as the width of the filter are required by the computation units in each cycle to

calculate the result of filtering regardless of its position in the image. This zigzag

flow is enabled by the use of computation units that are capable of shifting data up,

left and right, and also by the use of temporary registers that store the values for an

up shift of the image block.

For each valid state of the 7×7 register array, the pixels that correspond to the

Bresenham circle around the center pixel are evaluated. An initial comparison of 4

pixels intersecting the axis of the circle (labeled as 1, 5, 9 and 13 in Figure 2.9) is

performed to cut down the total number of comparisons, as described in [56]. If this

pre-computation returns false, then there is no need to further compare the remain-

ing 12 pixel points of the Bresenham circle since there is no chance of obtaining a

continuous 12 pixel ring around the center. The reduction in the number of compar-

isons does not impact the delay of the computation since the pixels do need to be

registered in either scenario to sustain constant throughput, however the reduction

of redundant comparison operations and bit propagation reduces the dynamic power

dissipation of the circuit.

The feature description in our Zynq FPGA system has been implemented and

evaluated with three different feature descriptor algorithms, BRIEF, BRISK and

FREAK. All three algorithms share the same basic framework presented in Fig-

ure 4.2. A region centered around a detected corner p needs to be described as a

binary string. Given a sampling pattern, the feature descriptor generates N sam-

pling pairs for each detected feature and determines whether the first or the second

element of each pair is greater than the other and defines the pair as binary 1 or 0

73

correspondingly. The resulting N bit vector D is the feature descriptor for the said

point to be used for feature matching. Here N is equal to 512 for all of our descriptor

implementations.

The descriptor algorithms are implemented as additional pipeline stages in the

FAST feature detection implementation. The line buffers used in FAST are incre-

mented in number to cover 31 rows of input image data and are utilized for both

the detection and description phases. Once the current pixel coordinate is computed

to be a feature, pairing samples are generated from a 31 × 31 sampling window

around its location, where the sampling window is stored in a 31×31 register array

and traversed in a zigzag manner using the technique presented with block matching

design in Chapter 3.

The sampling pairs for descriptor implementations are fixed according to the

algorithm descriptions. Therefore the pairs can be compared to generate the 512-bit

binary feature descriptor vector directly from the 31×31 register array. This register

array needs to be updated with each new data read, however feature description

computation only takes place after a pixel coordinate is already computed as a

feature coordinate.

Unlike BRIEF, the BRISK algorithm is an orientation invariant feature descriptor

and estimates the orientation of detected features from the selected sampling pairs

and rotates the sampling pattern to neutralize the effect of rotation. The sampling

points are distinguished as short pairs and long pairs based on their distance to each

other, where long pairs are used to determine orientation and short pairs are used

for the intensity comparisons that build the descriptor. For BRIEF, a preliminary

smoothing operation is applied over the whole image I before the sampling pairs are

chosen and compared. The smoothing is applied over the 31×31 sampling window

74

for each detected feature.

4.2.2 GPU Architecture

The Tegra K1 SoC combines four ARM Cortex A15 CPU cores with a Kepler class

GPU with 192 CUDA cores on the same die. The Kepler GPU has a separate 64kB

on-chip L1 cache and a 128kB on-chip L2 cache, while the CPUs and GPU share

2GB of off-chip memory. GPU programming with the Tegra K1 is dominated by two

factors common to all GPUs. The first is the highly parallel SIMD nature of GPU

programming. The second is the unique memory model.

The highly parallel nature of GPU programming makes large amounts of branches

and control logic expensive to implement. GPU threads on the K1 are organized into

blocks that can access global GPU memory or a memory space shared only within

the thread block. Shared memory on the K1 is implemented with the 64kB L1 cache

so most memory accesses for the kernels under study will be to the 2GB off-chip

global memory space.

We implement our GPU kernels using version 2.2.10 of the OpenCV computer

vision library and NVIDIA’s CUDA API. The implementations of FAST feature

detector and BRIEF and BRISK feature descriptors for both the CPU and GPU

OpenCV library are publicly available and highly optimized by the community to be

used for performance critical applications. We have excluded the analysis of FREAK

descriptor on the GPU due to lack of an optimized GPU implementation.

The FAST feature detector kernel can be mapped almost entirely to the Kepler

GPU. This kernel contains a modest amount of branches and no loops. However,

75

the feature descriptor implementations of BRISK, and BRIEF have large sections of

initialization code which does not map well to a GPU architecture because of the

non-uniform access pattern to the image data. These code sections are executed on

a single ARM core, limiting the achievable speedup. It is possible that some initial-

ization code can be mapped to other CPU cores, thus achieving a small additional

speedup. However, the additional speedup was judged to be too modest to warrant

inclusion in our analysis.

4.3 Results

The run-time, power, and energy comparisons of the FAST feature detector with

and without the BRIEF/BRISK feature description algorithms on various embedded

systems is given in Figure 4.5. As mentioned earlier, both the CPU and GPU

implementation were run using the Jetson TK1 development board. Jetson contains

an 192-core NVIDIA Kepler GK20a GPU, which can process upto 300 gigaflops, and

ARM quad-core Cortex A15 CPU running at 2.3GHz. The FPGA implementation

results were taken from the MicroZED development board running at 111 MHz. For

all platforms, we have measured the power by intercepting the current between the

power supply and the system by a 1 mΩ shunt resistor. The voltage across the

resistor is measured using a multi-meter to calculate the total power of each system.

As a sliding window operation, FAST has a highly predictable data access pattern

traversing each input image frame. On the other hand, feature description algorithms

traverse over a list of detected features and accesses the sampling window around

each feature coordinate, resulting in an irregular access pattern. This irregularity

impacts the run-time of the CPU the most, whereas the GPU architecture can handle

76

0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0%

ExecutionDependency

Pipe Busy

MemoryThrottle

Not Selected

Stall Reasons during GPU Computation

BRIEF Feature Description FAST Feature Detection

Figure 4.4: Issue Stall Reasons for FAST and BRIEF implementations on GPU

irregular accesses to a 2-D data relatively well. For the FPGA implementation, the

flexibility to trade-off resources for performance allows us to fully pipeline detection

and description computation, completely eliminating the need for additional memory

accesses thereby retaining the same throughput and thus very similar run-times.

We have used the NVidia Visual Profiler [18] to identify the performance bottle-

necks for our GPU implementations and analyze the underlying reasons for execution

stalls. Figure 4.4 displays the source of execution stalls for both our FAST (detec-

tion) and BRIEF (description) implementations. The dominant causes of execution

stalls include Execution Dependency, Pipe Busy, Memory Throttle and Not Selected.

Execution dependency stalls occur when an instruction is waiting for at least one of

its inputs to be computed by earlier instructions. Pipe busy stalls are observed when

the computation unit required for the instruction is not available. Memory throttle

stalls are due to requiring a larger number of memory requests then the capabilities

of the load/store unit and thus not being able to accommodate all the requests in

time. Not selected stalls occur when the warp scheduler gives priority to another

kernel over the computation kernel and thus not allocating the required resources.

We observe that the FAST feature detection algorithm is heavily stalled due to

77

Table 4.1: Instruction number Comparison between GPU and FPGA implementations

Load/Store Floating Point and Integer ControlFAST FPGA 384000 2688000 12288000FAST+BRIEF FPGA 384000 3456000 17280000FAST GPU 304629 46310713 12000FAST+BRIEF GPU 2560475 64262713 1113277

execution dependencies compared to the description computation. On the other hand

the description computation is heavily stalled due to memory throttle. This shows

us that the FAST feature detection algorithm could benefit greatly from further

instruction level parallelism whereas the BRIEF feature descriptor algorithm would

benefit from better data management.

The total instruction count for our GPU and FPGA implementations are given

in Table 4.1. As before, the GPU instruction counts were taken from the NVIDIA

visual profiler tool. The corresponding instruction counts for the FPGA were esti-

mated from the design architecture and system level simulation. We estimate the

Load/Store instruction count as the number of accesses to the DDR3 and disregard

the accesses to the internal register buffers. The number of integer instructions is

estimated based on the total number of cycle operations the architecture takes, when

all the pipeline stages are unrolled. Therefore, each cycle of operation a single feature

candidate or descriptor goes through is evaluated as an instruction. There can be

multiple instructions in different pipeline layers, giving an instruction count metric

comparable to that of a CPU or GPU. The control instructions were estimated via a

code level analysis of branch instructions. Similar to the total of cycle instructions,

the estimations are made by multiplying total number of branches per mask opera-

tion with the total number of mask operations applied over an image and thus each

pipeline stage is considered as a separate MIMD instruction.

We observe that the feature detection kernel (i.e., FAST) corresponds to the ma-

78

94

.0

16

6.8

59

.3

30

.00

80

9.4

69

6.4

11

4.1

30

.96

11

37

.7

70

5.1

17

4.3

31

.58

92

7.9

6

31

.48

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

Intel i7 CPU Embedded CPU Tegra GPU Zynq FPGA

En

erg

y (m

J)

FAST FAST+BRIEF FAST+BRISK FAST+FREAK

4.4

25

.8

13

.8

13

.7

36

.4

10

3.7

18

.2

13

.7

50

.8

11

1.7

27

.6

13

.7

41

.8

13

.7

0.0

20.0

40.0

60.0

80.0

100.0

120.0

Intel i7 CPU ARM on Jetson Tegra GPU onJetson

Zynq FPGA

Ru

n-T

ime

(m

s)


21

.5

6.5

4.3

2.2

0

22

.2

6.2 6.3

2.2

7

22

.4

6.3

6.3

2.3

1

22

.2

2.3

0

0.0

5.0

10.0

15.0

20.0

25.0

Intel i7 CPU ARM on Jetson Tegra GPU onJetson

Zynq FPGA

Po

we

r (W

)


Figure 4.5: Run-time and power results for FAST feature detection and BRIEF/BRISK/FREAKfeature description algorithms over various embedded systems.

79

jority of the floating point and integer instructions, making up over 70% of the total

instructions issued. On the other hand, the number of load/store instructions are

significantly higher for the feature description kernel (i.e., BRIEF) specifically for

the GPU. For the FPGA implementation, the image is read from the external mem-

ory only once for both detection and description computations into our line buffers,

drastically reducing the number of memory accesses. Thereafter the corresponding

pixel data is propagated through our computation logic until it is discarded.

As seen in Figure 4.5, the power consumption of feature description algorithms

are very similar for our embedded CPU and GPU implementations whereas the FAST

feature detection consumes significantly less power on the GPU. The sliding window

operation for the detection part is efficiently distributed to the low power cores of the

GPU whereas the feature description suffers from the bottleneck of irregular memory

accesses in terms of power consumption.

The resource utilization of our FPGA implementations are reported in Table 4.2.

The additional logic for description computation for the FPGA implementation is

readily available to compute the description at each pixel coordinate, however unnec-

essary bit propagation is eliminated by registering the inputs in order to minimize

the cost of extra circuitry on power.

Table 4.2: Resource utilization on the Zynq FPGA

LUT FFs BRAMsFAST 4564 1551 8FAST+BRIEF 14398 2093 11FAST+BRISK 25575 7115 11FAST+FREAK 28684 7935 11

The run-time performance for the GPU and FPGA are more comparable. How-

ever the highly customizable architecture of the FPGA lends itself much better to

optimization of FAST+BRIEF and FAST+BRISK implementations. When FAST

80

feature detection is pipelined with description our results show that the GPU lags

behind in performance due to a lack of MIMD capabilities. Combined with the lower

power dissipation overhead for the FPGA boards, we can see a clear advantage of

FPGAs over the other platforms in terms of energy consumption with measure-

ments of 705mJ, 174mJ, and 16mJ for feature detection and description of WGA

size frames on embedded CPUS, GPUs and FPGAs respectively, giving the FPGA

a 98% advantage over the CPU implementation and a 90% advantage over the GPU

implementation.

4.4 Summary and Discussion

In this chapter we presented a comparative analysis of the FAST feature detection

algorithm along with BRIEF and BRISK feature description algorithms on vari-

ous embedded systems (embedded CPUs, GPUs and FPGAs) in terms of run-time

performance, power, and energy. We determined that the utilization of hardware-

oriented and power-aware design decisions such as deep pipelining, continuous filter

flow, and pre-computation steps allow high-throughput FPGA implementations to

outperform state-of-the-art embedded CPUs and GPUs in terms of both power and

performance. We show that despite the high-level of parallelization GPUs provide,

computation of multiple kernels is highly bounded by the kernel scheduler and mem-

ory bottlenecks whereas customization of FPGAs on layers can tackle operation of

multiple kernels much more efficiently. We have shown that the initial profiling of

GPU implementations can allow the designers to identify bottlenecks in a design and

deduce whether these bottlenecks can yield performance gains with custom hardware

programmability of FPGAs. This analysis constitutes a first step toward high perfor-

mance computer vision based embedded systems. Future work will build upon these

81

results by integrating real-time image sensor data and adding additional hardware

accelerated kernels such as those necessary for autonomous navigation and mapping

applications.

Chapter 5

Summary of Dissertation and

Possible Future Extensions

With the rising complexity of image processing and computer vision applications,

more pressure is being placed on designing architectures that can effectively deal

with their high throughput requirements. Additionally, the use of such systems in

highly resource-constrained mobile environments makes the trade-offs between design

constraints such as area and power even more impactful.

In this thesis, we have investigated different hardware accelerator platforms

specifically targeted for real-time image processing applications and explored how

smart algorithmic and architectural choices can lead to optimal designs that meet

specific constraints in area, power, or performance. We explored different embedded

systems and accelerated various image processing algorithms in order to demonstrate

the impact of various design choices.

82

83

5.1 Summary of Results

In Chapter 3 we explored techniques for fast design space exploration and multi-

objective design optimization for FPGA-based accelerators using two different image

processing applications. We utilized reconfigurability of FPGAs to implement a small

fraction of a large design space and apply regression analysis to obtain analytical

information and formulate scalable models to predict various design metrics such as

power, arithmetic accuracy, performance, and area.

As our first case study, we implemented an image deblurring accelerator for

FPGAs. For the accelerator design, the proposed models predict the implementation

metrics within 8% of measured power values, 10% within the output arithmetic

accuracy, and within 3% of actual FPGA resources used. We have used a full-

search block matching algorithm as a secondary test case and introduced system

throughput as an additional variable dependent on the design parameters. This

second accelerator design confirmed our findings about the benefits of the L1-based

modeling methodology. Our predictions were fairly close to the actual measurements

of the metrics with measurements within 10%, 4%, 6%, and 4% for power, area,

arithmetic accuracy and throughput respectively. Using these predictions, we were

able to accelerate the design space exploration process by a 340× speedup for the

image deblurring test case and a 90× speedup for the block matching test case. We

have also explored finding the optimal design parameters under various objectives

and constraints.

The work presented in Chapter 3 led us to expand our design parameters on

a different level than the algorithmic and architectural design choices and use the

embedded system as a design parameter as well. In Chapter 4, we presented a

84

comprehensive comparison between embedded CPU, GPU and FPGA implementa-

tions of FAST feature detection and BRIEF,BRISK and FREAK feature description

algorithms, evaluating their power and performance trade-offs while exploring the

architectural advantages and limitations of the acceleration platforms. We deter-

mined that the utilization of hardware-oriented and power-aware design decisions

such as deep pipelining, continuous filter flow, and pre-computation steps allow high-

throughput FPGA implementations to outperform state-of-the-art embedded CPUs

and GPUs in terms of both power and performance. We showed that despite the

high-level parallelization GPUs provide, computation of multiple kernels are highly

bounded by kernel scheduler and memory bottlenecks whereas customization of FP-

GAs on layers can tackle operation of multiple kernels much more efficiently. We

have shown that the initial profiling of GPU implementations can allow the design-

ers to identify bottlenecks in a design and deduce whether these bottlenecks can be

reduced with this custom hardware programmability of FPGAs.

5.2 Future Work

The work presented in this dissertation can be extended in a few directions. Our

exploration and optimization methodology presented in Chapter 3 can also be ex-

tended to ASICs, which would give a high range of architectural design decisions.

Our methodology should work equally well or even better for ASICs as they en-

able larger customization to the hardware designs. By combining the modeling

presented in Chapter 3, and the embedded system exploration in Chapter 4, a more

methodological approach to select embedded systems can be explored. The embed-

ded system can be used as a tertiary design choice in addition to the algorithmic

and architectural design options presented, which would also enable a wider range

85

of architectural design decisions to be explored for a wider range of embedded sys-

tems. Another direction would be to combine both approaches from Chapters 3

and 4 would be to train our regression based design space exploration methodology

based on the software driven and instruction level analysis presented in Chapter 4

and make predictions for hardware accelerators for a range of design decisions.

We have made heavy use of the designer’s knowledge of the algorithm and the

various architectural platforms to perform our design space explorations. In the

future, standardizing the architectural blocks to the image processing domain and

building a larger variety of algorithms using these standardized blocks would enable

our approach to perform an even earlier predictions on a wider variety of parameters.

With a heavier use of standardized blocks, an automated parameter selection process

could be devised to further reduce the need of user input for our L1 regularization

methodology. Reducing the dependency on the designer’s own knowledge before

making design level decisions would also accelerate the prototyping of such algorithms

for real-time use.

Finally, our work presented in Chapter 3 can also be expanded to devise separate

regression models for different parts of the design. For instance, if the control and

data sections of a design each were mapped to separate regression models, this may

lead to higher precision for predicting the various design metrics.

The analysis presented in Chapter 4 constitutes a first step toward high perfor-

mance computer vision based embedded systems. Future work could build upon

these results by considering additional hardware accelerated kernels such as those

necessary for autonomous navigation and mapping applications and exploring a large

system to be used on multiple embedded platforms combining the advantages pro-

vided by them.

Bibliography

[1] Nabil Abdelli, A-M Fouilliart, Nathalie Julien, and Eric Senn. High-level powerestimation of FPGA. In 2007 IEEE International Symposium on Industrial Elec-tronics, pages 925–930. Institute of Electrical & Electronics Engineers (IEEE),June 2007.

[2] Giovanni Agosta, Gianluca Palermo, and Cristina Silvano. Multi-objective co-exploration of source code transformations and design space architectures forlow-power embedded systems. In Proceedings of the 2004 ACM symposium onApplied computing, SAC ’04, pages 891–896, New York, NY, USA, 2004. ACM.

[3] Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. FREAK: Fastretina keypoint. In Computer Vision and Pattern Recognition (CVPR), 2012IEEE Conference on, pages 510–517. Ieee, 2012.

[4] Arizona State University. YUV test sequences.http://trace.eas.asu.edu/yuv/index.html.

[5] Giuseppe Ascia, Vincenzo Catania, and Maurizio Palesi. A multiobjective ge-netic approach for system-level exploration in parameterized systems-on-a-chip.Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactionson, 24(4):635 – 645, april 2005.

[6] Luna Backes, Alejandro Rico, and Bjorn Franke. Experiences in speeding upcomputer vision applications on mobile computing platforms. In EmbeddedComputer Systems: Architectures, Modeling, and Simulation (SAMOS), 2015International Conference on, pages 1–8. Institute of Electrical & ElectronicsEngineers (IEEE), July 2015.

[7] Xuan-Quang Banh and Yap-Peng Tan. Adaptive dual-cross search algorithmfor block-matching motion estimation. IEEE Transactions on Consumer Elec-tronics, 50(2):766–775, May 2004.

[8] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robustfeatures. In Computer vision–ECCV, pages 404–417. Springer, 2006.

[9] Jan Biemond, Reginald L. Lagendijk, and Russell M. Mersereau. Iterative meth-ods for image deblurring. Proceedings of the IEEE, 78(5):856–883, May 1990.

86

87

[10] Dimitris Bouris, Antonis Nikitakis, and Ioannis Papaefstathiou. Fast and effi-cient FPGA-based feature detection employing the SURF algorithm. In Field-Programmable Custom Computing Machines (FCCM), pages 3–10. Institute ofElectrical & Electronics Engineers (IEEE), May 2010.

[11] H. Richard Byrd, Charles Jean Gilbert, and Jorge Nocedal. A trust regionmethod based on interior point techniques for nonlinear programming. Mathe-matical Programming, 89(1):149–185, 2000.

[12] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. BRIEF:Binary robust independent elementary features. Computer Vision–ECCV 2010,pages 778–792, 2010.

[13] Antonio Canclini, Matteo Cesana, Alessandro Redondi, Marco Tagliasacchi,Joao Ascenso, and Rodrigo Cilla. Evaluation of low-complexity visual featuredetectors and descriptors. In Digital Signal Processing (DSP), 2013 18th Inter-national Conference on, pages 1–7, July 2013.

[14] Hua-Yu Chang, Iris Hui-Ru Jiang, H. Peter Hofstee, Damir Jamsek, and Gi-Joon Nam. Feature detection for image analytics via FPGA acceleration. IBMJournal of Research and Development, 59(2/3):8:1–8:10, March 2015.

[15] Shuai Che, Jie Li, Jeremy W. Sheaffer, Kevin Skadron, and John Lach. Ac-celerating compute-intensive applications with gpus and fpgas. In ApplicationSpecific Processors, 2008. SASP 2008. Symposium on, pages 101–107. Instituteof Electrical & Electronics Engineers (IEEE), June 2008.

[16] Deming Chen, Jason Cong, Yiping Fan, and Zhiru Zhang. High-level powerestimation and low-power design space exploration for FPGAs. In Proceedingsof the 2007 Asia and South Pacific Design Automation Conference, ASP-DAC’07, pages 529–534, Washington, DC, USA, 2007. IEEE Computer Society.

[17] Nico Cornelis and Luc Van Gool. Fast scale invariant feature detection andmatching on programmable graphics hardware. In Computer Vision and PatternRecognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conferenceon, pages 1–8. Institute of Electrical & Electronics Engineers (IEEE), June 2008.

[18] NVIDIA Corporation. NVIDIA visual profiler.https://developer.nvidia.com/nvidia-visual-profiler.

[19] Piotr Czyzak and Adrezej Jaszkiewicz. Pareto simulated annealinga meta-heuristic technique for multiple-objective combinatorial optimization. Journalof Multi-Criteria Decision Analysis, 7(1):34–47, 1998.

[20] Joydip Das, Steven J. E. Wilton, Philip Leong, and Wayne Luk. Modeling post-techmapping and post-clustering FPGA circuit depth. In Field ProgrammableLogic and Applications, 2009. FPL 2009. International Conference on, pages205 –211, 31 2009-sept. 2 2009.

[21] Vijay Degalahal and Tim Tuan. Methodology for high level estimation of FPGApower consumption. In Proceedings of the 2005 Asia and South Pacific DesignAutomation Conference, ASP-DAC ’05, pages 657–660, New York, NY, USA,2005. ACM.

88

[22] Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien Yu, V. Leo Rideout, ErnestBassous, and Andre R. Leblanc. Design of ion-implanted MOSFET’s with verysmall physical dimensions. Proceedings of the IEEE, 87(4):668–678, April 1999.

[23] Michal Fularz, Marek Kraft, Adam Schmidt, and Andrzej Kasinski. A high per-formance FPGA based image feature detector and matcher based on the FASTand BRIEF algorithms. International Journal of Advanced Robotic Systems,12(141), 2015.

[24] Tony Givargis and Frank Vahid. Platune: a tuning framework for system-on-a-chip platforms. IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, 21(11):1317–1327, Nov 2002.

[25] Tony Givargis, Frank Vahid, and Jorg Henkel. System-level exploration forpareto-optimal configurations in parameterized systems-on-a-chip. In ComputerAided Design, 2001. ICCAD 2001. IEEE/ACM International Conference on,pages 25–30, 2001.

[26] Mark D. Hill. 21st century computer architecture: A community white paper.Technical report, The ACM Special Interest Group on Computer Architecture,2012.

[27] Ali Irturk, Bridget Benson, Shahnam Mirzaei, and Ryan Kastner. GUSTO: Anautomatic generation and optimization tool for matrix inversion architectures.ACM Trans. Embed. Comput. Syst., 9:32:1–32:21, April 2010.

[28] Dongsuk Jeon, M.B. Henry, Yejoong Kim, Inhee Lee, Zhengya Zhang,D. Blaauw, and D. Sylvester. An energy efficient full-frame feature extrac-tion accelerator with shift-latch FIFO in 28 nm CMOS. Solid-State Circuits,IEEE Journal of, 49(5):1271–1284, May 2014.

[29] Tianyi Jiang, Xiaoyong Tang, and Prith Banerjee. Macro-models for high levelarea and power estimation on FPGAs. In Proceedings of the 14th ACM GreatLakes symposium on VLSI, GLSVLSI ’04, pages 162–165, New York, NY, USA,2004. ACM.

[30] Nasser Kehtarnavaz and Mark Noel Gamadia. Real-time Image and Video Pro-cessing: From Research to Reality. Morgan & Claypool Publishers, 2006.

[31] Braislav Kisacanin, Shuvra S. Bhattacharyya, and Sek Chai. Embedded Com-puter Vision. Springer London, 2009.

[32] Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimationand model selection. In Proceedings of the 14th International Joint Conferenceon Artificial Intelligence - Volume 2, IJCAI’95, pages 1137–1143, San Francisco,CA, USA, 1995. Morgan Kaufmann Publishers Inc.

[33] Marek Kraft, Adam Schmidt, and Andrzej J Kasinski. High-speed image featuredetection using fpga implementation of fast algorithm. VISAPP (1), 8:174–9,2008.

[34] Murali E. Krishnan, E. Gangadharan, and Nirmal P. Kumar. H.264 motionestimation and applications. Technical report, InTech, 2012.

89

[35] A. V. Kulkarni, J. S. Jagtap, and V. K. Harpale. Object recognition with ORBand its implementation on FPGA. International Journal of Advanced ComputerResearch, 3(3):164–169, 2013.

[36] Tadahiro Kuroda. CMOS design challenges to power wall. In Microprocessesand Nanotechnology Conference, 2001 International, pages 6–7, Oct 2001.

[37] Benjamin C. Lee and David M. Brooks. Accurate and efficient regression mod-eling for microarchitectural performance and power prediction. SIGOPS Oper.Syst. Rev., 40:185–194, October 2006.

[38] Kwang-Yeob Lee. A design of an optimized ORB accelerator for real-time featuredetection. International Journal of Control & Automation, 7(3), 2014.

[39] Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart. BRISK: Binaryrobust invariant scalable keypoints. In Computer Vision (ICCV), 2011 IEEEInternational Conference on, pages 2548–2555. IEEE, 2011.

[40] Reoxiang Li, Bing Zeng, and M. L. Liou. A new three-step search algorithmfor block motion estimation. IEEE Transactions on Circuits and Systems forVideo Technology, 4(4):438–442, Aug 1994.

[41] Lei Liang and Yuanchang Xu. Adaptive landweber method to deblur images.IEEE Signal Processing Letters, 10(5):129–132, May 2003.

[42] Alexander Ling, Dhirendra Pratap Singhh, and Stephen D. Brown. FPGA tech-nology mapping: a study of optimality. In Proceedings. 42nd Design AutomationConference, 2005., pages 427–432, June 2005.

[43] David G. Lowe. Distinctive image features from scale-invariant keypoints. In-ternational Journal of Computer Vision, 60(2):91–110, nov 2004.

[44] Ondrej Miksik and Krystian Mikolajczyk. Evaluation of local detectors anddescriptors for fast feature matching. In Pattern Recognition (ICPR), 2012 21stInternational Conference on, pages 2681–2684, Nov 2012.

[45] Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen, Ram Rajamony, andRaj Rajkumar. Critical power slope: Understanding the runtime effects offrequency scaling. In Proceedings of the 16th International Conference on Su-percomputing, ICS ’02, pages 35–44, New York, NY, USA, 2002. ACM.

[46] Douglas C. Montgomery. Design and Analysis of Experiments. Wiley, 2012.

[47] Gordon E. Moore. Cramming more components onto integrated circuits. Pro-ceedings of the IEEE, 86(1):82–85, Jan 1998.

[48] D. Nagamalai, E. Renault, and M. Dhanuskodi. Advances in Digital ImageProcessing and Information Technology: First International Conference on Dig-ital Image Processing and Pattern Recognition, DPPR 2011, Tirunelveli, TamilNadu, India, September 23-25, 2011, Proceedings. Communications in Com-puter and Information Science. Springer Berlin Heidelberg, 2011.

[49] Kumud Nepal. New Directions for Design-Space Exploration of Low-PowerHardware Accelerators. PhD thesis, Brown University, 2015.

90

[50] David Nister, Oleg Naroditsky, and James Bergen. Visual odometry. In Com-puter Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the2004 IEEE Computer Society Conference on, volume 1, pages I–652–I–659 Vol.1,June 2004.

[51] Gianluca Palermo, Cristina Silvano, and Vittorio Zaccaria. Multi-objective de-sign space exploration of embedded systems. J. Embedded Comput., 1(3):305–316, August 2005.

[52] Giuseppe Ascia Vincenzo Catania Maurizi Palesi. A framework for design spaceexploration of parameterized VLSI systems. In Proceedings of the 2002 Asiaand South Pacific Design Automation Conference, ASP-DAC ’02, pages 245–250, Washington, DC, USA, 2002. IEEE Computer Society.

[53] Rajat Phull, Pradip Mainali, Qiong Yang, Patrice Rondao Alface, and HenkSips. Low complexity corner detector using CUDA for multimedia applications.MMEDIA 2011, 2011.

[54] Ab Al-Hadi Ab Rahman, R. Thavot, M. Mattavelli, and P. Faure. Hardware andsoftware synthesis of image filters from CAL dataflow specification. In Ph.D.Research in Microelectronics and Electronics (PRIME), 2010 Conference on,pages 1–4, July 2010.

[55] Iain E. Richardson. H.264 and MPEG-4 Video Compression: Video Coding forNext-generation Multimedia. Wiley, 2003.

[56] Edward Rosten and Tom Drummond. Fusing points and lines for high perfor-mance tracking. In Computer Vision, 2005. ICCV 2005. Tenth IEEE Interna-tional Conference on, volume 2, pages 1508–1515, Oct 2005.

[57] Edward Rosten and Tom Drummond. Machine learning for high-speed cornerdetection. In Proceedings of the 9th European Conference on Computer Vision- Volume Part I, ECCV’06, pages 430–443, Berlin, Heidelberg, 2006. Springer-Verlag.

[58] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB: Anefficient alternative to SIFT or SURF. In Computer Vision (ICCV), 2011 IEEEInternational Conference on, ICCV ’11, pages 2564–2571, Washington, DC,USA, Nov 2011. IEEE Computer Society.

[59] Maria Santamaria and Maria Trujillo. A comparison of block-matching motionestimation algorithms. In Computing Congress (CCC), 2012 7th Colombian,pages 1–6. Institute of Electrical & Electronics Engineers (IEEE), Oct 2012.

[60] Michael Schaeferling and Gundolf Kiefer. Object recognition on a chip: A com-plete SURF-based system on a single FPGA. In Reconfigurable Computing andFPGAs (ReConFig), 2011 International Conference on, pages 49–54. Instituteof Electrical & Electronics Engineers (IEEE), Nov 2011.

[61] Benjamin Carrion Schafer and Kazutoshi Wakabayashi. Machine learning pre-dictive modelling high-level synthesis design space exploration. Computers Dig-ital Techniques, IET, 6(3):153–159, May 2012.

91

[62] David Sheldon and Frank Vahid. Making good points: application-specificpareto-point generation for design space exploration using statistical meth-ods. In Proceeding of the ACM/SIGDA international symposium on Field pro-grammable gate arrays, FPGA ’09, pages 123–132, New York, NY, USA, 2009.ACM.

[63] Lee Chee Sing and Ha Yajun. Design space exploration for arbitrary FPGAarchitectures. In Proceedings of the Second International Conference on Embed-ded Software and Systems, ICESS ’05, pages 269–275, Washington, DC, USA,2005. IEEE Computer Society.

[64] Alastair M. Smith, Steven J.E. Wilton, and Joydip Das. Wirelength modelingfor homogeneous and heterogeneous FPGA architectural development. In Pro-ceeding of the ACM/SIGDA international symposium on Field programmablegate arrays, FPGA ’09, pages 181–190, New York, NY, USA, 2009. ACM.

[65] Stephen M. Smith and J. Michael Brady. SUSAN - a new approach to low levelimage processing. International Journal of Computer Vision, 23:45–78, 1995.

[66] Byoungro So, Mary W. Hall, and Pedro C. Diniz. A compiler approach tofast hardware design space exploration in FPGA-based systems. In Proceedingsof the ACM SIGPLAN 2002 Conference on Programming language design andimplementation, PLDI ’02, pages 165–176, New York, NY, USA, 2002. ACM.

[67] Jan Svab, Tomas Krajnik, Jan Faigl, and Libor Preucil. FPGA based speeded uprobust features. In Technologies for Practical Robot Applications, 2009. TePRA2009. IEEE International Conference on, pages 35–41. Institute of Electrical &Electronics Engineers (IEEE), Nov 2009.

[68] Kuen Hung Tsoi and Wayne Luk. Power profiling and optimization for heteroge-neous multi-core systems. SIGARCH Comput. Archit. News, 39:8–13, December2011.

[69] Onur Ulusel, Kumud Nepal, R. Iris Bahar, and Sherief Reda. Fast design ex-ploration for performance, power and accuracy tradeoffs in FPGA-based ac-celerators. ACM Trans. Reconfigurable Technol. Syst., 7(1):4:1–4:22, February2014.

[70] Gooitzen van der Wal, David Zhang, Indu Kandaswamy, James Marakowitz,Kevin Kaighn, Joe Zhang, and Sek Chai. FPGA acceleration for feature basedprocessing applications. In 2015 IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW), pages 42–47. Institute of Electrical& Electronics Engineers (IEEE), June 2015.

[71] Rick Weber, Akila Gothandaraman, Robert J. Hinde, and Gregory D. Peterson.Comparing hardware accelerators in scientific applications: A case study. IEEETransactions on Parallel and Distributed Systems, 22(1):58–68, Jan 2011.

[72] Wikipedia. Pareto efficiency, 2016. [Online; accessed 31-May-2016].

92

[73] Hongtao Xie, Ke Gao, Yongdong Zhang, Jintao Li, and Yizhi Liu. GPU-basedfast scale invariant interest point detector. In Acoustics Speech and Signal Pro-cessing (ICASSP), 2010 IEEE International Conference on, pages 2494–2497,March 2010.

[74] Xilinx. ML605 Hardware User Guide, 2011.

[75] Lu Yuan, Jian Sun, Long Quan, and Heung-Yeung Shum. Image deblurringwith blurred/noisy image pairs. ACM Trans. Graph., 26(3), July 2007.

[76] Anatoly A. Zhigljavsky. Theory of Global Random Search, volume 65 of Math-ematics and its Applications. Springer Netherlands, 1 edition, 1991.

[77] Ce Zhu, Xiao Lin, and Lap-Pui Chau. Hexagon-based search pattern for fastblock motion estimation. IEEE Transactions on Circuits and Systems for VideoTechnology, 12(5):349–355, May 2002.

[78] Shan Zhu and Kai-Kuang Ma. A new diamond search algorithm for fast block-matching motion estimation. IEEE Transactions on Image Processing, 9(2):287–290, Feb 2000.

[79] Xiang Zhu and Peyman Milanfar. Restoration for weakly blurred and stronglynoisy images. In Applications of Computer Vision (WACV), 2011 IEEE Work-shop on, pages 103–109. Institute of Electrical & Electronics Engineers (IEEE),Jan 2011.

design-space exploration of embedded hardware accelerators

Documents