an fpga based high performance optical flow hardware design for computer vision applications

Microprocessors and Microsystems 37 (2013) 270–286

Contents lists available at SciVerse ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier .com/locate /micpro

An FPGA based high performance optical flow hardware design for computervision applications

Gokhan Koray Gultekin ⇑, Afsar SaranliDepartment of Electrical and Electronics Engineering, Middle East Technical University, Ankara, Turkey

a r t i c l e i n f o

Article history:Available online 23 January 2013

Keywords:FPGAEmbedded machine visionOptical flowReal time image processingHorn and Schunck algorithm

0141-9331/$ - see front matter � 2013 Elsevier B.V. Ahttp://dx.doi.org/10.1016/j.micpro.2013.01.001

⇑ Corresponding author. Tel.: +90 3122104510.E-mail address: [email protected] (G.K. Gu

a b s t r a c t

Optical Flow (OF) information is used in higher level vision tasks in a variety of computer vision applica-tions. However, its use in resource constrained applications such as small-scale mobile robotic platformsis limited because of the high computational complexity involved. The inability to compute the OF vectorfield in real-time is the main drawback which prevents these applications to efficiently utilize some suc-cessful techniques from the computer vision literature. In this work, we present the design and imple-mentation of a high performance FPGA hardware with a small footprint and low power consumptionthat computes OF at a speed exceeding real-time performance. A well known OF algorithm by Hornand Schunck is selected for this baseline implementation. A detailed multiple-criteria performance anal-ysis of the proposed hardware is presented with respect to computation speed, resource usage, powerconsumption and accuracy compared to a PC based floating-point implementation. The implementedhardware computes OF vector field on 256 � 256 pixels images in 3.89 ms i.e. 257 fps. Overall, the pro-posed implementation achieves a superior performance in terms of speed, power consumption and com-pactness while there is minimal loss of accuracy. We also make the FPGA design source available in fullfor research and academic use.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

Optical flow (OF) is a popular algorithmic tool in machine vi-sion, that is often used in a variety of applications. These includecollision detection, motion segmentation, tracking, backgroundsubtraction, visual odometry and video compression among othersand can be used in domains like industrial inspection, robotics andspace missions. Some of these domains including mobile roboticsand space applications have severe power and computational re-source constraints. Optical flow computation, like many other vi-sion algorithms, unfortunately suffer from high computationalcomplexity. For a typical camera with sufficient resolution, thereis prohibitive amount of data to be processed in an image sequenceand optical flow computation itself involves complex and timeconsuming operations. Unlike some applications where offlinecomputation is possible, robotic and space applications generallyrequire the computation of optical flow in real time. Furthermore,optical flow may not be alone and rather a pre-processing part of alarger process, hence sharing available computational resources.

These resources traditionally take the form of one or more gen-eral purpose CPUs, often full embedded computers. However, thesearchitectures, including digital signal processors (DSPs) are poorly

ll rights reserved.

ltekin).

suited to the structure of image processing and vision algorithms.Since they lack suitable special instructions and hardware sub-sys-tems, the demanded computational power generally exceeds theavailable one [1]. In addition, because of their high clock rates, theyconsume a large amount of electrical power. This requires a largerpower supply and also causes a high heat dissipation. Both the ex-tra power requirement as well as the excess heat can be a signifi-cant problem in these resource constrained applications. E.g.,power autonomy requires ability to operate from battery or otheron-board power source for an extended amount of time. Excessheat often need to be expelled from the system. To achieve this, ex-tra cooling equipments may be required, which in turn increasessystem size and weight. In aerial, space and robotic platforms,the available amount of power, space and weight is limited.

Yet another problem with high clock rate CPUs is that they aremore susceptible to radiation in space [2], making lower clockrates more desirable. Hence, for image processing and vision appli-cation on these platforms, general purpose, high end processorsand computers are usually not the most effective choices. Theseweaknesses of widely available computational hardware havemade the use of optical flow computation methods on lightweight,power autonomous platforms quite difficult.

There exist a large body of literature on developing new meth-ods for accurate and efficient computation of optical flow. In anumber of these studies, the available methods are compared in

http://dx.doi.org/10.1016/j.micpro.2013.01.001

mailto:[email protected]

http://dx.doi.org/10.1016/j.micpro.2013.01.001

http://www.sciencedirect.com/science/journal/01419331

http://www.elsevier.com/locate/micpro

G.K. Gultekin, A. Saranli / Microprocessors and Microsystems 37 (2013) 270–286 271

terms of their accuracy, vector field density and computationalcomplexity [3,4]. Despite the improvements in accuracy and the ef-forts to come up with numerically efficient methods, the overallcomputational performance of these algorithms on general pur-pose CPU architectures is low. Real-time computation of opticalflow for a realistic resolution and frame rate video is often notpossible.

While improvements in optical flow computation continue toappear, there has been a noticeable interest recently on applicationspecific alternative computational platforms [2]. Since softwarebased computation of image processing algorithms are rather inef-ficient, in many studies in the literature the workaround is often toutilize hardware acceleration [5,6]. Application specific integratedcircuits (ASICs) are fully custom hardware designs, usually massproduced to satisfy the needs of a specific application in terms ofcomputational speed, space and power consumption. The hard-ware can be designed to process large amounts of data in parallel-ized and/or pipelined structures in a single clock cycle whilegeneral purpose sequential processors require a large number ofclock cycles for the same task. ASICs are therefore much more effi-cient than conventional processors. However, ASIC design andmanufacturing is a long and difficult process. Single instances ofa design prototype are prohibitively expensive to manufactureand the solution becomes only feasible in mass production. Inmany applications, a relatively small number of systems are re-quired, hence making ASICs unsuitable. Another drawback of ASICsis that once a chip is manufactured, it is impossible to modify. Anydesign change is costly and requires re-production and replace-ment of the existing hardware. Therefore, ASIC designs are onlysuitable for a mature and mass produced products and are not suit-able for academic research, prototyping and/or for small quantityapplications.

To overcome the static structure of ASICs, field programmablegate arrays (FPGAs) are developed. This class of hardware can bere-programmed multiple times while exhibiting comparable per-formance with ASICs in terms of computational speed, size andpower dissipation. This property makes FPGAs flexible platformswhich enable modifications to the design in a matter of hours, eventhe ability to be re-configured (re-programmed) in the field. Theseadvantages of FPGAs make them an accessible platform for proto-typing intensive academic research and hence attracted recentinterest for hardware based algorithm development. AlthoughFPGAs are in use since the 1980s, their incorporation to academicstudies is relatively new. Some recent publications study the favor-able potential of FPGAs in particular for vision research [2].

In the present study, we first review in Section 2 the fundamen-tal algorithmic approaches to computing optical flow as well as thestate-of-the art research in hardware based approaches to opticalflow. In Section 3 we provide an overview of the baseline Hornand Schunck optical flow algorithm which is followed by therequirements analysis and detailed description of our FPGA basedhardware design in Section 4. The paper finally presents in Sec-tion 5 a careful performance evaluation of the proposed design.We conclude in Section 6 by also outlining directions for our futureresearch.

2. Literature survey

Studies in optical flow calculation date back to 1980s and thereare many alternative methods proposed for optical flow computa-tion. They can be grouped as gradient based, correlation based, en-ergy based and phase based methods [4]. Gradient-based methodsdepend on the evaluation of spatio-temporal derivatives. The ear-liest two gradient based methods are presented by Horn andSchunck [7] and Lucas and Kanade [8]. [7] presents a method

assuming that the optical flow vector field is smooth and intro-duces a global smoothness term as a second constraint. The meth-od in [8] on the other hand, depends on an assumption that apoint’s neighboring pixels move with it, implying a locally constantflow. This introduces additional constraint equations and the solu-tion is obtained by least squares estimation. Since gradient basedmethods became quite popular, many other gradient based meth-ods followed [9–11]. Horn and Schunck’s method in particular havea relatively less computational complexity over many other meth-ods and provide high density optical flow vectors with a reasonableaccuracy.

The dominant implementation approach of optical flow meth-ods in the literature is to code and test them on general purposecomputers. The first motivation is the wide availability of suchhardware and the necessary programming experience. Anothermotivation is to be able to comparatively evaluate the performancewith those in the literature. Although PC hardware configurationsvary, they are standardized and it is possible to generate approxi-mate benchmarks. Although optical flow algorithms have im-proved over time, the PC based implementation performanceremained below the requirements of real-time applications. Onthe other hand, it is known that for many computer vision algo-rithms, parallelized and pipelined implementations on FPGAs canproduce significantly better performances [2]. Despite the needfor higher computational performance, the relative difficulty andthe specific expertize needed resulted in few hardware implemen-tation examples in the literature [12–17].

An interesting point should be mentioned concerning a choiceof method for a hardware based implementation: Although thisis not discussed much in the existing body of literature, the perfor-mance of a given method on a sequential general purpose com-puter does not give a clear idea about its performance onparallelized and pipelined architectures, such as ASICs, FPGAs orGPUs. A particular method that does not seem efficient for a PCbased implementation may have favorable properties that makesit a good candidate for a hardware based implementation. In thepresent work, we adopt the baseline method of Horn and Schunckfor our implementation. It will be demonstrated in the presentwork that this method has these favorable properties becauseaccuracy can be preserved when computations are performedusing a fixed point representation with small word lengths. Wealso show that this algorithm can also be implemented efficientlyusing a parallel and pipelined architecture, leading to highthroughput computation.

FPGA implementation of Horn and Schunck method is not en-tirely new. It was first presented in [12] which was implementedusing two Xilinx FPGAs (XC4020E and XC4005H). Their designedhardware is able to process 19 fps for 50 � 50 pixel image se-quences. Another implementation from the same authors usesthe Camus correlation method [18] yielding a performance of25 fps at 100 � 100 pixels images on an ALTERAEPF10K50RC240-3 FPGA operating at 15 MHz where the computa-tion performance of the same method on a PC could only achieve6 fps [13]. This corresponds to a more than 4� speed up wrt thePC performance. They also implement a method presented in[19]. None of their publications give quantitative measures onthe accuracy or make comparisons with other studies. Later, theypresented studies on utilizing their implementations on real-timeapplications. One such example is lane departure detection [20].Another hardware implementation of Horn and Schunck methodis presented in [21]. The study claims processing 256 � 256 pixelsimages at 60 fps on an ALTERA EP20K300EQC240-2 FPGA. There ishowever no information, discussion or comparison on the powerconsumption, error rates and accuracy of the implemented system.A recent work is presented in [22] which uses census transforma-tion. They claim a processing speed of 45 fps on images with

272 G.K. Gultekin, A. Saranli / Microprocessors and Microsystems 37 (2013) 270–286

640 � 480 resolution. They implement system using a XilinxXC2VP30 FPGA with two embedded PowerPC Processor cores.The FPGA hardware operates at 100 MHz and the PowerPC oper-ates at 300 MHz. They also report the power consumption of theirhardware as 10 W. The comparison is made with a Core 2 Duo1.86 GHz PC. Their hardware operates at a 18� lower clock fre-quency and achieve a 2� speed up with 6� lower power consump-tion. Another well known gradient based optical flow methodpresented by Lucas and Kanade is implemented in [23] using a Xi-linx XC2V1000-4 FPGA. Their design processes 2857 Kpps (30 fpsof 340 � 280 pixel images) at a clock speed of 40 MHz. Theaccuracy of their hardware implementation is only 2.48� worsethan the 64 bits software implementation. In [24], same opticalflow method is implemented on a relatively new FPGA hardwarewhich is Xilinx Virtex II XUPV2P FPGA board. The presented hard-ware can process Yosemite sequence (316 � 256 pixels) in 8 mswhich corresponds to 125 fps at a maximum frequency of55 MHz. Implementation of a tensor based optical flow method ispresented in [25]. Their pipelined hardware design operates at100 MHz and can process 640 � 480 pixels images at 64 fps. Theirchosen FPGA board is the same as the one used in [24] which is Xi-linx XUPV2P. They report an overall average angular error (AEE) of12.9� for the computed optical flow field. However, the contribu-tion of the designed hardware on the overall error is notmentioned.

There are also alternative computational platforms other thanFPGAs that can be used for high performance optical flow compu-tation. GPUs (Graphical Processing Units) are one of these plat-forms. There are a few studies in the literature that report theperformance of optical flow computation on GPUs. One such studyis given in [26]. A tensor based method is implemented and a 2.8times speed-up compared to a Pentium4 2.8 GHz PC implementa-tion is achieved. However, there is no discussion on the accuracy ofthe computed field and the platform consumes significant power.The only GPU implementation of Horn and Schunck’s method ispresented in [27]. They use a multiresolution variant with two lev-els. The computation of 316 � 252 images is achieved at 2 fps inmulti-scale and 333 fps in single-scale. In [28], there is a compari-son between an FPGA and a GPU implementation of optical flowcomputation. They use a tensor based optical flow method andclaim a processing speed of 320 � 240 images at 538 fps. They dis-cuss the advantages and drawbacks of GPUs over FPGAs and pointout to the complexity of the design process for FPGAs. They claimthat FPGAs require a 12 times more development time. However,GPUs consume significantly more power than FPGAs and requiresa host PC for operation. FPGAs on the other hand can be used asa stand-alone platform. These requirements of GPUs make themunfeasible for applications with severe power and space con-straints such as small scale mobile robotics.

To the best of our knowledge, there is no work discussing theaccuracy, power consumption and logic resource usage of Horn andSchunck’s method implemented on FPGA. Studies on this methodconcentrate mainly on the computation speed. Another importantaspect of an FPGA implementation that is not discussed in theexisting literature is the speed-accuracy trade-off resulting fromthe use of fixed-point arithmetic to implement optical flow in FPGAhardware. There are usually major speed disadvantages of usingfloating point operations on FPGAs [2] while using fixed-point de-creases the accuracy of the computations. We strongly believe thatspeed, power consumption and accuracy are all key performanceparameters to assess the success of an implementation.

The primary contributions of the present paper are twofold:Firstly, we present a high performance reference FPGA design foran important vision algorithm from the literature. We make thefull hardware design source available for interested academicresearchers to enable further studies and comparative evaluation

as explained in Section 7. This is a low cost, low power designwhich achieves over real-time performance for computing theoptical flow vector field. We believe sharing the source of the de-sign with the academic community is important, not only to ex-pose the design considerations involved but also because such adesign, to the best of our knowledge, is not available from anyother non-commercial source. As a second contribution, we pro-vide a careful characterization of the implementation with respectto its accuracy, FPGA resource utilization, power consumption andcomputation speed.

In the following section, we continue with a brief outline of theoptical flow computation in general and the Horn and Schunckmethod in particular for computing the optical flow.

3. Optical flow computation

Optical flow is defined as the distribution of apparent velocitiesof brightness patterns in an image [7]. This apparent velocity is in-duced due to the relative motion of the objects in the scene withrespect to the observer (usually a camera), or the motion of thecamera itself. Optical flow is a vector field and there are variousmethods for its computation. Among those, differential methodsare relatively popular. They depend on the use of spatio-temporalintensity gradients. They have lower computational complexity ascompared to other methods and provide a high density field with areasonable precision. They are also suitable for FPGA based parallelhardware implementations.

A significant approach belonging to the category of differentialmethods is the Horn and Schunck method of optical flow compu-tation algorithm [7]. Horn and Schunck’s method depends on theoptical flow constraint equation and the assumption of smooth-ness of the optical flow vector field [7]. The optical flow con-straint equation is derived based on the assumption ofbrightness constancy for pixels over time. For a given image pixelwith brightness E(x,y) located at coordinates (x,y), this can be ex-pressed as

Exuþ Eyv þ Et ¼ 0; ð1Þ

where u = dx/dt and v = dy/dt are scalar horizontal and vertical com-ponents of the optical flow vectors respectively [7] and Ex, Ey and Et

are the partial derivatives of the intensity with respect to all of itsvariables. The subscript based shortcut notation for partial deriva-tives is for notational clarity and will be used throughout the restof the paper.

It can be seen from the above equation that there are two un-known optical flow vectors u and v in a single linear equation. Thisis an ill-posed problem and this equation by itself is not sufficientto find a unique solution for the optical flow vectors. To make theproblem well posed and generate a unique solution, Horn andSchunck introduce an additional smoothness constraint equation.This second constraint is developed based on the observation thatoften, in typical video sequences, the neighbors of a given imagepixel are subject to a similar motion as the pixel itself. This behav-ior results in a smoothly varying vector field and the smoothness isrepresented by this constraint [7]. The constraint term is obtainedby defining a suitable cost function. The first component of the costis to represent the amount of local change in the image as afunction of image position (x,y). It is computed as the squaredmagnitudes of the optical flow vectors’ gradients and aregiven by

E2s ¼ u2

x þ u2y þ v2

x þ v2y : ð2Þ

Another component of the cost function is about the brightnessbeing constant over the sequence of image frames. This is given by

Eb ¼ Exuþ Eyv þ Et : ð3Þ


Given (2) and (3), the problem can be formulated as a minimizationof a combined cost function computed over the entire image frameas

E2 ¼ZZ

E2b þ a2E2

s

� �dx dy: ð4Þ

Here, a is a weight term used to adjust the relative contribution ofthe two terms in the cost function. It is possible to adjust thesmoothness of the computed optical flow field by adjusting a.Increasing the value of this parameter will increase the weight ofthe smoothness term in the cost function and therefore the algo-rithm will yield a smoother optical flow field. This parameter playsa significant role only for areas where the brightness gradient issmall, preventing haphazard adjustments to the estimated flowvelocity occasioned by noise in the estimated derivatives [7]. Thisformulation accepts that during the minimization, brightnesschange will be minimized but may not become exactly zero. Thisis not a problem because in practical images the brightness valuesmay slightly change due to image noise or quantization errors.

The (u,v) pair that minimizes the cost function given in (4) isthe computed optical flow vector field. It can be reformulated asin (5) using calculus of variations method.

E2x uþ ExEyv ¼ a2r2u� ExEt

ExEyuþ E2yv ¼ a2r2v � EyEt

ð5Þ

The Laplacians of u and v can be approximated using the expres-sions given in (6) where �u and �v correspond to local averages of uand v respectively.

r2u ¼ ð�ui;j;k � ui;j;kÞr2v ¼ ð�v i;j;k � v i;j;kÞ

ð6Þ

Replacing the Laplacians of u and v with their approximation givenin (6), the equations in (5) can be rewritten as in (7).

a2 þ E2x

� �uþ ExEyv ¼ ða2�u� ExEtÞ

ExEyuþ a2 þ E2y

� �v ¼ ða2 �v � EyEtÞ

ð7Þ

Rearranging the terms in (7) yields

a2 þ E2x þ E2

y

� �u ¼ a2 þ E2

y

� ��u� ExEy �v � ExEt

a2 þ E2x þ E2

y

� �v ¼ �ExEy�uþ a2 þ E2

x

� ��v � EyEt:

ð8Þ

Solving for u and v from equation set in (8), optical flow vectors areobtained as in (9).

aFig. 1. (a) Numerical computation of Ex, Ey and Et using first order difference of eight pixmatrix for estimating local averages �u and �v of optical flow vectors.

u ¼a2 þ E2

y

� ��u� ExEy �v � ExEt

a2 þ E2x þ E2

y

� � ;

v ¼�ExEy�uþ a2 þ E2

x

� ��v � EyEt

a2 þ E2x þ E2

y

� � :

ð9Þ

In practice, the solution is obtained numerically using the iterativeGauss–Seidel method where this method is explained in detail in[29]. At each iteration, we have the new vector field estimate as

unþ1 ¼ �un � ExEx�un þ Ey �vn þ Et

a2 þ E2x þ E2

y

; ð10Þ

vnþ1 ¼ �vn � EyEx�un þ Ey �vn þ Et

a2 þ E2x þ E2

y

; ð11Þ

where n represents the iteration number. To perform the computa-tions in (10) and (11), the gradients and Laplacian should be esti-mated numerically. This is possible in a number of different ways.For gradient computation, Horn and Schunck [7] utilize the first or-der difference of eight pixel values given in Fig. 1a. The Laplacian isestimated by subtracting the value at a point from a weighted aver-age of the values at neighboring points. They approximate theLaplacians of field vectors by using a 3 � 3 mask given in Fig. 1bfor the computation of local averages of optical flow vectors. Thei, j, k subscripts represents the row, column and frame numbersrespectively.

Using the given mask in Fig. 1a, the computation of Ex can beobtained as in (12). Using the corresponding masks given inFig. 1, other image gradients Ey, Et and Laplacian of optical flowvectors �u; �v can be easily derived similar to the equation given in(12).

Ex ¼14ðEi;jþ1;k�1 � Ei;j;k�1 þ Eiþ1;jþ1;k�1 � Eiþ1;j;k�1 þ Ei;jþ1;k

� Ei;j;k þ Eiþ1;jþ1;k � Eiþ1;j;kÞ ð12Þ

4. Proposed FPGA hardware design for optical flow

4.1. Requirements of the FPGA hardware platform

To be able to implement a digital hardware based vision algo-rithm, an appropriate FPGA with a complementary developmentboard is required. This board provides the chosen FPGA chip as wellas the peripheral hardware device infrastructure with sufficient

bels. The i, j,k are the row, column and frame numbers respectively. (b) 3 � 3 weight

Fig. 2. High level block diagram of the designed FPGA based OF hardware.


resources. Such hardware is often tailored to specific applicationareas with specific peripheral and resource choices.

For our mobile robotic visual processing application domain, asuitable board should have low power consumption and moderateperformance. It should preferably have enough extra resources toallow future improvements and modifications of the design.

To properly choose the FPGA and the associated board, we needto consider the required amount of computational resources (rep-resented by FPGA Logic Elements-LEs), memory resources (internaland external) and external communication (interface) options.

Each FPGA has a different number of LEs that are used to imple-ment user logic. The available amount often determines the com-plexity of the design that can be realized by the hardware andtherefore is a main design choice. It has a linear impact on thepower consumption that should be taken into account.

The FPGAs also vary in the amount of internal memory that theyprovice and the associated board similarly may contain a certainamount of external memory in the form of Static (SRAM) and Dy-namic (DRAM) memory chips. The choice between using internaland external memory resources is made according to the requiredcapacity and speed. Capacities of external memory devices aremuch larger than what is available internally, but they have a con-siderably lower bandwidth and slower access times. Common tocomputer vision domain, the implementation of optical flow algo-rithm is data intensive and requires significant storage. For thecomputation to begin, we require two entire image frames to bestored in memory. The amount of data prevents it to be stored ininternal memory blocks. Therefore, external memory should beutilized despite the overall lower bandwidths. In terms of perfor-mance, SRAMs are faster but has lower capacity as compared toDRAMs which are also harder to interface to. To prevent memoryaccess bottleneck and unnecessary design complexity, we selectedSRAM as our primary external frame buffer.

External interfaces are required to send and receive image andoptical flow data between the PC and the FPGA board. A standard-ized communication interface should preferably be available onthe development board. Since our focus is on the optical flow com-putation algorithm, for simplicity, we preferred the RS232 inter-face. Despite the very low communication bandwidth, this choiceconveniently allowed us to perform our performance tests whilefocusing on our computational hardware design. In practice, ourdesign will be used with a parallel camera interface and a higherbandwidth PC interface such as ethernet or USB.

In light of the above considerations, we have chosen a low cost,low power EP2C70 Cyclone-2 FPGA chip from Altera on a DE2-70board. This board is also attractive because it is also used widelyin a number of universities in the world for both education and re-search. This allows our design to be comparatively evaluated byother researchers.

4.2. High level design

We will now describe our design, which follows a top down de-sign methodology in some detail. The high level design block dia-gram of the hardware modules of our designed system is shownin Fig. 2.

The source image frames are stored in PC and transferred to theboard through an RS232 interface. At the FPGA side, the communi-cation protocol is handled by an RS232 controller module. The datareceived from the PC is then transferred into the correct memorylocations of SSRAM by the RS232 to SSRAM module. The SSRAMread/write operations are handled by the SSRAM controller mod-ule. The source data to compute OF vectors is read from the appro-priate locations of SSRAM by the Direct Memory Access (DMA)module and stored temporarily in its FIFO line buffers. These buf-fers increase performance by reducing memory access instances.

The Gradient and Average Computer (GAC) module reads the re-quired data from DMA buffers and computes spatio-temporal gra-dients of the image and local averages of OF vectors. The output isused by the Optical Flow Computer (OFC) module to produce the OFvectors, which are then written back to the SSRAM through theDMA module. Finally, the results are again read back from SSRAMby the SSRAM to RS232 module and written to the RS232 controllermodule to be sent to the PC. Since our current focus is on the actualoptical flow computation, we did not implement a high speedinterface between the FPGA and the experiment PC. We also donot have a FPGA-Camera interface.

The system is designed using multiple clock domains and in aparallel and pipelined structure which increases the systemthroughput dramatically. The memory interface operates at higherclock rate than the computation modules. This helps to overcomethe memory bottleneck in the design.

Each of the individual sub-blocks are implemented followingthe iterative sequence of design, implementation and test stages.After satisfactory testing each of the individual design blocks forfunctional and timing requirements, they are connected and inte-grated in accordance with the data flow sequence given in Fig. 3.

The performance of the design is proportional with the clockfrequency it operates at. The limiting factor of the clock rate isthe highest delay between the registers which is on the criticalpath. To decrease the limiting effect of the critical path, the partsof the design with lower delay are clocked at higher rates andpaths with higher delay are clocked at lower rates. This techniqueis called the multiple clock domain design [30].

Our design operates using two different clock sources which are50 MHz and 200 MHz. Operating frequencies of individual designmodules are given in Table 1. The main clock source of the systemis the external 50 MHz crystal oscillator. The 50 MHz and 200 MHzclocks are generated internally by a phase locked loop (PLL) circuitfrom this input clock. The signal transmission between two clockdomains are handled by using dual clock FIFO buffers and synchro-nization stages to prevent timing problems.

4.3. Direct Memory Access (DMA) module

As illustrated in Fig. 1, the computation of one OF vector for thek’th frame requires an input data packet composed of three parts:the first two parts of the data are the four neighboring pixel inten-sity values from the current (k’th) and previous ((k � 1)’th) imageframes. The final part is the set of eight neighboring flow vectors

Fig. 3. Data flow diagram of the design

Table 1Operating clock frequencies of modules. 200 MHz is used for the modules that accessSSRAM memory to reduce memory bottleneck.

Module Clk. Freq. Module Clk. Freq.

Optical flow computer 50 MHz SSRAM controller 200 MHzGradient and average computerRS232 controllerSSRAM to RS232 SSRAM DMARS232 to SSRAM


from the previous iteration (i.e., for the ((k � 1)’th frame). In ourfirst iteration of the design, there was a monolithic approach. Thedata packet was read; gradient and OF vectors computed and writ-ten back to the SSRAM by the same module. The drawback of thisdesign approach was its low throughput. The reason is that the OFCmodule has blocks with mathematical operations and introduceslong delays. It can therefore operate at most at 50 MHz, whereasthe SSRAM read/write operations are much faster and can operateat 200 MHz. A monolithic design results in the whole module oper-ating at the lowest common clock of 50 MHz, hence making itimpossible for any sub-block to operate at a higher clock rate.We propose the current design to overcome this difficulty and in-crease throughput.

The input/output structure of the proposed DMA module designis given in Fig. 4. The module operates at 200 MHz and fetches therequired data in the order required for processing. Data is stored indual clock FIFO buffers to be read by the gradient and the OFCmodules. When the OFC module computes the flow vector, itwrites the result to the write FIFO buffer. The DMA module readsback these results from the write FIFO and writes to the predeter-mined address location in SSRAM. The layout of the data in SSRAMis given in Fig. 5.

The DMA module has five read ports and one write port, eachwith a dedicated FIFO buffer. Two read ports are used to read pixelsfrom two consecutive lines of the image frames F1 and F2. Theother three read ports are used to read OF vectors from three con-secutive lines of the OF vector field. Fig. 6 illustrates the pixel FIFObuffers containing pixel intensity data to be processed. More spe-

cifically, we have two read FIFO buffers: FIFO1 (256 � 32 bits cor-responding to 2 � 256 pixels from F1 and 2 � 256 from F2) andFIFO2 (128 � 32 bits corresponding to 256 pixels from F1 and256 from F2). When pixel data reading starts, the first half of FIFO1is filled with pixel data from the first row of the image while thesecond half of FIFO1 together with FIFO2 is filled with pixel datafrom the second line of the image. This data duplication in the buf-fer is performed due to one row data overlap between OF compu-tations for successive image rows, which would otherwise result inan unnecessary SSRAM read duplication for the next row of data.Due to FIFO operation, when we finish computing the first row ofOF vectors, the second row image pixel data is already in the firsthalf of FIFO1. Therefore for all subsequent rows, we just need toread the next row pixel data to FIFO2 and the second half of FIFO1.Let us also remind that all this is done for data from both the F1and F2 frames concurrently.

To reduce the memory accesses, we only read pixels for the firstread FIFO. When the pixels from the second line of the image areread from SSRAM, they are also copied to the end of the first FIFO.However, in this method the pixels required to compute the firstOF vector can be computed only after all the first line of the imageis read from memory. Therefore, the gradient and vector averagecomputation module can start with a latency equal to the time re-quired to read one row of two frames from the memory. This phasedifference also requires the length of the first read buffer to be onerow more than the second read buffer.

The same idea for pixel read buffers is also applied to OF vectorread buffers. However, to compute their average, eight flow vectorsfrom three consecutive rows of vector field are required. This time,there should be three FIFO buffers to handle two rows of data over-lap between successive OF vector rows. As illustrated in Fig. 7FIFO1 buffer has length to hold three rows of OF vectors. FIFO2can hold two rows and the FIFO3 can hold one row of data.

The DMA module consists mainly of a finite state machine(FSM) operating at 200 MHz and the read/write dual clock FIFObuffers. The FSM serves the FIFO buffers using the round robinscheduling method. Since read operations are more than writeoperations, read buffers have higher priority than write buffers.

(a)

(c)

(e)

(d)

(b)

Fig. 4. Design module terminals. (a) SSRAM DMA module, (b) gradient and average computer module, (c) SSRAM to RS232 module, (d) RS232 to SSRAM module, and (e)optical flow computer module.

Fig. 5. Layout of data in SSRAM for two frames of 256 � 256 pixels. Image frames and OF vectors are stored in SSRAM starting from address locations 0 � 00000 and0 � 40,000 respectively.


The DMA FSM is at idle state until the trigger signal comes. Afterbeing triggered the FSM controls all data read/write operationsthat are required for the optical flow computation. To read the in-put data, the FSM puts the read address of the pixel data for FIFO1and controls whether FIFO1 is full or not. If not full, it initiates theread command to the SSRAM controller. After three wait cycles, thedata becomes ready on the write port of FIFO1 and is latched by asignal from the FSM. For row indexes larger than one, the writecommand for FIFO2 is also initiated. The same procedure is also ap-plied for other read FIFOs. After serving the read FIFOs, FSM checksthe write FIFO buffer to be empty. If the write FIFO is not empty,FSM puts the data and the SSRAM address for the data to be writ-ten and initiates a write command to the SSRAM controller. Afterone cycle the write operation completes. Then FSM either contin-ues to serve from the read FIFOs or switches to the idle state. Ifthe address of the last data written is equal to the last element

of the OF vector field, then the FSM stays in idle state, else it con-tinues to serve the FIFO buffers.

The FIFO buffers operate in show-ahead mode. In this mode thedata in the front of the buffer is continuously held in read data portand can be read any time without a prior read signal. After readingthe data, the read command should be initiated. One clock cycleafter the read command, the current data is removed and the nextdata in FIFO is put to the read data port. Before initiating the readcommand, the empty flag should also be set to low.

4.4. Spatio-temporal gradient and local average of optical flow vectors

The optical flow algorithm requires, as an initial stage, the com-putation of spatio-temporal image gradients and local averages ofOF vectors which will then constitute the input for the OF vectorfield computation. Horn and Schunck’s algorithm approximates

Fig. 6. Layout of pixel FIFO buffers. Each location is 32 bits in width and stores two consecutive pixels from Frame F1, and two consecutive pixels from Frame F2. Thesubscripts represent the pixel row and column indexes.

Fig. 7. Layout of OF vectors in FIFO buffers to handle two lines of OF vector data overlap between successive row computations. The subscripts represent the vector row andcolumn indexes.


the numerical computation of these quantities using a multiply-accumulate operation using the coefficient masks given in Fig. 1.Each pixel data used in computations are grayscale intensity valuesrepresented with unsigned 8 bits. This data is read from the FIFObuffers of the DMA module that are connected to the iPIX_DATA1and iPIX_DATA2 ports. Derivative computation requires thesummation of eight pixel values that are read from the pixelbuffers which are then to be divided by four. Pixels correspondingto negative mask coefficients are converted to signed two’scomplement format and summed. The result of this summationbefore division requires 11 bits to prevent overflow. The subse-quent division by four operation can easily be done using 2 bitsarithmetic shift right. To prevent loss of precision in the result ofthis division, we represent the spatio-temporal gradient result in11 bits fixed point format with 2 bits fraction. In this case, thedivision becomes shifting the fraction point to the left, i.e., to endup between the second and third bits.

The second stage executed by this module is the optical flowvectors average calculation. Again, the necessary mask for theapproximate computation of these local averages is given in

Fig. 1. The computation of these quantities in FPGA requiresdivide by 6 and divide by 12 operations. However, FPGA imple-mentation of a general purpose division operation is slow andrequires high resource usage. One efficient approach can be touse a sequence of multiplication, shifting and addition opera-tions to implement a division by a specific constant number.Another popular approach is to approximate the division oper-ation using methods such as look-up table or divide by powersof two. We elected the latter approach for our implementation.Our results show that the division approximation error is farless than the error introduced by the fixed point implementa-tion. We hence modified Horn and Schunck’s numerical approx-imation for local averages of OF vectors by replacing all maskcoefficients given in Fig. 1b by 1/8 terms corresponding todivide by 8 operation. In FPGA this is a simple 3 bits arithmeticright shift operation. The summation of product terms is againdone using two’s complement signed representation. The result-ing OF vector averages is represented with 11 bits wordlength fixed point with 3 bits fraction, again to preserveprecision.


All computations and data flow in FPGA is managed by a dedi-cated FSM. After a reset, the FSM first initializes the registers suchas pixel counters, read/write address counters, state variables aswell as data valid and control registers. It stays in the idle state un-til the receipt of a trigger signal. The trigger signal is generated byanother module called the Trigger Delay Module. This module gen-erates trigger signals with a predetermined timing to start all mod-ules in a specific order. For example, the DMA module should betriggered earlier than the gradient and vector average computationmodule to fill data buffers in advance for the gradient and vectoraverage computation module to start operating.

Prior to the computation, the buffers are controlled if they areempty or not and the required data is captured from appropriatebuffers. Computations corresponding to the first column of eachrow require eight pixels for the gradient and nine pixels for theOF vector average to be read from buffers. The other columns par-tially use the previously fetched data and require much less. Forthe remaining columns, four new pixel values are fetched for thegradient and three new vector values are fetched for OF vectoraverages. This flow is regulated by the FSM using data counters.The result is then put on the output ports with appropriate data va-lid signals and becomes available for the OFC module.

4.5. Optical flow computation

The final stage of optical flow computation is carried out in theOFC module whose I/O ports are illustrated in Fig. 4. This moduleimplements Eqs. (10) and (11).

The OFC module is designed with a high throughput pipelinedarchitecture that operates at 50 MHz clock. The spatio-temporalgradients and local OF vector averages are the inputs to the pipe-line. These signals are subjected to the operations illustrated inFig. 8 during their flow in the pipeline. The data valid signals cor-responding to these input also flow within the pipeline stages inaccordance with the required time for each stage. We also have de-lay registers in each stage where a signal is passed to the next stagewithout any computation being performed. We have not shownthese in order not to unnecessarily clutter the flow diagram.

The OFC pipeline consists of 15 stages. The first stage is a bufferonly stage where registers are used to prevent data input lineglitches and guarantee the stable operation of the pipeline. At

Fig. 8. Optical Flow Computer (OF

the next stage, the squares of spatial gradients E2x and E2

y are com-puted in parallel with the multiplications of Ex �u and Ey �v . Thirdstage corresponds to the parallel summations of Ex�uþ Ey �v þ Et

and a2 þ E2x þ E2

y . In implementation, instead of taking the squareof a using hardware, it is preferred to set the a2 parameter a directvalue. a2 is represented in hardware by 22 bits fixed point formatincluding 4 bits fraction. Therefore, a2 can be set within the range0–262,144 with 0.0625 resolution. At fourth stage, the sums fromthe last stage are multiplied with the operands Ex and Ey respec-tively. The following 10 stages implement two parallel divisionoperations which compute the expressions given in (13) and (14).

U1 ¼Ex ðEx�uþ Ey �v þ EtÞ

a2 þ E2x þ E2

y

ð13Þ

U2 ¼Ey ðEx�uþ Ey �v þ EtÞ

a2 þ E2x þ E2

y

ð14Þ

As discussed in Section 4.4, the division operation is costly in termsof delay it introduces to the circuit. The 32 bits division operationcan only work at a maximum frequency of 9 MHz. Hence the divi-sion cannot be implemented as a single pipeline stage at 50 MHz.We implement this division as a sequence of 10 pipeline stages(stages between 5 and 14) that can fit into our 50 MHz pipeline. Fi-nally, at 15th stage, the OF vectors u and v are computed by twoparallel subtraction operations as u ¼ �u� U1 and v ¼ �v � U2. Theoutput data valid signal is set high to indicate stable result vectorvalues. The computed vectors are written to the write FIFO bufferand then written back to the OF vector result locations in theSSRAM by the DMA module.

Throughout the computations in this module, the word lengthsare readjusted at every operation to preserve the accuracy. Thedivision operation at the end is computed using 32 bits numeratorand 24 bits denominator. However, the result has finite fractionand it stores the real error-free result only if the remainder is zero.If the remainder is nonzero, the result is rounded to the nearestfixed point number.

4.6. PC communication

Our FPGA based OF vector field computation hardware is testedon a number of well known image sequences available in the re-

C) module data flow diagram.


search literature. To enable testing, image frames should be send tothe FPGA to be stored in SSRAM while OF results from SSRAM needto be sent back to the computer to be analyzed for performance.Therefore, a two way communication link between an experimentPC and the FPGA is required. On the selected FPGA developmentboard, RS232, USB and Ethernet connectors are available. Ethernetand USB provide high speed communication capability with100 Mb/s and 480 Mb/s theoretical limits. However, their imple-mentations require corresponding hardware controllers to be de-signed within the FPGA, hence needing considerable FPGA designeffort.

RS232 can provide a communication bandwidth of 920 kb/s,which is much lower. However the implementation effort minimal,with a free-of-charge controller module available from the FPGAmanufacturer. Considering the fact that our focus is not on the de-sign of a high speed PC interface but rather on the performance ofthe OF computer, we elected to use the RS232 interface to commu-nicate with the experiment PC. Although the amount of data to betransferred between the PC and the FPGA is large, the transfer canbe done off-line for our testing purposes. One should however notethat a fully on-line use of our design, depending on the applicationrequirements, still requires either a high speed camera interface ora PC interface or both. Nevertheless, both of these are rather stan-dard applications for which designs are readily available.

4.6.1. UART controllerThe RS232 UART controller used in the design is provided by Al-

tera. The controller signal interface is compatible with Altera spe-cific bus structure called the Avalon bus. The sent and receiveddata are stored in 128 byte single clock FIFO buffers. The maximumbaud rate is adjustable up to 115,200 off-the-shelf. To increase thedata throughput we modified the module to work at 460,800.Along with the baud rate increase, the 128 byte FIFO buffers arealso increased to 256 bytes to prevent data loss caused by bufferunderrun. At this baud rate the transmission of a 256 � 256 pixelsimage from PC (Matlab) to FPGA takes about 1.4 s which is reason-able for off-line testing. An attempt to further increase baud rateincreased the communication error rate substantially. Apparentlythe USB to RS232 converter that we used supported a maximumof 460,800 bauds.

4.6.2. RS232 to SSRAM data transfer moduleThe computation of optical flow algorithm requires two image

frames to be processed concurrently. In our design, the imageframes are expected at predefined address locations of SSRAMand are read by the DMA module. This makes the design flexibleto process images from different sources. The frame data in theSSRAM can be captured from a camera, copied from another massstorage device or transferred from a PC. In the present design, weimplemented a module that receives the image frames from PCthrough RS232 communication and stores them in the SSRAM forfurther processing. We also designed a preliminary module to cap-ture images from a camera and store to SSRAM. We plan to use thismodule in our future work.

The image sequence is sent frame by frame through RS232 by asmall script written in Matlab. The received data is written to theRX FIFO buffer by the RS232 controller module. The RS232 toSSRAM transfer module periodically polls the RS232 receive buffercounter to check if there is data available. The polling period isdetermined by the polling clock divider as shown in Fig. 4. The buf-fer counter signal has a latency of two clock cycles. So, the pollingperiod should be more than two clock cycles, else the buffer coun-ter may indicate wrong number of data available in the buffer.

The module runs another dedicated FSM. FSM starts at idle statewhen the module recovers from reset. At the rising edge of thepolling clock, FSM switches to the polling state to check the

RS232 receive buffer counter. If there is data available in the buffer,it is latched and the read command is send to the RS232 controllerto delete that data from the buffer. Then the received data counteris incremented and the data is put on the output data port withdata valid signal asserted. When the received data counter is equalto the predefined maximum counter value, the packet count isincremented, indicating that one data packet (one frame in ourcase) is completely received. The packet count and the receiveddata count are also used to compute the address location of thedata to be written in SSRAM. The layout of the frame data storedin SSRAM is given in Fig. 5.

4.6.3. SSRAM to RS232 data transfer moduleThe final module of our design that we discuss is the SSRAM to

RS232 data transfer module. After the OF computation process isterminated, the OF vector field data stored in SSRAM is transferedto the PC (Matlab) through RS232 communication. The interfacebetween the SSRAM controller and the RS232 controller is handledby this module. The I/O port diagram of the module can be seen inFig. 4.

The module is designed around a dedicated FSM having 13states. After the reset signal, the FSM starts in the idle state andstays there until the receipt of the trigger signal. Then the modulestarts reading the given address interval from SSRAM. Each addresslocation of SSRAM stores 4 bytes of data, all of which is fetched atonce by the read operation. Then the bytes selected by the ‘‘bytee-nable’’ port of the module are written to the RS232 buffer. In ourdesign all of the 4 bytes are used to store OF vector data. So, everyread command from SSRAM is followed by 4 write commands tothe RS232 controller. Before initiating the write command to theRS232 buffer, the write slots available in the buffer should bechecked. One important point is to consider that the write spacesignal has a latency of 2 clock cycles. The buffer may be full whenwrite space signal indicates that there are two slots available. So,the write space signal should be checked to indicate at least fourspaces available in the buffer to prevent buffer overflow. TheFSM continues to read and send the data until the last address loca-tion of SSRAM in the specified interval is written to the RS232 buf-fer. It then returns to the idle state until the receipt of the nexttrigger signal.

5. Hardware design performance analysis and test results

In this section we present the performance analysis of the pro-posed hardware design in terms of accuracy, resource usage, powerconsumption and computation speed. Standard test image se-quences from the open research literature are used for evaluatingthe performance of the hardware implementation.

5.1. Standard test sequences for performance evaluation

We used three different image sequences that are frequentlyused in literature. The first test sequence is called ‘‘Rubik’s cube’’and is illustrated in Fig. 9a. This image sequence consists of 20frames with 256 � 256 pixels resolution. The scene consist of acounter-clockwise rotating Rubik’s cube on a turntable. The motionvector field induced by the rotation of the cube generates pixelvelocities that are between 1.2 and 1.4 pixels/frame on the edgeof the turntable and between 0.2 and 0.5 on the cube.

The second sequence is the ‘‘Hamburg Taxi’’ sequence which isillustrated in Fig. 9b. The frames have a resolution of 256 � 190pixels. It is recorded by a fixed camera looking at a street scenewith three moving cars and one walking pedestrian. A car on theleft is driving towards the right and a van on the right is driving to-wards the left, both at a speed of approximately 3 pixels/frame. The


taxi in the middle is turning the corner at a speed of about 1 pixel/frame and there is a pedestrian walking at about 0.3 pixel/frame.

The third sequence is a synthetic one called ‘‘Translating Tree’’sequence. It includes 40 image frames with 150 � 150 pixels reso-lution. The 8th frame of this sequence can be seen in Fig. 9c. In thissequence the motion is based on the movement of a camera fromright to left while looking at a constant scene including a tree in thefront side. This movement yields a motion field between 1.73 and2.26 pixels/frame.

5.2. Evaluation of OF computation accuracy

In evaluating the accuracy of the results from the hardwarecomputation of OF on standard image sequences, we used a setof performance measures presented in [4,31]. The first error mea-sure, called the angular error (AE), defines the angular error be-tween a test OF vector v = (u,v) and the reference OF vectorvr = (ur,vr). The angular error between the two vectors can be com-puted by

AE ¼ arccos1þ u:ur þ v:v rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1þ u2 þ v2p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1þ u2r þ v2

r

p !

: ð15Þ

This measure is computed for all vectors in the OF vector field of animage sequence. Averaging the angular errors over the entire framegives an overall measure called the average angular error (AAE) forthe frame. In order to evaluate the performance for the entire se-quence, we also consider what we define as sequence AAE (SAAE).The standard deviation of these errors are also used for evaluation.

Another criteria of accuracy is the endpoint error (EE) that com-pares the distances between the endpoints of the flow vectors [31].It is given by

EE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðu� urÞ2 þ ðv � v rÞ2

q: ð16Þ

This measure is also computed for the frame (AEE) as well as for theentire video sequence (SAEE). In our evaluation of accuracy, we alsoinvestigate how the mask coefficient approximation and fixed pointimplementation contribute to the total values of these two types oferrors.

We should also point out that in the present study, these mea-sures are used to evaluate the accuracy of the hardware implemen-tation rather than the performance of the OF algorithm. Therefore,we analyze the discrepancy between the fixed point FPGA hard-ware implementation and a reference floating point PCimplementation.

Our accuracy evaluation consist of computing the OF vectorfield for the two image sequences presented. Frame averagedperformance measures are computed for each frame and their

Fig. 9. Example frames from the ‘‘Rubik’s cube’’ (a), ‘‘Hambu

evolution as a function of frame number is investigated. The globalaverage of the measures over all frames of a sequence is alsocomputed.

The final OF field output of our design for the Rubik’s Cube se-quence is illustrated in Fig. 10 for the first frame pair in the se-quence. Parts of the frame exhibiting significant motions areshown with their magnified versions. Note that an OF vector iscomputed for every pixel in the image frame.

The frame averaged overall errors (AAE and AEE) are presentedin Table 2 where we also present the error decomposition causedby OF local average computation mask coefficient approximationonly.

It can be easily seen that the contribution of the mask coeffi-cient approximation is negligible as compared to the overall errors.Therefore, the main source of errors in the FPGA implementationoriginate from the fixed point implementation. We should alsonote that the fixed point implementation errors are totally due tothe division operation explained in Section 4.5 in detail. Othercomputational blocks do not introduce any loss of precision dueto fixed point representation.

The sequence averaged overall errors (SAAE and SAEE) are alsoshown in Table 3. This table indicates the mean and STD values oferrors over the entire test video sequence. The test sequence con-sists of video frames recorded over a very short time interval whichresults in very similar motion fields throughout the video frames.Therefore, the standard deviation of errors over the whole se-quence is very small as expected.

It is also instructive to analyze the distribution of accuracy lossover the entire frame sequence with respect to their magnitudesexpressed as image pixels. The histogram of the accuracy errorsare shown in Fig. 11. Histogram bars indicate the number of flowvectors that are within the error intervals specified on the x-axis.Histogram bins are within ±0.005 around values specified on thex-axis. As can be observed from this histogram plot, the errorsare concentrated within ±0.005 which corresponds to approxi-mately 60% of the total image pixels. The maximum error is±0.03125 pixels.

Remember that the above results are for the OF vectors beingrepresented by a fixed point number with 4 bits fraction and withrounding based on the remainder of division. We further analyzedthe accuracy loss when the vectors are represented by less numberof fraction bits or entirely as an integer with no fraction. Fig. 12show the implementation error rates for the Rubik’s Cube se-quence versus the number of fraction bits used for the fixed pointrepresentation.

As can be observed from the plots, integer representation yieldshigh error rates on the results because, most of the optical flowvectors are generally smaller than 1 pixel/frame. As expected, the

rg Taxi’’ (b) and ‘‘Translating Tree’’ (c) video sequences.

Fig. 10. Optical flow vectors computed on FPGA hardware for Rubik’s cube sequence.

Table 2Error rates of Rubik’s cube sequence calculated for the first two frames.

Error measure Mask appr. Overall

Mean STD Mean STD

Average angular error 0.079� 0.234� 1.002� 0.546�Average endpoint error 0.002 0.007 0.018 0.010

Table 3Sequence averaged overall errors SAAE and SAEE for Rubik’s cube sequence.

Error measure Mean STD

SAAE 1.0045� 0.0025�SAEE 0.0180 6.17 � 10�5


increasing number of fraction bits exponentially decreases the er-ror rates. For the present frame resolutions, the number of fractionbits has an upper limit in the present development board mainlydue to the SSRAM available. The larger word lengths with morefraction bits requires more memory to be stored in and exceedthe present memory available. In our case, we are constrained bythe storage of data in SSRAM.

The pipelined architecture achieves considerable performanceimprovement over a non-pipelined design. There is a potentialcomputational performance gain by using fewer bits for the repre-

−0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.040

0.5

1

1.5

2

2.5

3

3.5

4 x 104

Error

Num

ber o

f Vec

tors

uv

−0.04 −0.03 −0.02 −0.01 00

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 x 104

Erro

Num

ber o

f Vec

tors

(b(a)

Fig. 11. Error histogram of optical flow vectors for (a) ‘‘Rubik’s cube’’, (b) ‘‘Hamburg Ta±0.005 pixels error intervals.

sentation and this gain mainly results from faster memory opera-tions. However, also due to the pipelined architecture, realizingthis gain requires significant design changes in the form ofdifferent (smaller) register sizes and corresponding memoryorganization.

Similar accuracy analysis are also carried out for the ‘‘HamburgTaxi’’ sequence, which is a video of a real-life scene as given inFig. 9b. The OF FPGA implementation results from an exampleframe pair is given in Fig. 13. The error histogram of the OF vectorscan be seen in Fig. 11b. One can note that in this case, the error val-ues are larger than those reported for the Rubik’s cube sequence.

This sequence has a comparatively larger variance on the OFvector magnitudes. Flow vectors with small magnitudes yield moreerror in the computation because of the limited resolution of thefixed point representation. The corresponding angular error andendpoint error results are listed in Table 4.

The last test sequence we use is a synthetic one called Translat-ing Tree sequence. In this sequence the camera translates at a con-stant distance and speed with respect to the scene. The optical flowvectors of Translating Tree sequence are close to integer displace-ments. This reduces the error caused by fixed point representationof vectors. Therefore, the resultant angular error and end point er-ror results for this sequence are lower than the previous HamburgTaxi sequence. The angular and endpoint error rates are shown inTable 4. The error histogram graph of the flow vectors are given inFig. 11c. Computed OF vectors on FPGA are visualized in Fig. 14.

Because of the purely translational motion field at x direction,the apparent motion at y direction is zero. Therefore, the error

0.01 0.02 0.03 0.04

r

uv

−0.05−0.04−0.03−0.02−0.01 0 0.01 0.02 0.03 0.04 0.050

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Error

Num

ber o

f Vec

tors

uv

(c))

xi’’ and (c) ‘‘Translating Tree’’ sequences. Error values indicate the center points of

Fig. 13. OF vectors computed on FPGA hardware for Hamburg Taxi sequence.

0 bits 1 bit 2 bits 3 bits 4 bits 4 bits+rounding0

10

20

30

40

50

60

Fraction

Angu

lar E

rror

0 bits 1 bit 2 bits 3 bits 4 bits 4 bits+rounding0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Fraction

Endp

oint

Erro

r

Fig. 12. AAE and AEE versus number of fraction bits used to represent the OF vectors.

Fig. 14. Optical flow vectors computed on FPGA hardware for Translating Tree sequence. Four regions are zoomed in to provide a closer view of the optical flow fieldcomputed on FPGA.


Table 4Error rates of Hamburg Taxi and Translating Tree sequences.

Test sequence Error measure Mean STD

Hamburg Taxi Average angular error 1.319� 0.509�Average endpoint error 0.024 0.009

Translating Tree Average angular error 1.045� 0.550�Average endpoint error 0.023 0.009

Table 5Resource usage of the overall design and available resources on the FPGA device.

Designusage

EP2C70resources

Utilizationpercentage (%)

Logic elements 8086 68,416 11.9Embedded memory bits 151,772 1,152,000 13.2Hardware multipliers 6 150 4.0PLL blocks 1 4 25.0I/O pin count 262 622 42.1

Fig. 15. LE usage percentage of modules.


rates of v vectors are lower than the u vectors. Ideally, if the algo-rithm could estimate the ground truth motion, then all the com-puted v vectors should be zero. Since, the representation of zeroin our fixed point format has no error, the error rates correspond-ing to v vectors are expected to be zero also. However, the com-puted v vectors have representation errors since the algorithmyields nonzero values for some v vectors.

Table 6Total dynamic, static and I/O thermal power dissipation.

Core dynamic power dissipation 216.55 mWCore static power dissipation 165.79 mWI/O power dissipation 462.04 mWTotal power dissipation 844.38 mW

5.3. Analysis of FPGA resource usage

The resource usage of a design is one of the performance mea-sures of the hardware. It consists of programmable logic elements(LEs) for implementing logic functions, embedded memory blocksfor storage of data, embedded multiplier blocks for the dedicatedimplementation of fast multiplications, PLL blocks to generate clocksignals for the operation of synchronous design modules and I/Opins used for interfacing with devices outside the FPGA chip.

The available resources on a particular FPGA determines whichdesign can fit into a single chip. A more compact design often im-plies a more cost-effective and low-power FPGA selection. Even if alarger chip is available, a compact design allows for subsequent vi-sion tasks to fit into the same FPGA chip, since OF is generally apre-processing step of a computer vision system. The performanceof our design in terms of the utilization of the selected low-costlow-power FPGA’s resources is summarized in Table 5.

It can be observed from this table that our proposed design hasa very effective utilization of available resources on EP2C70. Thisgives our design considerable power consumption and cost advan-tage. We will demonstrate that the design can still achieve excep-tional processing performance for a reasonable frame size typicalof the related literature.

Typically, one would also like to know how the utilization of lo-gic elements are distributed among design modules. Fig. 15 illus-trates this for the proposed design. The highest number of LEsare used by the OFC module, which consumes nearly half of LEsused by the whole design. This is followed by the GAC module. Thisis due to the fact that these two modules are arithmetic operationintensive. Since the word lengths of operands are kept large to pre-serve accuracy, the mathematical operations need a high amountof logic resources to be implemented.

The rest of the utilization can be summarized as follows:Embedded memory bits are mostly used by the FIFO buffers ofthe DMA module for storing data that is transferred to/from theSSRAM. The hardware multipliers are used by arithmetic opera-tions in OFC module. Most of the I/O pins of the device are used

for interfacing with the external SSRAM chip. The rest are usedby RS232 communication, clock inputs, LEDs, switches and push-buttons.

5.4. Analysis of power consumption

Power consumption is also often an important performancemeasure of a hardware design, in particular for mobile applicationswith portable power sources. It is a function of the resource utili-zation (hence the choice of FPGA) but also of the particulars ofthe design since two design having the same ‘‘size’’ can have differ-ent power profiles. All power consumption in an FPGA based de-sign is thermal power dissipation and converted to heat withinthe chip. This performance aspect of the proposed design is ana-lyzed in terms of dynamic, static and I/O blocks consumption.

Dynamic power consumption is caused by the transition of logicsignals and is therefore higher in modules with operate at highswitching rates and high frequency. Static power is due to the leak-age current in transistors and is mainly affected by the number oflogic elements used in the design. Finally, I/O power consumptionis related with the load of the external circuitry. We summarize thestatic, dynamic, I/O comparison in Table 6 and the distributionamong design modules in Fig. 16.

In Table 6, one can observe that I/O power dissipation is signif-icant as compared to that dissipated by the Core of the FPGA. Thisis due to the fact that external interfaces operate at higher voltage(3.3 V) than the Core (1.2 V) and also mostly at high frequency(200 MHz) due to high bandwidth memory interface. Externalcapacitances are also higher with larger drive transistors, increas-ing both dynamic and static power consumption. The table alsoshows that the total power requirement is well under 1 W, whichillustrates a striking advantage over a general purpose CPU (35 Wat 1.66 GHz) as well a high performance parallel GPU architecture(50–100 W for a GeForce 7800).

The pie chart in Fig. 16 illustrates the distribution of power con-sumption among design modules. The DMA module has the high-est power consumption both in terms of static and dynamicconsumption. This is not the largest size module in the designhence do not use largest amount of resources. However, dynamicconsumption is high due the module’s high operating frequencyof 200 MHz. The static power consumption is also high since a

Fig. 16. The total power consumption distribution among the design modules.


large amount of memory resources are used and these have consid-erable leakage losses. Using that much memory also causes manyrouting connections to be made between the module and thememory blocks. These additional circuitry combined with highoperating frequency also dissipate considerable dynamic power.

Another module that operates at 200 MHz is the SSRAM con-troller. However, it can be seen from the table that it has the lowestdynamic and static power consumption. This module mostly in-clude routing interfaces between SSRAM ports and the modulesaccessing the SSRAM and uses very small amount of LEs.

The rest of the modules in the design operate at 50 MHz. Thegradient and optical flow function computation modules are thetwo most power consuming modules amongst them. Althoughthey operate at lower frequency, the high resource usage in bothcombinational and sequential structures is an effect to increasethe static and dynamic consumption of these modules.

5.5. Computation time

Real-time computer vision requires a minimum of 30 videoframes to be processed in 1 s. This allows approximately 33 msof processing time per frame for all operations that should be per-formed on a frame. Pipelined architectures allow for longer dura-tion operations provided each pipeline stage can be executed atthis interval and the pipeline delay is acceptable.

In our work, we measure execution speed performance in termsof the maximum sustained frames that can be processed per sec-ond for OF computation. This frame-rate is measured in frames-per second (fps). A speed much higher than the aforementionedminimum allows for other higher level vision operations to fit intoan overall single-chip design.

Both the ‘‘Rubic’s Cube’’ sequence (256 � 256 pixels) and ‘‘Ham-burg Taxi’’ sequence (256 � 190 pixels) are tested to determine theframe-rate measure. Since the implementation does not have anynon-deterministic components, the frame-rate does not vary overthe image. For different resolutions, a different frame rate is possi-ble but requires register and memory re-optimization. The two testsequences in our tests have close enough resolution which doesnot justify this re-optimization. We just zero-pad the ‘‘Hamburgtaxi’’ sequence to 256 � 256 pixels, resulting the same computa-tion speed for both sequences.

For both sequences, the computation of the frame OF vectorfield is completed in 3.89 ms. This corresponds to a frame rate of257 fps which is well above the real-time requirement. The mea-sured computation time includes whole optical flow computationprocess together with memory access times from/to SSRAM. How-ever, the transferring of the image sequence from PC to FPGA boardand the computation result from FPGA board back to the PC are notincluded in the computation time since it is not related with thedesign itself but depends on the communication interface used.

For comparison, our reference PC implementation of the algo-rithm using Matlab R2008b platform with fully vectorized codecan achieve approximately 0.57 s processing time (about 2 fps).The PC hardware platform has an Intel Core2 Duo processor oper-ating at 1.66 GHz and 1 GB memory at 667 MHz clock rate. Theprocessor has a power rating of 34 W.

In literature, computational speed of optical flow algorithms ona PC hardware is often presented either by Matlab or other pro-gramming language implementations such as C/C++. Although itcan be argued that a Matlab implementation may not be indicativeof the best that can be done as a PC implementation, it is still farfrom achieving real-time performance. Also there is clear andsignificant power consumption advantage of an FPGAimplementation.

6. Conclusions and discussion of results

In this study, we presented the design and implementation of ahigh performance FPGA hardware with a small footprint and lowpower consumption that is capable of providing over-realtimeoptical flow data. The motivation behind this work is the lack ofa suitable hardware that is capable of computing optical flow vec-tor field in real time for the computer-vision research community.To the best of our knowledge, a consistent multi-criteria evaluationof performance is also not available. Low power real-timeperformance is especially important for mobile robotic platformsto enable many successful computer vision algorithms and applica-tions on these platforms.

A well known reference optical flow algorithm proposed byHorn and Schunck is implemented in hardware and yields a highdensity OF vector field with reasonable accuracy. We discuss dif-ferent aspects and performance dimensions of the proposed designand attempt to present insight that is applicable to implementa-tions of other vision algorithms. There are different compromisesin an FPGA based design, from the selection of the FPGA chip tothe way the design is put together. We put our emphasis on lowpower and compact design while providing over real-time perfor-mance. The design uses a power efficient low end FPGA (EP2C70)from the Cyclone II family and occupies a small fraction of thischip. If a custom board is designed, smaller devices such asEP2C35-EP2C15 from the same family can be used for lower costand power consumption while still providing sufficient chip re-sources. To realize the advantages of an FPGA based hardware de-sign and achieve maximum possible execution speed, a parallel,pipelined architecture is used. Also, utilizing a multi clock designmethod boosts the performance dramatically by operating mem-ory interfaces at a higher (200 MHz) clock frequency and hencesaving significantly from memory read/write times. Computationalmodules can operate at most at 50 MHz using a 15-stage pipeline.Division operations is sub-divided into multiple stages to achievethis throughput.

Beside the discussion of the particulars of the design, we alsopresent a comprehensive testing of the proposed design usingtwo real video sequence test data that are frequently used for per-formance evaluations of optical flow methods in the literature.

The FPGA hardware implementation achieves an accuracy of1.319� average angular error rate with 0.509� standard deviationand maximum of 0.024 pixels endpoint error as compared withthe floating point PC implementation of the same algorithm. Thehardware can compute the OF vector field on a 256 � 256 imagepair in 3.89 ms which corresponds to an frame rate of 257 fps. Thisis approximately 146 times faster than the reference PC implemen-tation. The FPGA implementation consumes only 844.38 mW ofpower which is around 1/40 of the power consumed by the1.66 GHz PC processor.


In conclusion, the presented FPGA implementation of opticalflow computation can provide over real-time performance, hencehardware acceleration to vision applications on mobile platformswhile delivering low power consumption and reasonable accuracyat an affordable cost.

Besides the promising results that we report in the present pa-per, there is considerable complementary work that can be consid-ered when one is committed to using an FPGA based vision systemin practice. Our test setup takes an input image sequence from PCfor testing purposes, hence has a slow RS232 interface. A highspeed interface to feed the image sequence directly from a CMOScamera to the dedicated SSRAM is part of our ongoing work. Thisneeds to be complemented by a high speed PC interface betweenthe FPGA board and the PC responsible of higher level visionfunctions.

Also, the FPGA development board we utilized is a general pur-pose board designed for education and research purposes. An ac-tual vision application often would require a dedicated boardwith carefully selected memory size and sub-systems to decreasethe board size and power consumption.

Another aspect that need to be considered for a full OF imple-mentation is pre-processing filters that may increase accuracy ofthe algorithm. Since filtering operations takes less time comparedto OF computation and is not the critical bottleneck, we have ex-cluded the implementation of these pre-processing filters. Also,FPGA based filter implementation is simpler. Our design leaves en-ough room on the chip to integrate all these steps on a single chip.

7. Access to hardware source code

In performing this work and sharing it with the scientific com-munity, we were partly motivated by our application domain oflegged mobile robotics and partly by the lack of an open-sourcehardware design that researchers in Academia can build on. Asan integral part of this paper, the source code of our hardware de-sign is made available in full for the science community for re-search and academic use. A research proposal form and non-disclosure agreement needs to be submitted to the authors forindividual access to the source code. These forms can be obtaineddirectly from the authors by e-mail correspondence.

Acknowledgement

This work was supported by TUBITAK-The Scientific and Tech-nological Research Council of Turkey under Project 110E120.

References

[1] T. Browne, J. Condell, G. Prasad, T. McGinnity, An investigation into optical flowcomputation on FPGA hardware, in: International Machine Vision and ImageProcessing Conference, IMVIP ’08, September 2008, pp. 176–181.

[2] W. MacLean, An evaluation of the suitability of FPGAS for embedded visionsystems, in: Conference on Computer Vision and Pattern RecognitionWorkshops, June 2005.

[3] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, R. Szeliski, A database andevaluation methodology for optical flow, in: IEEE 11th InternationalConference on Computer Vision (ICCV), October 2007, pp. 1–8.

[4] J. Barron, D. Fleet, S. Beauchemin, Performance of optical flow techniques,International Journal of Computer Vision (IJCV) 12 (1994) 43–77.

[5] S.-Y. Chien, L.-G. Chen, Reconfigurable morphological image processingaccelerator for video object segmentation, Journal of Signal ProcessingSystems 62 (2011) 77–96.

[6] A. Lopich, P. Dudek, Hardware implementation of skeletonization algorithm forparallel asynchronous image processing, Journal of Signal Processing Systems56 (2009) 91–103.

[7] B. Horn, B. Schunck, Determining optical flow, AI 17 (1981) 185–203.[8] B. Lucas, T. Kanade, An iterative image registration technique with an

application to stereo vision, in: Proceedings of Imaging UnderstandingWorkshop, 1981, pp. 121–130.

[9] H.-H. Nagel, On a constraint equation for the estimation of displacement ratesin image sequences, IEEE Transactions on Pattern Analysis and MachineIntelligence 11 (1989) 13–30.

[10] M. Black, P. Anandan, A framework for the robust estimation of opticalflow, in: Fourth International Conference on Computer Vision, May 1993,pp. 231–236.

[11] M. Proesmans, L. van Gool, E. Pauwels, A. Oosterlinck, Determination of opticalflow and its discontinuities using non-linear diffusion, in: EuropeanConference on Computer Vision, 1994, pp. 295–304

[12] P.C. Arribas, F.M.H. Macia, Fpga implementation of the Horn and Shunk opticalflow algorithm for motion detection in real time images, in: Design of Circuitsand Integrated Systems Conference, 1998, pp. 616–621.

[13] P.C. Arribas, F.M.H. Macia, FPGA implementation of camus correlation opticalflow algorithm for real time images, in: Vision Interface Proceedings, 2001, pp.7–9.

[14] P.C. Arribas, F.M.H. Macia, FPGA board for real time vision developmentsystems, in: Proceedings of the Fourth IEEE International Caracas Conferenceon Devices, Circuits and Systems, 2002, pp. T021–1–T021–6.

[15] P.C. Arribas, F.M.H. Macia, FPGA implementation of santos-victor optical flowalgorithm for real-time image processing: an useful attempt, in: VLSI Circuitsand Systems, vol. 5117, 2003, pp. 23–32.

[16] H. Niitsuma, T. Maruyama, High speed computation of the optical flow, in:International Conference on Image Analysis and Processing (ICIAP), 2005, pp.287–295.

[17] Z. Wei, M. Martineau, D.-J. Lee, M. Martineau, A fast and accurate tensor-basedoptical flow algorithm implemented in FPGA, in: IEEE Workshop onApplications of Computer Vision, February 2007, pp. 18–18.

[18] T.A. Camus, Real-Time Optical Flow, PhD thesis, Brown University, Providence,RI, USA, 1994.

[19] J. Santos-Victor, J.S. victor Giulio S, Uncalibrated Obstacle Detection UsingNormal Flow, Tech. Rep., University of Genova, Italy, 1996.

[20] P.C. Arribas, F.J. Alonso, FPGA real time lane departure warning hardwaresystem, in: Computer Aided Systems Theory (EUROCAST), 2007, pp. 725–732.

[21] J. Martin, A. Zuloaga, C. Cuadrado, J. Lazaro, U. Bidarte, Hardwareimplementation of optical flow constraint equation using FPGAs, ComputerVision and Image Understanding Journal 98 (2005) 462–490.

[22] C. Claus, A. Laika, L. Jia, W. Stechele, High performance FPGA based optical flowcalculation using the census transformation, in: IEEE Intelligent VehiclesSymposium, June 2009, pp. 1185–1190.

[23] J. Diaz, E. Ros, F. Pelayo, E. Ortigosa, S. Mota, FPGA-based real-time optical-flowsystem, IEEE Transactions on Circuits and Systems for Video Technology 16(2006) 274–279.

[24] B. Pyda, R. Brindha, A novel high speed l-k based optical flow computation, in:International Conference on Communication and Computational Intelligence,2010, pp. 104–108.

[25] Z. Wei, D.-J. Lee, B.E. Nelson, FPGA-based real-time optical flow algorithmdesign and implementation, Journal of Multimedia 2 (2007) 38–45.

[26] R. Strzodka, C. Garbe, Real-time motion estimation and visualization ongraphics cards, in: Proceedings of the Conference on VisualizationS, 2004, pp.545–552.

[27] Y. Mizukami, K. Tadamura, Optical flow computation on compute unifieddevice architecture, in: Proceedings of the 14th International Conference onImage Analysis and Processing, 2007, pp. 179–184.

[28] J. Chase, B. Nelson, J. Bodily, Z. Wei, D.-J. Lee, Real-time optical flowcalculations on FPGA and GPU architectures: a comparison study, in: 16thInternational Symposium on Field-Programmable Custom ComputingMachines, April 2008, pp. 173–182.

[29] R. Barrett, M. Berry, T.F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R.Pozo, C. Romine, H.V. der Vorst, Templates for the Solution of Linear Systems:Building Blocks for Iterative Methods, second ed., 1994.

[30] S. Kilts, Advanced FPGA Design, Architecture, Implementation andOptimization, Wiley, 2007.

[31] M. Otte, H.-H. Nagel, Optical flow estimation: advances and comparisons, in:European Conference on Computer Vision, 1994, pp. 51–60.

Gökhan Koray Gültekin is currently a Ph.D. student inthe Department of Electrical and Electronics Engineer-ing in Middle East Technical University (METU), Ankara,Turkey where he also received his M.Sc. degree in 2010.He received the B.S. degree in 2007 from the Depart-ment of Electrical and Electronics Engineering in Bas-kent University, Ankara, Turkey where he worked as aresearch and teaching assistant for 1 year in 2008. Hehas been working as a research and teaching assistant inMETU since 2009. He has taken part in two researchprojects on legged robotics funded by TUBITAK (TheScientific and Technological Research Council of Turkey)

since 2007. He was the leader of the Baskent University Robotics Team who haswon the first prize in IEEE Region eight Student Robotics Contest held in the Uni-versity of Twente, Netherlands in 2006. He also has four other prizes in national
robotics contests held in Turkey. His research interests include machine vision andembedded hardware and software design in robotics systems.


Afs�ar Saranli received his B.S degree in 1993 withHonors and Ph.D. degree in 2000 both from theDepartment of Electrical and Electronics Engineering,Middle East Technical University, Ankara, Turkey. HisM.Sc. degree is with Distinction in 1994 from theDepartment of Electrical and Electronic Engineering,Imperial College of Science, Technology and Medicine,London, England. During the 1995–1999 period, he wasalso working part-time as a science consultant to STFASavronik Inc., a defence electronics company in Ankara,Turkey. He joined IPS Automation Inc., Toronto, Canadain 2000 as a Senior Computer Scientist and later Photon

Dynamics Canada Inc., Toronto, Canada in 2003 in the same capacity where heworked until 2005. His focus was on signal and image processing as well as oncontrol algorithms development for vision based automated inspection systems forthe automotive glass and LCD manufacturing industries. He then returned to Turkeyto join the Department of Electrical and Electronics Engineering, Middle EastTechnical University as an Assistant Professor where he is currently directing theLaboratory of Robotics and Autonomous Systems (RoLab). His current researchinterests include estimation and tracking for radar and other sensor systems, sensorbased mobile robotics, dynamics and control of legged locomotion for roboticplatforms. He is a co-author of four US and International patents in the field ofautomated inspection and computer vision as well author or co-author of a numberof scientific publications.

an fpga based high performance optical flow hardware design for computer vision applications

Documents