[ieee 2009 ieee symposium on computational intelligence for image processing (ciip) - nashville, tn,...

7
AN EFFICIENT ARCHITECTURE FOR HARDWARE IMPLEMENTATIONS OF IMAGE PROCESSING ALGORITHMS Farzad Khalvati and Hamid R. Tizhoosh Department of Systems Design Engineering University of Waterloo, Waterloo, Ontario, Canada ABSTRACT This work presents a new performance improvement technique for hardware implementations of non-recursive convolution based image processing algorithms. It combines an advanced data flow technique (instruction reuse) proposed in modern microprocessor design with the value locality of image data to develop a method, window memoization, that increases the throughput with minimal cost in area and accuracy. We implement window memoization as a 2-wide superscalar pipeline such that it consumes significantly less area than conventional 2-wide superscalar pipelines. As a case study, we have applied window memoization to Kirsch edge detector. The average speedup factor was 1.76 with only 25% extra hardware. I. INTRODUCTION Image processing algorithms are widely used in real- time applications such as medical imaging (e.g. MRI and ultrasound), quality control, navigation, security, and multi- media. These systems must process a huge amount of data in real time. This makes it both challenging and crucial to optimize the image processing algorithms. From hardware design perspective, the size of circuitry is another crucial factor. Hardware designers face an increasing challenge of designing faster and smaller image processing circuitry. Two common methods for performance improvement in hardware design are pipelining and parallel processing. Both methods speed up the operations with a cost in hardware area. In this paper, we present window memoization, a perfor- mance optimization technique for hardware implementations of non-recursive convolution-based algorithms, which are widely used in image processing. Window memoization combines instruction reuse with repetitive nature of image data to speed up the operations with significantly less cost in area than the conventional methods. Instruction reuse is a data flow technique, proposed in modern microprocessor design, that aims to exploit the value locality of data in computer programs to improve the performance by reusing the results of previously computed instructions. Window memoization employs a reuse buffer (RB) to store previously computed operations on a window of pixels (parcel) and the corresponding results. Subsequent parcels are compared against the reuse buffer; if a matching parcel is found, the previously computed result is reused and the actual computation is skipped. To apply window memoization to image processing al- gorithms, we implement the design as a 2-wide superscalar pipeline rather than as a scalar pipeline. Conventional 2- wide superscalar pipelines require twice the hardware that is needed by scalar pipelines. In contrast, our superscalar pipeline only needs extra hardware to implement the reuse mechanism rather than replicating the original hardware. Our design achieves high speedups by employing different meth- ods: multi-thresholding for parcel matching, which increases speedup with an insignificant loss in the result accuracy, efficient RB address generator, bypass path, and etc. We have analyzed the performance improvement of win- dow memoization for 52 natural images of 512×512 pixels with 256 gray levels. As a case study, we have applied our technique to Kirsch algorithm [1], which is a commonly used edge detector. It performs convolutions of 3 × 3 parcels with 8 masks, calculating 8 gradients. The pixel in the center of the parcel is identified as an edge if the maximum of the 8 gradients is greater than a threshold. The performance of window memoization is highly depen- dent on the amount of data locality in an image. Images with less detail tend to have more data locality, and hence benefit more from window memoization. To analyze our results more clearly, we have defined an algorithm to calculate the detailedness of images, based on which, we categorize images as being either Low, Medium, or High. In addition, we have defined efficiency as a measure of effectiveness, which in addition to speedup considers the extra hardware required by window memoization. The outline of the rest of the paper is as follows. In section II, we describe related research on different reuse techniques proposed in hardware. In section III, we introduce a detailedness algorithm, which categorizes the images based on the level of their complexity. In section IV, we present background on instruction reuse proposed in microprocessor design. In section V, we introduce our technique, window memoization, and in section VI, results are presented and analyzed. Finally, we present the conclusion in section VII. 978-1-4244-2760-4/09/$25.00 ©2009 IEEE

Upload: hamid-r

Post on 04-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

AN EFFICIENT ARCHITECTURE FOR HARDWARE IMPLEMENTATIONS OF IMAGEPROCESSING ALGORITHMS

Farzad Khalvati and Hamid R. Tizhoosh

Department of Systems Design EngineeringUniversity of Waterloo, Waterloo, Ontario, Canada

ABSTRACTThis work presents a new performance improvement

technique for hardware implementations of non-recursiveconvolution based image processing algorithms. It combinesan advanced data flow technique (instruction reuse) proposedin modern microprocessor design with the value localityof image data to develop a method, window memoization,that increases the throughput with minimal cost in area andaccuracy. We implement window memoization as a 2-widesuperscalar pipeline such that it consumes significantly lessarea than conventional 2-wide superscalar pipelines. As acase study, we have applied window memoization to Kirschedge detector. The average speedup factor was 1.76 withonly 25% extra hardware.

I. INTRODUCTION

Image processing algorithms are widely used in real-time applications such as medical imaging (e.g. MRI andultrasound), quality control, navigation, security, and multi-media. These systems must process a huge amount of datain real time. This makes it both challenging and crucial tooptimize the image processing algorithms. From hardwaredesign perspective, the size of circuitry is another crucialfactor. Hardware designers face an increasing challenge ofdesigning faster and smaller image processing circuitry.

Two common methods for performance improvement inhardware design are pipelining and parallel processing. Bothmethods speed up the operations with a cost in hardwarearea. In this paper, we present window memoization, a perfor-mance optimization technique for hardware implementationsof non-recursive convolution-based algorithms, which arewidely used in image processing. Window memoizationcombines instruction reuse with repetitive nature of imagedata to speed up the operations with significantly less costin area than the conventional methods. Instruction reuse isa data flow technique, proposed in modern microprocessordesign, that aims to exploit the value locality of data incomputer programs to improve the performance by reusingthe results of previously computed instructions.

Window memoization employs a reuse buffer (RB) tostore previously computed operations on a window of pixels(parcel) and the corresponding results. Subsequent parcels

are compared against the reuse buffer; if a matching parcelis found, the previously computed result is reused and theactual computation is skipped.

To apply window memoization to image processing al-gorithms, we implement the design as a 2-wide superscalarpipeline rather than as a scalar pipeline. Conventional 2-wide superscalar pipelines require twice the hardware thatis needed by scalar pipelines. In contrast, our superscalarpipeline only needs extra hardware to implement the reusemechanism rather than replicating the original hardware. Ourdesign achieves high speedups by employing different meth-ods: multi-thresholding for parcel matching, which increasesspeedup with an insignificant loss in the result accuracy,efficient RB address generator, bypass path, and etc.

We have analyzed the performance improvement of win-dow memoization for 52 natural images of 512×512 pixelswith 256 gray levels. As a case study, we have applied ourtechnique to Kirsch algorithm [1], which is a commonly usededge detector. It performs convolutions of 3×3 parcels with8 masks, calculating 8 gradients. The pixel in the center ofthe parcel is identified as an edge if the maximum of the 8gradients is greater than a threshold.

The performance of window memoization is highly depen-dent on the amount of data locality in an image. Images withless detail tend to have more data locality, and hence benefitmore from window memoization. To analyze our resultsmore clearly, we have defined an algorithm to calculatethe detailedness of images, based on which, we categorizeimages as being either Low, Medium, or High. In addition,we have defined efficiency as a measure of effectiveness,which in addition to speedup considers the extra hardwarerequired by window memoization.

The outline of the rest of the paper is as follows. Insection II, we describe related research on different reusetechniques proposed in hardware. In section III, we introducea detailedness algorithm, which categorizes the images basedon the level of their complexity. In section IV, we presentbackground on instruction reuse proposed in microprocessordesign. In section V, we introduce our technique, windowmemoization, and in section VI, results are presented andanalyzed. Finally, we present the conclusion in section VII.

978-1-4244-2760-4/09/$25.00 ©2009 IEEE

II. RELATED WORK

Memoization is a technique, introduced by Michie [2],which is used to speed up the calculations in a computerprogram by reusing the results of previous computations.In recent years, a few techniques based on memoization,in both hardware and software, have been proposed to takeadvantage of the value locality of data in computer programs.Value locality is defined as the possibility of repeatedlyencountering previously seen data upon which the samecalculation is to be performed [3].

Richardson [4] proposed to embed a memoization tech-nique in microprocessors to look up the results of a set oftargeted operations that are redundant.

Sodani and Sohi [5] proposed instruction reuse, whichreduces the number of instructions that have to be executeddynamically. Instruction reuse benefits from value localityof instructions and operands to produce the result of aninstruction as soon as it is fetched without executing it. Thisreduces the number of cycles that the pipeline has to stallfor data dependencies and hence increases the throughput.

Citron et al. [6] proposed a technique that enables exe-cuting multi-cycle operations in a single cycle by adding aMEMO-TABLE to each computation unit in the micropro-cessor. In another work, Citron and Feitelson [7] proposedadding two new instructions to the instruction set to lookup and update a generic MEMO-TABLE. This makes itpossible to memoize multi-cycle mathematical and trigono-metric functions, most of which are not included in themicroprocessors instruction sets.

Kavi and Chen [8] studied the possibility of improvingperformance by reusing the results from previous functioninvocations. It was concluded that there is a great potentialfor exploiting function reuse and also functions with fewerarguments have higher probability of reuse. Huang andLilja [9] proposed the block reuse technique, which appliesvalue reuse to blocks rather than single instructions.

As a hardware design technique specific to an imageprocessing algorithm (mathematical morphology), Chien etal. [10] proposed a method that reuses the results ofoverlapping operations in a pixel neighborhood to improveperformance.

Although the simulations results for different proposedreuse architectures show significant performance improve-ment, none of them has been implemented in a real designyet. The reasons are that implementing these techniquesrequire significant modifications to existing control anddatapath circuitry in microprocessors and that designers inindustry are not convinced that the performance gain willjustify the cost for design modifications [11].

Our technique benefits from value locality of image datato optimize the hardware implementations of non-recursiveconvolution-based image processing algorithms. Once thecontrol circuitry for reuse mechanism is designed for aparticular algorithm, with minor modifications it can be

applied to different image algorithms. The fact that inimage processing, differences between the result and thereference images that cannot be detected by the observer aretolerated leads to the idea of multi-thresholding for parcelmatching, which enables window memoization to achievehigh speedups. This is contrary to memoization techniquesin microprocessors, which always require 100% accuracy forinstructions results limiting reuse to only the instructions orfunctions with identical operands.

In the previous works, we have applied window mem-oization to several case studies and implemented them insoftware where the typical speedups in the range of 1.42 to2.8 have been achieved [12] [13].

III. IMAGE DETAILEDNESS

To analyze the speedups achieved for different images, weclassify images based on their complexity. The classificationwill give us an estimation of performance gain for an image,before applying the performance improvement techniqueon the image. In addition, it may help us customize thetechnique for individual classes of images. We present analgorithm that calculates how detailed an image is. For agiven image, the algorithm generates n×n seed pixels spreadacross the image, which are in equal distances (horizontallyand vertically) far from each other. Four derivatives arecalculated on 3×3 neighborhoods around each seed pixel andthe percentage of the instances that the maximum value ofthe four derivatives is larger than a threshold, ε, is calculated.The final result of the algorithm for an image, η, is anumber between 0 (the least detailed) and 100 (the mostdetailed) indicating the level of complexity of the image.We have experimentally determined n and ε to be 102 and15, respectively. Table I shows the detailedness algorithm.

Table I. Algorithm for calculation of detailedness

1. input an image ’I’2. initialize counter k3. generate n × n seed pixels, which are σ pixels(horizontally and vertically) far from each other4. for each seed pixel at (i,j) calculate:

Δ1 = abs(I(i − 1, j) − I(i + 1, j))Δ2 = abs(I(i, j − 1) − I(i, j + 1))Δ3 = abs(I(i − 1, j − 1) − I(i + 1, j + 1))Δ4 = abs(I(i − 1, j + 1) − I(i + 1, j − 1))

5. Δmax = max(Δ1, Δ2, Δ3, Δ4)6. if Δmax > ε, then increment k7. calculate detailedness η: η = k

n2 × 100%

We verified that the detailedness algorithm produces intu-itively reliable results by applying the algorithm on differentnatural images and demonstrating the results to 5 observers.

Using the detailedness algorithm, we are able to classifythe images based on the level of variation of the gray levels.We will use this classification in analyzing our proposedperformance improvement technique in upcoming sections.

As input images for our simulations, we have randomlychosen 52 different images of 512×512 pixels and runthe detailedness algorithm. The results for detailedness arebetween 1.38% and 77.65%.

To categorize the images, we calculated the average de-tailedness minus/plus the standard deviation as two boundarypoints, which gave us 17% and 55%, respectively. As aresult, all the images with detailedness below 17% werecategorized as class Low. Class Medium contains the imageswith detailedness between 17% and 55% and finally, classHigh includes the images that have detailedness higher than55%. Table II shows the range of detailedness, averagedetailedness and number of images and Figure 1 showstypical images for each class of detailedness.

Table II. Detailedness classes for 52 imagesClass range of η Average η imagesLow η < 17% 9% 11

Medium 17% ≤ η < 55% 38% 34High η ≥ 55% 66% 7

Fig. 1. Left to right: class low, medium and high

IV. BACKGROUND ON INSTRUCTIONREUSE

Despite the innovations made in modern microprocessorsdesign, their performance is essentially limited by two pro-gram characteristics: control flow limit and data flow limit.Control flow limit can cause control hazards, which arisefrom the speculative execution of instructions. Data flowlimit, on the other hand, is caused by data hazards, whichare due to un-handled data dependency between consecutiveinstructions. Instruction reuse [5] attempts to break the dataflow limit and hence to reduce the performance loss due todata hazards.

True data dependency between instructions causes that ascalar pipeline to stall frequently and as a result, it neverreaches its ultimate throughput, which is one instructionper cycle. Instruction reuse introduces additional instruction-level parallelism by reusing the result of an instructionwithout executing it. The technique benefits from valuelocality of data fed into the in-flight instructions. Valuelocality is caused by many factors but among them dataredundancy is a significant feature. Data redundancy is dueto the fact that many programs use data that have littledifferences.

Fetch

Decode

Issue

Execute

Commit

RBAccess

ReuseTest

If ReusedRegister File

Fig. 2. Instruction reuse for microprocessors

Instruction reuse tries to produce the result of an in-struction, with which the consequent instructions have truedata dependencies as soon as possible by using the resultsof previously executed instructions. Hence, it reduces thenumber of cycles that pipeline has to stall and consequentlyincreases the throughput up to one instruction per machinecycle.

Figure 2 from [14] shows how instruction reuse is im-plemented in a scalar pipeline with six stages for micropro-cessors. Instruction reuse adds two modules to the pipeline:reuse buffer (RB) access and reuse test. Based on the fetchedinstruction’s program counter, the reuse buffer is searchedin RB access. If a matching instruction is found, reuse testverifies whether its future result is going to be the same asthe previous one, in which case the instruction reuses itsresult and commits it directly to register file bypassing allthe intermediate stages (i.e. Decode to Execute).

V. WINDOW MEMOIZATION TECHNIQUE

We introduce window memoization as a performanceimproving technique for non-recursive convolution-basedimage processing algorithms. The main idea is that if acalculation has been performed on a window of pixels, whichwe call parcel, then when we encounter an identical parcelin the future, we can reuse the previously computed result.The goal is to increase performance by processing as manyas possible parcels per a clock cycle. To maximize the per-formance improvement, we want to maximize the percentageof parcels that are able to reuse previously computed results(reuse rate) and minimize the cost (hardware) of reusing aresult.

As with reuse techniques in software or hardware, windowmemoization uses a memory array (reuse buffer (RB)) tostore the parcels and their results after performing thecalculations. From a high level model perspective, pixelsof each new parcel are compared to the pixels of parcelsstored in the RB. If the new parcel matches a parcel storedin the RB (hit), the result is looked up from the RB andthe calculations for the new parcel are skipped. Otherwise(miss), the calculations are performed on the new parcel and

the RB is updated with the produced result. Implementingthe reuse mechanism in hardware requires a design method,which will be discussed shortly.

For non-recursive algorithms there is no data dependencybetween operations in pipeline. Therefore, throughput fora scalar pipeline of such algorithms is 1, which is themaximum throughput of scalar pipelines.

To tackle the performance limitation that scalar pipelineimposes, we design window memoization as a 2-wide super-scalar pipeline such that it increases the throughput up to 2but with an extra hardware significantly less than the hard-ware required by conventional 2-wide pipeline. Generally, 2-wide superscalar pipeline is obtained by duplicating the hard-ware of scalar pipeline (Figure 3, left). For conventional 2-wide superscalar pipelines with no hazards (e.g. conventionalimplementations of image processing algorithms), doublingthe area in hardware will double the throughput of thepipeline.

We have implemented a 2-wide pipeline (Figure 3, right)that accepts two pixels as inputs at each clock cycle. Twoparcels, A and B, are created and sent through the pipeline ateach clock cycle. Parcel A goes through the core and parcelB tries to find a matching parcel in the RB. If a matchingparcel is found (hit), the result for parcel B is looked upwhile the result for parcel A is calculated at the core. Asa result, two parcels exit the pipeline in one clock cycle.If parcel B is not able to locate a matching parcel in theRB (miss), pipeline stalls and parcel B is sent to the core,following parcel A and the RB is updated by its result. Inthis case, one parcel per cycle will exit the pipeline.

As Figure 3 shows, instead of duplicating the core, wehave only added RB, address generator for RB (RB-addr-gen), a matching block (lookup), and a fifo. RB stores theparcels and their results. RB-addr-gen generates the addressfor incoming parcels. Lookup determines whether hit or missoccurs and which parcel should enter the core (resolving thestructural hazard). Fifo ensures that outputs exit the pipelinein the correct order.

Two parameters that define the efficiency of windowmemoization is speedup and the hardware area growth rate,which we call sprawl. Speedup depends on reuse rate. Thereuse rate has a direct relation with RB depth and indirectrelation with result accuracy or the number of gray levelsused for parcel matching (multi-thresholding). Sprawl, onthe other hand, depends on the size of RB-addr-gen andlookup stages, RB and fifo width (accuracy), fifo depth(core latency) and RB depth. Figure 4 shows the designprocess for window memoization. RB depth and accuracyare independent parameters that can be adjusted for eachapplication to achieve desired optimal point where accuracyand speedup requirements are met.

For 256 gray-level, it is unlikely that a parcel of forinstance 9 pixels will exactly match a previously encounteredparcel (i.e. low reuse rate). We increase the reuse rate

create parcel A

two incoming pixels

create parcel B

core core

two outputs

create parcel A

core

two outputs

two incoming pixels

create parcel B

fifo

lookup

hit

upda

te R

B

generate RB address

miss

parcel A or B?

reus

e bu

ffer

(R

B)

Fig. 3. Left: Conventional 2-wide pipeline. Right: Our 2-wide pipeline for window memoization

RB depth

numbre of gray levels

accuracy

reuserate

RB width

speedup

sprawl

independent parameters

direct relation

inverse relation

RB addr gen& lookup

core latency fifo depth

fifo width

Fig. 4. Design process for window memoization

through multi-thresholding: rather than require that the pixelsmatch exactly, we reduce the number of gray levels ofparcels when doing the comparison. Decreasing the precisionof the match (using fewer gray levels when comparingpixels) increases the reuse rate and thereby the performanceat the potential cost of insignificant loss in results accuracy.To minimize the loss of accuracy, we reduce the numberof gray levels only for matching the parcels; the actualcalculations are done with a full range of 256 gray levels.Less number of gray levels for matching leads to fewerbits to represent each parcel and hence, RB and fifo widthdecreases.

To find the optimal values for the number of gray levelsfor matching for our case study (Kirsch edge detector), weevaluated the performance and the results accuracy withrespect to different numbers of gray levels for matching (2to 256). We found that using 8 gray levels for comparing theparcels leads to high reuse rates (and thus high speedups)while preserving the accuracy of the results.

RB-addr-gen plays an important role in producing high

reuse rates. It converts each incoming parcel to a number,full-key, which is used for matching. Full-key is generatedby combining the n MSBs of each pixel in the parcelwhere n depends on the number of gray levels selected formatching (e.g. n=3 bits for 8 gray levels). Then RB-addressis generated from the full-key. If RB size is chosen to be aprime number, high reuse rates will be gained using the modfunction: RB-address = full-key mod RB-size. However, it istoo expensive to implement the mod function in hardware.Instead, we select a subset of the full-key bits as an addressfor the RB, based on the RB size.

We tested different combinations of bits selected fromfull-key as RB-address. For an RB size, from each pixelcomponent in the full-key, we experimented picking middlebits, the left hand side bits, bits from two ends and xoringeach pair of bits. The highest reuse rate, which is less than5% lower than mod function result, is achieved by choosingat least one bit from each pixel component in the full-key,starting from the right hand side bit. For example, for aparcel of 8 pixels, with 8 gray levels for matching, the full-key will have 24 bits. For a 64K Bytes RB, we choose2 bits from right hand side of each 3 bits of the full-keythat belongs to one pixel, which gives a 16 bits address(Figure 5). Our address generator yields 5% to 15% higherreuse rate than a naive RB-addr-gen that simply chooses theLSBs of the full-key as the RB address (i.e. bit 0,1,2,..,15from full-key).

23 22 21 5 4 3 2 1 0...............

pixel7 pixel1 pixel0

15 14 3 12 0...............

full-key

RB-address

7,6,5pixelsbits 7,6,57,6,5

Fig. 5. RB-address generator

The RB is a dual-port memory, which performs readand write operations at the same clock cycle. If the newparcel generates the same address as the parcel that exitsthe core and intends to update the RB, a conflict will occur.Some memory architectures handle the conflict by allowingthe read to be done before the write or vice versa whileothers generate corrupted data on the read port when aconflict happens. Our simulations show that the reuse ratewill increase (approximately by 2%) if in case of a conflict,either read is performed before the write operation or abypass path is added from the output of the core to RB-addr-gen. With bypass path, if the new parcel matches theparcel that just exited the core, the new parcel will directlypick up the calculated result without trying the RB.

The RB must be updated only when the missed parcel hasgone through the core and its result is produced. However,we found out that the reuse rate can be increased by 1% to

2% by updating the RB at every clock cycle with the outputof the core regardless whether the parcel exiting the coreis in fact the missed parcel (parcel B) or the one that hasdirectly entered the core (parcel A in Figure 3, right).

In order to build two parcels at each clock cycle, twopixels must be read from the memory that holds the image.To minimize the number of registers that hold two parcelsthroughout the pipeline, we read two consequent pixels fromthe image at each clock cycle and therefore two overlappingparcels are created. For example, for 3 × 3 parcels, at eachpipe stage, instead of 144 bits for two separate parcels, 96bits are needed for two overlapping parcels.

Kirsch edge detector requires 8 pixels out of 9 in a 3× 3window to perform the calculations. Thus, to store a parcelin the RB, 8 × log2GL bits are required where GL is thenumber of gray levels for matching, which was chosen to be8. Therefore each parcel requires 3 Bytes to be stored in theRB. Each fifo element will also be 3 Bytes since it carriesthe full-keys until the corresponding results is ready by thecore. Afterward, the RB is updated by the full-key and itsresult. Therefore every memory entry contains 3 Bytes.

VI. RESULTS

We define efficiency of design2 versus design1 as:

efficiency =speedup

sprawl(1)

For our analysis, the maximum frequency and the numberof parcels processed by two designs are the same. Therefore,speedup is cycles2

cycles1and sprawl is defined as area2

area1.

To calculate efficiency (Equation 1), we need to calcu-late the sprawl and speedup for window memoization. Wecompare our design (2-wide superscalar pipeline with win-dow memoization) against conventional design with scalarpipeline. For sprawl, we calculate the area growth ratefor FPGA cells as well as for the memory elements. Aconventional scalar pipeline for image processing consistsof three stages: make 1 parcel (mp1), core and a memory(mem) to hold the image. A superscalar pipeline for windowmemoization contains the following stages: make 2 parcels(mp2), core, RB-addr-gen, lookup, RB, fifo and a memory(mem) to hold the image. To calculate the total sprawl,one solution is to calculate the sprawl for FPGA cellsand memory elements separately and then combine themby averaging. We take a more conservative approach bychoosing the maximum of sprawls as the total sprawl:

sprawltotal = max(sprawlcell, sprawlmem) (2)

We have implemented the Conventional Kirsch and Kirschwith window memoization on Altera APEX20KE. The max-imum frequency for two designs was 93.55MHz. Thehardware area, in terms of FPGA cells and memory elements

for different portions of the two designs are presented inTable III and IV.

Table III. FPGA cells for two designsmp1 mp2 core RB-addr-gen + lookup240 300 963 235

Table IV. Memory elements (Bytes) for two designsmem RB fifo

512 × 512 RB-depth × 3 core latency ×3

We ran the simulations for 52 images of 3 classes ofdetailedness, for different RB sizes, ranging from 4 up to64K entries (each entry = 3 Bytes). Figure 6 shows theaverage reuse rates versus the RB size for each class ofdetailedness.

128 256 512 1K 2K 4K 8K 16K 32K 64K0

5

10

15

20

25

30

35

40

45

50

Reuse Buffer Entries (log2 scale)

Reu

se R

ate

(%)

Class LowClass MediumClass High

Fig. 6. Reuse rate vs. RB size

Figure 7 shows the average speedups and efficienciesversus the RB size for each class of detailedness. Reuse rateand speedup are proportional to the RB size. However, forthe range of RB size used in our simulations, efficiency has amaximum value for each class of detailedness. Increasing theRB size further will cause the efficiency to decrease. Table Vshows the maximum efficiency values, corresponding RBsizes and accuracy for each class of images. It must bementioned that for a conventional 2-wide pipeline, bothspeedup and sprawl are 2, which means that the efficiencyfor such a design is always 1.

For each class of images, different RB size yields themaximum efficiency. By averaging the efficiency values ofthe three classes, we found that the maximum efficiency forall 52 images is 1.41 with 16K RB size and 98.28% accuracy.

One of the parameters that efficiency depends on is thesize of the core (e.g. Kirsch algorithm). Larger cores willbenefit more from window memoization. Figure 8 showshow the core size affects the efficiency for three classesof images. The minimum core size (FPGA cells) to gain

128 256 512 1K 2K 4K 8K 16K 32K 64K1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Reuse Buffer Entries (log2 scale)

Sp

eed

up

Class LowClass MediumClass High

(a) Speedup vs. RB size

128 256 512 1k 2k 4k 8k 16k 32k 64k

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Reuse Buffer Entries (log2 scale)

Eff

icie

ncy

Class LowClass MediumClass High

(b) Efficiency vs. RB size

Fig. 7. Speedup and efficiency vs. RB size

Table V. Max efficiency, RB-size and accuracy for eachclass of images

Class Max efficiency RB size AccuracyLow 1.50 8K or 16K 99.50%

Medium 1.41 16K 98.20%High 1.29 16K 96.76%

127 512 1k 2k 4k 8k 16k 32k0.8

1

1.2

1.4

1.6

1.8

2

Core Size (FPGA Cell) (log2 scale)

Eff

icie

ncy

Class LowClass MediumClass High

Fig. 8. efficiency vs. Core size

efficiency larger than 1 for the three classes are: low: 128,medium: 256 and high: 256.

Figure 9 shows the result of Kirsch edge detector withoutwindow memoization (conventional algorithm) and with

window memoization for sample images. It is observed thatthe accuracy of the results is such high that the differencesbetween two sets of edge maps can hardly be distinguishedby human eye.

Fig. 9. Edge maps of sample images generated by theconventional algorithm (top) and by the algorithm withwindow memoization (bottom)

VII. CONCLUSION

The goal of the work presented here was to develop aperformance optimization technique (window memoization)and apply it to image processing algorithms. Our techniquehas been inspired by instruction reuse, which has been pro-posed for microprocessors but not yet implemented in realdesigns. We chose image processing as our target becausewe anticipated that image data has high value locality, whichis a key factor for our technique. Moreover, real time imageprocessing faces serious challenges in design performance,which can be addressed effectively by our technique.

We implemented window memoization as a 2-wide su-perscalar pipeline with significantly less area comparing toconventional 2-wide superscalar pipeline. We discussed thekey issues (e.g. multi-thresholding, RB address generatorand bypass path) that affect the performance improvementgained by window memoization. We experimented the effectof the RB size on the reuse rate, speedup and efficiency.We also found the maximum value of efficiency and thecorresponding RB size for each class of images. Moreover,the effect of different core sizes on efficiency was explored.Our simulations showed efficiency of up to 1.50 with lessthan 4% loss in accuracy.

We expect that any non-recursive convolution-based algo-rithm in spatial domain can benefit from window memoiza-tion. However, the efficiency and result accuracy might notbe identical for different algorithms, based on the core sizeand whether or not the results are binary. We also anticipatethat the window memoization technique can be modified tobe applicable to the hardware implementations of recursiveimage processing algorithms where the pipeline throughput

is limited by data dependencies among the parcels in thepipeline.

VIII. REFERENCES[1] R. Kirsch, “Computer Determination of the Con-

stituent Structure of Biological Images,” Computersand Biomedical Research, vol. 4, pp. 315–328, 1971.

[2] D. Michie, “Memo functions and machine learning,”Nature, vol. 218, pp. 19–22, 1968.

[3] M. H. Lipasti, C. B. Wilkerson, and J.P Shen, “Valuelocality and load value prediction,” in ASPLOS-VII,1996, pp. 138–147.

[4] S. E. Richardson, “Exploiting trivial and redundantcomputation,” in ARITH 11, 1993, pp. 220–227.

[5] A Sodani and G. S. Sohi, “Dynamic instruction reuse,”in ISCA, 1997, pp. 194–205.

[6] D. Citron, D. Feitelson, and L. Rudolph, “Acceleratingmulti-media processing by implementing memoing inmultiplication and division units,” in ASPLOS-VIII,1998, pp. 252–261.

[7] D. Citron and D. Feitelson, “Hardware memoization ofmathematical and trigonometric functions,” Technicalreport, Hebrew University of Jerusalem, 2000.

[8] K. M. Kavi and P. Chen, “Dynamic function resultreuse,” in ADCOM 11, 2003.

[9] J. Huang and D. J. Lilja, “Extending value reuse tobasic blocks with compiler support,” IEEE Trans. onComputers, vol. 49, pp. 331–347, 2000.

[10] S. Chien, S. Ma, and L. Chen, “A partial-result-reusearchitecture and its design technique for morphologicaloperations,” in ICASSP, 2001, pp. 1185–1188.

[11] J. P. Shen and M. H. Lipasti, Modern Processor Design,McGraw-Hill, 2004.

[12] F. Khalvati, M. D. Aagaard, and H. R. Tizhoosh,“Accelerating image processing algorithms based onthe reuse of spatial patterns,” in CCECE, 2007, pp.172–175.

[13] F. Khalvati, H. R. Tizhoosh, and M. D. Aagaard,“Opposition-based window memoization for morpho-logical algorithms,” in CIISP, 2007, pp. 425–430.

[14] A. Sodani and G. Sohi, “Understanding the Differencebetween Value Prediction and Instruction Reuse,” inMicro, 1998, pp. 205–215.