processor array synthesis from shift-variant deep nested do loops

The Journal of Supercomputing, 24, 229–249, 2003© 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.

Processor Array Synthesis from Shift-VariantDeep Nested Do LoopsSURIN KITTITORNKUN [email protected]

Department of Computer Engineering King Mongkut’s Institute of Technology Ladkrabana,Bangkok, 10520 Thailand

YU HEN HU [email protected]

Department of Electrical and Computer Engineering, University of Wisconsin, Madison, WI 53706

Abstract. The consolidation of Internet devices into a universal/portable device will soon be accom-plishable through the incorporation of reconfigurable computing in system-on-a-chip (SOC). At anyparticular moment, it could be a video/audio mobile phone, an MP3 song player, and other devices.The basic construct of these multimedia processing algorithms can be described as deep nested Doloop algorithms. They are considered the most demanding data-intensive algorithms and hence idealcandidates for an array of reconfigurable nanoprocessors. Therefore, algorithm to hardware synthesismethodology is important for an efficient exploitation of both spatial parallelism and temporal pipelin-ing. In this paper, we propose a processor array synthesis methodology. It can map an n-level nestedDo loop represented by a nonuniform or shift-variant data dependence graph to a near-optimal of one-or two-dimensional processor array under the available resource constraints to satisfy high-throughputcomputation demands.

Keywords: reconfigurable computing, FPGA, nested loop, systolic mapping, motion estimation

1. Introduction

In the near future, a reconfigurable SOC (System-on-a-Chip) will enable a tremen-dous number of applications, such as software radio mobile phones [1], digitalvideo/picture cameras, digital MP3 music players, personal digital assistants (PDAs),and so on to be consolidated into a single wireless Internet device. The notion ofreconfigurable computing (RC) is based on programmable logic device technology,such as field-programmable gate arrays (FPGAs). RC architecture is a dynami-cally programmable array of nanoprocessors accelerating computation- and data-intensive tasks for a general-purpose control processor. Depending on the functionalgranularity of each nanoprocessor, it is capable of operating on either word [2], [3]or bit-level [4] data. Current commercial examples of reconfigurable SOC are theA7 Configurable SOC family from Triscend Corp. [5], and the CS2000 Reconfig-urable Communications Processor from Chameleon System Inc. [6].In a hardware/software codesign environment of embedded SOC, several syn-

thesis tools try to convert the loop body into a pipelined hardware counterpart.For example, the Cameron project [7] is retargetable to many RC platforms. Thealgorithm must be written in a single assignment C language. The tool can extract

230 KITTITORNKUN AND HU

the loop body to a data flow graph (DFG) and generate the corresponding hard-ware in VHDL. Pipelining the loop body can eventually enhance the computationthroughput. However, spatial parallelism in the form of parallel processing offeredby RC can be exploited further. Specifically, DG2VHDL [8] is a VHDL genera-tor of the hardware counterpart of a nested loop algorithm, based on systolic arraymapping [9].Due to identical functionality of each loop body, the mapping results in an array

of processing elements (PEs) interconnected with localized and pipelined communi-cation buses or wires. Systolic arrays can make the most of both spatial parallelismas an array of identical PEs and temporal pipelining from ubiquitous on-chip reg-isters. For instance, a high-speed 118-MHz FIR (finite impulse response) filter hasbeen implemented using the Xilinx Virtex FPGA by Martinez et al. [10], as partof a front-end radar signal processor. Furthermore, the size of configuration bitstream can be significantly reduced because of identical functionality of each PEand regular interconnect structures.Another ideal candidate for processor array implementation is the demanding

data-intensive multimedia processing expressed in nested Do loop algorithms. Manyof these algorithms are characterized as uniform recurrence equations [11], for exam-ple, vector/matrix multiplication. Nevertheless, nonuniform recurrence equations doexist. Such algorithms include full search block matching motion estimation [12].In the past, a nonuniform or shift-variant dependence graph (DG) had to be uni-formed [13] prior to systolic mapping. As a result, very few design methodologies[14] were developed. Since no systematic formulation of shift-variant DG mappingdoes exist, its corresponding structurally time-varying signal flow graph (SFG), [9],optimization has not yet been investigated.Therefore, the contribution of this paper is the systematic formulation of pro-

cessor array synthesis consisting of shift-variant DG mapping and time-varying SFGoptimization. Along with the synthesis process, a number of new objective functionsand realistic constraints are included to heuristically discover near-optimal solu-tions. The methodology is targeted at fine-grained RC such as FPGAs. Nevertheless,the final array structure can fit into the framework of coarse-grained RC as well.Ultimately, the proposed methodology can be incorporated into the reconfigurableSOC computer-aided design tool.The rest of this paper is organized as follows. Based on the systolic array map-

ping in Section 2, the synthesis methodology of shift-variant DGs is developedin Section 3. Next, Section 4 illustrates an application of the proposed synthesismethodology using a 6D motion estimation algorithm in digital video coding. Thispaper is finally concluded in Section 5.

2. Systolic array and reconfigurable computing

In this section, we will describe how our research is motivated. Inspired by theclassical systolic array, this paper is intended as a mapping methodology of shift-variant DG to a regular array structure of PEs. It overcomes the limitation oftraditional systolic mapping while achieving high computing throughput.

PROCESSOR ARRAY SYNTHESIS FROM SHIFT-VARIANT DEEP NESTED DO LOOPS 231

A. Systolic array or space-time mapping

We will use the example of a matrix–matrix multiplication algorithm to illustratethe basic notations and formulations of systolic or space-time mapping.

Example 1 Matrix–matrix multiplication C = A× B

ci� j =K∑k=1

ai� kbk� j �

where A = ai� j �, B = bi� j �, and C = ci� j � are matrices of appropriate dimensions.The corresponding nested loop algorithm formulation can be expressed as:

Listing 1 Matrix–Matrix Multiplication

Do i = 1 to MDo j = 1 to N

c i� j� = 0Do k = 1 to K

c i� j� = c i� j�+ a i� k�× b k� j�EndDo k

EndDo jEndDo i

where i, j , and k are loop indices. Together, they form an (iteration) index space,where each point �i� j� k� within the loop bounds corresponds to a single executionof the loop body.Let im be the loop index of mth level loop nest. We denote �i = �i1� i2� � � � � in�

t ∈ Zto be an n-Dimensional (n-D) column index vector of an n-level nested Do loop,where Z is the space of integer numbers and �at denotes the transpose of �a. Then,the n-D index space Jn can be expressed as:

Jn = {�i = �i1� i2� � � � � in�t � i1� i2� � � � � in ∈ Z

}(1)

In this example, the loop body consists of a single recurrence equation,

c i� j� = c i� j�+ a i� k�× b k� j��

where a i� k� and b k� j� are input variables, and their values are needed to executethis loop. c i� j� is an output variable whose value will be computed.In Listing 1, the innermost k-loop is used to realize the summation of K product

terms a i� k� × b k� j�, 1 ≤ k ≤ K. While c i� j� is the final result, it is also usedto store intermediate results before the last iteration. In other words, the samememory address designated to c i� j�, is assigned to new values K times duringthe execution of the algorithm. In a single-assignment formulation [9], a set of newintermediate variables are introduced to store the intermediate results. As such,


every variable can be assigned to a new value, at most once, during the executionof the algorithm.In this example, the input variable a i� k� will be used in each of the j loops,

and b k� j� will be used in each of the i loops. In particular, a i� k� will be madeavailable to iterations with indices ��i� j� k�t� 1 ≤ j ≤ N�, and b k� j� will be madeavailable to iteration indices ��i� j� k�t� 1 ≤ i ≤M� in the index space J3 defined inEquation (1). In a parallel computing platform, if different iterations are executedat different processors, these input variables, a and b, must be propagated or broad-cast to different processors to facilitate the computation. The use of intermediatevariables c i� j� k�, can ensure that the single assignment constraint is satisfied. Withthe introduction of these intermediate variables, every variable associated with aparticular iteration will have the full set of indices. For example, the matrix–matrixmultiplication loop body can now be rewritten as:

Listing 2 Single-Assignment Matrix–Matrix Multiplication

Do i = 1 to MDo j = 1 to N

Do k = 1 to K

a3 i� j� k� ={a3 i� j − 1� k�� j > 0a i� k�� j = 0

b3 i� j� k� ={b3 i − 1� j� k�� i > 0b k� j�� i = 0

c3 i� j� k� ={c3 i� j� k − 1�+ a3 i� j� k�× b3 i� j� k�� k > 00� k = 0

c i� j� = c3 i� j� k�� k = KEndDo k

EndDo jEndDo i

In the above listing, a3 and b3 are the transmittal variables of a and b, respec-tively, and c3 is the computation variable of c. We may now define an inter-iterationdependence vector as the set of index differences between the output of each itera-tion (on the left hand side of each equation), and the input (on the right hand sideof the equation). For example, the dependence vectors of variable a, b, and c are�da = �0� 1� 0�t , �db = �1� 0� 0�t , and �dc = �0� 0� 1�t , respectively.The plot of a graph representing index points corresponding to the algorithm in

the index space J3 and the dependence vectors, is called the dependence graph [9](DG). The DG of Listing 2 is plotted in Figure 1. In other words, the DG (alsoknown as the iteration space data dependence graph [15]) is a graphical representa-tion of data dependencies among loop iterations of a nested Do loop. It consists ofa set of nodes (vertices) and a set of edges. Each node corresponds to a loop index,�i ∈ Jn, or the innermost loop body, regardless of its complexity. Each directionaledge represents either a propagation or a computation dependence vector.


+X

b

a

c

ab

c

c11=0

a11

a21

a31a12

a22

a32a13

a23

a33

b11 b12 b13

b21 b22 b23

b32 b33b31

c33c32c31

c23

c13

(a) (b)

Figure 1. 3× 3 matrix–matrix multiplication in a 3D dependence graph (a), and its node (loop body),(b).

A loop nest is called a set of uniform recurrence equations (URE) [11] if itsdependence vectors are independent of the loop index �i. In other words, a URE’sloop bounds are known constants before the execution of the algorithm. Almost allthe data intensive nested loop formulated multimedia algorithms can be formulatedas uniform recurrence equations.In a URE formulated algorithm, each loop has a set of uniform dependence

vectors that can be described by a dependence matrix DV ,

DV =[ �db �da �dc

]=

1 0 00 1 00 0 1

�

where V = �a� b� c� is a set of variables. Hence, it is called shift-invariant DG.Because of the shift-invariance property of n-D DG, the task of scheduling and allo-cation or assignment of individual index (loop body) to be executed on a particularprocessor at a particular clock cycle, can be solved using algebraic projection. Thisprojection is called systolic, or space-time mapping. In other words, systolic mappingis limited to nested loop algorithm with uniform recurrence equation. The mappingmatrix T , [9], consists of a scheduling vector �s and a processor allocation matrix P .In this example, �s = �i� j� k�t = �0� 0� 1�t , and P = �p1 �p2 · · · �pn−1�, where �pm isan n-element column vector, 1 ≤ m ≤ n − 1. T must be a nonsingular matrix so


that the mapping has no conflicts such that T �p = T �q, where �p = �q� ∀ �p�∀ �q ∈ Jn,

T =[ �stP t

]=

0 0 10 1 01 0 0

�

The mapping of dependence matrix DV results in a delay-edge matrix,

TDV =[ �stDV

P tDV

]=

[rVeV

]=

[rb ra rc

�eb �ea �ec

]=

0 0 10 1 01 0 0

� (2)

The scalar-valued delay, or register rv, is associated with an edge (interprocessorlink) �ev = �x� y�t . For example, the edge vector of variable c, �ec = �0� 0�t , indi-cates a self loop, where rc = 1 indicates one register or delay element to store theintermediate results as shown in Figure 2.

c11

D

c21

c31

DD

c12

D

c22

c32

DD

c13D

c23

c33

DD

a13 a12 a11

a23 a22 a21

a33 a32 a31

b11

b21

b31

b12

b22

b32

b13

b23

b33

x

y

Figure 2. A 2D processor array for 3 × 3 matrix multiplication, after mapping with scheduling vector�s = �i� j� k�t = �0� 0� 1�t , (D indicates a register or delay element).


In addition to the resulting processor array, a parallel execution schedule, t�Jn� ⊂Z, and processor allocation space, J1 ⊂ Z of the index space Jn can be obtained by

T Jn =[ �stJnP tJn

]=

[t�Jn�Jn−1

]� (3)

In other words, T maps the execution of an index �q ∈ Jn to processor �Pt �q at clockcycle t� �q� = �st �q. It can be observed that the 3D DG has been projected directlyalong the k axis to a 2D processor array and a 1D execution schedule.

B. Systolic array vs. reconfigurable computing

The existing systolic mapping methodology is limited to the shift-invariant DG asshown earlier. An unreasonable assumption that was always made, is that input dataare always available. For instance, it causes the number of input/output (I/O) portsto grow as a function of problem size. Such a large number of I/O ports puts morepressure on the memory bandwidth to sustain such a high computation throughput.This might be one of the many reasons that make systolic array not a commercialsuccess.In this paper, we assume that a modern multi-million gate FPGA, such as the

Xilinx Virtex family [16, 17] can host a fine-grained reconfigurable processor arraytailored to computation- and data-intensive algorithms. The Xilinx Virtex PortableInterface (XVPI)[18], has been developed to support run-time configuration. Theresulting processor array can effectively exploit both loop-level or inter-iterationpipelining and parallelism. Each processor will consist of word-level cores providedby FPGA manufacturers, such as the Xilinx Core Generator system. These cores areplaced and routed to exploit the faster local interconnects.

3. Processor array synthesis

The synthesis of a processor array from a shift-variant DG can be decomposed intotwo major steps, mapping an n-D shift-variant DG to a 1D SFG (processor array),and optimizing the resulting 1D time-varying SFG. The advantages of 1D arraymapping are threefold. First, the final 1D array can be adjusted to fit the chip areaeasily. Second, the I/O port is already on the array boundary. Third, the array iseasy to rearrange to a 2D array that can be performed by the SFG optimizationdescribed next in Section 3B.

A. 1D processor array mapping

In contrast to the traditional systolic mapping [9], the integral mapping matrix T1,[19], consists of a scheduling vector, �s and an allocation vector, �P ,

T1 =[ �st�Pt

]=

[s1 s2 � � � snp1 p2 � � � pn

]� (4)


where si� pi ∈ Z� i = 1� � � � � n, and Z is the integer number space. Similar to thecase of a shift-invariant DG, the mapping of its dependence matrix DV and its indexspace Jn by T1, also results in a delay-edge matrix similar to Equation (2),

T1DV =[rVeV

]=

[rv1 rv2 � � �ev1 ev2 � � �

]�

where V = �v1� v2� � � � � is a set of variables in the algorithm. Equivalent to Equa-tion (3), the mapping yields

T1Jn =

[ �stJn�PtJn

]=

[t�Jn�J1

]�

where t�Jn�� J1 ⊂ Z are a 1D execution schedule, and a 1D processor space, respec-tively.The shift-invariant DG in the previous section can be extended to handle the

shift-variant, [9], one as follows. We denote Kn as an n-D shift-variant DG. Inother words, Kn is a set of nodes, where each node is attributed by the followingfields of information, n-D index, dependence matrix, input vector, output vector,and terminal vector. Thus, an n-D DG node, kn ∈ Kn is a tuple of

kn � ��i�DV � IV �OV � FV ��

Resembling the C++ object-oriented programming language, we use “.” to access aparticular field. For example, kn · �i denotes an n-D index, kn · �i = �j1� � � � � jn�

t� li ≤ji ≤ ui, where li� ui ∈ Z� i = 1� 2� � � � � n are the lower and upper bounds of ithloop, respectively. The dependence matrix kn · DV = �dv1 �dv2 � � � � is associatedto each node kn ∈ Kn. A field kn · IV is a row vector of input data, where kn ·IV = iv1 iv2 � � � �. Similarly, kn · OV is a row vector of output data, kn · OV = ov1 ov2 � � � �. Moreover, a terminal vector is a row vector specifying whether thenode is the I/O terminal for each variable, vi ∈ V , kn · FV = fv1 fv2 � � � �, wherefvi ∈ �0� 1�. Each fvi = 1 indicates that node kn is variable vi’s I/O terminal. Afterapplying the mapping matrix T1, Kn is mapped to a 1D array K1 by

K1 = T1�Kn��

where T1�·� denotes the 1D array mapping defined in Equation (4).The 1D or linear processor array K1 representation, captures all the informa-

tion in terms of PE location and a schedule of delay, edge (interconnect), andinput/output/terminal vectors. Processor k1 ∈ K1 is a tuple of

k1 � �J1� t��i�� rV � eV � IV �OV � FV ��

Input/output/terminal vectors are allocated to a PE, and organized as an arrayindexed by k1 · t��i�. Each processor k1 at PE location k1 · J1 is the resulting pro-jection of a set of n-D DG nodes, �kn � k1 · J1 = �Pt�kn · �i�� ∃kn ∈ Kn�, wherePEmin ≤ k1 · J1 ≤ PEmax and

PEmax = max� �Pt �q � �q ∈ Jn��PEmin = min� �Pt �q � �q ∈ Jn��


A.1 Mapping objective functions.

• Number of Execution cycles (Ncycle): The total parallel execution time, ttotal, canbe computed by

ttotal = Ncycle × tcycle�

where tcycle is the PE’s longest critical propagation delay, depending solely onthe PE architecture and interconnect topology. On the other hand, Ncycle is theduration (in clock cycles) between the first and the last computation indices [9],where

Ncycle = max�p� �q∈Jn

��st� �p − �q��+ 1� (5)

• Number of PEs (NPE) is defined as the number of distinct projections in J1,

NPE = PEmax − PEmin + 1� (6)

It can be observed that Ncycle and NPE can be expressed as close-form relationshipsin terms of scheduling vector, �s and allocation vector, �P only. An optimal proces-sor array based on Ncycle or NPE, or both, can be sought subject to the followingconstraints.

A.2 Mapping constraints. The space-time mapping can be viewed as an optimiza-tion problem, whose constraints are subject to both available resources and compu-tation conflicts.

• Available Resources: Interprocessor communication is constrained by the avail-able routing resources to be within the maximum radius, emax. On the other hand,pipelining is subject to the number of registers rmax. As a result, the optimalsolutions must satisfy

rv ≥ 0�

ev�min ≤ �ev� ≤ emax�∑v∈V

rv ≤ rmax� (7)

Rank�T1� = 2�

∀ v ∈ V �

where rv and ev are the delay and edge of variable v, respectively. The equation,rv ≥ 0, is also known as causality constraint [9], [19]. Regardless of the PE’scomplexity, the length of an edge �ev�, must be greater than or equal to theminimum link ev�min.


• Computation Conflicts: Due to insufficient rank of T1, a conflict occurs whenT1 �p = T1 �q� �p = �q� ∃ �p� ∃ �q ∈ Jn. In other words, there may exist a pair of iterationindices mapped to the same PE at the same cycle. In order to efficiently detectcomputation conflicts, a 2D integer array A, with Ncycle rows and NPE columnscan be used to determine whether

No computation conflictIff←→ A T1 �q� ≤ 1�∀ �q ∈ Jn� (8)

where

A T1� �q� ={

A T1� �q� �∀ �q ∈ Jn0 � during initialization�

Namely, each array element A �st �q� �Pt �q�, is zero from the initialization phase.Upon the checking, a computation conflict is detected if an array element isgreater than one. Hence, the worst-case time complexity of this method is O�Nn�,whereas that of Lee and Kedem [19], is O�N 2n�. In addition, no communica-tion conflict [19], [20] is introduced by this methodology, because the threads ofexecution are not allowed to interfere with one another.

A.3 Heuristic search. From the definition of T1 in Equation (4), each of its ele-ments can be searched within this set, for example, �0�±1�±li ± 1�±ui ± 1�±uiuj�,where li� uj ∈ Z are the lower bound of ith loop, and upper bound of jth loop,respectively. Because of the fact that the search for T1 is independent of oneanother, a number of different iterations can be distributed in a network of worksta-tions, and the performance evaluations can be compared at the end. From anotherperspective, the problem size can be scaled down to reduce the computationalcomplexity.In the next design step, the time-varying SFG representation will be formulated

based on the available information from 1D array K1.

B. Time-varying signal flow graph optimization

In this subsection, a time-varying signal flow graph (SFG) is formulated. The graph(PE array) is then optimized to satisfy an additional set of objective functions subjectto the available resource constraints in Equation (7).A time-varying SFG, G = �W�L�, is a directed graph consisting of a set of

nodes or processors, W = �2i � 0 ≤ i < NPE�, and a set of links or interconnects,L = �Lv � ∀ v ∈ V�. An SFG node, 2 ∈ W , encapsulates the following informationavailable from k1 ∈ K1 by

2 � �Z2� t��i�� IV �OV � FV �

where Z2 represents the 2D Cartesian coordinate to support 1D to 2D transforma-tion, IV �OV and FV form a schedule of I/O data and terminal vectors. The initial


horizontal coordinate of an SFG node is assigned to its node number as follows,2i ·Z2 = �x� y� = �i� 0�, where i = k1 · J1 − PEmin. On the other hand, Lv representsa set of variable v’s links, where each link lv ∈ Lv is attributed by source/destinationnodes, and schedule of delays and edges,

lv � �Z2� t��i�� rV � eV ��

A link of variable v connecting SFG nodes 2i and 2j at clock cycle 4 is denoted byits Z2 field as lv · Z2 4� = �i� j�, where j = i + lv · ev 4�� 2i� 2j ∈ W� 4 ∈ 2i · t��i�.

B.1 Assumptions. The SFG representation is close to its physical implementa-tion in several senses. Therefore, its optimization should be based on the followingassumptions. Firstly, each SFG node (PE) is a unit wide, and a unit long as a squarebox regardless of its complexity. Secondly, every link is only routed between cen-ters of source and destination nodes, such that the distance between two adjacentPEs is one unit long. In addition, its length follows the Manhattan distance. Thirdly,long pipelined interconnects are supported by current FPGA technology, such asXilinx Virtex II [17]. Finally, input values, e.g., 0, ±�, etc., which can be generatedor stored locally, are considered as virtual data.

B.2 Objective functions. Based on the previous assumptions, the SFG optimizationis still subject to the remaining registers, and the maximum length of available wires.However, the current objective is to minimize the following cost functions.

• Number of v’s I/O Ports: The set of SFG nodes performing I/O at cycle 4 is

IOv�4� ={2i � 2i�fv 4� = 1� 4 ∈ 2i�t��i�� 2i ∈ W

}� (9)

Therefore, the number of variable v’s I/O ports is given by

#IOv = max4∈t�Jn�

�IOv�4�� (10)

where �IOv�4�� denotes the size of IOv�4�. Because of limited number of I/O pinson the reconfigurable fabric, one of the goals is to minimize the number of I/Oports and keep them on the boundary of the array, using the procedure calledthe I/O interlacing, which makes use of virtual or known input data.

• Memory Bandwidth: Memory bandwidth associated with variable v, Bv, is thenumber of I/O instances, via a particular input/output port per unit time. In thiscase, the unit time is a clock cycle. From Equation (9), Bv is equivalent to thetotal number of input/output occurrences averaged over Ncycle,

Bv =∑

4∈t�Jn� �IOv�4��Ncycle

� (11)


Find T1 such thatEq. (5) and/or Eq. (6) is minimized

subject to constraints inEq. (7) and (8)

Kn K1

Optimize G such thatEq. (10), (11), and (12) are

minimized subject toconstraints in Eq. (7)

G=(W,L) G'=(W',L')

Shift-Variant DG Mapping Time-Varing SFG Optimization

Figure 3. Processor array synthesis: shift-variant DG mapping and time-varying SFG optimization.

• Interconnection is another limited resource in RC. Therefore, it must be utilizedefficiently. The total length of variable v’s interconnection is denoted by

�Lv� =∑lv∈Lv

�lv · ev� (12)

where �lv · ev� = max4∈lv ·t��i� �lv · ev 4�� denotes the length of each variable v’s link.Minimizing the total interconnection length can often be achieved by geometrictransformation from 1D to 2D array.

C. Summary

The proposed processor array synthesis can be decomposed into two major steps,mapping an n-D shift-variant DG to a 1D time-varying SFG (processor array), andoptimizing the 1D processor array according to realistic objective functions as shownin Figure 3. Although the space-time mapping part is the same as the shift-invariantone, the resulting SFG makes the problem even more challenging and has not beenformulated. From another viewpoint, a shift-variant DG should be mapped in sucha way that less time-varying properties are present.In the next section, an application of our design methodology will be demon-

strated.

4. Application: Digital video coding

In digital video coding standards, such as MPEG I, II, and ITU H.26x, a sequenceof pictures is coded by either block-based discrete cosine transform (DCT) cod-ing, or motion compensated differential coding. The motion compensated frame isreconstructed from motion estimated blocks of pixels. Every pixel in each block isassumed to displace with the same 2D displacement called motion vector. Motionvectors can be obtained by the so-called motion estimation (ME) algorithm, whichis one of the most time-consuming tasks in digital video coding.

A. Full-search block matching motion estimation (FSBM ME)

As a major step towards systematic approach, Yeo and Hu [12], formulated a six-level nested Do loop algorithm to represent a single-frame, multiple block motion


estimation. A similar work by Chen and Kung [21], has treated the DG as a shift-invariant one. Their methodology is solely based on multiprojection [9]. Multipro-jection is basically a series of space-time mapping, resulting in lower dimension ofthe SFG from 3D, DG to 2D SFG, and 2D SFG to 1D SFG, while maintaining 1Dtime schedule. It is easy and obvious to understand shallow nested Do loops, butrather tedious and error-prone for deep nested Do loops. The deeper the loop, thehigher the mapping complexity is going to be. Furthermore, the solution may be farfrom optimality, due to unknown constraints for intermediate dimensions prior toachieving the final SFG dimension.A typical video frame consists of Nh × Nv square blocks of pixels, where Nh

is the number of blocks in each row, and Nv is the number of rows in eachframe. Using FSBM ME, a motion vector �m� n� of a block of N × N pixels,which yields the minimum mean-absolute distortion (MAD) criterion, between cur-rent block and the �2p + 1�2 blocks in the search area, can be obtained by MV =arg�min MAD m� n��− p� p�� 0 ≤ m�n ≤ 2p, where p is the search range in num-ber of pixels, and is usually less than or equal to N . The corresponding MAD of avector �m� n� is obtained by

MAD m� n� = 1N 2

N−1∑i=0

N−1∑j=0

�x i� j�− y i +m− p� j + n− p�� (13)

where x i� j� and y i� j� are the luminance pixels of the current and previous frames,respectively. In practice, the scaling factor 1/N 2 in Equation (13) is omitted. A six-level nested Do loop, MAD-based FSBM algorithm, is shown in Figure 4, whereDmin is the minimum distortion measured using MAD.After converting this algorithm to a single-assignment form shown in Figure 5, a

number of variables with subscript are introduced to differentiate the original vari-ables to the new 6D ones. Note that this algorithm is one of the many possible onesdue to a large number of times, that a particular pixel has been reused. Dependencevectors of MV are omitted, because they are identical to those of Dmin. As a result,the 6D DG of the previous-frame pixels, y6, can be portrayed in Figure 6. This DGis obviously shift-variant. In other words, the dependence vectors vary from node(iteration index) to node.

Figure 4. Six-level nested Do loop FSBM motion estimation algorithm.


Figure 5. Single-assignment, 6-level nested Do loop FSBM motion estimation algorithm.

As a simple illustration, we let Nv = 3, Nh = 2, p = N/2 = 1. The 6D indexspace, J6, is ��h� v�m� n� i� j�t � 0 ≤ h < Nh� 0 ≤ v < Nv� 0 ≤ m�n ≤ 2p� 0 ≤ i� j <N�. Owing to limited presentation space, certain aspects of this shift-variant DGmapping can be elaborated as if it were a shift-invariant one. From Figure 5, eachnode of K6 may possess one or both dependence vectors of x, where

Dx =[0 0 0 1 0 00 0 1 −2p 0 0

]t�

Similarly, the dependence matrices of y and MAD are

Dy =[0 0 1 0 −1 00 0 0 1 0 −1

]t�

and

DMAD =[0 0 0 0 1 −N + 10 0 0 0 0 1

]t�

respectively. According to the algorithm in Figure 5, the motion vector MV and theminimum distortion Dmin share the same dependence matrix,

DMV = DDmin=

[0 0 0 1 0 00 0 1 −2p 0 0

]t�


m

n

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

m

n

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

m

n

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

m

n

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

m

n

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

m

n

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

i

j

h

v

Figure 6. 6D dependence graph of the previous-frame pixels y, Nv = 3, Nh = 2, p = N/2 = 1.


In summary, the dependence matrix DV of a 6-level nested Do loop motion esti-mation, can be expressed as

DV = [Dx � Dy � DMV � DMAD

]�

B. Constraints set #1

The search of T1 is conducted for minimum Ncycle and NPE with no resource con-straints, to determine if there is any existing solution as the following.

Algorithm 1 Find T1 in Equation (4) by enumerating sm� pm ∈ �0�±1� ui ± 1� �ui ±1��uj ± 1�� ui ± 1��uj ± 1��uk ± 1�� i� j� k�m = 1� 2� � � � � n, such that both Ncycle

and NPE are minimized subject to

rv ≥ 0

�ev� ≥ 0

Rank�T1� = 2

∀ v ∈ �x� y�MAD�Dmin��

Eventually, the space-time mapping matrix T1�#1 for mapping a 6D index space J6

to a 1D PE array is

T1�#1 =[ �st�Pt

]=

[N 2 NhN

2 2p + 1 1 N 10 0 2p + 1 1 0 0

]�

Its mapping yields this delay-edge matrix,

[rVeV

]=

[1 1 2p + 1−N 0 1 1 1 11 1 2p + 1 1 1 1 0 0

]�

Because T1�#1 uses minimal resource in terms of on-chip storage, and shows desir-able properties such as high processor utilization and low I/O port count, it isconsidered an optimal solution based on the current formulation and constraints.As shown in Figure 7, the resulting 1D SFG is structurally time-varying, where

an SFG node represents a PE, an edge is a bus, and D is a delay element or apipeline register. Each 8-bit current-frame pixel, x, is input and propagated fromleft to right. The 8-bit previous-frame pixel, y, either propagates or enters the arrayat certain PEs at certain clock cycle. Each MV’s 16-bit MAD is accumulated everycycle in each PE. The final MAD is compared with Dmin every N 2 = 4 cycles, andpropagated/pipelined from left to right, as well as x. It can be noticed that onlythe propagation of y is changed from clock cycle to cycle. After a multi-pass time-varying SFG optimization to achieve satisfactory objective functions described inSection 3-B.2, the final array is equivalent to that of [22].


D

4

D

D 5

D

DD 6

D

7

D

D 8

D

DD

D D D D

x MV

(2p+1)2

0

D

1

D

D 2

D

D 3

y13 y43

D

4

D

D 5

D

DD 6

D

7

D

D 8

D

DD

D D D D

x MV0

D

1

D

D 2

D

D 3

y14 y43 y51

(a)

(b)

D D

D

Figure 7. 1D time-varying signal flow graph: at 4 = 8 (a) and at 4 = 9 (b), 4 ∈ t�J6�

C. Constraints set #2

As indicated earlier in Section 3A, T1 must be searched to obtain the final 1Darray that satisfies numerous constraints. The previous solution of T1�#1 was unableto reuse pixels y i� j� efficiently. It thus resulted in two y input ports. With thesame figures of Ncycle and NPE, our current objective is to equalize the bandwidthof current frame pixel x, and previous frame pixel y. Based on the new single-assignment algorithm and new set of constraints:

Algorithm 2 Find T1 in Equation (4) and optimize the resulting 1D processorarrays, according to Figure 3, such that By/Bx = 1 subject to

1 ≤ rv ≤ 2

0 ≤ �ev� ≤ 1

Ncycle = 256

NPE = 25

IOx = 1

IOy = 1

Rank�T1� = 2

∀ v ∈ �x� y�MAD�Dmin��

Thus,

T1�#2 =[N 2 NhN

2 2p + 1 2 N 10 0 2p + 1 1 0 0

](14)

was obtained.


PE 0 PE 1x

y

DminDmin=Inf

y0 y1

xPE 2

x

DminPE 3

x

DminPE 4

x

Dmin

PE 5 PE 6x

DminPE 7

x

DminPE 8

x

DminPE 9

x

Dmin

y y y y

y

PE 10 PE 11x

DminPE 12

x

DminPE 13

x

DminPE 14

x

Dmin

y y y y

y

PE 15 PE 16x

DminPE 17

x

DminPE 18

x

DminPE 19

x

Dmin

y y y y

y

PE 20 PE 21x

DminPE 22

x

DminPE 23

x

DminPE 24

x

Dmin

y y y y

Dmin=Inf

Dmin=Inf

Dmin=Inf

Dmin=Inf

x

x

x

x

y y

Dmin

Dmin

Dmin

Dmin

Dmin

min

min

min

min

Dmin

y y y

Figure 8. 2D full search block matching motion estimation array (N = 4� p=N/2= 2).

Depicted in Figure 8, the 1D array is optimized to become 2D, where the smallersquares indicate the input ports to the array. On the other hand, the schedules ofcurrent frame pixel x i� j�, previous frame pixel y i� j� at 4 = 35, 36, and 37 arelisted in Figure 9 (a), (b), and (c), respectively. Each x i� j� propagates from left PEto right PE every two cycles, and to the consecutive row every five cycles only in thefirst column. The propagation of Dmin is of the similar pattern while the final resultis the minimum one among all rows. The previous frame pixels y i� j� propagatedownward every cycle, and to the right PE in the first row only. Although the figureshows that two y i� j� ports are drawn, every y i� j� is fetched/input alternatively toeach port, and reused throughout the entire array.

D. Discussions

Table 1 quantitatively compares different aspects among 2D arrays in terms ofsearch range, NPE, 1/throughput (cycles per block), memory bandwidth ratio By/Bx,and fan-out (number of capacitive outputs). We are able to achieve unit By/Bx

bandwidth ratio. It can be noticed that our By/Bx ratio is superior to others with afew more registers to compensate for the huge demand of y’s memory bandwidth.The less demand on memory bandwidth, the less pressure on the on-and off-chip


9,4 9,3 8,6 8,5 7,8 9,3 9,2 8,5 8,4 7,7 9,2 8,5 8,4 7,7 7,6 9,1 8,4 8,3 7,6 7,5 8,4 8,3 7,6 7,5 6,8

9,5 9,4 9,3 8,6 8,5 9,4 9,3 8,6 8,5 7,8 9,3 9,2 8,5 8,4 7,7 9,2 8,5 8,4 7,7 7,6 9,1 8,4 8,3 7,6 7,5

11,6 11,4 10,6 10,4 9,610,5 10,3 9,5 9,3 8,5 9,4 8,6 8,4 7,6 7,4 8,3 7,5 7,3 6,5 6,3 6,6 6,4 5,6 5,4 4,6

12,3 11,5 11,3 10,5 10,310,6 10,4 9,6 9,4 8,6 9,5 9,3 8,5 8,3 7,5 8,4 7,6 7,4 6,6 6,4 7,3 6,5 6,3 5,5 5,3

12,4 11,6 11,4 10,6 10,411,3 11,5 10,3 9,5 9,3 9,6 9,4 8,6 8,4 7,6 8,5 8,3 7,5 7,3 6,5 7,4 6,6 6,4 5,6 5,4

9,6 9,5 9,4 8,7 8,6 9,5 9,4 9,3 8,6 8,5 9,4 9,3 8,6 8,5 7,8 9,3 9,2 8,5 8,4 7,7 9,2 8,5 8,4 7,7 7,6

(a)

(b)

(c)

x(i, j) y(i, j)

Figure 9. Schedules of full search block matching motion estimation: (a) at cycle 4 = 35, (b) 4 = 36,and (c) 4 = 37 (Nh = 3, Nv = 3, N = 4, p = N/2 = 2), 4 ∈ t�J6�.

memory becomes. Contrary to that of Chen and Kung [21], the #2 array can elim-inate both redundant memory fetches and additional pixel caches while preservingmemory (I/O) transactions, as well as power consumption at the same time.

E. Hardware implementation

In order to verify the figure of merit of this ME array, the PE was implementedusing subtractors, adders, and others from the Xilinx Core Generator System.

Table 1. Performance comparison for 2D FSBM ME Arrays, N = 4� p = N/2 = 2

Search 1/Throughput RegistersType range NPE (cycles/block) (bytes) By/Bx Fan-Out

[23] AB2 −2/+2 16 40 94 10 0[24] Type 1 −1/+1 16 40 180 2 100[25] −2/+2 16 65 94 4 0[12] −2/+1 16 16 64 2 16[21] −2/+2 16 25 200 1 4[22] −2/+2 25 16 152 2 8Ours #1 −2/+2 25 16 125 2 6Ours #2 −2/+2 25 16 164 1 0


In spite of no human optimization, the minimum propagation delay along the criticalpath is 25 nanoseconds in the Xilinx XCV800-4 Virtex FPGA prototyping system.As a result, the maximum clock frequency is 40 MHz estimated by the Xilinx Tim-ing Analyzer. At this clock speed, the array can achieve up to 96.5 frames per secondof CCIR Rec. 601 �720× 576� resolution. Moreover, the complexity of each PE inboth cases is fairly low. Its equivalent gate count is 1,500–2,000 gates for single PE.A rough estimate of ME array for N × N = 8 × 8 block size, and p = N/2 = 4is equivalent to 200,000 gates. At 40 MHz, the 9×9-PE array delivers up to 3.34GOPS (giga operations per second) of sum of absolute difference.

5. Conclusion

Wireless Internet devices with application-on-demand capability will become feasi-ble eventually because of the reconfigurable system-on-a-chip. Among the popularapplications, most of them involve various kinds of multimedia processing, whosebasic construct is just a nested Do loop. Therefore, a nested Do loop algorithm toreconfigurable array synthesis is crucial to the underlying architecture. To exploit theparallelism of nanoprocessor array and temporal pipelining, the proposed method-ology can be applied as a general nested Do loop mapping computer-aided designtool.Our methodology can map deep nested loop algorithms to 1D and 2D processor

arrays. Essentially, it can deal with shift-variant dependence graphs and the cor-responding structurally time-varying signal flow graphs, which both have not beenformally investigated. Subject to the various realistic design constraints, the method-ology can heuristicallly discover the near-optimal solutions, according to severalcriteria, including number of clock cycles, number of processing elements, memorybandwidth, number of I/O ports, and so on. An example of full-search block match-ing motion estimation, one of the most computation, and data-intensive tasks indigital video coding, has been elaborated.From a compiler perspective, space-time mapping of an n-level nested Do loop

is similar to loop unrolling optimization with known loop bounds. The differenceis on how the unrolled loop is executed by a number of processors in parallel andpipelined fashion. On the other hand, time-varying signal flow graph optimizationcomprises a number of procedures to regularize the execution flow such that thearray is suitable for reconfigurable architecture.

References

1. M. Cummings and S. Haruyama. FPGA in the software radio. IEEE Communications Magazine,(2)37:108–112, 1999.

2. E. Mirsky and A. DeHon. MATRIX: A reconfigurable computing architecture with configurableinstruction distribution and aeployable resources. In Proc. IEEE Symposium on FPGAs for CustomComputing Machines, pp. 157–166, 1996.

3. H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho. MorphoSys:An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEETrans. on Computers, 49(5):465–481, 2000.


4. T. Nishitani. An approach to a multimedia system on a chip. IEEE Workshop on Signal ProcessingSystems, pp. 13–21, 1999.

5. Triscend Corporation. Mountain View, CA. http://www.triscend.com.6. Chameleon Systems Inc. San Jose, CA. http://www.chameleonsystems.com.7. R. Rinker, M. Carter, A. Patel, M. Chawathe, C. Ross, J. Hammes, W. A. Najjar, and W. Bohm. An

automated process for compiling dataflow graphs into reconfigurable hardware. IEEE Trans. on VeryLarge Scale Integration, 9(1):130–139, 1999.

8. A. Stone and E. S. Manalokos. DG2VHDL: To facilitate the high level synthesize of parallel pro-cessing array architectures. J. of VLSI Signal Processing, 24(1):99–120, 2000.

9. S. Y. Kung. VLSI Array Processors. Printice Hall, Englewood Cliffs, New Jersey, 1988.10. D. R. Martinez, T. J. Moeller, and K. Teitelbaum. Application of reconfigurable computing to a high

performance front-end radar signal processor. J. of VLSI Signal Processing, 28(1/2):65–83, 2001.11. R. M. Karp, R. E. Miller, and S. Winograd. The organization of computations for uniform recurrence

equations. J. ACM, 14(3):563–590, 1967.12. H. Yeo and Y. H. Hu. A novel modular systolic array architecture for full-search block matching

motion estimation. IEEE Trans. on Circuit and System for Video Technology, 5(5):407–416, 1995.13. V. Van Dongen and P. Quinton. Uniformization of linear recurrence equations: A step towards

the automatic synthesis of systolic arrays. In Proceedings of the International Conference on SystolicArrays, pp. 473–482, 1988.

14. F. M. El-Hadidy and O. E. Herrmann. Generalized methodology for array processor design of real-time systems. In IEEE Asia-Pacific Conference on Circuits and Systems, pp. 145–150, 1994.

15. M. J. Wolfe. High Performance Compilers for Parallel Computing, Addison-Wesley, Redwood City,CA, 1996.

16. Xilinx Inc. Virtex 2.5V field programmable gate arrays. May 2000.17. Xilinx Inc. Virtex-II 1.5V field programmable gate arrays. Jan. 2000.18. P. Sundaranjan and S. A. Guccione. XVPI: A protable hardware/software interface for virtex. In

Proceedings of SPIE, 4212:90–65, 2000.19. P. Lee and Z. M. Kedem. Synthesizing linear array algorithms from nested for loop algorithms.

IEEE Trans. on Computers, 37(12):1578–1598, 1988.20. W. Shang and J. A. B. Fortes. On time mapping of uniform dependence algorithms into lower

dimensional processor arrays. IEEE Trans. on Parallel and Distributed Systems, 3(3):350–363, 1992.21. Y.-K. Chen and S. Y. Kung. A systolic methodology with applications to full-search block matching

architectures. J. of VLSI Signal Processing, 19(1):51–77, 1998.22. S. Kittitornkun and Y. H. Hu. Frame-level pipelined motion estimation array processor. IEEE Trans.

on Circuit and System for Video Technology, 11(2):248–251, 2001.23. T. Komarek and P. Pirsch. Array architectures for block matching algorithms. IEEE Trans. on Circuit

and System, 36(10):1301–1308, 1989.24. L. D. Vos and M. Stegherr. Parameterizable VLSI architecutres for the full-search block-matching

algorithm. IEEE Trans. on Circuit and System, 36(10):1309–1316, 1989.25. C. H. Hsieh and T. P. Lin. VLSI architecture for block-matching motion estimation algorithm. IEEE

Trans. on Circuit and System for Video Technology, 2(2):169–175, 1992.

processor array synthesis from shift-variant deep nested do loops

Documents