integrated temporal partitioning and partial reconfiguration techniques for design latency...

ORIGINAL PAPER

Integrated temporal partitioning and partial reconfigurationtechniques for design latency improvement

Ramzi Ayadi • Bouraoui Ouni • Abdellatif Mtibaa

Received: 5 February 2013 / Accepted: 29 May 2013 / Published online: 16 June 2013

� Springer-Verlag Berlin Heidelberg 2013

Abstract In this paper, we present a novel temporal

partitioning methodology that temporally partitions a data

flow graph on reconfigurable system. Our approach opti-

mizes the whole latency of the design. This aim can be

reached by minimizing the latency of the graph and the

reconfiguration time at the same time. Consequently, our

algorithm starts by an existing temporal partitioning. The

existing temporal partitioning is the result of a whole

latency optimization algorithm. Next, our approach builds

the best architecture, on a partially reconfigurable FPGA,

that gives the lowest value of reconfiguration time. The

proposed methodology was tested on several examples on

the Xilinx Virtex-II pro. The results show significant

reduction in the design latency compared with others

famous approaches used in this field.

Keywords Temporal partitioning � Partially

reconfigurable FPGA � Data flow graph �VLSI applications

1 Introduction

The temporal partitioning has become an essential issue for

several important VLSI applications. Application with

several tasks entails problem complexities that are

unmanageable for existing programmable device. Thus, the

temporal partitioning is used to divide the application into

smaller, more manageable components, with the traditional

goals such as latency optimization or communication cost

optimization, etc. In literature, many methods have been

used to solve the temporal partitioning problem. Many

authors (Ouni and Ayadi 2008; Trimberger 1998; Cardoso

2003; Mtibaa et al. 2007) have used the list scheduling

algorithm. Others have extended existing scheduling of

high-level synthesis (Vasiliko and Ait-Bouraouli 1996;

Spillane and Owen 1998; Ouni et al. 2011a). Also, the

‘‘ILP’’ integer linear programming has been used in Kaul

et al. (1998), Byungil (1999), Wu et al. (2001) to solve the

temporal partitioning problem. The general problem of the

ILP approaches for partitioning a graph is its high execu-

tion time. In fact, the size of the computation model which

grows very fast and, therefore, the algorithm can only be

applied in small examples. To overcome this problem,

some authors reduce the size of the model by reducing the

set of constraints in the problem formulation, but the

numbers of variables and precedence constraints to be

considered still remain high. Further, the network flow

methodology has been used (Liu and Wong 1998a, b) and

improved in Jiang and Wang (2007). The main network

flow algorithm goal is the minimization of the communi-

cation overhead among the partitions, which also means the

minimization of the communication memory. The goal is

formulated as the minimization of the overall cut-size

among the partitions. In Ouni et al. (2009) authors com-

bined the force directed scheduling (FDS) algorithm and

network flow algorithm to reduce the whole latency and the

communication cost at the same time. In Ouni et al.

(2011b) authors combined temporal partitioning and tem-

poral placement to reduce the communication overhead of

the design. In Ouni et al. (2011c) authors a typical math-

ematical algorithm to reduce the whole latency of the

design. As conclusion, in the literature, the most of the

R. Ayadi (&) � B. Ouni � A. Mtibaa

Laboratory of Electronic and Microelectronic, Faculty

of Science of Monastir, University of Monastir,

Monastir 5019, Tunisia

e-mail: [email protected]

123

Evolving Systems (2014) 5:133–141

DOI 10.1007/s12530-013-9082-9

authors have proposed algorithmic approaches to solve the

temporal partitioning problem in the behavior level. The

majority of these studies have not taking into account the

technological evolution of reconfigurable architectures. In

fact, in the literature, temporal partitioning has been

applied only to full dynamic architecture (Ouni et al.

2011c; Ayadi et al. 2012). However, in this work, we

exploit the advantages of dynamic partial FPGA, to build a

full dynamic architecture. In fact, this paper shows how to

combine temporal partitioning algorithms with static and

reconfigurable modules of partially reconfigurable FPGA

to implement the design while optimizing the whole

latency of the design. As conclusion we present a new

algorithmic-architectural methodology to solve the tem-

poral partitioning problem.

2 Partially reconfigurable FPGA

In the last few years, it has been proposed a new method

for reconfiguring a Field Programmable Gate Array. This

method is well known as partial and dynamic FPGA

reconfiguration. Figure 1 shows an example of new FPGA,

such as Virtex-II, Virtex-II Pro and Virtex-4, it includes

two subsets of resources: the system resources and the

operational resources. The system resources have been

developed and optimized by the manufacturer. It may be

composed by processors (PowerPC), blocks RAM and

other internal peripherals (RS232 interface, etc.). The

operational resources are zones in the device reserved to

the user; it may be used to build modules. There are two

kinds of modules; static and reconfigurable modules. The

static modules accommodate the static part of the design

such as the configuration controller and logic required for

data and interface management. The dynamic modules

accommodate the dynamic part. Dynamic modules are

loaded and unloaded on the target device as requested by

the system scheduler. A reconfigurable module gives the

FPGA location where the design should be mapped. It

might be shown as user constraint file for the design. As

shown in Fig. 2 a module is a rectangle; it is characterized

by two parameters Wm and Hm. Where; Wm (width) is the

number of columns consumed by M and Hm (height) is the

number of lines consumed by M. Further, the new FPGA

families are capable to be self-reconfigured under software

control through a reconfiguration hardware interface called

Internal Configuration Access Port (ICAP) (OPH HWICAP

2004; Lysaght and Blodget 2006). Consequently, the sys-

tem may use the On-Chip PowerPC core and FPGA logic

to automatically reconfigure bitstreams from an external

memory, like compact flash.

3 Data flow graph

A data flow graph (DFG) is a directed acyclic graph

G = (V, E) where V is the set of nodes Vj j ¼ n and E is the

set of edges. A directed edge eij 2 E represents the data

dependency between nodes ðTi; TjÞ. We assume that each

node has an equivalent hardware implementation, which

occupies a rectangular area on the chip. Therefore, the nodes

as well as the edges in a DFG have some characteristics such as

area, latency and width that are derived from the hardware

resources used later to implement those tasks.

3.1 Task and edge parameters

Given a node Ti 2 V and eij 2 E.

• A ðTiÞ denotes the area Ti

• The latency Lat ðTiÞ of Ti is the time needed to execute

Ti

• For a given edge eij which defines a data dependency

between Ti and Tj, we define the weight aij of eij as the

amount of data transferred from Ti to Tj, and the latency rij

of eij as the time needed to transfer data from Ti to Tj.

OP

B

UART

SRAM Controller

Reconfigurable Module

Reconfigurable Module

FPGA

PLB

PowerPC PLB/OPB Bridge

Block RAM

ICAP Controller

ICAP

Fig. 1 New FPGA device

134 Evolving Systems (2014) 5:133–141

123

3.2 Temporal partitioning of data flow

A temporal partitioning P of the graph G = (V, E), is its

division into some disjoints partitions such that:

P = {P1…Pk}. A temporal partitioning is feasible in

accordance to a reconfigurable device H with area A (H)

and pins T (H) (number of programmable input/outputs (I/

Os) per device); if the two conditions are verified:

8Pi 2 P; A ðPiÞ�A ðHÞ ð1Þ

TCCost ¼XK

i¼1

CCost ðPmÞ ¼XK

m¼1

X

Ti2Pm; Tj2Pm

ai;j� T ðHÞ;

ð2Þ

where

• CCost ðPmÞ is the communication cost of partition Pm,

• TCCost is the total communication cost across partitions.

In the rest of this section, we are interested on the

explanation of these two conditions.

Given a temporal partitioning P = {P1…Pk} of the data

flow graph G = (V, E), the area constraint is satisfied if

and only if:

8Pi 2 P; A ðPiÞ�A ðHÞ;

where A ðPiÞ is the area of partition Pi; A ðHÞ is the area of

the device. Or the area of partition Pi equals the area of

nodes belong to partition Pi=[

A ðPiÞ ¼X

Ti2Pi

A ðTiÞ:

Based on Eqs. (1) and (2) to satisfy the area constraint

8Pi 2 P;X

Ti2Pi

A ðTiÞ�A ðHÞ:

Given a temporal partitioning P = {P1…Pk} of the data

flow graph G = (V, E), the pins constraint is satisfied if

and only if:

TCCost ¼XK

i¼1

CCost ðPmÞ ¼XK

m¼1

X

Ti2Pm;Tj2Pm

ai;j� T ðHÞ

In fact; if the variable ai;j 6¼ 0 signifies that Tj depends

on Ti. When node Ti is being placed in partition Pm and Tj

is being placed outside Pm, therefore the data being

communicated between them will have to be stored in the

memory. Consequently, the sum of all the data being

communicated across all partition should be less than the

pins constraint.

4 Problem formulation

In this paper we aim to solve the following problem:

Given a data flow graph G (V E), a temporal partitioning

P of the graph G = (V, E) and partially reconfigurable

device with Na CLBs: find the way of executing the graph

on the FPGA that optimizes the whole latency of the graph

while respecting all constraints. Given an existing temporal

partitioning P = {P1…Pk} of graph G = (V, E). In Ouni

et al. (2011c) the authors have presented a typical temporal

partitioning algorithm that optimizes the whole latency of

the design. In fact, this temporal partitioning gives the best

solution in term whole latency. The optimal whole latency

of the design P (Optimal LatðPÞ) has been calculated as

follows (OPH HWICAP 2004):

Optimal LatðPÞ ¼ KCT þ D ðGÞ; ð3Þ

where

• D(G) is the total latency of the design,

• Pik k is the latency of partition Pi,

• CT is the time needed to configure the device.

4.1 How calculate the total latency of the design

The total latency of the design is calculated as follows

(Ouni et al. 2011c):

D ðGÞ ¼Xk

i¼1

Pk ki ð4Þ

Let C ¼ CPi

1 ;CPi

2 ; . . .;CPin

� �be the set of paths in the

partition Pi. Hence

Pik k ¼ max1� j� n CPi

j

��

� �ð5Þ

We apply Eq. (6). On all couples ðTi;TjÞ; Tj depends on Ti

Lat ðTiÞ ¼ Lat ðTiÞ þ rij ð6Þ

Using Eq. (6):

CPi

j

�� ¼

X

ðTm2CPijÞ

Lat ðTmÞ: ð7Þ

Wm

Hm

Fig. 2 A module

Evolving Systems (2014) 5:133–141 135

123

https://www.researchgate.net/publication/220339853_Temporal_partitioning_of_data_flow_graph_for_dynamically_reconfigurable_architecture?el=1_x_8&enrichId=rgreq-759205e9-c3a0-4ce5-8943-011ac367e4de&enrichSource=Y292ZXJQYWdlOzIzNzg0MTA5MDtBUzoxOTA2MTc2NDI0MzA0NjRAMTQyMjQ1ODE5NTA2Mg==

4.2 How calculate CT

Given a device with Na CLBs that operates an execution

frequency Fe. We call sE ¼ 1

Fe the execution rate. We

assume that the device is able to configure Nr CLBs at each

configuration rate Fr. We call sR ¼ 1

Fr the configuration

rate. The relationship between sR and sE is given by:

sR ¼Fe

Fr

� �sE: ð8Þ

Then, the configuration time CT is calculated as follows

(Benoit et al. 2002)

CT ¼ Na

Nr

� �sR ¼

Na

Nr

� �Fe

Fr

� �sE: ð9Þ

Also CT can be written as follows:

CT ¼ Area ðDÞNr

� �sR ¼

W � HNr

� �sR; ð10Þ

where Area (D) = W�H, represent the total area of the

device and W, H are the width and the height of the device.

Hence for all modules Mn ðWn;HnÞ, if

ðAreaðMnÞ\AreaðDÞÞ then:

CT ¼ Area ðDÞNr

� �sR ¼

W � HNr

� �sR [ CT0

¼ Area ðMnÞNr

� �sR ¼

Mn � Hn

Nr

� �sR

Then CT [ CT0 ð11Þ

Therefore, 8 Area ðMnÞ\Area ðDÞð Þ

Optimal LatðPÞ ¼ KCT0 þ D ðGÞ ð12Þ

Hence, based on Eq. 12, we can minimize again the

value of timal LatP, this can be reached by minimizing the

term CT0. Therefore, the true optimal solution in term of

whole latency should be:

Optimal LatðPÞ ¼ K CT0|{z}min

þD ðG), ð13Þ

where CT0|{z}min

is the lowest possible value of CT0

Based on Eqs. (11) and (13) the whole latency mini-

mization problem can be expressed as follow: Given a

graph G (V, E) and an optimal temporal partitioning in

term of whole latency: Find the way of executing the graph

on the FPGA such as the product ðWn � HnÞ has the lowest

possible value while respecting all constraints.

5 Proposed architecture

As shown in Eq. (13) the whole latency minimization

problem can be solved by choosing the lowest possible

value of product ðWn � HnÞ. In this section, we show how

get the lowest value of ðWn � HnÞ.Given a temporal partitioning P of the graph G = (V, E)

into K disjoints temporal partitions P ¼ P1; P2. . .Pkf g.Based on Eq. (1):

88 Pi 2 P; A(PiÞ�Max(PiÞ� ðWn � HnÞ: ð14Þ

Hence, the lowest possible value of product ðWn � HnÞshould be closed to Max ðPiÞ.ðWn � HnÞ Max ðPiÞ ð15Þ

Hence, the graph should be executed on

reconfigurable area such as A ðH) ¼ Max ðPiÞ. As

shown in Sect. 2, we can build reconfigurable and

static modules on the device. Since a module is

characterized by its Wn (width) and it’s Hn (height),

thus we need to build a reconfigurable module on

the partially reconfigurable FPGA, such as

ðWn � HnÞ ¼ Max ðPiÞ. The Fig. 3 shows our proposed

architecture, which composed by three main blocks:

static module and reconfigurable module and a block

RAM. The static module includes the system resource

needed to control the execution steps of the application.

The static module can be shown as a controller, it

interacts with both the reconfigurable hardware and the

memory and is used to load new configuration, to control

the memory.

5.1 How running our architecture

We build the Finite State Machine (FSM) of Fig. 4 to

manage our architecture. The FSM will be executed by

the PowerPC. Initially, the system is in the state A, next

if switch start and switch C0 are activated then the

system goes to the state B. The state B configures the

bitstream of system resource and partition P0 from

compact flash. Next, if the guard T is true the system

goes to the state C that allows the execution of the

partition P0. The partition P0 can read/write data from

RMA through the macrobus. After that, if the partition

P0 achieves its execution then the system goes to the

state E. Next, if the switch Ci (i from 1 to k) the system

goes to the state F. The state F configures the bitstream

of partition Pi (i from 1 to k) from Compact flash to the

reconfigurable module. After that, if the guard T is true

the system goes from the state F to the state D that

executes the partition Pi. The partition Pi can read/write

data from RMA through the macrobus. If the partition

Pi achieves its execution then the system goes to the

state E. By the same way we execute the partition

Pi ? 1 and the process is repeated until the execution of

all partitions. In the end, the guard C (C = C0 & C1 &


123

Ci…& Ck) will be true and system comes back to the

state A.

6 Implementation of proposed architecture

Xilinx companies offering FPGA design tools that support

a self reconfiguration design flow. The available synthesis

tools have improved vastly both in speed and in the quality of

results. The following paragraph provides a detailed descrip-

tion of the design flow needed to implement our application.

The proposed design flow includes four mainly steps:

– building static module,

– looking for FPGA free space,

– building reconfigurable module,

– assembly phase.

6.1 Building static module

This step aims to build static module. In our approach the

static module includes the system resources, such as

PowerPC, UART, OPB bus and BLB bus, ICAP interface.

Xilinx Embedded Tool EDK may be used to build the

system resource efficiently. The system resource uses a set

of compiler tools to handle high-level languages such as C,

C??, or assembly language, and produce reliable code for

their embedded target. In this context we developed an

embedded C code to build the FSM of Fig. 4.

6.2 Looking for free space

After building the static module, we look for free space

inside the FPGA to build reconfigurable module. The

Fig. 3 Proposed architecture

Fig. 4 Proposed FSM


123

Fig. 5 shows the netlist of the static module and the places

where reconfigurable module may be built.

6.3 Building reconfigurable modules

The goal of the budgeting phase is to determine the size

and location of the reconfigurable module and to lock down

the placement of the bus macros. The budgeting phase can

be done manually. The process, however, is laborious andFig. 5 Free space inside the device

Fig. 6 Implementation of

modules

Fig. 7 XUP Virtex-II Pro


123

instead many of the steps have been automated with a tool

called PlanAhead; the PlanAhead tool allows building

modules anywhere inside the device, Fig. 6. After this step

a user constraints file (UCF file) is automatically generated.

After building reconfigurable and static module inside the

device, now we should achieve communication means

between them. BusMacros may be used to maintain correct

connections between the modules by spanning the

boundaries of these rectangular regions. The location of a

macrobuses should be closed to the modules location.

Further, the system needed to communicate with the

environment by using the FPGA pins. So, we used the data

sheet of the target FPGA to find appropriate pins.

6.4 Assembly phase

Last phase of the flow is the assembly of the static and

reconfigurable parts. The final bitstreams are generated as

full bitstreams, after loading the full bitstreams, it is now

possible to reconfigure module of the device with a partial

bitstream of each partition.

7 Experiments

Hardware architecture, XUP Virtex-II, on which the design

flow is to be mapped, is presented in Fig. 7. The XUP

Virtex-II Pro FPGA development system can be used at

any virtually level of the engineering curricula, from

introductory courses through advanced research projects.

In our experiences, we used four approaches, list

scheduling (Trimberger 1998), initial network flow (Liu

and Wong 1998), improved network flow (Jiang and Wang

2007) and our proposed approach. In our experiences, we

evaluated the performance of each approach in term of

whole latency. The Fig. 8 shows the color layout descriptor

‘‘CLD’’ is a low-level visual descriptor that can be

extracted from images or video frames. The process of the

CLD extraction consists of four stages: Image partitioning,

selection of a single representative color for each block,

DCT transformation and non linear quantization and zig-

zag scanning.

Since DCT is the most computationally intensive part of

the CLD algorithm, it has been chosen to be implemented

in hardware, and the rest of subtasks (partitioning, color

selection, quantization, zig-zag scanning and Huffman

encoding) were chosen for software implementation. The

model proposed by Mtibaa et al. (2007) is based on 16

vector products. Thus, the entire DCT is a collection of 16

Table 1 Benchmark characteristics

DFGs Nodes Edges Area (CLB)

DCT 4 9 4 224 256 8,045

DCT 16 9 16 1,929 2,304 13,919

Table 2 Design results

Algorithm Initial network flow Proposed approach Improvement versus

initial network flow

Graph 4 9 4 DCT Task graph 4 9 4 DCT Task graph 4 9 4 DCT Task graph

Number of partitions 9 9 –

Whole latency (ms) 234.005770 102.964770 56 %



Whole latency (ms) 390.009710 160.686610 58.90 %

Average improvement in whole latency 57.45 %

Fig. 8 Block diagram of the CLD extraction

Fig. 9 Vector products


123

tasks, where each task is a vector product as presented in

Fig. 9.

There are two kinds of tasks in the task graph. ‘‘T1’’ and

‘‘T2’’, whose structure is similar to vector product, but

whose bit widths differ. Table 1 gives the characteristic of

4 9 4 DCT, 16 9 16 DCT task graphs.

The Tables 2, 3 and 4 give the different solutions provided

by the list scheduling, the initial network flow technique, the

enhanced network flow and the proposed approach. Results of

Table 2 shows an average improvement of 57.45 % in term of

whole design latency compared with the list scheduling.

Tables 3 and 4 shows an improvement of 57.25 and 58.85 % in

term of whole design latency compared with initial network

flow and improved network flow, respectively. These results

show the magnitude of benefit possible adopting our approach.

Therefore, our algorithm has a good trade-off between latency

of the graph and reconfiguration overhead. Hence, our

approach can be qualified to be a good temporal partitioning

candidate. In fact, an optimal partitioning approach needs to

balance computation required for each partition and reduce the

reconfiguration overhead so that mapped applications can be

executed faster on dynamically reconfigurable hardware.

8 Conclusion

Today’s large and complex designs are now commonly

implemented in FPGAs, however designer suffers principally

from the time needed to reconfigure, which is still relatively

high, in same case it consumes over than 70 % of the whole

design latency. A high reconfiguration time may lead to not

practical design mainly when designer focuses on the overall

latency minimization of the application. This problem may be

easily faced when using the proposed approach; in fact, there

is always much gain in reconfiguration latency versus others

approaches.

References

Ayadi R, Ouni B, Mtibaa A (2012) A partitioning methodology that

optimizes the communication cost for reconfigurable computing

systems. Int J Autom Comput (Springer Publisher) 9(3): 280–287

Benoit P, Torres L, Robert M, Cambon G, Sassatelli G, Gil T (2002)

Caracterisation d’Architectures Reconfigurables. Un Exemple :

Le Systolic Ring. Francophone Days on Adequacy Algorithm

Architecture JFAAA’2002, 16–18 December 2002, Monastir,

Tunisia, pp 30–34

Cardoso JMP (2003) On combining temporal partitioning and sharing

of functional units in compilation for reconfigurable architec-

tures. IEEE Trans Comput 52(10):1362–1375

Jeong B (1999) Hardware software partitioning for reconfigurable

architectures. M.S. Thesis School of Electrical Engineering,

Seoul National University

Jiang Y-C, Wang J-F (2007) Temporal partitioning data flow graphs

for dynamically reconfigurable computing. IEEE Trans Very

Large Scale Integr Syst 15(12):1351–1361

Kaul K, Vermuri R, Govindarajan S, Ouaiss I (1998) An automated

temporal partitioning tool for a class of DSP application. In:


Algorithm Improved network flow approach Proposed approach Improvement versus

improved network flow



Whole latency (ms) 234.004000 102.964770 55.80 %



Whole latency (ms) 390.006000 160.686610 58.70 %



Algorithm List scheduling approach Proposed approach Improvement versus

list scheduling



Whole latency (ms) 234.0043800 102.964770 58.90 %



Whole latency (ms) 390.773000 160.686610 58.80 %



123

Workshop and reconfigurable computing in international con-

ference on parallel architecture and compilation technique

PACT, pp 22–27

Liu H, Wong DF (1998a) Network flow based circuit partitioning for

time-multiplexed FPGAs. In: Proceedings of IEEE/ACM Inter-

national Conference on Computer-Aided Design, pp 497–504

Liu H, Wong DF (1998b) Network flow based multi-way partitioning

with area and pin constraints. IEEE Trans Comput Aided Design

Integr Circuits Syst 17(1):50–59

Lysaght P, Blodget B, Manson J, Young J, Bridgford B (2006) Invited

paper: enhanced architectures, design methodologies and CAD

tools for dynamic reconfiguration of Xilinx FPGAs, PLS, pp 1–6

Mtibaa A, Ouni B, Abid M (2007) An efficient list scheduling

algorithm for time placement problem. Comput Electr Eng

33(4):285–298

OPH HWICAP (2004) Product specification datasheet—DS 280

(v1.3), March

Ouni B, Ayadi R, Abid M (2008) Novel temporal partitioning

algorithm for run time reconfigured systems. J Eng Appl Sci

03(010):335–340

Ouni B, Mtibaa A, Bourennane El-B (2009) Scheduling approach for

run time reconfigured systems. Int J Comput Sci Eng Syst

3(4):335–340

Ouni B, Ayadi R, Mtibaa A (2011a) Partitioning and scheduling

technique for run time reconfigured systems. Int J Comput Aided

Eng Technol 3(1):77–91

Ouni B, Ayadi R, Mtibaa A (2011b) Combining temporal partitioning

and temporal placement techniques for communication cost

improvement. Adv Eng Softw (Elsevier Publisher) 42(7):

444–451

Ouni B, Ayadi R, Mtibaa A (2011c) Temporal partitioning of data

flow graph for dynamically reconfigurable architecture. J Syst

Archit (Elsevier Publisher) 57(8): 790–798

Spillane J, Owen H (1998) Temporal partitioning for partially-

reconfigurable field-programmable gate. Reconfigurable Archi-

tectures Workshop in PS/SPDP’98

Trimberger S (1998) Scheduling designs into a time-multiplexed

FPGA. In: Proceedings of the ACM International Symposium on

Field Program. Gate Arrays, pp 153–160

Vasiliko M, Ait-Boudaoud D (1996) Architectural synthesis for

dynamically reconfigurable logic. In: International workshop on

field-programmable logic and applications, FPL’96, Darmstadt,

Germany

Wu GM, Lin JM, Chang YW (2001) Generic ILP-based approaches

for time-multiplexed FPGA partitioning. IEEE Trans Comput

Aided Des 20(10):1266–1274


123

integrated temporal partitioning and partial reconfiguration techniques for design latency...

Documents