temporal partitioning of data flow graphs for reconfigurable architectures

13
Int. J. Computational Science and Engineering, Vol. 9, Nos. 1/2, 2014 21 Copyright © 2014 Inderscience Enterprises Ltd. Temporal partitioning of data flow graphs for reconfigurable architectures Bouraoui Ouni* and Abdellatif Mtibaa Laboratory of Electronic and Microelectronic, Faculty of Sciences, University of Monastir, Monastir 5000, Tunisia E-mail: [email protected] E-mail: [email protected] *Corresponding author Abstract: In this paper, we present the famous temporal partitioning algorithms that temporally partition a data flow graph on reconfigurable systems. We have classified these algorithms into four classes: 1) whole latency optimisation algorithms; 2) whole communication cost optimisation algorithms; 3) whole area optimisation algorithms; 4) whole latency-communication cost optimisation algorithms. These algorithms can be used to solve the temporal partitioning problem at the behaviour level. Keywords: temporal partitioning; reconfigurable architecture; FPGA-engineering; VLSI applications; computer aided design; data flow graph. Reference to this paper should be made as follows: Ouni, B. and Mtibaa, A. (2014) ‘Temporal partitioning of data flow graphs for reconfigurable architectures’, Int. J. Computational Science and Engineering, Vol. 9, Nos. 1/2, pp.21–33. Biographical notes: Bouraoui Ouni is currently an Associate Professor at the National Engineering School of Sousse. He has obtained his PhD entitled ‘Synthesis and temporal partitioning for reconfigurable systems’ in 2008 from the Faculty of Sciences at Monastir. Currently, he is preparing his university habilitation entitled ‘Optimisation algorithm for reconfigurable architectures’. Hence, his researches interest cover: models, methods, tools, and architectures for reconfigurable computing; simulation, debugging, synthesis, verification, and test of reconfigurable systems; field programmable gate arrays and other reconfigurable technologies; algorithms implemented on reconfigurable hardware; hardware/software codesign and cosimulation with reconfigurable hardware; and high performance reconfigurable computing. Abdellatif Mtibaa is currently a Professor in Micro-Electronics and Hardware Design with the Electrical Department at the National School of Engineering of Monastir and the Head of Circuits Systems Reconfigurable-ENIM-Group at Electronic and Microelectronic Laboratory. He received his Diploma in Electrical Engineering in 1985 and his PhD in Electrical Engineering in 2000. His current research interests include system on programmable chip, high level synthesis, rapid prototyping and reconfigurable architecture for real-time multimedia applications. He has authored/co-authored over 100 papers in international journals and conferences. He served on the technical programme committees for several international conferences. He also served as a co-organiser of several international conferences. 1 Introduction The significance of the reconfiguration system can be illustrated through the following example: many applications have heterogeneous nature and comprise several sub tasks with different characteristics. For instance a multimedia application may include a data parallel task, a bit level task, irregular computation, high precision word operation and real time component. For such application the ASICs approach would lead to uneconomical size and a large number of separates chips. However, the reconfigurable architecture may solve this problem and meet the performance constraints. Indeed, using reconfigurable architecture many tasks can be performed by the same logic. This is the key of reconfigurable architecture. That is why; reconfigurable architecture has become an essential issue for several important VLSI applications. In fact, application with several tasks has entailed problem complexities that are unmanageable for existing programmable devices. Thus, such application should be temporally partitioned into smaller, more manageable components, this process is called temporal partitioning (Bobda, 2007; Cardoso, 2003; Jiang and Lai, 2007).

Upload: enim-tn

Post on 11-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Int. J. Computational Science and Engineering, Vol. 9, Nos. 1/2, 2014 21

Copyright © 2014 Inderscience Enterprises Ltd.

Temporal partitioning of data flow graphs for reconfigurable architectures

Bouraoui Ouni* and Abdellatif Mtibaa Laboratory of Electronic and Microelectronic, Faculty of Sciences, University of Monastir, Monastir 5000, Tunisia E-mail: [email protected] E-mail: [email protected] *Corresponding author

Abstract: In this paper, we present the famous temporal partitioning algorithms that temporally partition a data flow graph on reconfigurable systems. We have classified these algorithms into four classes: 1) whole latency optimisation algorithms; 2) whole communication cost optimisation algorithms; 3) whole area optimisation algorithms; 4) whole latency-communication cost optimisation algorithms. These algorithms can be used to solve the temporal partitioning problem at the behaviour level.

Keywords: temporal partitioning; reconfigurable architecture; FPGA-engineering; VLSI applications; computer aided design; data flow graph.

Reference to this paper should be made as follows: Ouni, B. and Mtibaa, A. (2014) ‘Temporal partitioning of data flow graphs for reconfigurable architectures’, Int. J. Computational Science and Engineering, Vol. 9, Nos. 1/2, pp.21–33.

Biographical notes: Bouraoui Ouni is currently an Associate Professor at the National Engineering School of Sousse. He has obtained his PhD entitled ‘Synthesis and temporal partitioning for reconfigurable systems’ in 2008 from the Faculty of Sciences at Monastir. Currently, he is preparing his university habilitation entitled ‘Optimisation algorithm for reconfigurable architectures’. Hence, his researches interest cover: models, methods, tools, and architectures for reconfigurable computing; simulation, debugging, synthesis, verification, and test of reconfigurable systems; field programmable gate arrays and other reconfigurable technologies; algorithms implemented on reconfigurable hardware; hardware/software codesign and cosimulation with reconfigurable hardware; and high performance reconfigurable computing.

Abdellatif Mtibaa is currently a Professor in Micro-Electronics and Hardware Design with the Electrical Department at the National School of Engineering of Monastir and the Head of Circuits Systems Reconfigurable-ENIM-Group at Electronic and Microelectronic Laboratory. He received his Diploma in Electrical Engineering in 1985 and his PhD in Electrical Engineering in 2000. His current research interests include system on programmable chip, high level synthesis, rapid prototyping and reconfigurable architecture for real-time multimedia applications. He has authored/co-authored over 100 papers in international journals and conferences. He served on the technical programme committees for several international conferences. He also served as a co-organiser of several international conferences.

1 Introduction

The significance of the reconfiguration system can be illustrated through the following example: many applications have heterogeneous nature and comprise several sub tasks with different characteristics. For instance a multimedia application may include a data parallel task, a bit level task, irregular computation, high precision word operation and real time component. For such application the ASICs approach would lead to uneconomical size and a large number of separates chips. However, the reconfigurable architecture may solve this

problem and meet the performance constraints. Indeed, using reconfigurable architecture many tasks can be performed by the same logic. This is the key of reconfigurable architecture. That is why; reconfigurable architecture has become an essential issue for several important VLSI applications. In fact, application with several tasks has entailed problem complexities that are unmanageable for existing programmable devices. Thus, such application should be temporally partitioned into smaller, more manageable components, this process is called temporal partitioning (Bobda, 2007; Cardoso, 2003; Jiang and Lai, 2007).

22 B. Ouni and A. Mtibaa

2 Definitions

2.1 Definition 1: data flow graph (DFG)

A DFG is a directed acyclic graph G = (V, E) where V is the set of nodes |V| = n = number of nodes in G and E is the set of edges. A directed edge ei,j ∈ E represents the data dependency between nodes (Ti, Tj). We assume that each node has an equivalent hardware implementation. Therefore, the nodes as well as the edges in a DFG have some characteristics such as area, latency and width that are derived from the hardware resources used later to implement those nodes.

2.1.1 Node and edge parameters

Given a node Ti ∈ V and eij ∈ E.

• a (Ti) denotes the area of node Ti

• the latency Llat(Ti) of Ti is the time needed to execute of node Ti

• for a given edge eij which defines a data dependency between Ti and Tj, we define the weight σij of eij as the amount of data transferred from Ti to Tj and the latency σij of eij as the time needed to transfer data from Ti to Tj.

2.2 Definition 2: temporal partitioning

Given a data flow graph G(V, E) and a reconfigurable processing unit H, a temporal partition of G on H is an ordered partition P of G for the RPU H if a set S = (H1, H2, .. Hn) of identical devices H available on which the partitions can be mapped then we have a multi-RPU or multi-FPGA temporal partition, otherwise it is a single-RPU or single-FPGA temporal partition. In a single-FPGA temporal partitioning, only one partition can be downloaded into the FPGA at a time, while in a multi-FPGA temporal partitioning many partitions can be downloaded at same time into the available FPGAs.

2.3 Definition 3: graph connectivity

Given a data flow graph G (V, E), we define the connectivity of G as the relation of the number of edges in E over the number of all edges which can be built with the nodes of G.

22

( ) ENcon G

n n×

=−

(1)

For a given subset V′ of V, the connectivity of V′ is defined as the relation of the number of edges connecting the nodes of V 0 over the set of all edges which can be built with the nodes of

NE number of arcs in G.

2.4 Definition 4: quality

Given a temporal partitioning of G = (E, V) into k disjoint partitions P = {P1, P2…Pk}; we define the quality of partition P as follows:

1

1( ) ( )i k

iQ P Con Pi

k=

== ∑ (2)

2.5 Definition 5: weighted adjacency matrix

Given a data flow graph G = (E, V), we define the (n × n) weighted adjacency matrix W(G) as follows;

, and i j ij i j ijW e W α= = (3)

, 0 i iW = (4)

n is the number of nodes in G.

2.6 Definition 6: Degree matrix

Given a graph G = (E, V), we define the (n × n) degree matrix D(G) as follows:

1

n

ii ijj

D W=

=∑ (5)

, 0i jD = (6)

n is the number of nodes in G.

2.7 Definition 7: Laplacien matrix

Given a graph G = (E, V), we define the (n × n) Laplacian matrix of G as follows:

( ) ( ) – ( )L G D G W G= (7)

n is the number of nodes in G.

2.8 Definition 8: cut size

Given a temporal partitioning of G = (E, V) into k disjoint partitions P = {P1, P2…Pk}; we define the cut size, Cut(Pm), of partition Pm as follows:

( );

m i jTi Pm Tj Pm

Cut P W∈ ∈

= ∑ (8)

This implies:

( )1 1 ,

_ ( )K K

m i jm m Ti Pm Tj Pm

T Cut G Cut P W= = ∈ ∈

= =∑ ∑ ∑ (9)

where (T_Cut (G)) is the total cut size of the graph G and .m mP P P= −

Temporal partitioning of data flow graphs for reconfigurable architectures 23

2.9 Definition 9: communication cost

Given a temporal partitioning of G = (E, V) into k disjoint partitions P = {P1, P2…Pk}, the communication cost of partition Pm as been defined as follows (Biswal et al., 2008):

;_ ( )

Ti Pm vj PmWij

Com Cost PmPm

∈ ∈=∑

(10)

This implies:

( ) 1 2 1 21 2

1 2 1 1

1 11 2 1

1 1 1 1

1

1 1

1

_ , ;

1 1_ ( , ) 1

=>T _ Com _ Cost2

_ ( ) (11)

2

K

m

m

W W W WCom Cost P P

P P P P

P PCom Cost P P W W

P P P P

WVP P

Com Cost Pm

V Wm

Pm Pm

=

=

= + = +

⎛ ⎞ ⎛ ⎞+⎜ ⎟ ⎜ ⎟= + =⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

⎛ ⎞⎜ ⎟= ⎜ ⎟⎜ ⎟⎝ ⎠

=

=

1

;

1

,

=2

with ; 1, 2i m j

K

vi Pm vj PK

i

v P v Pm

WijV

Pm Pm

Wm Wij m K

∈ ∈

=

∈ ∈

= = …

∑∑

2.10 Definition 10: latency cost

Given a temporal partitioning P of the graph G = (V, E) into K disjoints temporal partitions P = {P1…Pk}. The whole latency of P (LatP) is calculated as follows:

( ) P TLat KC D G= + (12)

CT is the time needed to reconfigure the device

1

( )k

ii

D G P=

=∑ (13)

where ||Pi|| is the latency of partition Pi.

3 Lemmas

3.1 Lemma 1

To calculate the lower number of partitions required to obtain solution we divide the area of all nodes, by the available reconfigurable resource. In other words, given a graph G = (V, E) partitioned into K disjoints temporal partitions; P = {P1 …Pk}; the lower number of temporal partitions is (Ouni et al., 2011):

min( )( )

A GKA H

= (14)

where A(G) is the area of the graph and A(H) is the area of the device.

3.2 Lemma 2

Given an indicator vector Xm defined as follow: Xm(i) = [Xm(1), Xm(2), ……, Xm(n)]t, where m = 1, 2,…., k

and i = 1,2,…n are defined as: 1( )m

mX iP

= if Ti ∈ Pm; 0

otherwise. We have (Ramzi et al., 2012):

1

_ _ K

tm m

m

T Com Cost X L X=

=∑ (15)

3.3 Lemma 3

Given a temporal partitioning of G (V, E) into k disjoint partitions P = {P1, P2…P} (Mohar and Poljak, 1990)

( )4Max

mCut P Vλ

≤ (16)

λMax is the highest eigenvalues of matrix L(G).

3.4 Lemma 4

Given a (n × n) matrix M defined as follows: 1ij

mM

P= if

Ti and Tj ∈ Pm; 0 otherwise (Ramzi et al., 2012)

( ) ( )tX G X G M= (17)

where X(G) is the matrix that contains the k indicator vectors, as defined in lemma 2, as columns.

3.5 Lemma 5

Given a temporal partitioning of G (V, E) into k disjoint partitions P = {P1, P2…Pk} and Given K indicator vector Xm(i) = [Xm(1), Xm(2), ……, Xm(n)], where m = 1, 2,…., k and i = 1,2,…n (Ramzi et al., 2012).

Xm(i) = |V| – |Pm| if Ti ∈ Pm, –|Pm| otherwise; |Pm| is the number of nodes in partition Pm

( ) ( )2 ( )tm m mCut P V X L G X= (18)

4 Constraints

4.1 A(H): the resource capacity of the reconfigurable device

The sum of resource cost of all the tasks mapped to a temporal partition must be less than the overall resource constraint of the available reconfigurable area. A temporal

24 B. Ouni and A. Mtibaa

partitioning is feasible in accordance to a reconfigurable device H with area A(H) if:

( ), ( ) ( )i i

i i iT PP P A P a T A H

∈∀ ∈ = ≤∑ (19)

4.2 Precedence constraint

This constraint can be defined as follows: if a task T2 depends on a task T1. Then the last should be placed the first.

This constraint can be formulated as follows ∀ T1, T1 depends on T2, ∀ P2; 1 ≤ P2 ≤ N

1 1 2 22 1

1T P T PP P N

X X≤

+ ≤∑≺

(20)

where XTP = 1 if the task T is placed in partition P, otherwise XTP = 0.

4.3 Uniqueness constraint

According to this constraint, every task should be placed in a unique partition. This constraint can be formulated by the following equation:

1

, 1k

TPp

T V X=

∀ ∈ =∑ (21)

4.4 Mmax: the temporary on-board memory

The variable ||Wij|| ≠ 1 signifies that Tj depends on Ti. When node Ti is being placed in partition Pm and Tj is being placed outside Pm, therefore the data being communicated between them will have to be stored in the memory. Consequently, the sum of all the data being communicated across all partition should be less than the pins constraint.

1 1 ,

maxK K

i jm m Ti Pm Tj Pm

W M= = ∈ ∈

≤∑∑ ∑ (22)

5 Temporal partitioning algorithms

5.1 Whole latency temporal partitioning algorithms

The kind of algorithms can be formulated as follows: given a data flow graph a reconfigurable device a set of constraints find the way of partitioning the graph on the device while minimising the whole latency of the design and while satisfying all constraints.

5.1.1 ILP algorithm

The ‘ILP’ integer linear programming (Byungil, 1999; Kaul and Vermuri, 1999; Wu et al., 2001) is one of the famous approaches that widely used to solve the temporal partitioning problem. To build the ILP model, we need to determine the number of partitions for which a solution is to be obtained. We start by getting a lower bound on the number of partitions for the particular problem. We sum the

resources for all tasks. This value divided by the available resource will be the minimum number of partitions required to obtain a solution. Next, we introduce two 0–1 variables called XTP and. XTP =1 if the node T is placed in partition P, otherwise XTP = 0. The minimisation goal is stated as:

( )1

Minimise * N

time ii

k C L P=

⎛ ⎞+ =>⎜ ⎟⎜ ⎟

⎝ ⎠∑ (23)

( )1

* min N

time ii

K C L P=

⎛ ⎞+ =>⎜ ⎟⎜ ⎟

⎝ ⎠∑ (24)

1 ; pp

rlp k Ti Tj P

⎯⎯→∀ ≤ ≤ ∀ ⎯⎯→ ∈ (25)

1

Minimise * ( )p

k

TPp T Ti Tj

X lat T= ∈ ⎯⎯→

⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠∑ ∑ (26)

K number of partitions

Ctime reconfiguration time pTi Tj⎯⎯→ a directed path from Ti to Tj

p rlP

⎯⎯→ set of paths from root tasks to leaf tasks.

Hence, the ILP formalism can be formulated as:

1

Minimise * ( )p

k

TPp T Ti Tj

X lat T= ∈ ⎯⎯→

⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠∑ ∑ (27)

Subject to:

: ; * ( ) ( ) (area constraint)TPT V

p P X a T A H∈

∀ ∈ ≤∑ (28)

1 1 2 21 2

1, 1 2, 1; 1 1 ;

1

(precedence constraint)

T P T PP P N

T T T P P N

X X≤

∀ → ∀ ≤ ≤

+ ≤∑≺

(29)

1

, 1 (uniqueness constraint)n

TPT

T V X=

∀ ∈ =∑ (30)

The main problem of the ILP approaches for partitioning a graph is the high execution time. In fact, the size of the computation model grows very fast; therefore the algorithm can only be applied to small examples. To overcome this problem, some authors reduced the size of the model by reducing the set of constraints in the problem formulation, but the numbers of variables and precedence constraints to be considered still remain high.

5.1.2 Eigenvectors-based Latency_algorithm

This algorithm (Ouni et al., 2011) shows that to reduce the whole latency of the graph, we need to maximise the cut size between design partitions. Next, the algorithm uses the graph’s eigenvectors to maximise the cut size.

Temporal partitioning of data flow graphs for reconfigurable architectures 25

Given a temporal partitioning P of the graph G = (V, E) into K disjoints temporal partitions P = {P1 …Pk}. The whole latency of P (LatP) is calculated as follows:

( )P TLat KC D G= + (31)

CT is the time needed to reconfigure the device

1

( )k

ii

D G P=

=∑ (32)

Let { }1 Pi PinC C C= …… be the set of paths in the partition

Pi. Hence

( )1max Pi

i jj nP C

≤ ≤= 33)

We apply the equation (34) on all couples (Ti, Tj); Tj depends on Ti

( ) ( ) lat i lat i ijL T L T σ= + (34)

Using equation (34)

( )Pi

m j

Pij lat m

T C

C L T∈

= ∑ (35)

Using equation (31) and (35) the whole latency can be expressed as follows:

( )1

1

maxPi

m j

k

P T lat mj ni T C

Lat KC L T≤ ≤

= ∈

= +∑ ∑ (36)

Based on equation (36) to minimise the whole latency, we need to minimise the term

1[max ( )].

Pim j

lat mj nT C

L T≤ ≤

∈∑ One

solution is to minimise the number of nodes in every path , .Pi Pi

j jC C C∈ On the other hand, given two nodes T1, T2

∈ V: if T1, 2 PijT C∈ then there is an edge e12 between T1

and T2. For that reason, to minimise the number of nodes in every path ,Pi

jC one solution is to minimise the number of

edges in every path .PijC This aim can be reached by

minimising the number of edges in partition Pi. This can be also reached by maximising the number of edges between the different partitions. As a result, the whole latency minimisation problem can be expressed as follows: Given a graph G(V, E): Find the way of graph partitioning such as number of edges between design partition has highest value while respecting all constraints.

The above problem can be shown as: how maximise the total cut size (T_Cut) between the design partitions. Hence, the latency minimisation problem can be expressed as follows:

( ) ( )( )Minimise Minimise _ PLat T Cut G=> (37)

Hence, based on equation (37) the whole latency minimisation problem can be solved by maximising the total cut size between design partitions. In this section, we present a good solution for this problem.

We introduce the graph complement ( , )G V E′ of G(V, E) as follows: the complement of a graph G is a graph G′ on the same nodes such that two nodes of G’ are connected if and only if they are not connected in G.

Given a temporal partitioning of G(V, E) into k disjoint partitions P = {P1, P2…Pk}, we have:

( ) ( ) ( )( ) ( )m m m mCut P G Cut P G P V P′+ = − => (38)

( ) ( ) ( )( ) ( )m m m mCut P G P V P Cut P G′= − − (39)

Based on equation (39), ∀ Pm ∈ P; if Pm has the highest cut in the graph G then it has the lowest cut in the graph G′. Hence, if (T_Cut (G)) has the highest value then (T_Cut (G′)) has the lowest value. Therefore, the whole latency minimisation problem can be shown as: How minimise the total cut (T_Cut(G’)) of the graph G′ instead of how maximise (T_Cut (G))of the graph G. As a result, to minimise the whole latency of the graph (G), we need to minimise the total cut size of the graph G′.

Based on of Lemma 5, we have

( )1

2

12

_ ( ) ( )

( )

+ ( ) ( ) ( )

K

mm

ktm m

m

t

T Cut G Cut P G

K V X L G X

K V trace X G L G X G

=

=

′ ′=

′= +

⎡ ⎤′ ′ ′= ⎣ ⎦

∑ (40)

Therefore, to minimise T(Cut(G′)) we need to minimise:

( ) ( ) ( )ttrace X G L G X G⎡ ⎤′ ′ ′⎣ ⎦ (41)

In Alpert and Yao (1995), the authors showed that the lowest value of trace[Xt(G′)L(G′)X(G′)] can be obtained if the matrix X(G′) contains the first k eigenvectors corresponding to the k lowest eigenvalues of matrix L(G′) as columns. As conclusion, the whole latency minimisation problem can be solved by choosing an assignment matrix X(G′) that contains the first k eigenvectors of the k lowest eigenvalues of matrix L(G′) as columns.

5.1.3 Dynamic algorithm

This temporal partitioning algorithm (Ouni et al., 2004) divides the graph into optimal number of partitions and associates each task to the most appropriate partition, so that the latency of the graph is optimum and while satisfying all constraints. The following steps summarise the functional steps of this algorithm (Figure 1):

Step 1 The algorithm computes the two scheduling algorithms without constraints ASAP and ALAP scheduling.

26 B. Ouni and A. Mtibaa

Step 2 This step find out the fix nodes. To find out the fix nodes we use the flowing formulates:

( ) ( ) irst i iF n ALAP n= (42)

If First (ni) = 1 then node (ni) should be placed within the first partition, else we calculate

( ) ( ) – ast i cont iL n N ASAP n= (43)

If Last (ni) = 0 then node (ni) should be placed in the last partition, else if is First(ni) and Last(ni) are non-zero the node (ni) can be placed Pi ∈ [P1, Pk]. Ncon number of control steps.

Step 3 The algorithm calculates the minimum number of partitions Nmin, as shown in Lemma 1.

Step 4 This steps is the key of this algorithm, it can generate all possible schedules that may exist on Nmin partitions. This algorithm is based on iterative process illustrated in Ouni et al. (2004).

Step 5 During this step the algorithm verifies all constraints. If the constraints are satisfied for at least on scheduling then it keeps Nmin as number of partitions and generates other scheduling. Else it relaxes the number of partitions by one Nmin = Nmin + 1 and it goes back to Step 2.

Step 6 Once all possible scheduling have been generated, we choose the scheduling that has the lowest latency.

Figure 1 Dynamic algorithm (see online version for colours)

C o m p u te s A S P A a n d A L A P (s te p 1 )

F in d o u t f ix n o d e s (s te p e 2 )

G e n e ra te s c h e d u lin g ( i) (s te p 4 )

V io la tio n o f c o n s tra in ts

(s te p 5 )

E n d g e n e ra tio n

N m in = N m in + 1 e t i = 1

k e e p N m in

i= i+ 1

G e n e ra te sc h e d u lin g ( i+ 1 ) S te p 4

L o o k in g fo r o p tim u m sc h e d u lin g (s te p 6 )

E s tim a tio n d e N m in , s e t i= 1 (s te p 3 )

E n d g e n e ra tio n

i= i+ 1

N m in = N m a x

E x it

e x it

Temporal partitioning of data flow graphs for reconfigurable architectures 27

The execution time of this algorithm depends on the connectivity of the graph. In fact, in some cases the Steps 1 and 2 of the algorithm may decrease the number of possible scheduling. This leads to significant gain in execution time of the algorithm that is why this algorithm runs like ILP, which always gives the optimal solution, with shorter CPU time.

5.2 Whole communication cost temporal partitioning algorithms

5.2.1 Enhanced list scheduling algorithm

This approach (Ouni et al., 2011) focuses on minimising the required data transfer between different temporal partitions of the design. The goal to be reached during partitioning is the minimisation of the communication overhead among the partitions, which also means the minimisation of the communication memory. The goal is formulated as the minimisation of the overall cut-size among the partitions. A small cut size among the partitions means fewer edges connecting the partitions, less communication and therefore a good partitioning quality. This algorithm is based on the following steps:

Step 1 Calculation of ASAP and ALAP scheduling

In this step, the algorithm calculates the as soon and the as late schedule of tasks of the introduced task graph model.

Step 2 Calculation of mobility

After achieving the as soon and the as late scheduling, the algorithm calculates the mobility of each task. This mobility is calculated as follow:

( )( ) ( )( )( )( )

i i

i

Mobility node n ALAP node n

ASAP node n

=

− (44)

where ALAP (node (ni)) is the number of control step of node (ni) according to the ALAP scheduling, and ASAP (node (ni)) is the number of control step of node (ni) according to the ASAP scheduling.

Step 3 Building a principal list

In this step, the algorithm calculates the priority on the urgency (PUr(ni)) and the priority on the mobility (PMo(ni)) of each node. Then, it calculates the priority of each node in the input graph. Finally, the algorithm places each note according to its priority on the list. The note having the highest priority is placed first on the list.

The priority of each node is calculated as follow:

( ) ( ) ( ) r i Ur i Mo iP n P n P n= + (45)

The priority on the urgency of each node is calculated as follow:

( ) ( )( ) Ur i Cstep iP n N ALAP n= − (46)

where NCstep is the number of control steps

The priority on the mobility of each node is calculated as follow:

( ) ( )( )( )1 1Mo i iP n Mobility node n= + (47)

As we show, given two nodes ‘ni’ and ‘nj’ if PUr(ni) is greater than PUr(nj), then Pr(ni) is greater than Pr(nj) without regarding the ‘PMo’ of each node. Thus, the dependency constraint is always satisfied.

The priority on the mobility is added to put nodes with low mobility (especially nodes without mobility) in the list as soon as all predecessors have been placed on the list.

Step 4 Building a dependency list for each node

In this step, the algorithm builds the dependency matrix ‘Md’ which defined as follow:

Given a graph G (E, V), the dependency matrix ‘D’ of ‘G’; D = di,j with 1 ≤ i, j ≤ V and di,j = 1 if vj depends on vi, otherwise di,j = 0. Then the algorithm builds the dependency list for each node (ni). This list is calculated as follow:

( ) { } 1_ i ij j j To V

Dep list n nϕ=

= (48)

where Φi,j =1 if node ‘nj’ depends on node ‘ni’; otherwise Φi,j =0.

By the same way, the note having the highest priority is first placed in the dependency list of node ‘ni’, ni ∈ V.

Step 5 Building partition

The algorithm puts the first node from the list in the first partition, and it removes nodes from its dependency list to this partition until the size of the reconfigurable area is reached. Next, the algorithm puts the first non removed node from the principal list in a new partition, and then it removes nodes from its dependency list to this partition until the size of the reconfigurable area is reached. This process is repeated until all nodes are placed in partitions.

5.2.2 Network flow algorithm

The network flow methodology (Liu and Wong, 1998a, 1998b; Ouni et al., 2008) has been used to reduce the communication cost across temporal partitions. The method is a recursive bipartition approach that successively partitions a set of remaining nodes in two sets, one of which is a final partition, whereas a further partition step must be applied on the second one. Figure 2 presents the network algorithm as presented in Liu and Wong (1998a, 1998b).

28 B. Ouni and A. Mtibaa

Figure 2 Network flow algorithm

Begin 1 Construct G′ from G by net modelling 2 pick a pair of node s and t in G′ as source and sink 3 Find a min cut C in G′. let X be the sub-circuit reachable

from s through augmenting path, and X′ be the rest 4 if (lr ≤ w(X) ≤ ur) then stops and return C as answer 5 if (w(X) < Lr) then collapse all nodes in X to S pick a node v in X′, and collapse v to s go to Step 3 6 if (w(X) > ur) then collapse all nodes in X′ to t pick a node v in X, and collapse v to t go to Step 3 End

where w(X) is the total area of all nodes in X; Lr = (1 – ε) Rmax , Rmax is the area of the device; ur = (1 – ε) Rmax; ε = 0,05, S is the source node, t is the sink node.

5.2.3 Graph eigenvectors-based communication Cost_algorithm

This algorithm (Ramzi et al., 2012) is based on typical mathematic flow to solve the temporal partitioning problem. This algorithm (Figure 3) optimises the transfer of data required between design partitions and the reconfiguration overhead.

Figure 3 Graph eigenvector-based CC_algorithm

1 Compute the minimum number of partitions K = Min_Part = [Area(G) / Area(H)]

2 Compute the Laplacian matrix L(G) of G 3 Compute k lowest eigenvalues of L(G) 4 Construct X an n * k matrix whose columns are the K

eignvectors. 5 Compute Z = X Xt 6 Construct the matrix M = Mi, j from Z. Mi, j = 1 if

Zi, j ≥ 1/n, 0 otherwise. Where n is the number of nodes 7 Generate the initial partitioning using Mp matrix

5.3 Whole area cost temporal partitioning algorithms

5.3.1 List scheduling algorithm

The list scheduling algorithm has been used in Trimberger (1998), Cardoso (2003) and Mtibaa et al. (2007). The main idea of this method consists in placing all nodes of the graph on a list. The first partition is built by removing nodes from

the list to the partition until the size of the target area is reached. Then, a new partition is built and the process is repeated until all nodes are placed in partition. This technique is often oriented by the as soon as possible (ASAP) and/or the as late as possible (ALAP) scheduling. The mobility of a given node, i.e., the difference between its ALAP-value and its ASAP-value, can be used as its priority. At any time step t, the so-called ready set, that is the set of operations ready to be scheduled, is constructed. The ready set contains operations whose predecessors have already been scheduled early enough to complete their execution at time t. The algorithm checks whether there are enough resources of a given type k to implement all the operations of type k. If so, the operations are assigned the resources, otherwise, higher priority nodes are assigned the available resources and the rest of the operations will be scheduled later, when some resources will be available. If the mobility of a node is used as priority criteria, it is possible that all operators in the ready list are on critical paths, which means that their mobility is zero. As a consequence, the complete depth of each operator is increased by one, thus increasing the latency of the graph’s execution. The main advantage of the list scheduling technique is its very fast run time.

5.4 Whole latency-communication cost optimisation algorithms

This algorithm (Ouni et al., 2009) (Figure 4) focuses on introducing a new temporal partitioning algorithm. It combines the integer linear programming and the network flow techniques to solve the temporal partitioning problem. This algorithm puts each task in the appropriate partition in order to decrease the transfer of data required between partitions while handling the whole latency of the design. First, we start from an existing partitioning, in our approach we chose the partitioning given by the ILP technique. We select the ILP technique because it gives always the optimal solution in term of whole latency. Then, we allow an augmentation of the latency lat (pi) of partition pi by a value of (τ * lat (Pi)). This implies that we allow lat (pi) to deviate from lat (Pi) to (1 + τ) lat (pi), in other word lat (pi) < Dpi < (1 + τ) lat (pi), where Dpi is the new latency value of partition pi. The value is fixed by the user, in this paper we consider τ = 5%. We start from ILP partitioning, the technique balances tasks from partition pi to pj or inversely. The balance of tasks is based on the force F(Ti, pi → pj) associated with partition pi on a node Ti to be scheduled into partition pj and on the force F(Ti, pi → pj) associated with partition pj on a node Ti to be scheduled into partition pi. For instance let us assume that pi < pj; pi, pj ∈ P. These forces are calculated as follow:

( ) ( )( , ) * i jF Ti p p Ti OF Tiδ→ = (49)

δ(Ti) = 0, if there is a task Tj ∈ Pi and Tj is an input of Ti, otherwise δ(Ti) =1.

Temporal partitioning of data flow graphs for reconfigurable architectures 29

Figure 4 Latency-communication cost algorithm

Get the temporal partitioning solution of G by using the ILP technique

for i = 1,i ≤ N; N is the number of partition

for j = i, j ≤ N, List-nodes = generate-node_list_with associated forces node (ni) = first no removed node from List_node

While (list_ nodes ≠ Φ)

if ni ∈ pi then

if ((Dpi < τ * Lat(pi)) (size_ partition (i)+ size_node(ni)≤Rmax)); Rmax is available reconfigurable area then Partition (j) <= (node (i)) End if End if

if ni ∈ pj then

if ((Dpj < τ*lat(pj) & (size_ partition (j) + size_node(ni) ≤ Rmax)) then Partition (i) <= (node (i)) End if End if End while j = j + 1 End for i = i + 1 End for

( ) ( )( ),( ) 1

Nu TiOF TiNu Ti

=+

given a tasks Ti, Tj ∈ pi;

( ) ;Tj pi

Nu Ti ijTjα∈

= ∑ αij =1 if Tj is an input of Ti, 0

otherwise

( ) ( )( , ) *j iF Ti p p Ti InF Tiδ→ = (50)

( ) ( )( ),( ) 1

Nq TiInF TiNq Ti

=+

given a tasks Ti, Tj ∈ pj;

( ) ; Tj pj

Nq Ti ijTjα∈

= ∑ αij =1 if Tj is an output of Ti, 0

otherwise In general, due to the scheduling of one node, other node

schedules will also be affected. At each iteration, the force of every node being scheduled in every possible partition is computed. Then, the distribution graph is updated and the process repeats until no more nodes remain to be scheduled.

6 Experiments

6.1 DCT data flow graph

Since DCT is the most computationally intensive part of the CLD algorithm, it has been chosen to be implemented in hardware, and the rest of subtasks (partitioning, color selection, quantisation, zig-zag scanning and Huffman encoding) were chosen for software implementation. The model proposed by Mtibaa et al. (2007) is based on 16 vector products. Thus, the entire DCT is a collection of 16 nodes, where each node is a vector product as presented in Figure 5.

Figure 5 Vector products

There are two kinds of nodes in the graph ‘T1’ and ‘T2’. The structure of ‘T1’ and ‘T2’is similar to vector product, but with different bits widths. Table 1 gives the characteristics of 4 × 4 DCT, 16 × 16 DCT, 16-FFT and 64-FFT graphs.

Table 1 DCT benchmark characteristics

DFGs Nodes Edges Area (CLB)

DCT 4 × 4 224 256 8045

DCT 16 × 16 1,929 2,304 13,919

Table 2 summarises the design results given by each algorithm.

Design results show that graph eigenvectors-based Latency_algorithm has a good trade-off between latency of the graph and number of partitions. Hence, this algorithm can be qualified to be a good temporal partitioning candidate. In fact, an optimal partitioning algorithm needs to balance computation required for each partition and reduce the number of partition so that mapped applications can be executed faster on dynamically reconfigurable hardware.

30 B. Ouni and A. Mtibaa

Table 2 Design results

Algorithm List scheduling ILP Initial

network flow Improved

network flow

Graph eigenvectors-based

L_algorithm

Graph eigenvectors-based

CC_algorithm

Graph 4 × 4 DCT graph

4 × 4 DCT graph

4 × 4 DCT graph

4 × 4 DCT graph

4 × 4 DCT graph

4 × 4 DCT graph

Number of partitions

9 9 9 7 7

Latency of the graph D(G) (ns)

4,770 4,395 4,570 3,492 5,770

Run time of algorithm

0.2 sec

Infeasible

0.12 sec 0.12 sec 0.3 sec 0.2 sec

Graph 16 × 16 DCT graph

16 × 16 DCT graph

16 × 16 DCT graph

16 × 16 DCT graph

16 × 16 DCT graph

Number of partitions

15 15 15 12 11

Latency of the graph D(G) (ns)

6,610 6,420 7,730 5,196 8,420

Run time of algorithm (second)

1.5

Infeasible

1.5 1.5 1.5 2

Figure 6 Intra prediction graph (see online version for colours)

Input video file Open file

Encoder Read one Frame

Frame Source

Select 16X16 MB

Research neighboring pixels16X16

Vertical prediction

Select 4X4 MB

Research neighboringpixels 16X16

Horizontal prediction

Diag/Down Left

prediction

Horizontal Up

prediction

Vertical Left

prediction

DC prediction

Diag/Down right

prediction

Vertical /right

prediction

Hor/Down prediction

4X4 SAD calculates for each

mode

Plane prediction

Horizontal 16X16

prediction

Vertical 16X16

prediction DC 16X16 prediction

Choice the minimum's SAD mode

Choice the minimum's SAD modeChoice of best

prediction mode 4X4 or

16X16

Calculated residue

Generation of the lost

macro

DCT

Quantification

INV quantification

INV DCT

Entropy coding

16X16 SAD calculates for each

mode

Temporal partitioning of data flow graphs for reconfigurable architectures 31

Table 3 Design results

Algorithm List scheduling ILP Initial network flow

Improved network flow

Graph eigenvectors-based

L_algorithm

Graph H.264 AVC graph H.264 AVC graph H.264 AVC graph H.264 AVC graph H.264 AVC graph Number of partitions

18 18 17 13

latency of the graph (DG) (ns)

9,710 8,940 8,535 6,743

Run time of algorithm (second)

1.8

Infeasible

1.9 1.9 1.7

6.2 Intra prediction block

The prediction algorithms represent the main elements of the H264 algorithms. Indeed, the prediction ‘inter’ exploits the temporal correlation between successive images and the prediction ‘intra’ exploits the spatial correlation in the same image. These two modes of prediction allow a considerable gain in terms of quality and compression ratio. According to intra mode, shown in Figure 6, the predicted block is based on previous encoded blocks. This predicted block is subtracted from the current block prior to encoding. For the luminance (luma) samples, the predicted block may be formed by 4 × 4 sub block or by 16 × 16 macro block. There is one of nine optional prediction modes for 4 × 4 luma block; four optional modes for 16 × 16 luma block.

Table 3 gives the different solutions provided by the ILP algorithm, the list scheduling, the initial network flow technique, the enhance network flow and the proposed algorithm. Result shows that graph eigenvectors base Latency_algorithm has always the lowest number of partitions. Results show a latency improvement of 30.55% compared to list scheduling, 24.57% compared to network flow algorithm, 20.99% compared to enhanced network flow algorithm.

Table 4 FFT benchmark characteristics

Nodes Edges Area (CLBs)

16-FFT 203 228 5,226 64-FFT 1,287 664 10,098

6.3 Fast Fourier transform

The fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier transform (DFT) and its inverse. 16-FFT and 64-FFT are 16 points and 64 points of FFT, respectively, have important roles in analysis, design, and implementation of discrete-time signal processing

algorithms and systems. Table 4 gives the characteristic 16-FFT and 64-FFT task graphs.

Table 5 gives the different solutions provided by the list scheduling, the network flow technique, and the graph eigenvectors-based communication Cost_algorithm.

6.4 Differential equation

A numerical method for solving a differential equation as described in Bobda (2003) is considered. Figure 7 shows the DFG for solving a differential equation of the form y′ + 3xy’ + 3y = 0 in the interval [x0, a]; a with step size dx and initial values y(xO) = y0, y’(x0)= u0, using Euler’s method. The area computation of the multiplication is 100 CLBs, and a 50 CLBs each for the addition, the comparison and the subtraction.

Table 6 gives the different solutions provided by the ILP technique, the network flow communication_latency algorithm for the temporal partitioning problem. In each case we evaluate the performance of each algorithm in term in terms of cut size across each partition, in term of whole latency of the application and in term of run time of the algorithm. The host computer was an Intel Pentium III. Results show that our technique provides a very good compromise between the two principal constraints of the application, the communication cost and the whole latency. For the differential equation tasks graph ILP-FDS algorithm provide scheduling solution with whole latency of 2,570 ns and total cut size 6. Knowing that, for these results, we only considered the latency (execution time) of tasks to calculate the whole latency of the design. We believe that this solution is better than the solution given by the network flow algorithm with a latency equals 5,620 ns, cut size = 4. And better than the solution given by the ILP algorithm with a latency equals 2,810 ns and cut size = 11. We not considered the time needed for communication between the design partitions.

32 B. Ouni and A. Mtibaa

Table 5 Design results

Graph

eigenvectors-based CC_algorithm

Network flow List scheduling

Graph eigenvectors-based CC_algorithm

Improvement versus network flow

Graph eigenvectors-based CC_algorithm

Improvement versus list scheduling

Number of partition

5 6 6

T.C cost 412 392 488 4.85% 19.6% M.C cost 82 77 93 Whole latency

4,730 ns + 5* CT ≅ 5* CT

16% 16%

Run time 0.2 sec 0.09 sec 0.09 sec Graph 64-FFT Task

graph 64-FFT Task

graph 64-FFT Task

graph

Number of partition

9 11 11

TC cost 1,863 1,689 1 976 9.33% 14.42% MC cost 205 185 229 Whole latency

7,250 ns + 9 * CT ≅ 9* CT

18% 18%

Run time 2 sec 1.31 sec 1.31 sec

Figure 7 Task graph of solution of differential equation

T2 *

T3 *

T1 *

T4 *

T5 +

T9 comp

T8 +

T7 *

T6 *

T10 -

T11 -

3 x u dx y dx x dx

a

u

3 y

u dx

Temporal partitioning of data flow graphs for reconfigurable architectures 33

Table 6 Design result

Communication-latency algorithm

Network flow ILP

Graph Task graph of solution of differential equation Area (CLB) 400 400 400 Number of partitions

3 3 3

Cut size across P1

2 1 4

Cut size across P2

3 2 4

Cut size across P3

1 1 3

Number of cuts

6 4 11

Whole latency (ns)

2,965 ns + 3 * CT

5,620 ns + 3 * CT

2,810 ns+ 3 * CT

7 Conclusions

In this paper we have formulated the temporal partitioning problem. We have introduced a set of algorithms while indicating their advantages and their drawbacks. We have devoted a large part of this paper to introduce our contribution in this field. We have compared our methods to others technique proposed in the literature.

References Alpert, C. and Yao, S-Z. (1995) ‘Spectral partitioning: the

more eigenvectors, the better’, 32nd Design Automation Conference, pp.195–200.

Biswal, P., Lee, J.R. and Rao, S. (2008) ‘Eigenvalue bounds, spectral partitioning, and metrical deformations via flows’, Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science, p.751–760.

Bobda, C. (2003) ‘Synthesis of dataflow graphs for reconfigurable systems using temporal partitioning and temporal placement’, Thesis, Faculty of Computer Science, Electrical Engineering and Mathematics of the University of Paderborn.

Bobda, C. (2007) Introduction to Reconfigurable Computing Architectures, Algorithms, and Applications, 1st ed., Springer, 9 November, ISBN-10: 1402060882/ ISBN-13: 978-1402060885.

Byungil, J. (1999) ‘Hardware software partitioning for reconfigurable architectures’, MS theses School of Elec. Eng., Seoul National University.

Cardoso, J.M.P. (2003) ‘On combining temporal partitioning and sharing of functional units in compilation for reconfigurable architectures’, IEEE Trans. Computers, Vol. 52 No. 10, pp.1362–1375.

Cardoso, J.M.P. (2003) ‘On combining temporal partitioning and sharing of functional units in compilation for reconfigurable architectures’, IEEE Transaction on Computer, October, Vol. 52, No. 10, pp.1362–1375.

Jiang, Y-C. and Lai, Y-T. (2007) ‘Temporal partitioning data flow graphs for dynamically reconfigurable computing’, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, December, Vol. 15, No. 12, pp.210–218.

Kaul, K. and Vermuri, R. (1999) ‘Integrate block processing and design space exploration in temporal partitioning for RTR architecture’, International Reconfigurable Architecture Workshop, RAW ‘99, Springer Publication, pp.606–615.

Liu, H. and Wong, D.F. (1998) ‘Network flow based circuit partitioning for time-multiplexed FPGAs’, in Proc. IEEE/ACM Int. Conf. Comput. – Aided Des., pp.497–504.

Liu, H. and Wong, D.F. (1998) ‘Network flow based multi-way partitioning with area and pin constraints’, IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, January, Vol. 17, No. 1, pp.50–59.

Mohar, B. and Poljak, S. (1990) ‘Eigenvalues and the max-cut problem’, Czechoslovak Mathematical Journal, Vol. 40, No. 2, pp.343–352.

Mtibaa, A., Ouni, B. and Abid, M. (2007) ‘An efficient list scheduling algorithm for time placement problem’, Computers & Electrical Engineering, pp.285–298.

Ouni, B., Ayadi, R. and Abid, M. (2008) ‘Novel temporal partitioning algorithm for run time reconfigured systems’, Journal. Engineering and Applied Sciences, October, Vol. 3, pp.766–773.

Ouni, B., Ayadi, R. and Mtibaa, A. (2011) ‘Partitioning and scheduling technique for run time reconfigured systems’, Int. J. Computer Aided Engineering and Technology, Vol. 3, No. 1, pp.77–91.

Ouni, B., Ayadi, R. and Mtibaa, A. (2011) ‘Temporal partitioning of data flow graph for dynamically reconfigurable architecture’, Journal of Systems Architecture, Vol. 57, No. 8, pp.790–798, Elsevier Publisher, ISSN: 1383-7621.

Ouni, B., Mtibaa, A. and Abid, M. (2004) ‘Synthesis and time partitioning for reconfigurable systems’, Journal Design Automation for Embedded Systems, September, Vol. 9, No. 3, pp.177–191, Springer Publishers.

Ouni, B., Mtibaa, A. and Bourennane, E-B. (2009) ‘Scheduling approach for run time reconfigured systems’, International Journal of Computer Sciences and Engineering Systems, Vol. 4, pp.336–334.

Ramzi, A., Bouraoui, O. and Abdellatif, M. (2012) ‘A partitioning methodology that optimizes the communication cost for reconfigurable computing systems’, International Journal of Automation Computing, Vol. 9, No. 3, pp.280–287.

Trimberger, S. (1998) ‘Scheduling designs into a time-multiplexed FPGA’, in Proc. ACM Int. Symp. Field Program. Gate Arrays, pp.153–160.

Wu, G.M., Lin, J.M. and Chang, Y.W. (2001) ‘Generic ILP-based approaches for time-multiplexed FPGA partitioning’, IEEE Trans. Comput.-Aided Des., October, Vol. 20, No. 10, pp.1266–1274.