high-level synthesis algorithms. 2 scheduling: inputs: − a dfg − an architecture (i.e. a set of...

45
High-Level Synthesis Algorithms

Upload: toni-roscoe

Post on 16-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

High-Level Synthesis Algorithms

2

• Scheduling: Inputs:

− A DFG − An architecture (i.e. a set of processing elements)

Output:− Starting time of each node on a given resource

• Temporal partitioning: Input:

− A DFG− A reconfigurable device

Output:− A set of partitions− Starting time of each node is the starting time of the partition to which it belongs

• Solution approaches: List scheduling Integer linear programming (exact method) Network flow Spectral method

− * Recursive bi-partitioning approaches

Temporal partitioning & Scheduling

3

• Unconstrained scheduling:

Assumption: unlimited amount of resources− Device with unlimited size

Usually as pre-processing step for other algorithms

− E.g. computation of the upper and lower bounds on the starting time of operations.

Lower bound: the earliest time at which a module can be scheduled,

Upper bound: the latest time at which a module can be started.

Unconstrained Scheduling

4

• ASAP (as soon as possible)

Defines the earliest starting time for each node in the DFG

Computes a minimal latency

• ALAP (as late as possible)

Defines the latest starting time for each node in the DFG according to a given latency

• The mobility of a node: (ALAP starting time) – (ASAP starting time)

Mobility = 0 node is on a critical path

Unconstrained Scheduling

5

Unconstrained scheduling with optimal latency : L = 4

Zeit 4

* +

-

<

Zeit 0

Zeit 3

Zeit 4

* * *

* * +

-

Time 1

Time 2

Time 3

Zeit 3Time 4

Time 0

ASAP Example

6

• Assumptions: Multiplication: latency of

100 clocks, Addition/subtraction: 50

clocks, data transmission delay is

neglected.

ASAP Example

Computation delay of the prev. node

Node’s starting time as computed by the algorithm.

7

ASAP(G(V,E),d) { FOREACH ( vi without predecessor)

s(vi) := 0;

REPEAT {choose a node vi , whose predecessors are all planned;

s(vi) := maxj:(vj,vi)E {s(vj)+ dj};

} UNTIL (all nodes vi are planned);

RETURN s;}

ASAP Algorithm

8

Unconstrained scheduling with optimal latency : L = 4

*

+-

<

Zeit 1

Zeit 3

Zeit 4

*

*

*

*

*

+-

Zeit 4

Time 1

Time 2

Time 3

Time 4

Time 0

ALAP-Example

9

*

*1

1

Zeit 0

Zeit 1

Zeit 2

Zeit 3

Zeit 4

*

* +

<

*

+

*

-

*

*

-

2

2

2

2

*

+

+

<

0

0

0

0

0

Time 1

Time 2

Time 3

Time 4

Time 0

Mobility

10

• Assumptions: Multiplication: latency of

100 clocks, Addition/subtraction: 50

clocks, Overall computation time:

250

ALAP Example

Computation delay of the prev. node

Node’s starting time as computed by the algorithm.

11

ALAP(G(V,E),d, L) { FOREACH( vi without successor)

s(vi) := L - di; REPEAT {

Choose a node vi , which successors are all planned; s(vi) := minj:(vi,vj)E {s(vj)} - di;

} UNTIL (all nodes vi are planned); RETURN s}

ALAP-Algorithm

12

• Constrained scheduling: A set of fixed resources available (ASIC).

Many tasks competing for a given resource,− One of them must be chosen according to a given

criteria and the rest will be scheduled later.1. Extended ASAP, ALAP:

Compute ASAP or ALAP

Assign the tasks earlier (ASAP) or later (ALAP), until the resource constraints (e.g. area) are fulfilled.

Constrained Scheduling

13

* +

-

<

*

*

*

*

*

+-

● Constraint: 2 Multipliers, 2 ALUs (+, , <)

Time 0

Time 1

Time 2

Time 3

Time 4

Extended ASAP

14

• List scheduling:

Sort nodes in topological order

Assign priority to nodes

Criteria can be:− number of successors,

− depth (length of longest path from inputs),

− latency-weighted depth,− w: latency of the operation to be executed by the nodes on the path.

− mobility,

− connectivity,

− ...

Constrained Scheduling

15

• At any time step t:

A ready set L is constructed (operations ready to be scheduled)

− L: operations whose predecessors have already been scheduled early enough to complete their execution at time t.

Tasks are placed in L in decreasing priority order

At a given step, the free resource is assigned the task with highest priority.

Constrained Scheduling

17

* +

-

<

* * *

* * +

-

3 3

2

2 1 1

1

1

0 0

0

● Criterion: number of successors● Resources: 1 multiplier, 1 ALU (+, -, <)

Constrained Scheduling (Example)

18

Time 0

Time 1

Time 2

Time 3

Time 4

Time 5

Time 6

Time 7

* +

-

<

*

*

*

*

+

*

-

Constrained Scheduling (Example)

19

List Scheduling: Example• Resources: 1 multiplier, 1 adder• Latency:

Multiplication: 100 clocks, Add/sub: 50 clocks,

*

*

*

*

21

• In RCS,

Resource types are not important.

− Amount of basic resources are important.

Operators do not compete for resources.

− They compete for area.

Only the starting time and the end time of the complete partition is usually considered.

Temporal Partitioning vs. Constrained Scheduling

22

Temporal Partitioning in RCS

• Temporal partitioning: The same as list scheduling Assignment criterion: there should be enough places left

on the device to accommodate the new component.

Algorithm: List-scheduling algorithm for reconfigurable devices sort the nodes of v according to their priorities P0 := Ø while V ≠Ø do select a vertex v V with highest priority and whose predecessors are all placed if (a partition Pi exists with s(Pi) + s(v) ≤ s(H)) then Pi = Pi {v} else create a new partition Pi+1 and set Pi+1 = {v} end if end while

23

P2

P1

+

<

* *

* *

P3

-

*

-

*

+

● Connectivity:

● c(P1) = 1/6, ● c(P2) = 1/3, ● c(P3) = 2/6.

● Quality: 0.28

Temporal Partitioning vs. Constrained Scheduling

● Criterion: number of successors

● size(FPGA) = 250,

● size (mult) = 100,

● size(add) = size(sub) = 20,

● size(comp) = 10.

* +

-

<

* * *

* * +

-

3 3

2

2 1 1

1

1

0 0

0

3 3 1

2 2

1

1 1

0

00

24

Improvement

• Best criteria: Total computation time of DFG:

tDFG = n × CH + 1,…,n(tPi)

CH: Reconfiguration time of device H tPi : Computation time of partition Pi. n: Number of partitions

• Optimization: If CH too large, then the optimization will tend to minimize the

number of partitions If CH « tp, then algorithm will tend to avoid long paths in

partitions.

25

Improvement

• Advantage of LS-based temporal partitioning: Fast (linear time algorithm) Local optimization possible

− e.g. configuration switching + /*

*

+ - *

- /

Level 0

Level 1

Level 2

Level 3• Disadvantage: Levelization:

− Modules are assigned to partitions based more on their level number rather than their interconnectivity with other component.

Interconnectivity (data exchange) must be optimized.

26

P2

P1

+

<

* *

* *

P3

-

*

-

*

+

● Connectivity:

● c(P1) = 1/6, ● c(P2) = 1/3, ● c(P3) = 2/6.

● Quality: 0.28

LS-Based Temporal Partitioning

● Criterion: number of successors

● size(FPGA) = 250,

● size (mult) = 100,

● size(add) = size(sub) = 20,

● size(comp) = 10.

* +

-

<

* * *

* * +

-

3 3

2

2 1 1

1

1

0 0

0

3 3 1

2 2

1

1 1

0

00

27

* +

-

<

* * *

* * +

-

3 3

2

2 1 1

1

1

0 0

0

● Connectivity: ● c(P1) = 2/10, ● c(P2) = 2/3, ● c(P3) = 2/3.

● Quality: 0.51

● Quality is better

P2

P1+

<

*

*

*

P3

*-

*

-

*

+

Improved Temporal Partitioning3

3

1

2

2

1

1

1

00

0

28

• Pair wise interchange

Improved List Scheduling

29

• With the ILP (Integer Linear Programming), Temporal partitioning constraints are formulated as

equations. The equations are then solved using an ILP-solver.

• The constraints usually considered are: Uniqueness constraint Temporal order constraint Memory constraint Resource constraint Latency constraint

• Notations:

2.2 Temporal partitioning – ILP

)()1( ivi Pvy ))()()(()0( jijiuv PPPvPuw

30

• Unique assignment constraint: Each task must be placed in exactly one partition. (m = # of partitions)

• Precedence constraint: For each edge e = (u, v) in the graph, u must be placed either in the same partition as v or in an earlier partition than that in which v is placed.

m

iviyVv

1

1,

m

ivi

m

iui iyiyEvu

11

,),(

2.2 Temporal partitioning – ILP

31

• Resource constraint: The sum of the resources needed to implement the modules in one partition should not exceed the total amount of available resources.

− Device area constraint: s− Device terminal constraints: T (size of communication memory):

)()(, devicesvsyPPVv

vii

)(,)()()()(

deviceTwwPPiiii PvPuuv

PvPuuvi

Temporal partitioning – ILP

32

Temporal partitioning by ILP: Example

• assignment constraint: y11+ y12 + y13 = 1 y21+ y22 + y23 = 1 … y71 +y72 + y73 = 1

Partition P1: y22 = y23 = 0, y21 = 1 y32 = y33 = 0, y31 = 1 y42 = y43 = 0, y41 = 1

Partition P2: y11 = y13 = 0, y12 = 1 y51 = y53 = 0, y52 = 1 y61 = y63 = 0, y62 = 1

Partition P3: y71 = y72 = 0, y73 = 1

33

Temporal partitioning by ILP: Example

• Precedence constraint:

i i

i i

34

Temporal partitioning by ILP: Example

• Resource constraint: device with a size of 200 LUTs, and 100 LUTs for the

multiplication, 50 LUTs each for the addition, the comparison

s(u)=

s(u)=

s(u)=

35

Temporal partitioning by ILP: Example

• Communication memory constraint: Assume that a memory with 50 bytes is available

for communication and each datum has a 32-bit width.

BitsBits

Multi-Context FPGAs

39

Multi-Context FPGAs

• Reconfiguration Time: Can be high (compared to computation time) If in a loop, too many reconfigurations

− High total computation• Solutions:

Multi-Context Partial Reconfiguration Pipeline Reconfiguration

[Trimberger97]

40

Multi-Context FPGA

• Advantages: Switch between stored configurations quickly (some in

a single clock cycle)− Dramatically reducing reconfiguration overhead if the

next configuration is present in one of the alternate contexts

Background loading of configuration data during circuit operation

− Overlapping computation with reconfiguration

41

Multi-Context FPGAs

• Pg 99 of [Hauck08]

42

Multi-Context FPGAs

• Multi-Context Problems: Consumes valuable area which could be used for

logic Either all needed contexts must fit in the available

hardware or some control must determine when contexts

should be loaded from external memory Additional configuration data and required

multiplexing occupies valuable area − This could otherwise be used for logic or routing.

Never been commercialized? [Bobda07]1. Eight-context DRFPGA fabricated by NEC [Fujii99]

43

Partial Reconfiguration

• Partial reconfiguration: Some part of the device is configured. Can decrease reconfiguration time.

− Especially if a small part needs to be changed− E.g. in a cryptography system, the key is changed.

Can allow multiple independent configurations to be swapped in/out independently.

44

Partial Reconfiguration

• Devices: Xilinx 6200 family (1997):

− Each logic block could be programmed individually. Atmel AT40K (1999):

Xilinx Virtex FPGA family:− Reconfigures logic blocks in groups called frames− Virtex II (2004): Frame = A full column− Virtex 5 (2006): Frame = Partial column (41 32-bit

words)

45

Virtex Devices

• Partial reconfiguration in Virtex:• Frames:

Smallest unit of reconfiguration.• Frames in Xilinx devices:

Virtex, Virtex II, Virtex II-Pro:− The whole column.

Virtex 4, Virtex 5, Virtex 6− Only a complete tile.

− Different in various devices:

Width

He

igh

tTASK 1

Logical shared memory

TASK 2

CLB

[Banerjee07]

46

Partial Reconfiguration• Problems:

If configurations occupy large areas,

Time spent transmitting configuration addresses may be > time saved transmitting configuration data

− Serial loading better If the full configuration sequence is not known at

compile time,

Overlapping configurations− Solution: De-fragmentation:

47

Pipeline Reconfiguration

• Pipeline reconfiguration: Uses a series of physical pipeline stages.

Number of virtual stages is generally not constrained by the number of physical stages

PipeRench (2000)

Numbers (in boxes): pipeline stage

Shaded boxes: reconfiguration for the given cycle

48

Pipeline Reconfiguration

• Problem: Can only propagate forward through the pipeline

stages.− Any feedback connections must be completely

contained within a single stage.

49

References

[Bobda07] C. Bobda, “Introduction to Reconfigurable Computing: Architectures, Algorithms and Applications,” Springer, 2007.

[Hauck08] S. Hauck, A. DeHon, "Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation" Morgan-Kaufmann, 2007

[Fujii99] T. Fujii et al., “A dynamically reconfigurable logic engine with a multicontext/multi-mode unified-cell architecture,” in Proc. IEEE Int. Solid-State Circuits Conf., 1999, pp. 364–365.

[Mehdipour06] F. Mehdipour*, M. Saheb Zamani, M. Sedighi, “An integrated temporal partitioning and physical design framework for static compilation of reconfigurable computing systems,” Journal of Microprocessors and Microsystems, Elsevier, v30, 2006, pp. 52–62.