cosynthesis algorithms partitioning
TRANSCRIPT
-
8/13/2019 CoSynthesis Algorithms Partitioning
1/29
-
8/13/2019 CoSynthesis Algorithms Partitioning
2/29
Winter-Spring 2001 Codesign of Embedded Systems 2
Topics Introduction
Preliminaries
Hardware/Software Partitioning
Distributed System Co-Synthesis
-
8/13/2019 CoSynthesis Algorithms Partitioning
3/29
Winter-Spring 2001 Codesign of Embedded Systems 3
Topics Introduction
A Classification
Examples
Vulcan
Cosyma
-
8/13/2019 CoSynthesis Algorithms Partitioning
4/29
Winter-Spring 2001 Codesign of Embedded Systems 4
Introduction to
HW/SW Partitioning The first variety of co-synthesis applications
Definition
A HW/SW partitioning algorithm implements aspecificationon some sort of multiprocessorarchitecture
Usually
Multiprocessor architecture = one CPU + someASICs on CPU bus
-
8/13/2019 CoSynthesis Algorithms Partitioning
5/29
Winter-Spring 2001 Codesign of Embedded Systems 5
Introduction to
HW/SW Partitioning (contd) A Terminology
Allocation
Synthesis methods which design the multiprocessortopology along with the PEs and SW architecture
Scheduling
The process of assigning PE (CPU and/or ASICs) time toprocesses to get executed
-
8/13/2019 CoSynthesis Algorithms Partitioning
6/29
Winter-Spring 2001 Codesign of Embedded Systems 6
Introduction to
HW/SW Partitioning (contd) In most partitioning algorithms
Type of CPU is fixed and given
ASICs must be synthesized What function to implement on each ASIC?
What characteristics should the implementation have?
Are single-rate synthesis problems
CDFG is the starting model
-
8/13/2019 CoSynthesis Algorithms Partitioning
7/29Winter-Spring 2001 Codesign of Embedded Systems 7
HW/SW Partitioning (contd) Normal use of architectural components
CPU performs less computationally-intensive
functions ASICs used to accelerate core functions
Where to use?
High-performance applications
No CPU is fast enough for the operations
Low-cost application
ASIC accelerators allow use of much smaller, cheaperCPU
-
8/13/2019 CoSynthesis Algorithms Partitioning
8/29Winter-Spring 2001 Codesign of Embedded Systems 8
A Classification Criterion: Optimization Strategy
Trade-off between Performanceand Cost
Primal Approach Performance is the primary goal
First, all functionality in ASICs. Progressively move moreto CPU to reduce cost.
Dual Approach
Cost is the primary goal
First, all functions in the CPU. Move operations to theASIC to meet the performance goal.
-
8/13/2019 CoSynthesis Algorithms Partitioning
9/29Winter-Spring 2001 Codesign of Embedded Systems 9
A Classification (contd) Classification due to optimization strategy
(contd)
Example co-synthesis systems Vulcan (Stanford): Primal strategy
Cosyma (Braunschweig, Germany): Dual strategy
-
8/13/2019 CoSynthesis Algorithms Partitioning
10/29
Winter-Spring 2001 Codesign of Embedded Systems 10
Co-Synthesis Algorithms:HW/SW Partitioning
HW/SW Partitioning Examples:
Vulcan
-
8/13/2019 CoSynthesis Algorithms Partitioning
11/29Winter-Spring 2001 Codesign of Embedded Systems 11
Partitioning Examples:
Vulcan Gupta, De Micheli, Stanford University
Primal approach
1. All-HW initial implementation.2. Iteratively move functionality to CPU to reduce
cost.
System specification language
HardwareC
Is compiled into a flow graph
-
8/13/2019 CoSynthesis Algorithms Partitioning
12/29Winter-Spring 2001 Codesign of Embedded Systems 12
Partitioning Examples:
Vulcan (contd)nop
x=a y=b
1 1x=a; y=b;
HardwareC
cond
x=e y=f
c>d cd)x=e;
else y=f;
HardwareC
-
8/13/2019 CoSynthesis Algorithms Partitioning
13/29
-
8/13/2019 CoSynthesis Algorithms Partitioning
14/29
Winter-Spring 2001 Codesign of Embedded Systems 14
Partitioning Examples:
Vulcan (contd) Flow Graph
is executed repeatedly at some rate
can have initiation-time constraints for each node t(vj)+lijt(vj) t(vj)+uij
can have rate constraints on each node
miRiMi
-
8/13/2019 CoSynthesis Algorithms Partitioning
15/29
Winter-Spring 2001 Codesign of Embedded Systems 15
Partitioning Examples:
Vulcan (contd) Vulcan Co-synthesis Algorithm
Partitioning quantum is a thread
Algorithm divides the flow graph into threadsandallocates them
Thread boundary is determined by
1. (always) a non-deterministic delay element, such as waitfor an external variable
2. (on choice) other points of flow graph Target architecture
CPU + Co-processor (multiple ASICs)
-
8/13/2019 CoSynthesis Algorithms Partitioning
16/29
Winter-Spring 2001 Codesign of Embedded Systems 16
Partitioning Examples:
Vulcan (contd) Vulcan Co-synthesis algorithm (contd)
Allocation
Primal approach Scheduling
is done by a scheduler on the target CPU
is generated as part of synthesis process
schedules all threads (both HW and SW threads)
cannot be static, due to some threads non-deterministicinitiation-time
-
8/13/2019 CoSynthesis Algorithms Partitioning
17/29
Winter-Spring 2001 Codesign of Embedded Systems 17
Partitioning Examples:
Vulcan (contd) Vulcan Co-synthesis algorithm (contd)
Cost estimation
SW implementation Code size
relatively straight forward
Data size
Biggest challenge.
Vulcan puts some effort to find bounds for eachthread
HW implementation
?
-
8/13/2019 CoSynthesis Algorithms Partitioning
18/29
Winter-Spring 2001 Codesign of Embedded Systems 18
Partitioning Examples:
Vulcan (contd) Vulcan Co-synthesis algorithm (contd)
Performance estimation
Both SW- and HW-implementation From flow-graph, and basic execution times for the
operators
-
8/13/2019 CoSynthesis Algorithms Partitioning
19/29
Winter-Spring 2001 Codesign of Embedded Systems 19
Partitioning Examples:
Vulcan (contd) Algorithm Details
Partitioning goal
Allocate each thread to one of two partitions CPU Set: FS Co-processor set: FH
Required execution-rate must be met, and total costminimized
-
8/13/2019 CoSynthesis Algorithms Partitioning
20/29
Winter-Spring 2001 Codesign of Embedded Systems 20
Partitioning Examples:
Vulcan (contd) Algorithm Details (contd)
Algorithm steps
1. Put all threads in FHset2. Iteratively do
2.1. Move some operations to FS.
2.1.1. Select a group of operations to move to FS.
2.1.2. Check performance feasibility, by computing
worst-case delay through flow-graph given the newthread times
2.1.3. Do the move, if feasible
2.2. Incrementally update the new cost-function to reflectthe new partition
-
8/13/2019 CoSynthesis Algorithms Partitioning
21/29
Winter-Spring 2001 Codesign of Embedded Systems 21
Partitioning Examples:
Vulcan (contd) Algorithm Details (contd)
Vulcan cost function
f(w) = c1Sh(F
H) - c2Ss(F
S) + c3B - c4P + c5|m|
c: weight constants
S(): Size functions
B: Bus utilization (
-
8/13/2019 CoSynthesis Algorithms Partitioning
22/29
Winter-Spring 2001 Codesign of Embedded Systems 22
Partitioning Examples:
Vulcan (contd) Algorithm Details (contd)
Complementary notes
A heuristic to minimize communication Once a thread is moved to FS, its immediate successors
are placed in the list for evaluation in the next iteration.
No back-track
Once a thread is assigned toF
S, it remains there Experimental results
considerably faster implementations than all-SW, butmuch cheaper than all-HW designs are produced
-
8/13/2019 CoSynthesis Algorithms Partitioning
23/29
Winter-Spring 2001 Codesign of Embedded Systems 23
Co-Synthesis Algorithms:HW/SW Partitioning
HW/SW Partitioning Examples:
Cosyma
-
8/13/2019 CoSynthesis Algorithms Partitioning
24/29
Winter-Spring 2001 Codesign of Embedded Systems 24
Partitioning Examples:
Cosyma Rolf Ernst, et al: Technical University of
Braunschweig, Germany
Dual approach1. All-SW initial implementation.
2. Iteratively move basic blocks to the ASICaccelerator to meet performance objective.
System specification language Cx
Is compiled into an ESG(Extended Syntax Graph)
ESGis much like a CDFG
-
8/13/2019 CoSynthesis Algorithms Partitioning
25/29
Winter-Spring 2001 Codesign of Embedded Systems 25
Partitioning Examples:
Cosyma (contd) Cosyma Co-synthesis Algorithm
Partitioning quantum is a Basic Block
A Basic Blocks is a branch-free block of program Target Architecture
CPU + accelerator ASIC(s)
Scheduling
Allocation Cost Estimation
Performance Estimation
Algorithm Details
-
8/13/2019 CoSynthesis Algorithms Partitioning
26/29
Winter-Spring 2001 Codesign of Embedded Systems 26
Partitioning Examples:
Cosyma (contd) Cosyma Co-synthesis Algorithm (contd)
Performance Estimation
SW implementation Done by examining the object code for the basic block
generated by a compiler
HW implementation
Assumes one operator per clock cycle.
Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required.
Communication
Done by data-flow analysis of the adjacent basic blocks.
In Shared-Memory
Proportional to number of variables to be accessed
-
8/13/2019 CoSynthesis Algorithms Partitioning
27/29
-
8/13/2019 CoSynthesis Algorithms Partitioning
28/29
Winter-Spring 2001 Codesign of Embedded Systems 28
Partitioning Examples:
Cosyma (contd) Experimental Results
By moving only basic-blocks to HW
Typical speedup of only 2x Reason:
Limited intra-basic-block parallelism
Cure:
Implement several control-flow optimizations to increase
parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with
multiple branch prediction, operator pipelining
Result:
Speedups: 2.7 to 9.7
CPU times: 35 to 304 seconds on a typical workstation
-
8/13/2019 CoSynthesis Algorithms Partitioning
29/29
Winter Spring 2001 Codesign of Embedded Systems 29
What we learned today HW/SW Partitioning: One broad category of
co-synthesis algorithms
Criteria by which a co-synthesis algorithm iscategorized