where is it used? accelerators - university of torontojzhu/courses/behsyn10/hls-handout.pdf · a...
TRANSCRIPT
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 1
Introduction to Introduction to CC--Based High Level SynthesisBased High Level Synthesis
JianwenJianwen ZhuZhuElectrical and Computer EngineeringElectrical and Computer Engineering
University of TorontoUniversity of Toronto
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 2
Where is it Used? Accelerators
ARM11 DSP
2D/3D GraphicsAccelerator
Imaging VideoAccelerator
Shared Memory Controller/DMA
Timers, Interrupt Controller, RTC
Internal SRAM, Boot/Secure ROM
Security
OMAP2420
Pow
er
USBOTG
PeripheralsPeripherals Pe
riphe
rals
ARM11 DSP
2D/3D GraphicsAccelerator
Imaging VideoAccelerator
Shared Memory Controller/DMA
Timers, Interrupt Controller, RTC
Internal SRAM, Boot/Secure ROM
Security
OMAP2420
Pow
er
USBOTG
PeripheralsPeripherals Pe
riphe
rals
ARM11 DSP
2D/3D GraphicsAccelerator
Imaging VideoAccelerator
Shared Memory Controller/DMA
Timers, Interrupt Controller, RTC
Internal SRAM, Boot/Secure ROM
Security
OMAP2420
Pow
er
USBOTG
PeripheralsPeripherals Pe
riphe
rals
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 3
How many of them: a lot!
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 4
When is it used? Design Flow
algorithm selectionoptimization
algorithm selectionoptimization
functional modelHW/SW partitioning
behavior mappingarchitecture exploration
Protocol generationTopology synthesis
HW/SW partitioningbehavior mapping
architecture explorationProtocol generationTopology synthesis architecture model
INPUT OUTPUT
RTL model
application requirements
High level synthesisHigh level synthesis
CPU
CPU S
physical implementation
M
IP
IP M
S
CPU
CPU S
M
IP
IP M
S
RTL-to-GDSIIRTL-to-GDSII
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 5
Why C-based
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 6
AgendaAgendaHLS: What is it?
Decision flowDesign representationsQuality metrics
HLS: AlgorithmsProblem formulationsKey algorithms
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 7
What is HLS?What is HLS?
A process of converting behavioural level design to
register transfer level
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 8
HLS Input: HLS Input: BehavioralBehavioral level designlevel designImperative program:
Statement and expression operating on program variables
CharacteristicsUntimedCaptures the function of the designNo hardware implementation detail
int A[100], B[100];int sum, i;
sum = i = 0;while( i < 100 ) {
sum = sum + A[i] * B[i];i = i + 1;}
Dot product algorithm
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 9
HLS Output: RTL designHLS Output: RTL designFinite state machine with datapath(FSMD)
Datapath performs computations (what)FSM generates control flow (when)
pstate
Next State L
ogic
`status0
sel6
sel4
sel2
sel0
we2
we0
func0
sel5
sel3
sel1
we3
we1
ControlL
ogic
Controller Datapath
R3R2R1R0
ADD MUL
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 10
Why HLS?
Automate the task of RTL designSimilar to the transition from assembly programming to high level programming in software
Guarantee the correctness of generated HDLCorrect by construction
Value of productivity gainSignificant reduction of development and verification costSignificant reduction of time-to-market
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 11
BackendBackend
C/C++Code
Verilog/VHDLCode
FrontendFrontend
OptimizerPartial redundancy
elimination,Strength reduction,
Dead code elimination,Constant propagation,
Code motion,Tree height reduction,
Algebraic optimization,Loop unrolling,
Software pipelining•…
OptimizerPartial redundancy
elimination,Strength reduction,
Dead code elimination,Constant propagation,
Code motion,Tree height reduction,
Algebraic optimization,Loop unrolling,
Software pipelining•…
IntermediateRepresentation
IntermediateRepresentation
Code GeneratorCode Generator
RTLRepresentation
RTLRepresentation
AllocationAllocation
SchedulingScheduling
BindingBinding
HLS Design FlowHLS Design Flow
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 12
Frontend and OptimizerFrontend and OptimizerFrontend: Textual program → Intermediate Representation (IR)
Perform lexical and syntactical analysisOptimizer: IR → IR
Common sub-expression eliminationDead code eliminationTree height reductionStrength reduction...
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 13
BackendBackendHLS backend : IR → RTL
Allocation – Determine the hardware resources requirementScheduling– Map each operation into a control step– Exploit the parallelism in sequential code– Maintain data dependenciesBinding – Map operations to function units– Map values to storage resources– Exploit hardware sharing (minimize area)
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 14
Code GeneratorCode GeneratorRTL → HDL
Translate key decisions made by backend to HDL acceptable by logic synthesis toolsDerive datapath from allocation and bindingDerive FSM from scheduling and binding
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 15
Design RepresentationsDesign RepresentationsDesign representations
Capture designs with varying detailsSimplified representations for teaching
TinyC: simplified C languageTinyIR: simplified IRTinyRTL: simplified RTL representation
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 16
TinyCTinyCProgram state defined by a set of variablesActions defined by statementsSimplifying assumptions
All statements are in a single procedure, and there are no procedure callsOnly support int and bool primitive typesOnly support one-dimensional arrayNo pointers are supported
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 17
TinyCTinyC SyntaxSyntaxDefined by a set of production rules in Backus-Naur Form (BNF)Key constructs
Declarations: define scalars or array variablesStatements: assignment or control flow statementsExpressions: transformations of scalar, defined primitive or program variable values
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 18
program:declaration* statement*
statement:variable '=' expression ';',
'if' '(' expression ') ' statement( 'else' statement )* ,
'while' ' (' expression ') ' statement ,'break' '; ','{' declaration* statement* '}'
declaration:type identifier ['=' expression] '; ',type identifier '[' expression ']' '; '
type:'int' , 'bool'
expression:'-' expression , '!' expression ,expression '+' expression ,expression '-' expression ,expression '*' expression ,expression '/' expression ,expression '^' expression ,expression '>>' expression ,expression '
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 19
NotationsNotations
A data type T corresponds to a set T, in particular the integer type Z corresponds to set ZA linked list or arrays whose elements are of type T corresponds to the power set of T, or the set of all subsets of T, denoted as T[]A record with fields a of type A, and b of type B corresponds to a set of named tuples, denoted as A graph R whose nodes are of type A corresponds to a relation R: A x AA hash table or dictionary F that maps a value of type A to a value of type B corresponds to a function F: A |--> B
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 20
TinyIRTinyIR
Why IR?Decouple optimization algorithms from input languages and target architectures
Definition: A TinyIR is a tuple with the following elements:A set O = {lds, sts, lda, sta, ba, br, cnst, +, -, *, /, , …} of operation codes, which corresponds to the set of all virtual instruction types.A set S of symbols, which corresponds to the scalar and array variablesA set V: of virtual instructions, which corresponds to the expressions and control transfers in the program.A set B: V[] of basic blocks, each containing a sequence of virtual instructions
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 21
From From TinyCTinyC to to TinyIRTinyIRConstructs in TinyC have equivalent representation in TinyIR
Declarations correspond to symbolsStatements and expressions correspond to virtual instructionsVirtual instructions are grouped within different basic blocks
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 22
Dot product in TinyIR
scalar sum;scalar i;array A[100];array B[100];
B1:(0) cnst 0(1) sts (0), sum(2) sts (0), i
B2:(3) lds i
(4) lda (3), A(5) lda (3), B(6) * (4) (5)(7) lds sum(8) + (6) (7)(9) sts (8), sum(10) cnst 1(11) + (3) (10)(12) sts (11), i(13) cnst 100(14) < (11) (13)(15) bt (14), B2
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 23
TinyRTLTinyRTLCapture key micro architectural information
Computational resourcesStorage resources
Instruction in TinyRTL is referred to as register transfer and is annotated with micro architecture resource useRegister transfer is cycle accurate
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 24
TinyRTLTinyRTL DefintionDefintionDefinition: A TinyRTL is a tuple with the following
elements:A set M of memories used to store scalar and array variablesA set R of registers used to store scalar variables or temporary instruction resultsA set U of functional units, such as adders, subtractors, multipliers, shifters, etc.A set I: of register transfers, each of which uses a functional unit to transform values, which are either constants or retrieved from registers, and stores the result back to a register.A set C: I[] of control steps, each of which contains a set of register transfer
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 25
Dot Product in Dot Product in TinyRTLTinyRTL5 control step: C0-C4Each step contains one or more register transfers
register R0, R1, R2, R3;memory M;unit U0, U1;
C0: sts 0, R0; sts 0, R1;C1: M.lda R2, R1, A;C2: M.lda R3, R1, B; U0.+ R1, R1, 1;C3: U1.* R2, R2, R3; U0.< R3, R1, 100;C4: U0.+ R0, R0, R2; bt C1, R3;
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 26
FSMDFSMD
Structured hardware descriptionsDescribed in
SchematicsVerilogHDL
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 27
Dot Product in FSMD: ControllerDot Product in FSMD: Controller
C0 C1 C2 C3 C4
T
F
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 28
Dot Product in FSMD: ControllerDot Product in FSMD: Controller // control signals always@( pstate ) begin we0 = 1'b0; we1 = 1'b0; we2 = 1'b0; we3 = 1'b0; sel0 = 1'bx; sel1 = 1'bx; sel2 = 1'bx; sel3 = 1'bx; sel4 = 1'bx; sel5 = 1'bx; sel6 = 2'bxx; func0 = 1'bx; case( pstate ) `C0:begin we0 = 1'b1; we1 = 1'b1; sel0 = 1'b0; sel1 = 1'b0; end `C1:begin we2 = 1'b1; sel2 = 1'b0; sel4 = 1'b0; end `C2:begin we1 = 1'b1; we3 = 1'b1; sel1 = 1'b1; sel3 = 1'b0; sel4 = 1'b1; sel5 = 1'b0; sel6 = 2'b00; func0 = 1'b0; end `C3:begin we2 = 1'b1; we3 = 1'b1; sel2 = 1'b1; sel3 = 1'b1; sel5 = 1'b0; sel6 = 2'b01; func0 = 1'b1; end `C4:begin we0 = 1'b1; sel0 = 1'b1; sel5 = 1'b1; sel6 = 2'b10; func0 = 1'b0; end endcase end
endmodule
`define C0 3'b000`define C1 3'b001`define C2 3'b010`define C3 3'b011`define C4 3'b100
module ctrl( clk, rst, status0, we0, we1, we2, we3, sel0, sel1, sel2, sel3, sel4, sel5, sel6, func0 );
input clk, rst;input status0;output we0, we1, we2, we3;output sel0, sel1, sel2, sel3, sel4, sel5;output [1:0] sel6;output func0;
reg [2:0] pstate, nstate;reg we0, we1, we2, we3;reg sel0, sel1, sel2, sel3, sel4, sel5;reg [1:0] sel6;reg func0;
// present state register always@( posedge clk or negedge rst )
if( !rst )pstate
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 29
Dot Product in FSMD: Dot Product in FSMD: DatapathDatapath
R0(sum)
R1(i)
R2
R3
U1(MUL)
U0(ADD/COMP)
M
1 100
0 0
sel0 sel1
we0
sel2 sel3
we1 we2
sel5 sel6
func0
we3
status0
A B
sel4
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 30
Dot Product in FSMD: Dot Product in FSMD: DatapathDatapath
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 31
Quality metricsQuality metricsWhen using HLS, it is important to
Establish required performance Measure implementation performance
Metrics need to be established to quantify both
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 32
Required PerformanceRequired PerformanceNaturally by input/output bandwidth in
Bits per secondTriangles per second...
More often by speed inMillion Instructions per Second (MIPS)
The work, measured by the number of TinyIRinstructions required for the computation an algorithm
applies per unit of data
workbandwidthTinyIRPerf ×=)(
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 33
Implementation PerformanceImplementation Performance
Use “back-of-the-envelope” method to quickly assess quality metrics directly from the RTL implementation
CycleCount is number of clock cycles it takes to complete the algorithmCycleTime is the shortest period for correct operation of the synthesized circuit
)(*)()(
TinyRTLCycleTimeTinyRTLCycleCountworkTinyRTLPerf =
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 34
Cycle CountCycle CountEstimate CycleCount(TinyRTL)
Statically examining the RTL
– Example: Consider the dot product algorithm in TinyRTL. It can be shown that C0 will be executed once, while C1-C4 will be executed 100 times. Therefore Note that this is significantly less than the total number of virtual instructions, which can be calculated from TinyIR as 3 + 13*100 = 1303. This type of speedup is achieved by executing multiple instructions in parallel in RTL.
401100*41)( =+=TinyRTLCycleCount
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 35
Cycle TimeCycle TimeThe worst-case delay along all register-to-register paths
Process-independent delay unit
)()( rtDelayMAXTinyRTLCycleTime Trt∈=
drawnLumnsFO */36.04 =
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 36
Cycle TimeCycle TimeDot product example, find all true register-to-register paths in FSMD model
Path 1: src1→mux1→u→muxd→destPath 2: src2→mux2→u→muxd→destPath 3: pstate→c1→mux1→u→muxd→destPath 4: pstate→c2→mux2→u→muxd→destPath 5: pstate→cu→u→muxd→destPath 6: pstate→cd→muxd→dest
Note: on some cases, mux1, mux2, muxd, c1, c2, cd and cu may not be needed
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 37
Cycle TimeCycle TimeTime has to be reserved for the correct functioning of the register
Now, the worst case delay of all paths in a register transfer can be calculated as:
skewCQsetup TTTdSeqOverhea ++=
)()(
),()(
),2()2(),1()1(
)(
muxdDelaycdDelay
uDelaycuDelay
muxDelaycDelaymuxDelaycDelay
MaxMax
dSeqOverheartDelay
+
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡+
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡++
+=
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 38
Cycle TimeCycle TimeHow to estimate the delay of the multiplexers and the control logic?
There are cases where some multiplexers or control logic are not neededThe delay of existing multiplexers depend on the number of inputs they have
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 39
Unit Control Signal DelayUnit Control Signal DelayAssuming the delay of all control signals assume a
value , we then have:
Where
cT
⎩⎨⎧
>=
=1)(1)(0
)(uOpcodesTuOpcodes
cuDelayc
[ ]{ }uurtTrtopcodertuOpcodes =∈∀= ..|.)(
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 40
MuxMux Control Signal DelayControl Signal DelaySimilarly, we have
where⎩⎨⎧
>=
=
⎩⎨⎧
>=
=
⎩⎨⎧
>=
=
1)(1)(0
)(
1)(21)(20
)2(
1)(11)(10
)1(
destSrcdTdestSrcd
cdDelay
uSrcTuSrc
cDelay
uSrcTuSrc
cDelay
c
c
c
[ ]{ }[ ]{ }[ ]{ }...|.)(
..|2..)(2..|1..)(1
destdestrtTrturtdestSrcduurtTrtsrcrtuSrc
uurtTrtsrcrtuSrc
=∈∀==∈∀==∈∀=
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 41
Unit and Unit and MuxMux DelayDelay
Process independent delay unitPre-characterize commonly used components
FO4 FO4 FO4 FO4setup 2.0 adder 10.0 2-input 2.4 1KB 10.5hold 0.0 comparator 6.0 4-input 3.2 4KB 12.0
clock skew/jitter 4.0 multiplier 35.0 8-input 4.8 8KB 12.8clock to Q 4.0 inverter 1.0 16-input 8.0 64KB 15.0
multiplexerfunctional unitregister SRAM
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 42
Adding UpAdding UpExample: To estimate the cycle time of the dot product example, we assume = 5 FO4. With the method developed previously, we can find the delay of each register transfer and determine that the cycle time of the RTL is 47.4 FO4. In 90nm technology, this is equivalent to 32.4ps*47.4=1.54ns. In other words, the maximum speed at which the circuit can run is 649Mhz. Recall that the cycle count is 401, and the work (number of virtual instructions) is 1303, we can conclude that
MIPSTinyRTLPerf 210854.1*401
1303)( ==
Register transferSeq
overhead u muxd cd cu c1 c2 mux1 mux2 Delaysts 0, R0 10.0 0.0 2.4 5.0 0.0 0.0 0.0 0.0 0.0 17.4sts 0, R1 10.0 0.0 2.4 5.0 0.0 0.0 0.0 0.0 0.0 17.4
M.lda R2, R1, A 10.0 10.5 2.4 5.0 0.0 0.0 5.0 0.0 2.4 30.3M.lda R3, R1, B 10.0 10.5 2.4 5.0 0.0 0.0 5.0 0.0 2.4 30.3U0.+ R1, R1, 1 10.0 10.0 2.4 5.0 5.0 5.0 5.0 2.4 3.2 33.2
U1.* R2, R2, R3 10.0 35.0 2.4 5.0 0.0 0.0 0.0 0.0 0.0 47.4U0.< R3, R1, 100 10.0 10.0 2.4 5.0 5.0 5.0 5.0 2.4 3.2 33.2U0.+ R0, R0, R2 10.0 10.0 2.4 5.0 5.0 5.0 5.0 2.4 3.2 33.2
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 43
High Level Synthesis AlgorithmsHigh Level Synthesis AlgorithmsZooming in…
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 44
High-level synthesis algorithmsFocus on
the backend component (i.e. core synthesis algorithms)Input: IR in terms of virtual instrctionsOutput: RTL in terms of register transfers
On top of functionally correct RTL, best quality of result (QoR) needs to be met as well.
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 45
Problem Formulation
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 46
Simplifying AssumptionsUser-provided functional unit allocation (i.e. the HW resources used in the implementation) → design explorationPartially finished storage allocation
– Array → a single memory– Scalar → separate registers– Temp value → ONLY thing needs to be mapped
Each virtual inst has one and only one register transferEach register transfer takes a single clock
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 47
Refined Problem
3 key decisionsScheduling: Map each inst to a control step in SchedRegister binding: Map the value computed by each inst to a registerFunctional unit binding: Map each inst to a functional unit
Must satisfy the resource constraint# of inst in each ctrl step
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 49
Dependency Test for TinyIR
No pointersdependency test only consists of comparing the symbol names
With pointersPointer/alias analysis has to be performed
Test result: a precedence graph for each basic block
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 50
Precedence Graph Example
TinyC(a) Code to TinyIR(b)
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 51
Precedence Graph ExampleFor a basic block, for each instruction, draw its dependencies as edges in Precedence Graph.E.g. (28) → (26) →(24),(25)Instructions outside the basic block will be names “s” and “t”
(a) Chain of instructions(b) Precedence Graph of B4
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 52
Unconstrained Scheduling
Assume an unlimited # of functional unitsProblem
Total # of steps = S(t) – S(s)Schedule of source must be earlier and sink must be later than any nodes in the basic block.
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 53
As Soon As Possible (ASAP) Scheduling
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 54
ASAP in EnglishIterative approachIn each iteration
a set of “ready” nodes (inst) are scheduled to a control step.A node is ready when all its predecessors are scheduled
Key: how to efficiently decide if a node is ready?Naïve approach
for each node being scheduled, visit each successor,check all predecessors of the successor
Better approach: keep a counter for each node representing # of predecessors
For each node being scheduled, visit each successorDecrement the counter of the successorWhen counter = 0, the node is ready.O(|V|+|E|)
ALAP(As Late As Possible) algorithm starts to schedule in reverse order (from sink)
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 55
ASAP and ALAP Example
step 0
step 1
step 2
step 3
s
15 16 24 25
17 26
18
20
15 16
24 25
17
26
18
20
15 16 17 18 20 24 25 26
(a) (b) (c)
t t
s
(a) ASAP (b) ALAP (c) MobilityMobility shows the flexibility of scheduling for each node.(difference between ASAP step and ALAP step.
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 56
Resource-constrained Scheduling
Now consider with limited # of functional units:New constraint:
# of instructions scheduled at any control step must
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 57
List Scheduling
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 58
List Scheduling in EnglishModified version of ASAP.Like ASAP, list of nodes ready are maintainedUnlike ASAP
Unit occupancy for current control step needs to be maintained: reservation station (restab)Only a subset of ready nodes can be scheduled.
Key question: How to choose the subset for better performance?Classic NP-complete problem
Solution:Assign priority to ready nodesPriority determined by heuristics
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 59
List Scheduling Heuristics
Less-Flexible-First assigns higher priority to nodes have smaller mobility Mobility = ALAP - ASAP
Distance to sink# successors
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 60
Example(a) Uses less-flexible-first priority(b) Random1 – did not choose 16 at step 0 (extra step)(c) Ramdom2 – did not choose 17 at step 2 (extra step)
(a)
t
18*
20+
26*
s
15 + 16-
25 *
17*
24+
(b)
*15 24 25
16 26
18
20
t
s
17
+ +
+
- *
*
*
(c)
26
18
20
t
+
*
*
s
15 + 16-
25 *
17*
24+
step 0
step 1
step 2
step 3
step 4
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 61
Register BindingVirtual instructions compute values
Values must be stored for later useStore values in registers
QuestionsHow many registers are needed?How can values be bound to registers?
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 62
Register Binding
Objective:Minimize the number of registersMaximize the sharing of registers between values
Observation:Two values can only share the same register if they are not in use at the same time
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 63
Liveness analysis
Objective: Determine when variables are in useWhat does it mean to be live?Definition 5.5 A value (instruction) v is live at a control step s1 if there exists another control step s2 reachable from s1, such that v is used as an operand by one of the instructions scheduled at s2. A live set at control step s1 is the set of all values alive at s1.
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 64
Example
step 0
step 1
step 2
step 3
6 4
15 16
24
25
17
26
18
20
{4, 6}
{4, 15, 16, 25}
{17, 24, 25}
{18, 24, 25}
{20, 26}
15 16 17 18 20 24 25 264 6
(a) (b)
Live values
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 65
Live(s), Def(s) and Use(s)
Liveness analysis is used to compute when variables are aliveUses three sets
Live(s): Set of live values at the beginning of step sDef(s): Set of values defined at step sUse(s): Set of values used at step s
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 66
Basic Block Liveness Analysis
Relationship between variablesLive(s) = Use(s) U [Live(s+1) – Def(s)]
Backward scan: Since each step requires information from the next stepIdea:
Used variables become liveDefined variables become dead
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 67
Basic Block Liveness Analysis
{20, 26}4
{18, 24, 25}{20, 26}3
{17}{18}2
{4, 15, 16}{17, 24}1
{4, 6}{15, 16, 25}0LiveUseDefSTEP
{18, 24, 25}
{18, 24, 25} U[ - ]{20, 26}{20, 26}
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 68
Basic Block Liveness Analysis
{20, 26}4
{18, 24, 25}{20, 26}3
{17}{18}2
{4, 15, 16}{17, 24}1
{4, 6}{15, 16, 25}0LiveUseDefSTEP
{18, 24, 25}
{17} U[ - ]{18}{18,24,25}
{17, 24, 25}
{4, 15, 16, 25}
{4, 6}
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 69
Basic Block Liveness Analysis
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 70
Liveness Analysis Algorithm
This algorithm can be extended to whole control flow graphCaveats:
CFG has backward edgesBB can have multiple successors
ModificationsRepeatedly traverse graph until solution convergesUnion all the live sets from predecessors before applying equation
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 71
Liveness Analysis Algorithm
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 72
Interference Graph
Liveness of variables has been computedConstruct a graph describing interference of two variables
i.e. both variables are alive at the same timeNodes: represent variablesEdges: represent two variables alive at the same time
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 73
Interference Graph
4
6
15
17
18
20
26
1624
25
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 74
Interference Graph Algorithm
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 75
Register binding by coloring
Recall: We want to assign variables to registers to minimize the number of registers usedCorresponds to the graph coloring problem
Given a graph color each node such that no edge joins nodes of the same color using the fewest number of color
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 76
Graph Coloring
Graph color is NP-completeNeed heuristics to complete problem within reasonable timeOverview
Given a order which to visit nodes (i.e. vertex elimination order)Visit each node in that orderPick a color such that no neighbor has the same one
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 77
16
25
Example
16
25
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 78
Example
16
25 1515
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 79
4
Example4
15
16
25
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 80
Example4
15
16
25
2424
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 81
Example4
15
16
25
24
17
1818
17
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 82
Example4
6
15
16
26
25
24
20 17
18
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 83
Register Binding By Coloring
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 84
Vertex Elimination Order
In previous example, assumed that order was givenFor basic block
To determine order use left-edge algorithmOptimal for the interval graph
GenerallyUse heuristic: less-flexible firsti.e. pick nodes with the most neighbors first
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 85
Vertex Elimination
To generate vertex orderVisit node with fewest neighbors and push them onto a stackRepeat until all edges visited
Stack will now contain vertex order starting with the top node on the stack
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 86
Example4
6
15
16
26
25
24
20 17
18 62026
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 87
Example4
15
16
25
24
17
18 620261718
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 88
Example4
15
16
25
24
62026171824
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 89
Example4
15
16
25
620261718244
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 90
Example
15
16
25
620261718244
15
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 91
Example
16
25
620261718244
151625
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 92
Functional Unit Binding
Register count and functional binding affects number of multiplexersDesirable have operands share registers
Common source: If operands for two instructions share the same register, desirable to map both instructions to the same functional unitCommon destination: Same as common source but applied to destination register of instruction
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 93
Impact of Register Binding on Mux
t
t
t0 t3 t1 t4
2 = t0 + t1;
5 = t3 + t4;
..
. MUX
R0 R1 R2 R3
R4 R5
t2 t5
MUX R0 R1
R3
t 0 / t3 t 1 / t 4
t2 /t 5
adder adder
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 94
Impact of Unit Binding on Mux
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 95
15 16
20
24
15 16
20
24
Compatibility GraphDuality with interference graphClique
“Complete” subgraph of a graphGraph coloring = Clique partitioning
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 96
Clique Partitioning Algorithm
Given the compatibility graphIteratively contract an edge
Merge two nodes (u and v) of the selected edgeUpdate the graph
– Preserve only edges to nodes that are neighbors to both to u and v
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 97
Unit Binding Problem Refined
Establish compatibility graphNodes represent instructionsEdges created between nodes such that
– Opcodes are of the same class– Instructions are not scheduled at the same time
Edges are weighted – To help identify common source/destination
Solve weighted clique partitioning problem
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 98
Second Iteration
INSTRUCTION SRC 1 SRC 2 DEST
15 R3 R0 R2
16 R3 R0 R1
20 R2 R1
24 R3 R1
15 16
20
24
0 1
1 2
1
Iteration 1
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 99
Third Iteration
1516
24
20
0 1
Iteration 2
INSTRUCTION SRC 1 SRC 2 DEST
15 R3 R0 R2
16 , 24 R3 R0 R1
20 R2 R1
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 100
End Result
16
24
20
15
After iteration 2
INSTRUCTION SRC 1 SRC 2 DEST
15 R3 R 0 R2
16 , 24 , 20 R3 , R2 R 0 R1
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 101
Random Unit Binding
(a) (b) (c)
1520
1624
15 16
20
24
0 1
1 2
1
1520
2 216
24
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 102
Resultant Datapath
Compare resultWeighted clique partitioningRandom clique partitioning
On #mux inputs1720
(a)
MUL ADD/SUB 0 ADD/SUB 1
R0 R1 R2 R3
12 13 1612
(4)
(6)
sel0 sel1
sel2 sel3 sel4 sel5
we0 we1 we2 we3
func
(b)
MUL ADD/SUB 0 ADD/SUB 1
R0 R1 R2 R3
12 13 16
(4)
(6)
sel0
sel3 sel3 sel4
we0 we1 we2 we3
func
sel2sel1
sel5 sel6
12
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 103
Acknowlegement
This powerpoint is created with the help of
Alex ChoongZefu DaiNick NiWai Sum Mong
From University of Toronto
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 104
Historical NotesMaterial due to pioneering contributions of
McFarland, Don Thomas et al (CMU)De Man et al (IMEC)Parker et al (USC)Gajski et al (UIUC, UC Irvine)Composano et al (IBM)Paulin et al (Carleton)And numerous others…
Closely related evolutionsHardware/Software CodesignApplication-specific instruction processorMPSoC / NoC
-
ECE1769
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 105
Commercial EffortsEarly efforts
Synopsys Behavioral CompilerY Explorations…
Current effortsForteSynforaMentor Graphics CatapultCadence C2SAutoESL…
EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 106
What Went Wrong and Future Top 5
5: Architecture does not scaleFSMD limits to ILP
4: Application does not scaleInput limits in size
3: Poor link with logic/physical synthesis2: Methodology does not scale
Function-centric -> Architecture-centric -> Application centric
1: Identity crisis: what problem to solve best and first
Solvable w/t Good R&D
RequiresCommunityEffort