where is it used? accelerators - university of torontojzhu/courses/behsyn10/hls-handout.pdf · a...

53
ECE1769 EE141 Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 1 Introduction to Introduction to C- Based High Level Synthesis Based High Level Synthesis Jianwen Jianwen Zhu Zhu Electrical and Computer Engineering Electrical and Computer Engineering University of Toronto University of Toronto EE141 Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 2 Where is it Used? Accelerators ARM11 DSP 2D/3DGraphics Accelerator ImagingVideo Accelerator SharedMemoryController/DMA Timers,InterruptController,RTC InternalSRAM,Boot/Secure ROM Security OMAP2420 Power USB OTG Peripherals Peripherals Peripherals ARM11 DSP 2D/3DGraphics Accelerator ImagingVideo Accelerator SharedMemoryController/DMA Timers,InterruptController,RTC InternalSRAM,Boot/Secure ROM Security OMAP2420 Power USB OTG Peripherals Peripherals Peripherals ARM11 DSP 2D/3D Graphics Accelerator ImagingVideo Accelerator SharedMemoryController/DMA Timers,InterruptController,RTC InternalSRAM,Boot/Secure ROM Security OMAP2420 Power USB OTG Peripherals Peripherals Peripherals

Upload: others

Post on 20-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 1

    Introduction to Introduction to CC--Based High Level SynthesisBased High Level Synthesis

    JianwenJianwen ZhuZhuElectrical and Computer EngineeringElectrical and Computer Engineering

    University of TorontoUniversity of Toronto

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 2

    Where is it Used? Accelerators

    ARM11 DSP

    2D/3D GraphicsAccelerator

    Imaging VideoAccelerator

    Shared Memory Controller/DMA

    Timers, Interrupt Controller, RTC

    Internal SRAM, Boot/Secure ROM

    Security

    OMAP2420

    Pow

    er

    USBOTG

    PeripheralsPeripherals Pe

    riphe

    rals

    ARM11 DSP

    2D/3D GraphicsAccelerator

    Imaging VideoAccelerator

    Shared Memory Controller/DMA

    Timers, Interrupt Controller, RTC

    Internal SRAM, Boot/Secure ROM

    Security

    OMAP2420

    Pow

    er

    USBOTG

    PeripheralsPeripherals Pe

    riphe

    rals

    ARM11 DSP

    2D/3D GraphicsAccelerator

    Imaging VideoAccelerator

    Shared Memory Controller/DMA

    Timers, Interrupt Controller, RTC

    Internal SRAM, Boot/Secure ROM

    Security

    OMAP2420

    Pow

    er

    USBOTG

    PeripheralsPeripherals Pe

    riphe

    rals

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 3

    How many of them: a lot!

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 4

    When is it used? Design Flow

    algorithm selectionoptimization

    algorithm selectionoptimization

    functional modelHW/SW partitioning

    behavior mappingarchitecture exploration

    Protocol generationTopology synthesis

    HW/SW partitioningbehavior mapping

    architecture explorationProtocol generationTopology synthesis architecture model

    INPUT OUTPUT

    RTL model

    application requirements

    High level synthesisHigh level synthesis

    CPU

    CPU S

    physical implementation

    M

    IP

    IP M

    S

    CPU

    CPU S

    M

    IP

    IP M

    S

    RTL-to-GDSIIRTL-to-GDSII

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 5

    Why C-based

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 6

    AgendaAgendaHLS: What is it?

    Decision flowDesign representationsQuality metrics

    HLS: AlgorithmsProblem formulationsKey algorithms

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 7

    What is HLS?What is HLS?

    A process of converting behavioural level design to

    register transfer level

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 8

    HLS Input: HLS Input: BehavioralBehavioral level designlevel designImperative program:

    Statement and expression operating on program variables

    CharacteristicsUntimedCaptures the function of the designNo hardware implementation detail

    int A[100], B[100];int sum, i;

    sum = i = 0;while( i < 100 ) {

    sum = sum + A[i] * B[i];i = i + 1;}

    Dot product algorithm

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 9

    HLS Output: RTL designHLS Output: RTL designFinite state machine with datapath(FSMD)

    Datapath performs computations (what)FSM generates control flow (when)

    pstate

    Next State L

    ogic

    `status0

    sel6

    sel4

    sel2

    sel0

    we2

    we0

    func0

    sel5

    sel3

    sel1

    we3

    we1

    ControlL

    ogic

    Controller Datapath

    R3R2R1R0

    ADD MUL

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 10

    Why HLS?

    Automate the task of RTL designSimilar to the transition from assembly programming to high level programming in software

    Guarantee the correctness of generated HDLCorrect by construction

    Value of productivity gainSignificant reduction of development and verification costSignificant reduction of time-to-market

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 11

    BackendBackend

    C/C++Code

    Verilog/VHDLCode

    FrontendFrontend

    OptimizerPartial redundancy

    elimination,Strength reduction,

    Dead code elimination,Constant propagation,

    Code motion,Tree height reduction,

    Algebraic optimization,Loop unrolling,

    Software pipelining•…

    OptimizerPartial redundancy

    elimination,Strength reduction,

    Dead code elimination,Constant propagation,

    Code motion,Tree height reduction,

    Algebraic optimization,Loop unrolling,

    Software pipelining•…

    IntermediateRepresentation

    IntermediateRepresentation

    Code GeneratorCode Generator

    RTLRepresentation

    RTLRepresentation

    AllocationAllocation

    SchedulingScheduling

    BindingBinding

    HLS Design FlowHLS Design Flow

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 12

    Frontend and OptimizerFrontend and OptimizerFrontend: Textual program → Intermediate Representation (IR)

    Perform lexical and syntactical analysisOptimizer: IR → IR

    Common sub-expression eliminationDead code eliminationTree height reductionStrength reduction...

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 13

    BackendBackendHLS backend : IR → RTL

    Allocation – Determine the hardware resources requirementScheduling– Map each operation into a control step– Exploit the parallelism in sequential code– Maintain data dependenciesBinding – Map operations to function units– Map values to storage resources– Exploit hardware sharing (minimize area)

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 14

    Code GeneratorCode GeneratorRTL → HDL

    Translate key decisions made by backend to HDL acceptable by logic synthesis toolsDerive datapath from allocation and bindingDerive FSM from scheduling and binding

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 15

    Design RepresentationsDesign RepresentationsDesign representations

    Capture designs with varying detailsSimplified representations for teaching

    TinyC: simplified C languageTinyIR: simplified IRTinyRTL: simplified RTL representation

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 16

    TinyCTinyCProgram state defined by a set of variablesActions defined by statementsSimplifying assumptions

    All statements are in a single procedure, and there are no procedure callsOnly support int and bool primitive typesOnly support one-dimensional arrayNo pointers are supported

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 17

    TinyCTinyC SyntaxSyntaxDefined by a set of production rules in Backus-Naur Form (BNF)Key constructs

    Declarations: define scalars or array variablesStatements: assignment or control flow statementsExpressions: transformations of scalar, defined primitive or program variable values

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 18

    program:declaration* statement*

    statement:variable '=' expression ';',

    'if' '(' expression ') ' statement( 'else' statement )* ,

    'while' ' (' expression ') ' statement ,'break' '; ','{' declaration* statement* '}'

    declaration:type identifier ['=' expression] '; ',type identifier '[' expression ']' '; '

    type:'int' , 'bool'

    expression:'-' expression , '!' expression ,expression '+' expression ,expression '-' expression ,expression '*' expression ,expression '/' expression ,expression '^' expression ,expression '>>' expression ,expression '

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 19

    NotationsNotations

    A data type T corresponds to a set T, in particular the integer type Z corresponds to set ZA linked list or arrays whose elements are of type T corresponds to the power set of T, or the set of all subsets of T, denoted as T[]A record with fields a of type A, and b of type B corresponds to a set of named tuples, denoted as A graph R whose nodes are of type A corresponds to a relation R: A x AA hash table or dictionary F that maps a value of type A to a value of type B corresponds to a function F: A |--> B

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 20

    TinyIRTinyIR

    Why IR?Decouple optimization algorithms from input languages and target architectures

    Definition: A TinyIR is a tuple with the following elements:A set O = {lds, sts, lda, sta, ba, br, cnst, +, -, *, /, , …} of operation codes, which corresponds to the set of all virtual instruction types.A set S of symbols, which corresponds to the scalar and array variablesA set V: of virtual instructions, which corresponds to the expressions and control transfers in the program.A set B: V[] of basic blocks, each containing a sequence of virtual instructions

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 21

    From From TinyCTinyC to to TinyIRTinyIRConstructs in TinyC have equivalent representation in TinyIR

    Declarations correspond to symbolsStatements and expressions correspond to virtual instructionsVirtual instructions are grouped within different basic blocks

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 22

    Dot product in TinyIR

    scalar sum;scalar i;array A[100];array B[100];

    B1:(0) cnst 0(1) sts (0), sum(2) sts (0), i

    B2:(3) lds i

    (4) lda (3), A(5) lda (3), B(6) * (4) (5)(7) lds sum(8) + (6) (7)(9) sts (8), sum(10) cnst 1(11) + (3) (10)(12) sts (11), i(13) cnst 100(14) < (11) (13)(15) bt (14), B2

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 23

    TinyRTLTinyRTLCapture key micro architectural information

    Computational resourcesStorage resources

    Instruction in TinyRTL is referred to as register transfer and is annotated with micro architecture resource useRegister transfer is cycle accurate

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 24

    TinyRTLTinyRTL DefintionDefintionDefinition: A TinyRTL is a tuple with the following

    elements:A set M of memories used to store scalar and array variablesA set R of registers used to store scalar variables or temporary instruction resultsA set U of functional units, such as adders, subtractors, multipliers, shifters, etc.A set I: of register transfers, each of which uses a functional unit to transform values, which are either constants or retrieved from registers, and stores the result back to a register.A set C: I[] of control steps, each of which contains a set of register transfer

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 25

    Dot Product in Dot Product in TinyRTLTinyRTL5 control step: C0-C4Each step contains one or more register transfers

    register R0, R1, R2, R3;memory M;unit U0, U1;

    C0: sts 0, R0; sts 0, R1;C1: M.lda R2, R1, A;C2: M.lda R3, R1, B; U0.+ R1, R1, 1;C3: U1.* R2, R2, R3; U0.< R3, R1, 100;C4: U0.+ R0, R0, R2; bt C1, R3;

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 26

    FSMDFSMD

    Structured hardware descriptionsDescribed in

    SchematicsVerilogHDL

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 27

    Dot Product in FSMD: ControllerDot Product in FSMD: Controller

    C0 C1 C2 C3 C4

    T

    F

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 28

    Dot Product in FSMD: ControllerDot Product in FSMD: Controller // control signals always@( pstate ) begin we0 = 1'b0; we1 = 1'b0; we2 = 1'b0; we3 = 1'b0; sel0 = 1'bx; sel1 = 1'bx; sel2 = 1'bx; sel3 = 1'bx; sel4 = 1'bx; sel5 = 1'bx; sel6 = 2'bxx; func0 = 1'bx; case( pstate ) `C0:begin we0 = 1'b1; we1 = 1'b1; sel0 = 1'b0; sel1 = 1'b0; end `C1:begin we2 = 1'b1; sel2 = 1'b0; sel4 = 1'b0; end `C2:begin we1 = 1'b1; we3 = 1'b1; sel1 = 1'b1; sel3 = 1'b0; sel4 = 1'b1; sel5 = 1'b0; sel6 = 2'b00; func0 = 1'b0; end `C3:begin we2 = 1'b1; we3 = 1'b1; sel2 = 1'b1; sel3 = 1'b1; sel5 = 1'b0; sel6 = 2'b01; func0 = 1'b1; end `C4:begin we0 = 1'b1; sel0 = 1'b1; sel5 = 1'b1; sel6 = 2'b10; func0 = 1'b0; end endcase end

    endmodule

    `define C0 3'b000`define C1 3'b001`define C2 3'b010`define C3 3'b011`define C4 3'b100

    module ctrl( clk, rst, status0, we0, we1, we2, we3, sel0, sel1, sel2, sel3, sel4, sel5, sel6, func0 );

    input clk, rst;input status0;output we0, we1, we2, we3;output sel0, sel1, sel2, sel3, sel4, sel5;output [1:0] sel6;output func0;

    reg [2:0] pstate, nstate;reg we0, we1, we2, we3;reg sel0, sel1, sel2, sel3, sel4, sel5;reg [1:0] sel6;reg func0;

    // present state register always@( posedge clk or negedge rst )

    if( !rst )pstate

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 29

    Dot Product in FSMD: Dot Product in FSMD: DatapathDatapath

    R0(sum)

    R1(i)

    R2

    R3

    U1(MUL)

    U0(ADD/COMP)

    M

    1 100

    0 0

    sel0 sel1

    we0

    sel2 sel3

    we1 we2

    sel5 sel6

    func0

    we3

    status0

    A B

    sel4

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 30

    Dot Product in FSMD: Dot Product in FSMD: DatapathDatapath

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 31

    Quality metricsQuality metricsWhen using HLS, it is important to

    Establish required performance Measure implementation performance

    Metrics need to be established to quantify both

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 32

    Required PerformanceRequired PerformanceNaturally by input/output bandwidth in

    Bits per secondTriangles per second...

    More often by speed inMillion Instructions per Second (MIPS)

    The work, measured by the number of TinyIRinstructions required for the computation an algorithm

    applies per unit of data

    workbandwidthTinyIRPerf ×=)(

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 33

    Implementation PerformanceImplementation Performance

    Use “back-of-the-envelope” method to quickly assess quality metrics directly from the RTL implementation

    CycleCount is number of clock cycles it takes to complete the algorithmCycleTime is the shortest period for correct operation of the synthesized circuit

    )(*)()(

    TinyRTLCycleTimeTinyRTLCycleCountworkTinyRTLPerf =

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 34

    Cycle CountCycle CountEstimate CycleCount(TinyRTL)

    Statically examining the RTL

    – Example: Consider the dot product algorithm in TinyRTL. It can be shown that C0 will be executed once, while C1-C4 will be executed 100 times. Therefore Note that this is significantly less than the total number of virtual instructions, which can be calculated from TinyIR as 3 + 13*100 = 1303. This type of speedup is achieved by executing multiple instructions in parallel in RTL.

    401100*41)( =+=TinyRTLCycleCount

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 35

    Cycle TimeCycle TimeThe worst-case delay along all register-to-register paths

    Process-independent delay unit

    )()( rtDelayMAXTinyRTLCycleTime Trt∈=

    drawnLumnsFO */36.04 =

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 36

    Cycle TimeCycle TimeDot product example, find all true register-to-register paths in FSMD model

    Path 1: src1→mux1→u→muxd→destPath 2: src2→mux2→u→muxd→destPath 3: pstate→c1→mux1→u→muxd→destPath 4: pstate→c2→mux2→u→muxd→destPath 5: pstate→cu→u→muxd→destPath 6: pstate→cd→muxd→dest

    Note: on some cases, mux1, mux2, muxd, c1, c2, cd and cu may not be needed

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 37

    Cycle TimeCycle TimeTime has to be reserved for the correct functioning of the register

    Now, the worst case delay of all paths in a register transfer can be calculated as:

    skewCQsetup TTTdSeqOverhea ++=

    )()(

    ),()(

    ),2()2(),1()1(

    )(

    muxdDelaycdDelay

    uDelaycuDelay

    muxDelaycDelaymuxDelaycDelay

    MaxMax

    dSeqOverheartDelay

    +

    ⎥⎥⎥⎥⎥

    ⎢⎢⎢⎢⎢

    ⎡+

    ⎥⎥⎥

    ⎢⎢⎢

    ⎡++

    +=

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 38

    Cycle TimeCycle TimeHow to estimate the delay of the multiplexers and the control logic?

    There are cases where some multiplexers or control logic are not neededThe delay of existing multiplexers depend on the number of inputs they have

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 39

    Unit Control Signal DelayUnit Control Signal DelayAssuming the delay of all control signals assume a

    value , we then have:

    Where

    cT

    ⎩⎨⎧

    >=

    =1)(1)(0

    )(uOpcodesTuOpcodes

    cuDelayc

    [ ]{ }uurtTrtopcodertuOpcodes =∈∀= ..|.)(

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 40

    MuxMux Control Signal DelayControl Signal DelaySimilarly, we have

    where⎩⎨⎧

    >=

    =

    ⎩⎨⎧

    >=

    =

    ⎩⎨⎧

    >=

    =

    1)(1)(0

    )(

    1)(21)(20

    )2(

    1)(11)(10

    )1(

    destSrcdTdestSrcd

    cdDelay

    uSrcTuSrc

    cDelay

    uSrcTuSrc

    cDelay

    c

    c

    c

    [ ]{ }[ ]{ }[ ]{ }...|.)(

    ..|2..)(2..|1..)(1

    destdestrtTrturtdestSrcduurtTrtsrcrtuSrc

    uurtTrtsrcrtuSrc

    =∈∀==∈∀==∈∀=

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 41

    Unit and Unit and MuxMux DelayDelay

    Process independent delay unitPre-characterize commonly used components

    FO4 FO4 FO4 FO4setup 2.0 adder 10.0 2-input 2.4 1KB 10.5hold 0.0 comparator 6.0 4-input 3.2 4KB 12.0

    clock skew/jitter 4.0 multiplier 35.0 8-input 4.8 8KB 12.8clock to Q 4.0 inverter 1.0 16-input 8.0 64KB 15.0

    multiplexerfunctional unitregister SRAM

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 42

    Adding UpAdding UpExample: To estimate the cycle time of the dot product example, we assume = 5 FO4. With the method developed previously, we can find the delay of each register transfer and determine that the cycle time of the RTL is 47.4 FO4. In 90nm technology, this is equivalent to 32.4ps*47.4=1.54ns. In other words, the maximum speed at which the circuit can run is 649Mhz. Recall that the cycle count is 401, and the work (number of virtual instructions) is 1303, we can conclude that

    MIPSTinyRTLPerf 210854.1*401

    1303)( ==

    Register transferSeq

    overhead u muxd cd cu c1 c2 mux1 mux2 Delaysts 0, R0 10.0 0.0 2.4 5.0 0.0 0.0 0.0 0.0 0.0 17.4sts 0, R1 10.0 0.0 2.4 5.0 0.0 0.0 0.0 0.0 0.0 17.4

    M.lda R2, R1, A 10.0 10.5 2.4 5.0 0.0 0.0 5.0 0.0 2.4 30.3M.lda R3, R1, B 10.0 10.5 2.4 5.0 0.0 0.0 5.0 0.0 2.4 30.3U0.+ R1, R1, 1 10.0 10.0 2.4 5.0 5.0 5.0 5.0 2.4 3.2 33.2

    U1.* R2, R2, R3 10.0 35.0 2.4 5.0 0.0 0.0 0.0 0.0 0.0 47.4U0.< R3, R1, 100 10.0 10.0 2.4 5.0 5.0 5.0 5.0 2.4 3.2 33.2U0.+ R0, R0, R2 10.0 10.0 2.4 5.0 5.0 5.0 5.0 2.4 3.2 33.2

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 43

    High Level Synthesis AlgorithmsHigh Level Synthesis AlgorithmsZooming in…

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 44

    High-level synthesis algorithmsFocus on

    the backend component (i.e. core synthesis algorithms)Input: IR in terms of virtual instrctionsOutput: RTL in terms of register transfers

    On top of functionally correct RTL, best quality of result (QoR) needs to be met as well.

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 45

    Problem Formulation

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 46

    Simplifying AssumptionsUser-provided functional unit allocation (i.e. the HW resources used in the implementation) → design explorationPartially finished storage allocation

    – Array → a single memory– Scalar → separate registers– Temp value → ONLY thing needs to be mapped

    Each virtual inst has one and only one register transferEach register transfer takes a single clock

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 47

    Refined Problem

    3 key decisionsScheduling: Map each inst to a control step in SchedRegister binding: Map the value computed by each inst to a registerFunctional unit binding: Map each inst to a functional unit

    Must satisfy the resource constraint# of inst in each ctrl step

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 49

    Dependency Test for TinyIR

    No pointersdependency test only consists of comparing the symbol names

    With pointersPointer/alias analysis has to be performed

    Test result: a precedence graph for each basic block

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 50

    Precedence Graph Example

    TinyC(a) Code to TinyIR(b)

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 51

    Precedence Graph ExampleFor a basic block, for each instruction, draw its dependencies as edges in Precedence Graph.E.g. (28) → (26) →(24),(25)Instructions outside the basic block will be names “s” and “t”

    (a) Chain of instructions(b) Precedence Graph of B4

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 52

    Unconstrained Scheduling

    Assume an unlimited # of functional unitsProblem

    Total # of steps = S(t) – S(s)Schedule of source must be earlier and sink must be later than any nodes in the basic block.

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 53

    As Soon As Possible (ASAP) Scheduling

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 54

    ASAP in EnglishIterative approachIn each iteration

    a set of “ready” nodes (inst) are scheduled to a control step.A node is ready when all its predecessors are scheduled

    Key: how to efficiently decide if a node is ready?Naïve approach

    for each node being scheduled, visit each successor,check all predecessors of the successor

    Better approach: keep a counter for each node representing # of predecessors

    For each node being scheduled, visit each successorDecrement the counter of the successorWhen counter = 0, the node is ready.O(|V|+|E|)

    ALAP(As Late As Possible) algorithm starts to schedule in reverse order (from sink)

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 55

    ASAP and ALAP Example

    step 0

    step 1

    step 2

    step 3

    s

    15 16 24 25

    17 26

    18

    20

    15 16

    24 25

    17

    26

    18

    20

    15 16 17 18 20 24 25 26

    (a) (b) (c)

    t t

    s

    (a) ASAP (b) ALAP (c) MobilityMobility shows the flexibility of scheduling for each node.(difference between ASAP step and ALAP step.

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 56

    Resource-constrained Scheduling

    Now consider with limited # of functional units:New constraint:

    # of instructions scheduled at any control step must

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 57

    List Scheduling

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 58

    List Scheduling in EnglishModified version of ASAP.Like ASAP, list of nodes ready are maintainedUnlike ASAP

    Unit occupancy for current control step needs to be maintained: reservation station (restab)Only a subset of ready nodes can be scheduled.

    Key question: How to choose the subset for better performance?Classic NP-complete problem

    Solution:Assign priority to ready nodesPriority determined by heuristics

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 59

    List Scheduling Heuristics

    Less-Flexible-First assigns higher priority to nodes have smaller mobility Mobility = ALAP - ASAP

    Distance to sink# successors

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 60

    Example(a) Uses less-flexible-first priority(b) Random1 – did not choose 16 at step 0 (extra step)(c) Ramdom2 – did not choose 17 at step 2 (extra step)

    (a)

    t

    18*

    20+

    26*

    s

    15 + 16-

    25 *

    17*

    24+

    (b)

    *15 24 25

    16 26

    18

    20

    t

    s

    17

    + +

    +

    - *

    *

    *

    (c)

    26

    18

    20

    t

    +

    *

    *

    s

    15 + 16-

    25 *

    17*

    24+

    step 0

    step 1

    step 2

    step 3

    step 4

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 61

    Register BindingVirtual instructions compute values

    Values must be stored for later useStore values in registers

    QuestionsHow many registers are needed?How can values be bound to registers?

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 62

    Register Binding

    Objective:Minimize the number of registersMaximize the sharing of registers between values

    Observation:Two values can only share the same register if they are not in use at the same time

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 63

    Liveness analysis

    Objective: Determine when variables are in useWhat does it mean to be live?Definition 5.5 A value (instruction) v is live at a control step s1 if there exists another control step s2 reachable from s1, such that v is used as an operand by one of the instructions scheduled at s2. A live set at control step s1 is the set of all values alive at s1.

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 64

    Example

    step 0

    step 1

    step 2

    step 3

    6 4

    15 16

    24

    25

    17

    26

    18

    20

    {4, 6}

    {4, 15, 16, 25}

    {17, 24, 25}

    {18, 24, 25}

    {20, 26}

    15 16 17 18 20 24 25 264 6

    (a) (b)

    Live values

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 65

    Live(s), Def(s) and Use(s)

    Liveness analysis is used to compute when variables are aliveUses three sets

    Live(s): Set of live values at the beginning of step sDef(s): Set of values defined at step sUse(s): Set of values used at step s

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 66

    Basic Block Liveness Analysis

    Relationship between variablesLive(s) = Use(s) U [Live(s+1) – Def(s)]

    Backward scan: Since each step requires information from the next stepIdea:

    Used variables become liveDefined variables become dead

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 67

    Basic Block Liveness Analysis

    {20, 26}4

    {18, 24, 25}{20, 26}3

    {17}{18}2

    {4, 15, 16}{17, 24}1

    {4, 6}{15, 16, 25}0LiveUseDefSTEP

    {18, 24, 25}

    {18, 24, 25} U[ - ]{20, 26}{20, 26}

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 68

    Basic Block Liveness Analysis

    {20, 26}4

    {18, 24, 25}{20, 26}3

    {17}{18}2

    {4, 15, 16}{17, 24}1

    {4, 6}{15, 16, 25}0LiveUseDefSTEP

    {18, 24, 25}

    {17} U[ - ]{18}{18,24,25}

    {17, 24, 25}

    {4, 15, 16, 25}

    {4, 6}

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 69

    Basic Block Liveness Analysis

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 70

    Liveness Analysis Algorithm

    This algorithm can be extended to whole control flow graphCaveats:

    CFG has backward edgesBB can have multiple successors

    ModificationsRepeatedly traverse graph until solution convergesUnion all the live sets from predecessors before applying equation

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 71

    Liveness Analysis Algorithm

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 72

    Interference Graph

    Liveness of variables has been computedConstruct a graph describing interference of two variables

    i.e. both variables are alive at the same timeNodes: represent variablesEdges: represent two variables alive at the same time

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 73

    Interference Graph

    4

    6

    15

    17

    18

    20

    26

    1624

    25

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 74

    Interference Graph Algorithm

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 75

    Register binding by coloring

    Recall: We want to assign variables to registers to minimize the number of registers usedCorresponds to the graph coloring problem

    Given a graph color each node such that no edge joins nodes of the same color using the fewest number of color

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 76

    Graph Coloring

    Graph color is NP-completeNeed heuristics to complete problem within reasonable timeOverview

    Given a order which to visit nodes (i.e. vertex elimination order)Visit each node in that orderPick a color such that no neighbor has the same one

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 77

    16

    25

    Example

    16

    25

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 78

    Example

    16

    25 1515

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 79

    4

    Example4

    15

    16

    25

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 80

    Example4

    15

    16

    25

    2424

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 81

    Example4

    15

    16

    25

    24

    17

    1818

    17

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 82

    Example4

    6

    15

    16

    26

    25

    24

    20 17

    18

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 83

    Register Binding By Coloring

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 84

    Vertex Elimination Order

    In previous example, assumed that order was givenFor basic block

    To determine order use left-edge algorithmOptimal for the interval graph

    GenerallyUse heuristic: less-flexible firsti.e. pick nodes with the most neighbors first

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 85

    Vertex Elimination

    To generate vertex orderVisit node with fewest neighbors and push them onto a stackRepeat until all edges visited

    Stack will now contain vertex order starting with the top node on the stack

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 86

    Example4

    6

    15

    16

    26

    25

    24

    20 17

    18 62026

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 87

    Example4

    15

    16

    25

    24

    17

    18 620261718

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 88

    Example4

    15

    16

    25

    24

    62026171824

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 89

    Example4

    15

    16

    25

    620261718244

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 90

    Example

    15

    16

    25

    620261718244

    15

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 91

    Example

    16

    25

    620261718244

    151625

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 92

    Functional Unit Binding

    Register count and functional binding affects number of multiplexersDesirable have operands share registers

    Common source: If operands for two instructions share the same register, desirable to map both instructions to the same functional unitCommon destination: Same as common source but applied to destination register of instruction

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 93

    Impact of Register Binding on Mux

    t

    t

    t0 t3 t1 t4

    2 = t0 + t1;

    5 = t3 + t4;

    ..

    . MUX

    R0 R1 R2 R3

    R4 R5

    t2 t5

    MUX R0 R1

    R3

    t 0 / t3 t 1 / t 4

    t2 /t 5

    adder adder

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 94

    Impact of Unit Binding on Mux

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 95

    15 16

    20

    24

    15 16

    20

    24

    Compatibility GraphDuality with interference graphClique

    “Complete” subgraph of a graphGraph coloring = Clique partitioning

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 96

    Clique Partitioning Algorithm

    Given the compatibility graphIteratively contract an edge

    Merge two nodes (u and v) of the selected edgeUpdate the graph

    – Preserve only edges to nodes that are neighbors to both to u and v

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 97

    Unit Binding Problem Refined

    Establish compatibility graphNodes represent instructionsEdges created between nodes such that

    – Opcodes are of the same class– Instructions are not scheduled at the same time

    Edges are weighted – To help identify common source/destination

    Solve weighted clique partitioning problem

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 98

    Second Iteration

    INSTRUCTION SRC 1 SRC 2 DEST

    15 R3 R0 R2

    16 R3 R0 R1

    20 R2 R1

    24 R3 R1

    15 16

    20

    24

    0 1

    1 2

    1

    Iteration 1

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 99

    Third Iteration

    1516

    24

    20

    0 1

    Iteration 2

    INSTRUCTION SRC 1 SRC 2 DEST

    15 R3 R0 R2

    16 , 24 R3 R0 R1

    20 R2 R1

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 100

    End Result

    16

    24

    20

    15

    After iteration 2

    INSTRUCTION SRC 1 SRC 2 DEST

    15 R3 R 0 R2

    16 , 24 , 20 R3 , R2 R 0 R1

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 101

    Random Unit Binding

    (a) (b) (c)

    1520

    1624

    15 16

    20

    24

    0 1

    1 2

    1

    1520

    2 216

    24

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 102

    Resultant Datapath

    Compare resultWeighted clique partitioningRandom clique partitioning

    On #mux inputs1720

    (a)

    MUL ADD/SUB 0 ADD/SUB 1

    R0 R1 R2 R3

    12 13 1612

    (4)

    (6)

    sel0 sel1

    sel2 sel3 sel4 sel5

    we0 we1 we2 we3

    func

    (b)

    MUL ADD/SUB 0 ADD/SUB 1

    R0 R1 R2 R3

    12 13 16

    (4)

    (6)

    sel0

    sel3 sel3 sel4

    we0 we1 we2 we3

    func

    sel2sel1

    sel5 sel6

    12

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 103

    Acknowlegement

    This powerpoint is created with the help of

    Alex ChoongZefu DaiNick NiWai Sum Mong

    From University of Toronto

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 104

    Historical NotesMaterial due to pioneering contributions of

    McFarland, Don Thomas et al (CMU)De Man et al (IMEC)Parker et al (USC)Gajski et al (UIUC, UC Irvine)Composano et al (IBM)Paulin et al (Carleton)And numerous others…

    Closely related evolutionsHardware/Software CodesignApplication-specific instruction processorMPSoC / NoC

  • ECE1769

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 105

    Commercial EffortsEarly efforts

    Synopsys Behavioral CompilerY Explorations…

    Current effortsForteSynforaMentor Graphics CatapultCadence C2SAutoESL…

    EE141Introduction to C-based High Level Synthesis Jianwen Zhu © 2009 - P. 106

    What Went Wrong and Future Top 5

    5: Architecture does not scaleFSMD limits to ILP

    4: Application does not scaleInput limits in size

    3: Poor link with logic/physical synthesis2: Methodology does not scale

    Function-centric -> Architecture-centric -> Application centric

    1: Identity crisis: what problem to solve best and first

    Solvable w/t Good R&D

    RequiresCommunityEffort