sections 3.2 and 3.3 dynamic scheduling – tomasulo’s algorithm

Download Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm

Post on 24-Jan-2016




3 download

Embed Size (px)


EEF011 Computer Architecture 計算機結構. Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm. 吳俊興 高雄大學資訊工程學系 October 2004. A Dynamic Algorithm: Tomasulo’s Algorithm. For IBM 360/91 (before caches!) – 3 years after CDC Goal: High Performance without special compilers - PowerPoint PPT Presentation


  • Sections 3.2 and 3.3Dynamic Scheduling Tomasulos Algorithm

    October 2004EEF011 Computer Architecture

  • A Dynamic Algorithm: Tomasulos AlgorithmFor IBM 360/91 (before caches!) 3 years after CDCGoal: High Performance without special compilersSmall number of floating point registers (4 in 360) prevented interesting compiler scheduling of operationsThis led Tomasulo to try to figure out how to get more effective registers renaming in hardware! Why Study 1966 Computer? The descendants of this have flourished!Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604,

  • Example to eleminate WAR and WAW by register renamingOriginalDIV.DF0, F2, F4ADD.DF6, F0, F8S.DF6, 0(R1)SUB.DF8, F10, F14MUL.DF6, F10, F8WAR between ADD.D and SUB.D, WAW between ADD.D and MUL.D(Due to that DIV.D needs to take much longer cycles to get F0)Register renamingDIV.DF0, F2, F4ADD.DS, F0, F8S.DS, 0(R1)SUB.DT, F10, F14MUL.DF6, F10, T

  • Tomasulo AlgorithmRegister renaming providedby reservation stations, which buffer the operands of instructions waiting to issueby the issue logicBasic idea:a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register (WAR)pending instructions designate the reservation station that will provide their input (RAW)when successive writes to a register overlap in execution, only the last one is actually used to update the register (WAW)As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renamingmore reservation stations than real registers

  • Properties of Tomasulo AlgorithmControl & buffers distributed with Function Units (FU)Hazard detection and execution control are distributedFU buffers called reservation stations; have pending operandsRegisters in instructions replaced by values or pointers to reservation stations(RS)form of register renaming to avoids WAR, WAW hazardsBypassing: Results passed directly to FU from RS, not through registers, over Common Data Busthat broadcasts results to all FUs, so allows all units waiting for an operand to be loaded simultaneously

    Load and Stores treated as FUs with RSs as wellInteger instructions can go past branches, allowing FP ops beyond basic block in FP queue

  • Figure 3.2 Basic structure of a MIPS floating-point unit using Tomasulos algorithmLoad buffers:hold components of the effected addrtrack outstanding loads that are waiting on the memoryhold the results of completed loads that are waiting for the CDBStore buffers:hold components of the effected addrhold the destination memory addresses of outstanding stores that are waiting for the data value to storehold the addr and value to store until the memory unit is available

  • Three Stages of Tomasulo Algorithm1.Issueget instruction from the head of the instruction queue If reservation station free (no structural hazard), control issues instr with the operand values (renames registers).No free RS => there is a structural hazardIf the operands are not in the registers, keep track of FUThis step renames registers, eliminating WAR and WAW hazards2.Executeoperate on operands (EX) When both operands ready (placed into RS), then execute; if not ready, monitor Common Data Bus for resultBy delaying EX until the operands are available, RAW hazards are avoided3.Write resultfinish execution (WB) Write on Common Data Bus to the registers and the RS of all awaiting units; mark reservation station available

    Normal data bus: data + destination (go to bus)Common data bus: data + source (come from bus)64 bits of data + 4 bits of Functional Unit source addressWrite if matches expected Functional Unit (produces result)Does the broadcast

  • 7 Components of Reservation Station Op:Operation to perform in the unit (e.g., + or )Qj, Qk: Reservation stations producing the corresponding source operandNote: Qj,Qk=0 => ready or unnessaryStore buffers only have Qi for RS producing resultVj, Vk: Value of Source operandsOnly one of V field or the Q field is validStore buffers has V field, result to be stored A: used to hold information for the memory address calculation for a load or a storeBusy: Indicates reservation station or FU is busyRegister result status QiIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

  • Tomasulo ExampleExample speed: 3 clocks for FP +,-; 10 for * ; 40 clks for /

  • Tomasulo Example Cycle 1

  • Tomasulo Example Cycle 2Note: Can have multiple loads outstanding

  • Tomasulo Example Cycle 3Note: registers names are removed (renamed) in Reservation Stations; MULT issuedLoad1 completing; what is waiting for Load1?

  • Tomasulo Example Cycle 4Load2 completing; what is waiting for Load2?

  • Tomasulo Example Cycle 5Timer starts down for Add1, Mult1

  • Tomasulo Example Cycle 6Issue ADDD here despite name dependency on F6?

  • Tomasulo Example Cycle 7Add1 (SUBD) completing; what is waiting for it?

  • Tomasulo Example Cycle 8

  • Tomasulo Example Cycle 9

  • Tomasulo Example Cycle 10Add2 (ADDD) completing; what is waiting for it?

  • Tomasulo Example Cycle 11Write result of ADDD here?All quick instructions complete in this cycle!

  • Tomasulo Example Cycle 12

  • Tomasulo Example Cycle 13

  • Tomasulo Example Cycle 14

  • Tomasulo Example Cycle 15Mult1 (MULTD) completing; what is waiting for it?

  • Tomasulo Example Cycle 16Just waiting for Mult2 (DIVD) to complete

  • Tomasulo Example Cycle 55

  • Tomasulo Example Cycle 56Mult2 (DIVD) is completing; what is waiting for it?

  • Tomasulo Example Cycle 57Once again: In-order issue, out-of-order execution and out-of-order completion.

  • Tomasulo DrawbacksComplexitydelays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon!Many associative stores (CDB) at high speedPerformance limited by Common Data BusEach CDB must go to multiple functional units high capacitance, high wiring densityNumber of functional units that can complete per cycle limited to one!Multiple CDBs more FU logic for parallel assoc storesNon-precise interrupts!We will address this later

  • Tomasulo Loop ExampleLoop:LDF00R1 MULTDF4F0F2 SDF40R1 SUBIR1R1#8 BNEZR1Loop

    This time assume Multiply takes 4 clocksAssume 1st load takes 8 clocks (L1 cache miss), 2nd load takes 1 clock (hit)To be clear, will show clocks for SUBI, BNEZReality: integer instructions ahead of Fl. Pt. InstructionsShow 2 iterations

  • Loop Example

  • Loop Example Cycle 1

  • Loop Example Cycle 2

  • Loop Example Cycle 3Implicit renaming sets up data flow graph

  • Loop Example Cycle 4Dispatching SUBI Instruction (not in FP queue)

  • Loop Example Cycle 5And, BNEZ instruction (not in FP queue)

  • Loop Example Cycle 6Notice that F0 never sees Load from location 80

  • Loop Example Cycle 7Register file completely detached from computationFirst and Second iteration completely overlapped

  • Loop Example Cycle 8

  • Loop Example Cycle 9Load1 completing: who is waiting?Note: Dispatching SUBI

  • Loop Example Cycle 10Load2 completing: who is waiting?Note: Dispatching BNEZ

  • Loop Example Cycle 11Next load in sequence

  • Loop Example Cycle 12Why not issue third multiply?

  • Loop Example Cycle 13Why not issue third store?

  • Loop Example Cycle 14Mult1 completing. Who is waiting?

  • Loop Example Cycle 15Mult2 completing. Who is waiting?

  • Loop Example Cycle 16

  • Loop Example Cycle 17

  • Loop Example Cycle 18

  • Loop Example Cycle 19

  • Loop Example Cycle 20Once again: In-order issue, out-of-order execution and out-of-order completion.

  • Why can Tomasulo overlap iterations of loops?Register renamingMultiple iterations use different physical destinations for registers (dynamic loop unrolling).

    Reservation stations Permit instruction issue to advance past integer control flow operationsAlso buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard.

    Other perspective: Tomasulo building data flow dependency graph on the fly.

  • Tomasulos scheme offers 2 major advantagesthe distribution of the hazard detection logicdistributed reservation stations and the CDBIf multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available.(2) the elimination of stalls for WAW and WAR hazards of scoreboard

    What you might have thought1. 4 stages of instruction executino2.Status of FU: Normal things to keep track of (RAW & structura for busyl):Fi from instruction format of the mahine (Fi is dest)Add unit can Add or SubRj, Rk - status of registers (Yes means ready)Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it3.Status of register result (WAW &WAR)s:which FU is going to write into registersScoreboard on 6600 = size of FU6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17FU latencies: Add 2, Mult 10, Div 40 clocks


View more >