sections 3.2 and 3.3 dynamic scheduling – tomasulo’s algorithm
Post on 24-Jan-2016
Embed Size (px)
DESCRIPTIONEEF011 Computer Architecture 計算機結構. Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm. 吳俊興 高雄大學資訊工程學系 October 2004. A Dynamic Algorithm: Tomasulo’s Algorithm. For IBM 360/91 (before caches!) – 3 years after CDC Goal: High Performance without special compilers - PowerPoint PPT Presentation
Sections 3.2 and 3.3Dynamic Scheduling Tomasulos Algorithm
October 2004EEF011 Computer Architecture
A Dynamic Algorithm: Tomasulos AlgorithmFor IBM 360/91 (before caches!) 3 years after CDCGoal: High Performance without special compilersSmall number of floating point registers (4 in 360) prevented interesting compiler scheduling of operationsThis led Tomasulo to try to figure out how to get more effective registers renaming in hardware! Why Study 1966 Computer? The descendants of this have flourished!Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604,
Example to eleminate WAR and WAW by register renamingOriginalDIV.DF0, F2, F4ADD.DF6, F0, F8S.DF6, 0(R1)SUB.DF8, F10, F14MUL.DF6, F10, F8WAR between ADD.D and SUB.D, WAW between ADD.D and MUL.D(Due to that DIV.D needs to take much longer cycles to get F0)Register renamingDIV.DF0, F2, F4ADD.DS, F0, F8S.DS, 0(R1)SUB.DT, F10, F14MUL.DF6, F10, T
Tomasulo AlgorithmRegister renaming providedby reservation stations, which buffer the operands of instructions waiting to issueby the issue logicBasic idea:a reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register (WAR)pending instructions designate the reservation station that will provide their input (RAW)when successive writes to a register overlap in execution, only the last one is actually used to update the register (WAW)As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renamingmore reservation stations than real registers
Properties of Tomasulo AlgorithmControl & buffers distributed with Function Units (FU)Hazard detection and execution control are distributedFU buffers called reservation stations; have pending operandsRegisters in instructions replaced by values or pointers to reservation stations(RS)form of register renaming to avoids WAR, WAW hazardsBypassing: Results passed directly to FU from RS, not through registers, over Common Data Busthat broadcasts results to all FUs, so allows all units waiting for an operand to be loaded simultaneously
Load and Stores treated as FUs with RSs as wellInteger instructions can go past branches, allowing FP ops beyond basic block in FP queue
Figure 3.2 Basic structure of a MIPS floating-point unit using Tomasulos algorithmLoad buffers:hold components of the effected addrtrack outstanding loads that are waiting on the memoryhold the results of completed loads that are waiting for the CDBStore buffers:hold components of the effected addrhold the destination memory addresses of outstanding stores that are waiting for the data value to storehold the addr and value to store until the memory unit is available
Three Stages of Tomasulo Algorithm1.Issueget instruction from the head of the instruction queue If reservation station free (no structural hazard), control issues instr with the operand values (renames registers).No free RS => there is a structural hazardIf the operands are not in the registers, keep track of FUThis step renames registers, eliminating WAR and WAW hazards2.Executeoperate on operands (EX) When both operands ready (placed into RS), then execute; if not ready, monitor Common Data Bus for resultBy delaying EX until the operands are available, RAW hazards are avoided3.Write resultfinish execution (WB) Write on Common Data Bus to the registers and the RS of all awaiting units; mark reservation station available
Normal data bus: data + destination (go to bus)Common data bus: data + source (come from bus)64 bits of data + 4 bits of Functional Unit source addressWrite if matches expected Functional Unit (produces result)Does the broadcast
7 Components of Reservation Station Op:Operation to perform in the unit (e.g., + or )Qj, Qk: Reservation stations producing the corresponding source operandNote: Qj,Qk=0 => ready or unnessaryStore buffers only have Qi for RS producing resultVj, Vk: Value of Source operandsOnly one of V field or the Q field is validStore buffers has V field, result to be stored A: used to hold information for the memory address calculation for a load or a storeBusy: Indicates reservation station or FU is busyRegister result status QiIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
Tomasulo ExampleExample speed: 3 clocks for FP +,-; 10 for * ; 40 clks for /
Tomasulo Example Cycle 1
Tomasulo Example Cycle 2Note: Can have multiple loads outstanding
Tomasulo Example Cycle 3Note: registers names are removed (renamed) in Reservation Stations; MULT issuedLoad1 completing; what is waiting for Load1?
Tomasulo Example Cycle 4Load2 completing; what is waiting for Load2?
Tomasulo Example Cycle 5Timer starts down for Add1, Mult1
Tomasulo Example Cycle 6Issue ADDD here despite name dependency on F6?
Tomasulo Example Cycle 7Add1 (SUBD) completing; what is waiting for it?
Tomasulo Example Cycle 8
Tomasulo Example Cycle 9
Tomasulo Example Cycle 10Add2 (ADDD) completing; what is waiting for it?
Tomasulo Example Cycle 11Write result of ADDD here?All quick instructions complete in this cycle!
Tomasulo Example Cycle 12
Tomasulo Example Cycle 13
Tomasulo Example Cycle 14
Tomasulo Example Cycle 15Mult1 (MULTD) completing; what is waiting for it?
Tomasulo Example Cycle 16Just waiting for Mult2 (DIVD) to complete
Tomasulo Example Cycle 55
Tomasulo Example Cycle 56Mult2 (DIVD) is completing; what is waiting for it?
Tomasulo Example Cycle 57Once again: In-order issue, out-of-order execution and out-of-order completion.
Tomasulo DrawbacksComplexitydelays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon!Many associative stores (CDB) at high speedPerformance limited by Common Data BusEach CDB must go to multiple functional units high capacitance, high wiring densityNumber of functional units that can complete per cycle limited to one!Multiple CDBs more FU logic for parallel assoc storesNon-precise interrupts!We will address this later
Tomasulo Loop ExampleLoop:LDF00R1 MULTDF4F0F2 SDF40R1 SUBIR1R1#8 BNEZR1Loop
This time assume Multiply takes 4 clocksAssume 1st load takes 8 clocks (L1 cache miss), 2nd load takes 1 clock (hit)To be clear, will show clocks for SUBI, BNEZReality: integer instructions ahead of Fl. Pt. InstructionsShow 2 iterations
Loop Example Cycle 1
Loop Example Cycle 2
Loop Example Cycle 3Implicit renaming sets up data flow graph
Loop Example Cycle 4Dispatching SUBI Instruction (not in FP queue)
Loop Example Cycle 5And, BNEZ instruction (not in FP queue)
Loop Example Cycle 6Notice that F0 never sees Load from location 80
Loop Example Cycle 7Register file completely detached from computationFirst and Second iteration completely overlapped
Loop Example Cycle 8
Loop Example Cycle 9Load1 completing: who is waiting?Note: Dispatching SUBI
Loop Example Cycle 10Load2 completing: who is waiting?Note: Dispatching BNEZ
Loop Example Cycle 11Next load in sequence
Loop Example Cycle 12Why not issue third multiply?
Loop Example Cycle 13Why not issue third store?
Loop Example Cycle 14Mult1 completing. Who is waiting?
Loop Example Cycle 15Mult2 completing. Who is waiting?
Loop Example Cycle 16
Loop Example Cycle 17
Loop Example Cycle 18
Loop Example Cycle 19
Loop Example Cycle 20Once again: In-order issue, out-of-order execution and out-of-order completion.
Why can Tomasulo overlap iterations of loops?Register renamingMultiple iterations use different physical destinations for registers (dynamic loop unrolling).
Reservation stations Permit instruction issue to advance past integer control flow operationsAlso buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard.
Other perspective: Tomasulo building data flow dependency graph on the fly.
Tomasulos scheme offers 2 major advantagesthe distribution of the hazard detection logicdistributed reservation stations and the CDBIf multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available.(2) the elimination of stalls for WAW and WAR hazards of scoreboard
What you might have thought1. 4 stages of instruction executino2.Status of FU: Normal things to keep track of (RAW & structura for busyl):Fi from instruction format of the mahine (Fi is dest)Add unit can Add or SubRj, Rk - status of registers (Yes means ready)Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it3.Status of register result (WAW &WAR)s:which FU is going to write into registersScoreboard on 6600 = size of FU6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17FU latencies: Add 2, Mult 10, Div 40 clocks