dynamic scheduling
DESCRIPTION
Dynamic Scheduling. The hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior. Enables handling some cases when dependences are unknown at compile time. Simplifies the compiler. - PowerPoint PPT PresentationTRANSCRIPT
CSCE 614 Fall 2009 1
Dynamic Scheduling• The hardware rearranges the instruction
execution to reduce the stalls while maintaining data flow and exception behavior.– Enables handling some cases when dependences are
unknown at compile time.– Simplifies the compiler.– Allows code that was compiled with one pipeline in
mind to run efficiently on a different pipeline.
• Needs a significant increase in hardware complexity.
CSCE 614 Fall 2009 2
Classic 5-Stage Pipeline
• Check COD or Appendix A.
IM Reg DM RegALU
IM Reg DM RegALU
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
Time (in clock cycles)
lw $2, 200($0)
lw $3, 300($0)
Programexecutionorder(in instructions)
lw $1, 100($0) IM Reg DM RegALU
CSCE 614 Fall 2009 3
Simple Pipeline
• Uses in-order instruction issue and execution.– Instruction issue: the process of letting an
instruction move from ID stage to EX stage.
• If an instruction is stalled in the pipeline, no later instructions can proceed.
• If there are multiple functional units, these units could sit idle.
CSCE 614 Fall 2009 4
Example
DIV.D F0, F2, F4ADD.D F10, F0, F8SUB.D F12, F8, F14
The SUB.D instruction cannot execute because the dependenceof ADD.D on DIV.D causes the pipeline stall.
CSCE 614 Fall 2009 5
Dynamic Scheduling• Classic 5-stage pipeline
– IF – ID – EX – MEM – WB
• Both structural and data hazards could be checked during ID stage.
• If there is a dependence between two closely spaced instructions in the pipeline, this will lead to a hazard and a stall will result.
• Separating Issue process – Checking for any structural hazards– Waiting for the absence of a data hazard
CSCE 614 Fall 2009 6
• In-order instruction issue – Instructions are issued in program order.
• An instruction can begin execution as soon as its data operand is available.
• Out-of-order execution (out-of-order completion)– Possibility of WAR and WAW hazards
CSCE 614 Fall 2009 7
WAW / WAR
DIV.D F0, F2, F4ADD.D F6, F0, F8SUB.D F8, F10, F14MUL.D F6, F10, F8
antidependence(WAR)
output dependence(WAW)
Both hazards can be avoided by the use of register renaming.
CSCE 614 Fall 2009 8
Out-of-Order Completion• Creates major complications in handling
exceptions.– Dynamic scheduling with out-of-order
completion must preserve exception behavior in the sense that exactly those exceptions that would arise if the program were executed in strict program order actually do arise.
– Dynamic scheduling may generate imprecise exceptions.
• The processor state when an exception is raised does not look exactly as if the instructions were executed sequentially in strict program order.
CSCE 614 Fall 2009 9
Splitting ID Pipe Stage
• Issue stage– Decode instructions, check for structural
hazards.
• Read operands stage– Wait until no data hazards, then read
operands.
CSCE 614 Fall 2009 10
• The pipeline allows multiple instructions to be in execution at the same time.– An instruction is fetched into a register (IR) or
a queue of pending instructions.– IF – Issue – Read Operands – EX - …– Execution may take multiple cycles depending
on the operation.– In execution: from begin execution to
complete execution– Requires multiple functional units, pipelined
functional units, or both.
CSCE 614 Fall 2009 11
• In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue).
• They can be stalled or bypass each other in the second stage (read operands), so enter execution out-of-order.
• Scoreboarding in CDC 6600– Allows instructions to execute out of order
when there are sufficient resources and no data dependences.
• Tomasulo’s algorithm: more sophisticated.
CSCE 614 Fall 2009 12
Tomasulo’s Algorithm• Originally developed for IBM 360/91
floating-point unit by R. Tomasulo.– Only 4 double-precision floating-point registers– Long memory accesses– Long floating-point delays– Key concept: tracking instruction dependences
to allow execution as soon as operands are available and renaming registers to avoid WAR and WAW hazards.
• MIPS also uses the concept.
CSCE 614 Fall 2009 13
Dependences
• Data dependences (True data dependences) => RAW
• Name dependences– Antidependence => WAR– Output dependence => WAW
• Control dependences– Not critical
RegisterRenaming
CSCE 614 Fall 2009 14
Dynamic Pipelining
• All instructions pass through the issue stage in order (in-order issue).
• Instructions can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order.
• The processor has multiple functional units.
• Tomasulo’s algorithm
CSCE 614 Fall 2009 15
Tomasulo’s Algorithm
• Tracks when operands for instructions are available.– Minimizes RAW hazards.– Introduces register renaming to minimize
WAW and WAR hazards.
• Supports the overlapped execution of multiple iterations of a loop.
CSCE 614 Fall 2009 16
Avoiding Hazards
• RAW– Execute an instruction only when its operands
are available.
• WAW and WAR– Register renaming
CSCE 614 Fall 2009 17
Register Renaming (Example)
• Eliminates WAR and WAW hazards by renaming all destination registers.
DIV.D F0, F2, F4ADD.D F6, F0, F8S.D F6, 0(R1)SUB.D F8, F10, F14MUL.D F6, F10, F8
The renaming process can be done statically by the compiler.
CSCE 614 Fall 2009 18
DIV.D F0, F2, F4ADD.D F6, F0, F8S.D F6, 0(R1)SUB.D F8, F10, F14MUL.D F6, F10, F8
DIV.D F0, F2, F4ADD.D S, F0, F8S.D S, 0(R1)SUB.D T, F10, F14MUL.D F6, F10, T
Antidependence
Output dependence
S, T: temporary registers
CSCE 614 Fall 2009 19
Tomasulo’s Algorithm
• Register renaming is provided by the reservation stations, which buffer the operands of instructions waiting to issue, and by the issue logic.
• A reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register.
CSCE 614 Fall 2009 20
Tomasulo’s Algorithm• Pending instructions designate the
reservation station that will provide their input.
• When successive writes to a register overlap in execution, only the last one is actually used to update the register.
• As instructions are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming.
CSCE 614 Fall 2009 21
• The information held in the reservation stations at each functional unit determine when an instruction can begin execution at that unit. (Decentralized control)
• Results are passed directly to functional units from the reservation stations where they are buffered using a common result bus called the common data bus (CDB).
CSCE 614 Fall 2009 22
MIPS FP Unit Using Tomasulo’s Algorithm
CSCE 614 Fall 2009 23
Tomasulo-Based MIPS
• Includes both the FP unit and the load-store unit.
• Each reservation station holds an instruction that has been issued and is awaiting execution at a functional unit, and either the operand values for that instruction, if they have already been computed, or else the names of the reservation stations that will provide the operand values.
CSCE 614 Fall 2009 24
• The load and store buffers hold data or addresses coming from and going to memory and behave almost exactly like reservation stations.
• The FP registers are connected by a pair of buses to the functional units (reservation stations) and by a single bus to the store buffer.
• All results from the functional units and from memory are sent on the CDB, which goes to everywhere except the load buffer.
CSCE 614 Fall 2009 25
3 Steps An Instruction Goes Through 1. Issue
- Get the next instruction from the head of the instruction queue (FIFO order).
- If there is a matching reservation station that is empty, issue the instruction.
- No empty reservation station => structural hazard
- If operands are not in registers, keep track of the functional units that will produce the operands.
- Register renaming (WAW and WAR avoided)
CSCE 614 Fall 2009 26
2. Execute- If one or more operands is not available, monitor the common data bus (CDB).- When all the operands are available, the operation can be executed at the functional unit.- By delaying instruction execution until the operands are available, RAW hazards are avoided.- If more than one instruction is ready for a single functional unit, it will choose among them.
CSCE 614 Fall 2009 27
• Loads and Stores– First compute the effective address when the
base register is available and the effective address is place in the load or store buffer.
– Loads in the load buffer execute as soon as the memory unit is available.
– Stores in the store buffer wait for the value to be stored before being sent to the memory unit.
• No instruction is allowed to initiate execution until all branches that precede the instruction in program order have completed.
CSCE 614 Fall 2009 28
3. Write Result
- When the result is available, write it on the CDB and from there into registers and into any reservation stations waiting for this result.
- Stores also write data to memory during this step.
CSCE 614 Fall 2009 29
• Once an instruction has issued and is waiting for a source operand, it refers to the operand by the reservation station number where the instruction that will write the register has been assigned.
• WAW and WAR hazards are eliminated by renaming results using reservation station numbers.– # reservation stations > # registers
CSCE 614 Fall 2009 30
Reservation Station Fields• Op: operation to perform• Qj, Qk: indicate reservation stations that
will produce the operands. (0 indicates the operand is available in Vj or Vk)
• Vj, Vk: stores the value of the operands• A: holds information for the memory
address calculation (immediate field/effective address)
• Busy: indicates this reservation station is busy.
CSCE 614 Fall 2009 31
Register File Field
• Qi: reservation station number that contains the operation whose result should be stored into this register. (0 indicates currently no active instruction is computing a result destined for this register. => the value is simply the register contents.)
CSCE 614 Fall 2009 32
Load/Store Buffers
• Have a field A– Holds the result of the effective address once
the first step of execution has been completed.
CSCE 614 Fall 2009 33
Dynamic Scheduling Example
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2
Only the first load has completed and written its result.
See Figure 3.3 on page 190.
CSCE 614 Fall 2009 34
Tomasulo’s Algorithm
• Advantages– Distribution of the hazard detection logic: If
multiple instructions are waiting on a single result, and each instruction already has its other operand, then the instructions can be released simultaneously by the broadcast on the CDB.
– Elimination of stalls for WAW and WAR hazards by renaming registers using the reservation stations.
CSCE 614 Fall 2009 35
Examples
• See Figure 2.10 on page 99.
• See Figure 2.11 on page 100
CSCE 614 Fall 2009 36
Tomasulo’s Algorithm
• Fig 2.12
CSCE 614 Fall 2009 37
Example (p.103)
Loop: L.D F0, 0(R1)MUL.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop
The loop is unrolled dynamically by the hardware.
All instructions in two successive iterations issued, but noneof the FP/load/store operations completed
CSCE 614 Fall 2009 38
Load/Store
• If a load and a store access the same memory address, then either– the load is before the store in program order
and interchanging them results in a WAR hazard, or
– the store is before the load in program order and interchanging them results in a RAW hazard.
• Interchanging two stores => WAW
CSCE 614 Fall 2009 39
Simple (not Optimal) Solution
• The processor performs the effective address calculations in program order.
• We can check whether there is an address conflict by examining the A field of all active store/load buffers.
CSCE 614 Fall 2009 40
Dynamic Scheduling
• Very high performance, if branches are predicted accurately.
• Main drawback: Complexity– Large amount of hardware– Each reservation station contains associative
buffer and complex control logic.
• The performance can be limited by the single CDB.