can pipelining get us into trouble? -...
TRANSCRIPT
04.10.2010
1
Pipelining - Hazards
Can Pipelining Get Us Into Trouble?
� Yes: Pipeline Hazards� Structural hazards: attempt to use the same resource two different ways at the same time
� E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)
� Control hazards: attempt to make a decision before condition is evaluated
� E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in
� Branch instructions
� Data hazards: attempt to use item before it is ready� E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer
� Instruction depends on result of prior instruction still in the pipeline
04.10.2010
2
Structural Hazard
� A relation between two instructions indicating that the two instructions may want to use the same hardware resource (function unit, register file port, shared bus, cache port, etc.) at the same time
� MIPS pipeline as designed so far does not have structural hazard� But we had to avoid it
� Usually occurs when a functional unit is not fully pipelined (e.g., in floating point pipeline)
Single Memory Port / Structural Hazard
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg
AL
U
RegIfetch DMem
Reg
AL
U
RegIfetch DMem
Reg
AL
U
RegIfetch DMem
Reg
AL
U
RegIfetch DMem
Reg
AL
U
RegIfetch DMem
04.10.2010
3
Single Memory Port / Structural Hazard
I
n
s
t
r.
O
r
d
e
r
Time (clock cycles)
Load
Instr 1
Instr 2
StallStall
Instr 3
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg
AL
U
DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
How do you “bubble” the pipe?
Single Memory Port / Structural Hazard
� Instead of stalling the pipeline
� Other solutions� Make dual ported memory
� Physically separate memory architecture into instruction and data (Harvard Architecture from Harvard Mark I project of IBM led by Dr. Howard Aiken)
� Another typical structural hazard� Functional unit is not fully pipelined due to cost/complexity
� Pipeline interval > 1 pipe stage
04.10.2010
4
Example: Cost of Structural Hazard
Suppose that 40% of instruction mix are loads or stores, and that the ideal CPI of the pipelined machine is 1. Assume that the machine withthe structural hazard has a clock rate that is 5% higher than the clock rate of the machine without the hazard. Which pipeline is faster, and by how much?
( )
ideal
idealidealideal
idealidealideal
hazardhazardhazardhazard
Time3.1
timeCycleCPIIC3.1
05.1
timeCycleCPI14.01IC
timeCycle CPIIC Time
×=
×××=
×××+×=
××=
Data Hazards
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
04.10.2010
5
Three Generic Data Hazards
� True (or Flow) Dependency (Read After Write, or RAW)� A later instruction tries to read operand before earlier instructions write it
I: add r1,r2,r3
J: sub r4,r1,r3
RAW Hazards
� True (value, flow) dependence between instructions i and j means iproduces a result value that j uses� This is a producer-consumer relationship
� This is a dependence based on values, not on the names of the containers of the values
� Every true dependence is a RAW hazard
� Not every RAW hazard is a true dependence� Any RAW hazard that cannot be removed by renaming is a true dependence
Original programOriginal programOriginal programOriginal program
1: A = B+C
2: A = D+E
3: G = A+H
True dependence: (2,3)RAW hazard: (1,3), (2,3)
Renamed ProgramRenamed ProgramRenamed ProgramRenamed Program
1: X = B+C
2: A = D+E
3: G = A+H
True dependence: (2,3)RAW hazard: (2,3)
04.10.2010
6
Three Generic Data Hazards
� Anti-Dependency (Write After Read, or WAR)� A later instruction tries to write operand before earlier instructions read it
� This hazard results from reuse of the same register
� Can’t happen in our simple 5 stage pipeline because:� All instructions take 5 stages, and
� Reads are always in stage 2, and
� Writes are always in stage 5
I: add r2, r1,r3
J: sub r1,r4,r3
Three Generic Data Hazards
� Output Dependency (Write After Write, or WAW)� A later instruction tries to write operand before earlier instructions write it
� This hazard results from reuse of the same register
� Can’t happen in our simple 5 stage pipeline because:� All instructions take 5 stages, and
� Reads are always in stage 2, and
� Writes are always in stage 5
I: add r1,r2,r3
J: sub r1,r4,r3
04.10.2010
7
More on WAR and WAW
� WAR and WAW hazards are name dependences� Two instructions happen to use the same register (name), although they don’t have to
� Can often be eliminated by renaming, either in software or hardware
� Implies the use of additional resources, hence additional cost� Renaming is not always possible: implicit operands such as accumulator, PC, or condition codes cannot be renamed
How to Break the Dependency
� Dependency reduces concurrency
� Can we break� True dependency (RAW)
� Name dependency or False dependency (WAR, WAW)
04.10.2010
8
� Have compiler guarantee no hazards
� Where do we insert the “nops” ?
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
� Problem: this really slows us down!
Software Solution
Hardware Solution: ForwardingTime (clock cycles)
Instr
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
04.10.2010
9
Forwarding (simplified)
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
ALU
Forwarding Unit1. Forwarding between ALUOut and ALUMuxA
sub $2, $1, $3
and $12, $2, $5
EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 =>
Use EX/MEM.ALUOut instead of ID/EX.A
a. Some instructions do not write registers
b. Every use of $0 as an operand must yield an operand value of zero
If ( EX/MEM.RegWrite &
(EX/MEM.RegisterRd ≠ 0) &
(EX/MEM.RegisterRd = ID/EX.RegisterRs) )
ForwardA= 01
04.10.2010
10
Forwarding Unit
2. Forwarding between ALUOut and ALUMuxB
sub $2, $1, $3
and $12,$5, $2
EX/MEM.RegisterRd = ID/EX.RegisterRt = $2 =>
Use EX/MEM.ALUOut instead of ID/EX.B
If ( EX/MEM.RegWrite &
(EX/MEM.RegisterRd ≠ 0) &
(EX/MEM.RegisterRd = ID/EX.RegisterRt) )
ForwardB= 01
Forwarding (from EX/MEM)
ALU
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
MU
XM
UX
04.10.2010
11
Forwarding Unit
3. Forwarding between ALUOut and ALUMuxA
sub $2, $1, $3
and $12, $2, $5
or $13, $2, $6
MEM/WB.RegisterRd = MEM/WB.RegisterRs = $2 =>
Use MEM/WB.ALUOut instead of ID/EX.A
If ( MEM/WB.RegWrite &
(MEM/WB.RegisterRd ≠ 0) &
(MEM/WB.RegisterRd = ID/EX.RegisterRs) )
ForwardA= 10
Forwarding Unit
4. Forwarding between ALUOut and ALUMuxB
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
MEM/WB.RegisterRd = MEM/WB.RegisterRt = $2 =>
Use MEM/WB.ALUOut instead of ID/EX.B
If ( MEM/WB.RegWrite &
(MEM/WB.RegisterRd ≠ 0) &
(MEM/WB.RegisterRd = ID/EX.RegisterRt) )
ForwardB= 10
04.10.2010
12
Forwarding (from MEM/WB)
ALU
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
MU
XM
UX
Forwarding (operand selection)
ALU
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
MU
XM
UX
ForwardingUnit
04.10.2010
13
Forwarding (operand propagation)
ALU
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
MU
XM
UX
ForwardingUnit
Rt
Rs
MU
X
Rd
Rt
EX/MEM Rd
MEM/WB Rd
Forwarding
04.10.2010
14
Datapath with Forwarding Unit
PCInstruc tion
m em ory
Reg iste rs
M
u
x
M
u
x
C ontro l
ALU
EX
M
W B
M
WB
W B
ID /EX
E X/M E M
M EM /W B
D ata
m em ory
M
u
x
Forw ard ing
un it
IF /ID
Ins
tru
cti
on
M
u
x
R d
EX /M EM .Registe rRd
M EM /WB .R egis terRd
R t
R t
R s
IF /ID.R eg is te rR d
IF /ID.R eg is te rR t
IF /ID.R eg is te rR t
IF /ID.R eg is te rR s
Forwarding Unit
add $1, $1, $2
add $1, $1, $3
add $1, $1, $4
If ( MEM/WB.RegWrite &
(MEM/WB.RegisterRd ≠ 0) &
(EX/MEM.RegisterRd ≠ ID/EX.RegisterRs)
(MEM/WB.RegisterRd = ID/EX.RegisterRs) )
ForwardA= 10
If ( MEM/WB.RegWrite &
(MEM/WB.RegisterRd ≠ 0) &
(EX/MEM.RegisterRd ≠ ID/EX.RegisterRt)
(MEM/WB.RegisterRd = ID/EX.RegisterRt) )
ForwardB= 10
04.10.2010
15
Some Other Data Dependencies� add $1, $1, $2 F | D | X | M | W
sw $7, 0($1) F | D | X | M | W
sw $8, 0($1) F | D | X | M | W
sw $9, 0($1) F | D | X | M | W
� add $1, $1, $2 F | D | X | M | W
sw $1, 0($7) F | D | X | M | W
sw $1, 0($8) F | D | X | M | W
sw $1, 0($9) F | D | X | M | W
� lw $1, 0($2) F | D | X | M | W
sw $1, 0($7) F | D | X | M | W
sw $1, 0($8) F | D | X | M | W
sw $1, 0($9) F | D | X | M | W
� Load word can still cause a hazard
Can't always forward
Time (clock cycles)
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Reg
AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
Reg AL
U
DMemIfetch Reg
Reg
AL
U
DMemIfetch Reg
04.10.2010
16
Data Hazard Even with Forwarding
Time (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg
AL
U
DMemIfetch Reg
RegIfetch
AL
U
DMem RegBubble
Ifetch
AL
U
DMem RegBubble Reg
Ifetch
AL
U
DMemBubble Reg
Thus, we need a hazard detection unit to “stall” the load
instruction
Stalling
� Hazard detection unit:
� When the pipeline is stalled:
� Do not fetch a new instruction: Prevent PC and IF/ID registers from changing
� Create a “buble” in the pipeline: Set all control signals to 0 to create a “do nothing” instruction
If ( ID/EX.MemRead &
((ID/EX.RegisterRt = IF/ID.RegisterRs) |
(ID/EX.RegisterRt = IF/ID.RegisterRt) ))
stall the pipeline
04.10.2010
17
Hazard Detection Unit
P CInstruction
m e m ory
R eg is te rs
M
u
x
M
u
x
M
u
x
C ontro l
A LU
E X
M
W B
M
W B
W B
ID /E X
E X /M E M
M EM /W B
D ata
m em ory
M
u
x
H a zard
de tec tion
u nit
Forw ard ing
u n it
0
M
u
x
IF /ID
Ins
tru
cti
on
ID /EX .M e m Rea d
IF/I
DW
rit
e
PC
Wr
ite
ID /EX .R e g is te rR t
IF /ID.R eg is terR d
IF /ID.R eg is terR t
IF /ID.R eg is terR t
IF /ID.R eg is terR s
R t
R s
R d
R tEX /M EM .R eg isterR d
ME M/W B .R egisterR d
Code rescheduling to Avoid Load HazardsTry producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Compiler optimizes for performance. Hardware checks for safety.
Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd
04.10.2010
18
Branch in the Pipelined Datapath
In s tr uc t io n
m e m o ry
A dd re s s
4
3 2
0
A d dA d d
r es u lt
S hi f t
le ft 2
Ins
tru
cti
on
IF / ID E X / M E M M E M /W B
M
u
x
0
1
A d d
P C
0
A d dr e ss
W ri te
d at a
M
u
x
1
R e g is te r s
R e a d
d a t a 1
R e a d
d a t a 2
R e a d
re g is te r 1
R e a d
re g is te r 2
1 6S ig n
e xte n d
W rite
re g is te r
W rite
d a ta
R e a d
d a ta
D a ta
m e m o ry
1
A L U
re s u ltM
u
x
A L U
Z e ro
ID /E X
Computes branch target address
Computes branch outcome
Changes PC
� When we decide to branch, other instructions are in
the pipeline!
Branch (Control) Hazards
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch
04.10.2010
19
� Stall the pipeline until the branch is complete � Brach is detected in ID stage
� Pipeline is stalled
� Pipeline is started in IF stage� Next instruction
� Branch target
� Three clock cycles will be lost for each branch !!!
Solving Branch Hazards
Reducing Taken Branch Penalty
� Compute branch target address earlier
� Compute branch outcome earlier
04.10.2010
20
Reducing Taken Branch Penalty
� Branch is completed in ID stage
� If branch is taken, flush the pipeline� 1 cycle loss for a taken branch
Taken branch F D X M W
Branch + 1 F FL FL FL FL
Branch target F D X M W
BT + 1 F D X M W
Flushing the Instruction After Branch
PCInstruction
memory
4
Reg iste rs
Mux
Mux
Mux
ALU
EX
M
WB
M
W B
WB
ID/E X
0
EX/ME M
M EM /W B
Data
m em ory
Mux
H azard
detection
unit
Forward ing
unit
IF.F lush
IF /ID
S ign
extend
C ontro l
Mux
=
Shift
le ft 2
Mux
04.10.2010
21
� Continue execution after the branch
� If branch is not taken, no penalty
� If branch is taken, flush the pipeline and loss of 1
clock cycles
Predict–not-Taken (Predict-Untaken)
� What about Predict-Taken?
Delayed Branches
� Execution cycle with a branch delay of length n:
branch instruction
sequential successor1sequential successor2........
sequential successorn
branch target if taken
� Instructions in the branch delay slot are executed irrespective of branch outcome
Branch delay of length n
04.10.2010
22
Delayed Branches on MIPS
� One branch delay slot on MIPS� Taken and untaken branch behaviour are similar
� Compiler must fill in the branch delay slot with useful instructions
Delayed Branches
� Question: What instruction do we put in the branch delay slot?� Fill with NOP (always possible)
� Fill from before (not always possible)
� Fill from target (not always possible)
� Fill from fall-through (not always possible)
04.10.2010
23
Filling Branch Delay Slot
Make sure R7 will not be used in taken path before redefined
Filling Branch Delay Slot
04.10.2010
24
Cancelling Branches
� Improves the ability of the compiler to fill in delay slots
� Instruction includes a bit showing its predicted direction
� When branch behaves as predicted, instruction in the delay slot is executed
� When branch is incorrectly predicted, instruction in the delay slot is turned to NOP
Predict-Taken Cancelling Branch
04.10.2010
25
Summary: Pipelining� Reduce CPI by overlapping many instructions
� Average throughput of approximately 1 CPI with fast clock
� Utilize capabilities of the Datapath � Start next instruction while working on the current one� Limited by length of longest stage (plus fill/flush)� Detect and resolve hazards
� What makes it easy� All instructions are the same length� Just a few instruction formats� Memory operands appear only in loads and stores
� What makes it hard?� Structural hazards: suppose we had only one memory� Control hazards: need to worry about branch instructions� Data hazards: an instruction depends on a previous instruction