eece476: computer architecture lecture 19: pipelining reducing control hazard penalty chapter 6.6...
Post on 20-Dec-2015
216 views
TRANSCRIPT
EECE476: Computer Architecture
Lecture 19: PipeliningReducing Control Hazard Penalty
Chapter 6.6
The University ofBritish Columbia EECE 476 © 2005 Guy Lemieux
2
Reminders• Midterm 1: Tomorrow!
– 50 minutes, not open book, calculator OK– Based on your assignments + lecture material– Covers EVERYTHING up to end of week 5
– No Verilog, Altera tools on midterm
– Try Lecture 14 study problems– Try extra study problems on web
• Partner signup deadline: Friday!– MUST work in pairs
• EMAIL ME IMMEDIATELY IF YOU’RE STILL ALONE– MUST email to project TA: 2 names, stud #s, emails– Late penalty: 5% of final grade
3
Directly Reducing Penaltyof Control Hazards
• Control hazards demand solutions:– Stalling– Nullify– Branch delay slots– All of these negatively impact performance (in various ways)
• Alternative– Can we directly reduce the negative impact of control hazards?
• Yes!– Execute branch/jump instruction earlier in pipeline– Outcome known sooner– Fetch fewer instructions enter pipeline after branch (before outcome
known)– For BEQ, we must detect “equals” earlier
4
BEQ in “X” Stage Instead?
• Move logic gates from “M” stage into “X” stage– “AND” gate, PCSrc mux– These logic gates depend on results *after* ALU
• Benefit– Only 2 instructions follow BEQ into pipeline– Improves only BEQ-if-taken performance
• But…– ALU may be slowest part of pipeline– Causes longer delay path after ALU, “X” stage slower– Clock rate may be affected– This may negatively affect ALL instructions, be careful!!!!
6
BEQ in “D” Stage Instead?• Move logic gates from “M” stage into “D” stage
– But we need ALU to compute (Rs – Rt) and Zero– How can we do this?
• Key benefit– Only 1 instruction follows “BEQ” into pipeline
• Notice– “EQ” can be computed efficiently in “D” stage
(Rs XOR Rt) == 32’b0• Simple bitwise XOR• “== 32’b0” is simple: wide 32-input NOR gate, single output
– No need for subtraction• Simple logic, no carry chain
– Move “ + SgnExt(Imm16) logic into “D” stage as well
8
Reducing Branch PenaltyReducing branch penalty
– Compute (Rs == Rt) and target address in “D” stage– Reduces branch delay to 1 cycle– Works well
But, this introduces a forwarding error!• Suppose
ADD $1, $2, $3 previous instructionBEQ $1, $2, 7 RAW hazard: needs new $1, forwarding?NOP delay slot
– Dependency causes data hazard
• Solutions?– Option 1: Stall until writeback of dependent instruction– Option 2A: Forward as much as possible (stall 1 cycle for LW)– Option 2B: Forward a bit less (stall 2 cycles for LW, 1 cycle for others)
9
Data Hazard with “BEQ”• Example: Clock cycle 1
1 ADD $1,$2,$3 I D X M W
BEQ $1,$2, 7 I D X M W
NOP I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
WMXDI
10
Data Hazard with “BEQ”
• Clock cycle 2
1 ADD $1,$2,$3 I D X M W
2 BEQ $1,$2, 7 I D X M W
NOP I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
WMXDI
11
Data Hazard with “BEQ”• Clock cycle 3 normal forwarding into X doesn’t work,
arrives too late!
1 ADD $1,$2,$3 I D X M W
2 BEQ $1,$2, 7 I D X M W
3 NOP I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
WMXDI
12
Forwarding with “BEQ”• Clock cycle 4 Option 2B: insert “bubble”, forward to D
1 ADD $1,$2,$3 I D X M W
2 ? I D X ? ?
3 BEQ $1,$2, 7 I D D X M W
4 NOP I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
I D X M W
WM?DI
13
Cause of the Error?• Moved “BEQ” execution from X to D
– PROBLEM: pipeline currently only forwards data into X stage
• To resolve, we have two options:
– Option 1: Low performance, add Hazard Detection Unit (HDU) condition• BEQ depends on earlier instruction• Stall >= 1 cycle until dependent instruction instruction finishes writeback• If not told otherwise, assume this approach
– Option 2, Higher performance, add Forward Detection Unit (FDU) and muxes
• Option 2A: Forward data from ALU out, DataMem out, W result into D stage– Avoids most stalls (no HDU needed, except for LW case again).– Longer delay, will probably slow clock and affect all instructions.
• Option 2B: Stall if dependent instr in X, forward data from M, W stages into D stage, stall if dependence in X
– HDU must now stall 1 cycle when BEQ depends on immediately prior R-type instruction.
14
Control Hazards Summary• Branches/jumps cause interruptions to control flow
– This affects the stream of instructions entering pipeline afterward the branch/jump
• These interruptions cause a utilization problem– We may fetch the wrong instruction(s) after branch/jump
• Option 1: stall after every branch/jump• Option 2: nullify-if-branch-taken (small performance improvement)• Option 3: declare as a “delay slot”, always-execute (bad idea for future ISAs)
– Default: assume MIPS behaviour – Option 3 with 1 Delay Slot
• To reduce Utilization problem, move branch/jump to D stage– This introduces new data hazards if the branch depends upon recent instructions
• These hazards introduce a new forwarding problem– Branch/jump may depend on result of recent instruction(s)
• Option 1: HDU forces stall until writeback (multiple cycles)• Option 2: minimal HDU stall, forward data when dependence can be resolved
– 2A: more forwarding needed, stall only for LW, may slow down clock speed– 2B: less fowarding needed, stall for LW and R-type until forwarding can be used
– Default: assume FORWARDING OPTION 1 for this course